Lianmin Zheng
|
0cb099e20a
|
set CUDA_DEVICE_MAX_CONNECTIONS=1 (#1113)
|
2024-08-16 03:47:39 +10:00 |
|
Lianmin Zheng
|
326df4bab2
|
Use a single workspace for flashinfer (#1077)
|
2024-08-14 19:25:37 -07:00 |
|
Ying Sheng
|
6767e2229f
|
Support jinja as chat template file (#1104)
|
2024-08-14 17:43:14 -07:00 |
|
rainred
|
616b59f384
|
[Feature] modify Runtime to support skip_tokenizer_init (#1088)
Co-authored-by: lzhang <zhanglei@modelbest.cn>
|
2024-08-14 00:28:04 -07:00 |
|
Lianmin Zheng
|
c877292cc1
|
Re-organize CI tests (#1052)
|
2024-08-12 03:39:01 -07:00 |
|
Yineng Zhang
|
94752ac811
|
feat: use FlashInfer rmsnorm and silu (#907)
|
2024-08-11 14:57:13 +10:00 |
|
gryffindor-rr
|
9cf0a5bada
|
Add skip_tokenizer_init args. (#959)
Co-authored-by: lzhang <zhanglei@modelbest.cn>
|
2024-08-09 12:14:13 -07:00 |
|
Ying Sheng
|
b16e856f11
|
Add openai embedding API (#997)
|
2024-08-09 11:19:18 -07:00 |
|
liuyhwangyh
|
b91a4cb1b1
|
support models from www.modelscope.cn (#994)
Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>
|
2024-08-09 02:52:14 -07:00 |
|
Ying Sheng
|
e040a2450b
|
Add e5-mistral embedding model - step 3/3 (#988)
|
2024-08-08 16:31:19 -07:00 |
|
Ying Sheng
|
9f662501a3
|
Move torch.compile configs into cuda_graph_runner.py (#993)
|
2024-08-08 13:20:30 -07:00 |
|
Ying Sheng
|
ff68ae857a
|
Show more error messages for warmup errors (#932)
|
2024-08-06 23:57:06 -07:00 |
|
yichuan~
|
795eab6dda
|
Add support for Batch API test (#936)
|
2024-08-06 23:52:10 -07:00 |
|
Ying Sheng
|
0d4f3a9fcd
|
Make API Key OpenAI-compatible (#917)
|
2024-08-04 13:35:44 -07:00 |
|
Ying Sheng
|
70cc0749ce
|
Add model accuracy test - step 1 (#866)
|
2024-08-03 18:20:50 -07:00 |
|
任嘉
|
4013a4e1b0
|
Implement served_model_name to customize model id when use local mode… (#749)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
|
2024-08-01 17:13:51 -07:00 |
|
Ying Sheng
|
60340a3643
|
Improve the coverage of the openai api server test (#878)
|
2024-08-01 16:01:30 -07:00 |
|
Ying Sheng
|
72b6ea88b4
|
Make scripts under /test/srt as unit tests (#875)
|
2024-08-01 14:34:55 -07:00 |
|
Ying Sheng
|
6f221d4ca0
|
Fix unit tests for the frontend language part (#872)
|
2024-08-01 12:39:12 -07:00 |
|
Yineng Zhang
|
bc3eaac2b8
|
chore: update flashinfer to v0.1.3 (#850)
|
2024-08-01 04:37:05 +10:00 |
|
Liangsheng Yin
|
cdcbde5fc3
|
Code structure refactor (#807)
|
2024-07-29 23:04:48 -07:00 |
|
yichuan~
|
084fa54d37
|
Add support for OpenAI API : offline batch(file) processing (#699)
Co-authored-by: hnyls2002 <hnyls2002@gmail.com>
|
2024-07-29 13:07:18 -07:00 |
|
Ying Sheng
|
eba458bd19
|
Revert "Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118"" (#806)
|
2024-07-29 12:20:42 -07:00 |
|
Ying Sheng
|
7d352b4fdd
|
Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118" (#805)
|
2024-07-29 11:39:12 -07:00 |
|
Yineng Zhang
|
87064015d9
|
fix: update flashinfer to 0.1.2 to fix sampling for cu118 (#803)
|
2024-07-29 11:00:52 -07:00 |
|
Liangsheng Yin
|
7cd4f244a4
|
Chunked prefill (#800)
|
2024-07-29 03:32:58 -07:00 |
|
Ying Sheng
|
98111fbe3e
|
Revert "Chunked prefill support" (#799)
|
2024-07-29 02:38:31 -07:00 |
|
Liangsheng Yin
|
2ec39ab712
|
Chunked prefill support (#797)
|
2024-07-29 02:21:50 -07:00 |
|
Yineng Zhang
|
dd7e8b9421
|
chore: add copyright for srt (#790)
|
2024-07-28 23:07:12 +10:00 |
|
Lianmin Zheng
|
752e643007
|
Allow disabling flashinfer sampling kernel (#778)
|
2024-07-27 20:18:56 -07:00 |
|
Mingyi
|
e4db4e5ba5
|
minor refactor: move check server args to server_args.py (#774)
|
2024-07-27 19:03:40 -07:00 |
|
Ying Sheng
|
8fbba3de3d
|
Fix bugs (fp8 checkpoints, triton cache manager) (#729)
|
2024-07-25 07:42:00 -07:00 |
|
Liangsheng Yin
|
04ec6ba2ac
|
Fix dockerfile and triton cache manager (#720)
|
2024-07-25 03:04:21 -07:00 |
|
Lianmin Zheng
|
01d66ae2e8
|
Fix multi-node deadlock (#709)
|
2024-07-23 21:53:36 -07:00 |
|
Ying Sheng
|
444a02441a
|
Update vllm version to support llama3.1 (#705)
|
2024-07-23 13:49:34 -07:00 |
|
Liangsheng Yin
|
eedc12e12e
|
Support Deepseek MoE Model (#689)
|
2024-07-21 03:09:29 -07:00 |
|
Liangsheng Yin
|
caaad53b52
|
Support gpt-bigcode model class (#681)
|
2024-07-20 18:34:37 -07:00 |
|
Liangsheng Yin
|
69d19188fc
|
Decouple kv (#679)
|
2024-07-20 14:16:45 -07:00 |
|
Mingyi
|
e3046ea3a8
|
Update OpenAI API (#667)
|
2024-07-19 23:20:54 -07:00 |
|
Ying Sheng
|
e87c7fd501
|
Improve docs (#662)
|
2024-07-19 10:58:03 -07:00 |
|
Ying Sheng
|
51fda1439f
|
Update Readme (#660)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-07-19 09:54:01 -07:00 |
|
zhyncs
|
c126a6ccba
|
feat: add benchmark serving (#657)
|
2024-07-19 09:15:21 -07:00 |
|
Lianmin Zheng
|
e1792cca24
|
Remove cached triton launcher (#656)
|
2024-07-18 23:28:40 -07:00 |
|
zhyncs
|
b050d9283f
|
fix: set ulimit -n 65535 (#647)
|
2024-07-18 02:35:45 -07:00 |
|
Mingyi
|
d774acad5c
|
Remove the dependency of rpyc (#646)
|
2024-07-18 02:13:54 -07:00 |
|
Liangsheng Yin
|
abd5385ac5
|
Move global_server_args_dict (#642)
|
2024-07-17 13:49:15 -07:00 |
|
Liangsheng Yin
|
3de2f30a27
|
Flashinfer sample kernel (#617)
|
2024-07-17 13:24:43 -07:00 |
|
zhyncs
|
2e341cd493
|
misc: add pre-commit config (#637)
|
2024-07-17 11:55:39 -07:00 |
|
Ying Sheng
|
0aa189f150
|
Disable NCCL_NVLS by default (#631)
|
2024-07-16 09:05:10 -07:00 |
|
Ying Sheng
|
6a2941f4d0
|
Improve tensor parallel performance (#625)
Co-authored-by: Mingyi <wisclmy0611@gmail.com>
|
2024-07-15 07:10:51 -07:00 |
|