Commit Graph

115 Commits

Author SHA1 Message Date
Lianmin Zheng
0cb099e20a set CUDA_DEVICE_MAX_CONNECTIONS=1 (#1113) 2024-08-16 03:47:39 +10:00
Lianmin Zheng
326df4bab2 Use a single workspace for flashinfer (#1077) 2024-08-14 19:25:37 -07:00
Ying Sheng
6767e2229f Support jinja as chat template file (#1104) 2024-08-14 17:43:14 -07:00
rainred
616b59f384 [Feature] modify Runtime to support skip_tokenizer_init (#1088)
Co-authored-by: lzhang <zhanglei@modelbest.cn>
2024-08-14 00:28:04 -07:00
Lianmin Zheng
c877292cc1 Re-organize CI tests (#1052) 2024-08-12 03:39:01 -07:00
Yineng Zhang
94752ac811 feat: use FlashInfer rmsnorm and silu (#907) 2024-08-11 14:57:13 +10:00
gryffindor-rr
9cf0a5bada Add skip_tokenizer_init args. (#959)
Co-authored-by: lzhang <zhanglei@modelbest.cn>
2024-08-09 12:14:13 -07:00
Ying Sheng
b16e856f11 Add openai embedding API (#997) 2024-08-09 11:19:18 -07:00
liuyhwangyh
b91a4cb1b1 support models from www.modelscope.cn (#994)
Co-authored-by: mulin.lyh <mulin.lyh@taobao.com>
2024-08-09 02:52:14 -07:00
Ying Sheng
e040a2450b Add e5-mistral embedding model - step 3/3 (#988) 2024-08-08 16:31:19 -07:00
Ying Sheng
9f662501a3 Move torch.compile configs into cuda_graph_runner.py (#993) 2024-08-08 13:20:30 -07:00
Ying Sheng
ff68ae857a Show more error messages for warmup errors (#932) 2024-08-06 23:57:06 -07:00
yichuan~
795eab6dda Add support for Batch API test (#936) 2024-08-06 23:52:10 -07:00
Ying Sheng
0d4f3a9fcd Make API Key OpenAI-compatible (#917) 2024-08-04 13:35:44 -07:00
Ying Sheng
70cc0749ce Add model accuracy test - step 1 (#866) 2024-08-03 18:20:50 -07:00
任嘉
4013a4e1b0 Implement served_model_name to customize model id when use local mode… (#749)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2024-08-01 17:13:51 -07:00
Ying Sheng
60340a3643 Improve the coverage of the openai api server test (#878) 2024-08-01 16:01:30 -07:00
Ying Sheng
72b6ea88b4 Make scripts under /test/srt as unit tests (#875) 2024-08-01 14:34:55 -07:00
Ying Sheng
6f221d4ca0 Fix unit tests for the frontend language part (#872) 2024-08-01 12:39:12 -07:00
Yineng Zhang
bc3eaac2b8 chore: update flashinfer to v0.1.3 (#850) 2024-08-01 04:37:05 +10:00
Liangsheng Yin
cdcbde5fc3 Code structure refactor (#807) 2024-07-29 23:04:48 -07:00
yichuan~
084fa54d37 Add support for OpenAI API : offline batch(file) processing (#699)
Co-authored-by: hnyls2002 <hnyls2002@gmail.com>
2024-07-29 13:07:18 -07:00
Ying Sheng
eba458bd19 Revert "Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118"" (#806) 2024-07-29 12:20:42 -07:00
Ying Sheng
7d352b4fdd Revert "fix: update flashinfer to 0.1.2 to fix sampling for cu118" (#805) 2024-07-29 11:39:12 -07:00
Yineng Zhang
87064015d9 fix: update flashinfer to 0.1.2 to fix sampling for cu118 (#803) 2024-07-29 11:00:52 -07:00
Liangsheng Yin
7cd4f244a4 Chunked prefill (#800) 2024-07-29 03:32:58 -07:00
Ying Sheng
98111fbe3e Revert "Chunked prefill support" (#799) 2024-07-29 02:38:31 -07:00
Liangsheng Yin
2ec39ab712 Chunked prefill support (#797) 2024-07-29 02:21:50 -07:00
Yineng Zhang
dd7e8b9421 chore: add copyright for srt (#790) 2024-07-28 23:07:12 +10:00
Lianmin Zheng
752e643007 Allow disabling flashinfer sampling kernel (#778) 2024-07-27 20:18:56 -07:00
Mingyi
e4db4e5ba5 minor refactor: move check server args to server_args.py (#774) 2024-07-27 19:03:40 -07:00
Ying Sheng
8fbba3de3d Fix bugs (fp8 checkpoints, triton cache manager) (#729) 2024-07-25 07:42:00 -07:00
Liangsheng Yin
04ec6ba2ac Fix dockerfile and triton cache manager (#720) 2024-07-25 03:04:21 -07:00
Lianmin Zheng
01d66ae2e8 Fix multi-node deadlock (#709) 2024-07-23 21:53:36 -07:00
Ying Sheng
444a02441a Update vllm version to support llama3.1 (#705) 2024-07-23 13:49:34 -07:00
Liangsheng Yin
eedc12e12e Support Deepseek MoE Model (#689) 2024-07-21 03:09:29 -07:00
Liangsheng Yin
caaad53b52 Support gpt-bigcode model class (#681) 2024-07-20 18:34:37 -07:00
Liangsheng Yin
69d19188fc Decouple kv (#679) 2024-07-20 14:16:45 -07:00
Mingyi
e3046ea3a8 Update OpenAI API (#667) 2024-07-19 23:20:54 -07:00
Ying Sheng
e87c7fd501 Improve docs (#662) 2024-07-19 10:58:03 -07:00
Ying Sheng
51fda1439f Update Readme (#660)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-07-19 09:54:01 -07:00
zhyncs
c126a6ccba feat: add benchmark serving (#657) 2024-07-19 09:15:21 -07:00
Lianmin Zheng
e1792cca24 Remove cached triton launcher (#656) 2024-07-18 23:28:40 -07:00
zhyncs
b050d9283f fix: set ulimit -n 65535 (#647) 2024-07-18 02:35:45 -07:00
Mingyi
d774acad5c Remove the dependency of rpyc (#646) 2024-07-18 02:13:54 -07:00
Liangsheng Yin
abd5385ac5 Move global_server_args_dict (#642) 2024-07-17 13:49:15 -07:00
Liangsheng Yin
3de2f30a27 Flashinfer sample kernel (#617) 2024-07-17 13:24:43 -07:00
zhyncs
2e341cd493 misc: add pre-commit config (#637) 2024-07-17 11:55:39 -07:00
Ying Sheng
0aa189f150 Disable NCCL_NVLS by default (#631) 2024-07-16 09:05:10 -07:00
Ying Sheng
6a2941f4d0 Improve tensor parallel performance (#625)
Co-authored-by: Mingyi <wisclmy0611@gmail.com>
2024-07-15 07:10:51 -07:00