Commit Graph

284 Commits

Author SHA1 Message Date
Lianmin Zheng
bbc07c4197 Move sampling logits to float32 (#773) 2024-07-27 17:30:12 -07:00
Lianmin Zheng
a036d41980 Fix max new tokens (#772) 2024-07-27 17:22:18 -07:00
Lianmin Zheng
f95e661757 Fix max_tokens for OpenAI chat completion API (#766) 2024-07-27 15:44:27 -07:00
Lianmin Zheng
0736b27020 [Minor] Improve the code style in TokenizerManager (#767) 2024-07-27 05:05:15 -07:00
Ke Bao
3fdab91912 Fix TransformerTokenizer init for chatglm2 & 3 (#761) 2024-07-27 02:44:46 -07:00
Liangsheng Yin
d9fccfefe2 Fix context length (#757) 2024-07-26 18:13:13 -07:00
Liangsheng Yin
679ebcbbdc Deepseek v2 support (#693) 2024-07-26 17:10:07 -07:00
Yineng Zhang
6b32bb1c0b misc: format (#741) 2024-07-26 21:00:51 +10:00
Toshiki Kataoka
40facad5f1 feat: support token ids in /v1/completions (#736) 2024-07-26 02:53:17 -07:00
Toshiki Kataoka
da504445dc fix /generate without sampling_params (#734) 2024-07-26 01:27:56 -07:00
Ying Sheng
252e0f7bbd fix: small bug for llama-405b fp16 (#733) 2024-07-25 21:14:54 -07:00
Ying Sheng
8fbba3de3d Fix bugs (fp8 checkpoints, triton cache manager) (#729) 2024-07-25 07:42:00 -07:00
Ying Sheng
ae0f6130cb Revert "fix: fp8 config" (#728) 2024-07-25 07:25:33 -07:00
Liangsheng Yin
04ec6ba2ac Fix dockerfile and triton cache manager (#720) 2024-07-25 03:04:21 -07:00
Ying Sheng
d63f13c13b fix: fp8 config (#723) 2024-07-25 02:01:56 -07:00
Ying Sheng
30d8e130e7 Improve benchmark scripts (#717) 2024-07-24 14:44:14 -07:00
Yineng Zhang
e17deb27b5 fix: llama 3.1 405b fp8 (#714) 2024-07-24 09:37:41 -07:00
Ying Sheng
83d2b30d75 format 2024-07-24 10:53:07 +00:00
Ying Sheng
4367f4bb8d Fix prefill size (#711) 2024-07-24 03:41:15 -07:00
Lianmin Zheng
00e4baa728 Update schedule_heuristic.py 2024-07-24 01:22:30 -07:00
Liangsheng Yin
4cd64b8ee6 Auto adjust new ratio (#708) 2024-07-23 22:06:02 -07:00
Lianmin Zheng
01d66ae2e8 Fix multi-node deadlock (#709) 2024-07-23 21:53:36 -07:00
Mingyi
a523a3c13a Reduce hardcoded logic of kernel usage (#707) 2024-07-23 16:42:21 -07:00
Ying Sheng
444a02441a Update vllm version to support llama3.1 (#705) 2024-07-23 13:49:34 -07:00
Liangsheng Yin
268684439b Use min new token ratio at start (#701) 2024-07-23 11:52:50 -07:00
Ke Bao
824a77d04d Fix hf config loading (#702) 2024-07-23 11:39:08 -07:00
Ying Sheng
cf99eab7d5 Fix flashinfer (#700) 2024-07-23 01:27:01 -07:00
Ying Sheng
c3f1aac811 Tune params (#696) 2024-07-22 03:19:24 -07:00
Ke Bao
5303c1ed22 Support Mistral-Nemo (#691) 2024-07-22 03:36:53 +10:00
Liangsheng Yin
eedc12e12e Support Deepseek MoE Model (#689) 2024-07-21 03:09:29 -07:00
Lianmin Zheng
77e592e8e0 support non-streaming benchmark (#682) 2024-07-20 18:36:42 -07:00
Liangsheng Yin
caaad53b52 Support gpt-bigcode model class (#681) 2024-07-20 18:34:37 -07:00
Liangsheng Yin
69d19188fc Decouple kv (#679) 2024-07-20 14:16:45 -07:00
Ke Bao
0ac94c36cb Fallback when sampling failed (#678) 2024-07-20 10:44:54 -07:00
Liangsheng Yin
f424e76d96 Fix illegal tokens during sampling (#676) 2024-07-20 03:11:15 -07:00
Lianmin Zheng
490a1f39dd Fix cuda graph with flashinfer (#675) 2024-07-20 02:43:55 -07:00
Ying Sheng
06487f126e refactor model loader: initial refactor (#664) 2024-07-20 02:18:22 -07:00
Liangsheng Yin
39c57317e1 Revert "Temporary fix invalid sample results" (#673) 2024-07-20 02:06:31 -07:00
Lianmin Zheng
35759efa91 Support random dataset in bench_serving.py (#669) 2024-07-20 01:06:43 -07:00
Liangsheng Yin
8f4b1559e7 Temporary fix invalid sample results (#668) 2024-07-20 00:51:05 -07:00
Mingyi
e3046ea3a8 Update OpenAI API (#667) 2024-07-19 23:20:54 -07:00
yichuan~
49c5e0eca9 Add support for OpenAI API parallel sampling (#640) 2024-07-19 23:10:01 -07:00
Ke Bao
ec2150b294 Fix kill process util (#666) 2024-07-19 21:43:11 -07:00
Liangsheng Yin
7620cd37dd Fix jump forward when streaming (#665) 2024-07-19 16:42:06 -07:00
Ying Sheng
e87c7fd501 Improve docs (#662) 2024-07-19 10:58:03 -07:00
Ying Sheng
51fda1439f Update Readme (#660)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-07-19 09:54:01 -07:00
Ying Sheng
2d96da813e refactor model loader [unreachable code]: initial refactor (#655) 2024-07-19 09:27:06 -07:00
zhyncs
c126a6ccba feat: add benchmark serving (#657) 2024-07-19 09:15:21 -07:00
zhyncs
ac971ff633 perf: reduce ttft and itl with stream_interval 1 (#658) 2024-07-19 09:14:22 -07:00
Lianmin Zheng
e1792cca24 Remove cached triton launcher (#656) 2024-07-18 23:28:40 -07:00