Commit Graph

51 Commits

Author SHA1 Message Date
Ke Bao
c6b6d2e71b Enable MLA by default (#1447) 2024-09-17 11:42:48 +00:00
Lianmin Zheng
2fa5cec775 Simplify sampler and its error handling (#1441) 2024-09-16 21:23:31 -07:00
Lianmin Zheng
27b557aea7 Clean up model loader (#1440) 2024-09-16 18:16:27 -07:00
Liangsheng Yin
70b6802982 Optimize conflicts between CUDA graph and vocab mask tensors (#1392) 2024-09-13 20:27:53 -07:00
Ying Sheng
712216928f [Feature] Initial support for multi-LoRA serving (#1307) 2024-09-12 16:46:14 -07:00
Lianmin Zheng
3efa798116 Support cuda graph in the triton attention backend (#1401) 2024-09-12 00:36:55 -07:00
Lianmin Zheng
fec185ce0c Refactor attention backend (#1381) 2024-09-11 11:44:26 -07:00
Lianmin Zheng
46094e0c1b Deprecate --disable-flashinfer and introduce --attention-backend (#1380) 2024-09-10 17:11:16 -07:00
Lianmin Zheng
3a6e8b6d78 [Minor] move triton attention kernels into a separate folder (#1379) 2024-09-10 15:15:08 -07:00
Liangsheng Yin
69b3bb9ae1 Unify forward mode (#1360) 2024-09-09 13:49:29 -07:00
Jerry Zhang
a7c47e0f02 Add torchao quant (int4/int8/fp8) to llama models (#1341)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-09-09 05:32:41 -07:00
Lianmin Zheng
e4d68afcf0 [Minor] Many cleanup (#1357) 2024-09-09 04:14:11 -07:00
Yineng Zhang
dc67d97693 misc: speedup load safetensors (#1319)
Co-authored-by: ispobock <ISPObaoke@163.com>
2024-09-04 04:29:53 +10:00
Lianmin Zheng
f64eae3a29 [Fix] Reduce memory usage for loading llava model & Remove EntryClassRemapping (#1308) 2024-09-02 21:44:45 -07:00
Liangsheng Yin
a5a134f39f Fix bugs in sampler with CUDA graph / torch.compile (#1306) 2024-09-02 23:18:48 +00:00
Kai-Hsun Chen
0836055324 [Chore] Rename model_overide_args to model_override_args (#1284)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-09-01 03:14:56 -07:00
Ke Bao
6cb32ef92c Support Triton fp8 e5m2 kv cache (#1286)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-09-01 19:46:40 +10:00
Liangsheng Yin
381dd57bd6 Sampler cudagraph (#1253) 2024-08-28 18:58:52 -07:00
Lianmin Zheng
bf53bf5142 [Fix] Fix llava on multi images (#1247) 2024-08-28 06:33:05 -07:00
Yineng Zhang
f25f4dfde5 hotfix: revert sampler CUDA Graph (#1242) 2024-08-28 21:16:47 +10:00
Yineng Zhang
198974cd1a feat: support sm75 with FlashInfer v0.1.6 (#1233) 2024-08-28 18:39:12 +10:00
Yineng Zhang
c5fe11a8e1 chore: bump v0.2.14 (#1155) 2024-08-27 00:28:24 +10:00
Liangsheng Yin
75ce37f401 Move sampler into CUDA graph (#1201)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-26 07:02:50 -07:00
Ke Bao
2c615d120f [Feature] Support fp8 e5m2 kv cache with flashinfer (#1204)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-25 17:38:11 -07:00
Lianmin Zheng
902278008a [Minor] Improve the function organization in TokenizerManager & improve loggers (#1208) 2024-08-25 14:46:34 -07:00
Chayenne
30b4f771b0 Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2024-08-25 10:29:12 -07:00
Lianmin Zheng
f6af3a6561 Cleanup readme, llava examples, usage examples and nccl init (#1194) 2024-08-24 08:02:23 -07:00
Liangsheng Yin
83e23c69b3 Improve code style of sampler (#1168) 2024-08-21 16:48:24 -07:00
Lianmin Zheng
bea2bb9eea Improve multi-node stability (#1171) 2024-08-20 22:35:05 -07:00
Shan Yu
cd10654e7e [Feat] Support update weights without restart server (#1157) 2024-08-20 13:48:24 -07:00
Xu-Chen
ff2cfdb1a2 [Feature] add disable-custom-all-reduce (#1148)
Co-authored-by: chenxu02 <chenxu02@zhihu.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2024-08-20 08:44:12 -07:00
Lianmin Zheng
a8ae640328 Improve docs and warnings (#1164) 2024-08-20 08:31:29 -07:00
Yineng Zhang
9208591f05 fix: use fp16 dtype for sm75 (#1136) 2024-08-18 00:45:42 +10:00
Lianmin Zheng
5a261bd055 Fix the deadlock in multi-node tp (#1122) 2024-08-16 01:39:24 -07:00
Ying Sheng
93d4e354d8 [Fix] Window attention compatible with RadixAttention and chunked prefill (#1112) 2024-08-15 10:33:20 -07:00
Yineng Zhang
9195d1362a misc: rm unused model_loader (#1110) 2024-08-15 08:29:35 -07:00
Ying Sheng
14cb544d56 [Fix] fix flashinfer usage for window attention (#1107) 2024-08-15 00:53:24 -07:00
Ying Sheng
8d2d876fc8 [Fix] fix the typo bug for window attention (#1106) 2024-08-14 21:56:01 -07:00
Lianmin Zheng
326df4bab2 Use a single workspace for flashinfer (#1077) 2024-08-14 19:25:37 -07:00
Ying Sheng
96a2093ef0 [Fix] Compatibility of window attention and cuda graph (#1090) 2024-08-14 10:37:01 -07:00
Lianmin Zheng
a59636bb5e Update grok 1 model (#1095) 2024-08-14 04:40:44 -07:00
Ying Sheng
0909bb0d2f [Feat] Add window attention for gemma-2 (#1056) 2024-08-13 17:01:26 -07:00
Ying Sheng
e040a2450b Add e5-mistral embedding model - step 3/3 (#988) 2024-08-08 16:31:19 -07:00
Liangsheng Yin
1ac304eeb4 Adjust InputeMetadata and ScheduleBatch (#981) 2024-08-08 01:11:22 -07:00
Liangsheng Yin
f724f1f1e9 PrefillAdder abstraction (#968) 2024-08-07 13:47:28 -07:00
Liangsheng Yin
87e8c090e9 Organize code (rename, movement) (#953) 2024-08-06 20:50:32 -07:00
Ke Bao
e1eae1fd15 Support MLA for DeepSeek-V2 with Triton - step 1 (#905) 2024-08-05 03:40:33 +10:00
Yineng Zhang
6b8f66efe1 misc: update cuda graph capture exception log (#894) 2024-08-03 00:40:52 +10:00
Liangsheng Yin
6b0f2e9088 Add --max-total-tokens (#840) 2024-07-30 13:33:55 -07:00
Ying Sheng
e7487b08bc Adjust default mem fraction to avoid OOM (#823) 2024-07-30 01:58:31 -07:00