Ke Bao
|
c6b6d2e71b
|
Enable MLA by default (#1447)
|
2024-09-17 11:42:48 +00:00 |
|
Lianmin Zheng
|
2fa5cec775
|
Simplify sampler and its error handling (#1441)
|
2024-09-16 21:23:31 -07:00 |
|
Lianmin Zheng
|
27b557aea7
|
Clean up model loader (#1440)
|
2024-09-16 18:16:27 -07:00 |
|
Liangsheng Yin
|
70b6802982
|
Optimize conflicts between CUDA graph and vocab mask tensors (#1392)
|
2024-09-13 20:27:53 -07:00 |
|
Ying Sheng
|
712216928f
|
[Feature] Initial support for multi-LoRA serving (#1307)
|
2024-09-12 16:46:14 -07:00 |
|
Lianmin Zheng
|
3efa798116
|
Support cuda graph in the triton attention backend (#1401)
|
2024-09-12 00:36:55 -07:00 |
|
Lianmin Zheng
|
fec185ce0c
|
Refactor attention backend (#1381)
|
2024-09-11 11:44:26 -07:00 |
|
Lianmin Zheng
|
46094e0c1b
|
Deprecate --disable-flashinfer and introduce --attention-backend (#1380)
|
2024-09-10 17:11:16 -07:00 |
|
Lianmin Zheng
|
3a6e8b6d78
|
[Minor] move triton attention kernels into a separate folder (#1379)
|
2024-09-10 15:15:08 -07:00 |
|
Liangsheng Yin
|
69b3bb9ae1
|
Unify forward mode (#1360)
|
2024-09-09 13:49:29 -07:00 |
|
Jerry Zhang
|
a7c47e0f02
|
Add torchao quant (int4/int8/fp8) to llama models (#1341)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
|
2024-09-09 05:32:41 -07:00 |
|
Lianmin Zheng
|
e4d68afcf0
|
[Minor] Many cleanup (#1357)
|
2024-09-09 04:14:11 -07:00 |
|
Yineng Zhang
|
dc67d97693
|
misc: speedup load safetensors (#1319)
Co-authored-by: ispobock <ISPObaoke@163.com>
|
2024-09-04 04:29:53 +10:00 |
|
Lianmin Zheng
|
f64eae3a29
|
[Fix] Reduce memory usage for loading llava model & Remove EntryClassRemapping (#1308)
|
2024-09-02 21:44:45 -07:00 |
|
Liangsheng Yin
|
a5a134f39f
|
Fix bugs in sampler with CUDA graph / torch.compile (#1306)
|
2024-09-02 23:18:48 +00:00 |
|
Kai-Hsun Chen
|
0836055324
|
[Chore] Rename model_overide_args to model_override_args (#1284)
Signed-off-by: Kai-Hsun Chen <kaihsun@anyscale.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-09-01 03:14:56 -07:00 |
|
Ke Bao
|
6cb32ef92c
|
Support Triton fp8 e5m2 kv cache (#1286)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-09-01 19:46:40 +10:00 |
|
Liangsheng Yin
|
381dd57bd6
|
Sampler cudagraph (#1253)
|
2024-08-28 18:58:52 -07:00 |
|
Lianmin Zheng
|
bf53bf5142
|
[Fix] Fix llava on multi images (#1247)
|
2024-08-28 06:33:05 -07:00 |
|
Yineng Zhang
|
f25f4dfde5
|
hotfix: revert sampler CUDA Graph (#1242)
|
2024-08-28 21:16:47 +10:00 |
|
Yineng Zhang
|
198974cd1a
|
feat: support sm75 with FlashInfer v0.1.6 (#1233)
|
2024-08-28 18:39:12 +10:00 |
|
Yineng Zhang
|
c5fe11a8e1
|
chore: bump v0.2.14 (#1155)
|
2024-08-27 00:28:24 +10:00 |
|
Liangsheng Yin
|
75ce37f401
|
Move sampler into CUDA graph (#1201)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-08-26 07:02:50 -07:00 |
|
Ke Bao
|
2c615d120f
|
[Feature] Support fp8 e5m2 kv cache with flashinfer (#1204)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-08-25 17:38:11 -07:00 |
|
Lianmin Zheng
|
902278008a
|
[Minor] Improve the function organization in TokenizerManager & improve loggers (#1208)
|
2024-08-25 14:46:34 -07:00 |
|
Chayenne
|
30b4f771b0
|
Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
|
2024-08-25 10:29:12 -07:00 |
|
Lianmin Zheng
|
f6af3a6561
|
Cleanup readme, llava examples, usage examples and nccl init (#1194)
|
2024-08-24 08:02:23 -07:00 |
|
Liangsheng Yin
|
83e23c69b3
|
Improve code style of sampler (#1168)
|
2024-08-21 16:48:24 -07:00 |
|
Lianmin Zheng
|
bea2bb9eea
|
Improve multi-node stability (#1171)
|
2024-08-20 22:35:05 -07:00 |
|
Shan Yu
|
cd10654e7e
|
[Feat] Support update weights without restart server (#1157)
|
2024-08-20 13:48:24 -07:00 |
|
Xu-Chen
|
ff2cfdb1a2
|
[Feature] add disable-custom-all-reduce (#1148)
Co-authored-by: chenxu02 <chenxu02@zhihu.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
|
2024-08-20 08:44:12 -07:00 |
|
Lianmin Zheng
|
a8ae640328
|
Improve docs and warnings (#1164)
|
2024-08-20 08:31:29 -07:00 |
|
Yineng Zhang
|
9208591f05
|
fix: use fp16 dtype for sm75 (#1136)
|
2024-08-18 00:45:42 +10:00 |
|
Lianmin Zheng
|
5a261bd055
|
Fix the deadlock in multi-node tp (#1122)
|
2024-08-16 01:39:24 -07:00 |
|
Ying Sheng
|
93d4e354d8
|
[Fix] Window attention compatible with RadixAttention and chunked prefill (#1112)
|
2024-08-15 10:33:20 -07:00 |
|
Yineng Zhang
|
9195d1362a
|
misc: rm unused model_loader (#1110)
|
2024-08-15 08:29:35 -07:00 |
|
Ying Sheng
|
14cb544d56
|
[Fix] fix flashinfer usage for window attention (#1107)
|
2024-08-15 00:53:24 -07:00 |
|
Ying Sheng
|
8d2d876fc8
|
[Fix] fix the typo bug for window attention (#1106)
|
2024-08-14 21:56:01 -07:00 |
|
Lianmin Zheng
|
326df4bab2
|
Use a single workspace for flashinfer (#1077)
|
2024-08-14 19:25:37 -07:00 |
|
Ying Sheng
|
96a2093ef0
|
[Fix] Compatibility of window attention and cuda graph (#1090)
|
2024-08-14 10:37:01 -07:00 |
|
Lianmin Zheng
|
a59636bb5e
|
Update grok 1 model (#1095)
|
2024-08-14 04:40:44 -07:00 |
|
Ying Sheng
|
0909bb0d2f
|
[Feat] Add window attention for gemma-2 (#1056)
|
2024-08-13 17:01:26 -07:00 |
|
Ying Sheng
|
e040a2450b
|
Add e5-mistral embedding model - step 3/3 (#988)
|
2024-08-08 16:31:19 -07:00 |
|
Liangsheng Yin
|
1ac304eeb4
|
Adjust InputeMetadata and ScheduleBatch (#981)
|
2024-08-08 01:11:22 -07:00 |
|
Liangsheng Yin
|
f724f1f1e9
|
PrefillAdder abstraction (#968)
|
2024-08-07 13:47:28 -07:00 |
|
Liangsheng Yin
|
87e8c090e9
|
Organize code (rename, movement) (#953)
|
2024-08-06 20:50:32 -07:00 |
|
Ke Bao
|
e1eae1fd15
|
Support MLA for DeepSeek-V2 with Triton - step 1 (#905)
|
2024-08-05 03:40:33 +10:00 |
|
Yineng Zhang
|
6b8f66efe1
|
misc: update cuda graph capture exception log (#894)
|
2024-08-03 00:40:52 +10:00 |
|
Liangsheng Yin
|
6b0f2e9088
|
Add --max-total-tokens (#840)
|
2024-07-30 13:33:55 -07:00 |
|
Ying Sheng
|
e7487b08bc
|
Adjust default mem fraction to avoid OOM (#823)
|
2024-07-30 01:58:31 -07:00 |
|