Lianmin Zheng
|
5493c3343e
|
Fix data parallel + tensor parallel (#4499)
|
2025-03-17 05:13:16 -07:00 |
|
萝卜菜
|
d6d21640d3
|
[Feature] Support Deepseek-VL2 (#2798)
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
|
2025-03-16 23:07:59 -07:00 |
|
Mick
|
9d02bb3e2a
|
Urgent model support: support gemma-3-it (#4424)
|
2025-03-16 17:37:32 -07:00 |
|
lukec
|
a53fe428f9
|
Support FlashMLA backend (#4472)
Co-authored-by: yinfan98 <1106310035@qq.com>
|
2025-03-16 09:07:06 -07:00 |
|
Ying Sheng
|
1b859295f4
|
[Eagle] Remove the greedy branch and some redundant code (#4363)
Co-authored-by: Sehoon Kim <sehoon@x.ai>
|
2025-03-16 02:48:55 -07:00 |
|
JieXin Liang
|
1a3fa75f2f
|
[Fix] use torch.cat instead of torch.concat to prevent entering the Autograd backends. (#4466)
|
2025-03-16 00:02:47 -07:00 |
|
wangyu
|
1ce4878d31
|
feat(remote_model): support variable remote backend for model loader (#3964)
Signed-off-by: wangyu <wangyu.steph@bytedance.com>
|
2025-03-14 00:40:44 -07:00 |
|
Lianmin Zheng
|
8e66fbecee
|
Improve DP attention (#4390)
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2025-03-13 08:23:56 -07:00 |
|
Lianmin Zheng
|
45de89719c
|
Revert "[XPU][CPU] Enable the native path of DeepSeek" (#4367)
|
2025-03-12 23:45:52 -07:00 |
|
Meng, Hengyu
|
71046fcd71
|
[XPU][CPU] Enable the native path of DeepSeek (#4086)
Co-authored-by: Zhang, Liangang <liangang.zhang@intel.com>
|
2025-03-12 22:26:29 -07:00 |
|
Lianmin Zheng
|
c76040e31b
|
Support page size > 1 (#4356)
|
2025-03-12 22:22:39 -07:00 |
|
Lianmin Zheng
|
e35a93fa8a
|
Move output processing logic from scheduler.py into a separate file (#4354)
|
2025-03-12 16:21:49 -07:00 |
|
Lianmin Zheng
|
d40ee62b5d
|
Update nightly tests (#4352)
|
2025-03-12 15:36:13 -07:00 |
|
Yineng Zhang
|
d1da58e275
|
unify is_cuda and is_hip (#4321)
|
2025-03-11 18:12:56 -07:00 |
|
Lianmin Zheng
|
00d25a7f5e
|
Fix quantization and nightly tests (#4258)
|
2025-03-10 03:06:21 -07:00 |
|
Lianmin Zheng
|
08c4d764a5
|
lazy import attn backends (#4200)
|
2025-03-08 00:41:35 -08:00 |
|
Lianmin Zheng
|
d4017a6b63
|
[EAGLE] many fixes for eagle (#4195)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <sehoon@x.ai>
|
2025-03-07 22:12:13 -08:00 |
|
Yineng Zhang
|
eb61f5c9af
|
Revert "ROCm: Flex Attention Enablement with custom backends (#4178)" (#4186)
|
2025-03-07 10:27:52 -08:00 |
|
HAI
|
0beea4503f
|
ROCm: Flex Attention Enablement with custom backends (#4178)
Co-authored-by: linsun12 <linsun12@amd.com>
|
2025-03-07 04:38:53 -08:00 |
|
Lianmin Zheng
|
bc1534ff32
|
Fix a draft model accuracy bug in eagle; support step=1; return logprob in eagle (#4134)
Co-authored-by: Sehoon Kim <kssteven418@gmail.com>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: Sehoon Kim <sehoon@x.ai>
|
2025-03-06 06:13:59 -08:00 |
|
Lianmin Zheng
|
98c73d71cb
|
[Minor] make the __init__ function of model_runner.py shorter (#4132)
|
2025-03-06 01:51:12 -08:00 |
|
Zhiqiang Xie
|
aee30630d8
|
Add a pointer to the real KV cache pool (#4113)
|
2025-03-05 21:39:07 -08:00 |
|
Ke Bao
|
ef9d3b3c2c
|
Fix triton kernel illegal memory issue for eagle (#4100)
|
2025-03-05 11:23:53 -08:00 |
|
Baizhou Zhang
|
fc91d08a8f
|
[Revision] Add fast decode plan for flashinfer mla (#4012)
|
2025-03-05 11:20:41 -08:00 |
|
Ying Sheng
|
d3d4d76758
|
[Eagle] Refactor eagle speculative decoding (#3986)
Co-authored-by: Ke Bao <ISPObaoke@163.com>
|
2025-03-05 08:06:07 -08:00 |
|
Lianmin Zheng
|
e074d84e5b
|
[Minor] more code cleanup (#4077)
|
2025-03-04 21:23:47 -08:00 |
|
Chen Shengzhi
|
61261b3996
|
[XCCL] Use xccl for xpu backend since xccl is ready in latest PyTorch. (#3954)
|
2025-03-04 04:05:56 -08:00 |
|
Ke Bao
|
9fafa62db7
|
Share target model embed and head weights for nextn (#4033)
|
2025-03-03 13:30:04 -08:00 |
|
Lianmin Zheng
|
935cda944b
|
Misc clean up; Remove the support of jump forward (#4032)
|
2025-03-03 07:02:14 -08:00 |
|
Lianmin Zheng
|
66301e124f
|
Improve code styles (#4021)
|
2025-03-03 03:20:23 -08:00 |
|
Lianmin Zheng
|
ac2387279e
|
Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
|
2025-03-03 00:12:04 -08:00 |
|
Lianmin Zheng
|
9e1014cf99
|
Revert "Add fast decode plan for flashinfer mla" (#4008)
|
2025-03-02 19:29:10 -08:00 |
|
Baizhou Zhang
|
fa56106731
|
Add fast decode plan for flashinfer mla (#3987)
|
2025-03-02 19:16:37 -08:00 |
|
Qiaolin Yu
|
40782f05d7
|
Refactor: Move return_hidden_states to the generate input (#3985)
Co-authored-by: Beichen-Ma <mabeichen12@gmail.com>
|
2025-03-01 17:51:29 -08:00 |
|
Baizhou Zhang
|
90a4b7d98a
|
[Feature]Support ragged prefill in flashinfer mla backend (#3967)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
|
2025-02-28 18:13:56 -08:00 |
|
fzyzcjy
|
e3e0bc50a9
|
[Feature] SPMD for SGLang + Verl (#3852)
|
2025-02-28 09:53:10 -08:00 |
|
Qiaolin Yu
|
d6898dd253
|
Add return hidden state in the native API (#3897)
Co-authored-by: Beichen-Ma <mabeichen12@gmail.com>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
|
2025-02-26 22:06:54 -08:00 |
|
who who who
|
4606e2a3fe
|
Bug: fix capture_bs (#3857)
|
2025-02-25 08:40:35 -08:00 |
|
Shenggui Li
|
c0bb9eb3b3
|
[improve] made timeout configurable (#3803)
|
2025-02-25 00:26:08 -08:00 |
|
Baizhou Zhang
|
b110084654
|
Refactor flashinfer logic for deepseek v3 and fix accuracy bug (#3785)
|
2025-02-24 04:07:25 -08:00 |
|
Yineng Zhang
|
714f3e6362
|
feat: support flashinfer mla with prefix cache (#3643)
|
2025-02-18 02:06:43 +08:00 |
|
Yineng Zhang
|
dfce926921
|
fix high qps crash when enable mtp (#3592)
Co-authored-by: ispobock <ispobaoke@hotmail.com>
|
2025-02-15 23:11:28 +08:00 |
|
Yineng Zhang
|
70f894b810
|
feat: support flashinfer mla attention for deepseek v3 (#3550)
|
2025-02-14 08:50:14 +08:00 |
|
HAI
|
d81ac4434e
|
MI30x: More graph captures for larger batch sizes and concurrencies (#3420)
|
2025-02-12 03:04:38 +08:00 |
|
Jackmin801
|
5f0e7de339
|
[Feat] Return hidden states (experimental) (#3364)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
|
2025-02-10 15:54:37 -08:00 |
|
Yineng Zhang
|
bc72e5bd32
|
add cuda graph capture failure possible solution (#3430)
|
2025-02-09 22:57:11 +08:00 |
|
Yineng Zhang
|
fad315cb8e
|
fix EAGLE 2 non greedy case (#3407)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
|
2025-02-09 07:28:34 +08:00 |
|
Baizhou Zhang
|
70817a7eae
|
[Feature] Define backends and add Triton backend for Lora (#3161)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
|
2025-02-03 22:09:13 -08:00 |
|
Yineng Zhang
|
013021b6a1
|
refactor EAGLE 2 (#3269)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: merrymercy <lianminzheng@gmail.com>
Co-authored-by: Ying1123 <sqy1415@gmail.com>
|
2025-02-03 20:52:30 +08:00 |
|
Yineng Zhang
|
4eb4b401cc
|
update and simplify CustomOp (#3249)
|
2025-02-01 18:56:44 +08:00 |
|