Commit Graph

294 Commits

Author SHA1 Message Date
Baizhou Zhang
a42736bbb8 Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113) 2025-04-15 22:01:22 -07:00
Lianmin Zheng
0769b14bf9 [Minor] Move torch.compile patch to a better place (#5397) 2025-04-15 18:37:07 -07:00
Yineng Zhang
fa909dc3c4 feat: update model_specific_adjustment (#5344)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
2025-04-15 14:45:15 -07:00
Zhaoyang Hao
5d13440162 [FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (#5412) 2025-04-15 01:42:27 -07:00
Yineng Zhang
57de7c6b5f feat: use fa3 mla by default on hopper (#5210)
Co-authored-by: yundai424 <yundai424@gmail.com>
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
2025-04-12 01:09:25 -07:00
Mick
e53a0b3d5b [fix] fix mrope positions not picked up (#5265) 2025-04-11 01:29:45 -07:00
Cheng Wan
038bc5d521 Support --enable-llama4-multimodal (#5254) 2025-04-11 01:24:14 -07:00
Richard Zou
a879811c4b Fix torch.compile cacheing (#5259)
Co-authored-by: zhyncs <me@zhyncs.com>
2025-04-10 18:08:45 -07:00
Yineng Zhang
4cb53ecd0c fix: log warning when disable cuda graph (#5209) 2025-04-09 14:16:13 -07:00
Jinyan Chen
bc3f6db2dd [Fix] DeepEP Compatibility with Low Latency (#5068)
Co-authored-by: ch-wan <cwan39@gatech.edu>
2025-04-08 20:31:31 -07:00
Chunan Zeng
a7c3f74bec [FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (#5103) 2025-04-07 22:58:08 -07:00
Chang Su
f04c80dc42 Add Llama4 support (#5092)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
Co-authored-by: ispobock <ispobaoke@163.com>
2025-04-07 00:29:36 -07:00
Baizhou Zhang
efbae697b3 [Revision] Replace enable_flashinfer_mla argument with attention_backend (#5052) 2025-04-05 01:23:02 -07:00
Stefan He
ca8d02abd5 FA3 Spec Decoding to support top k = 1 and add cuda graph support (#5050)
Co-authored-by: Qingquan Song <ustcsqq@gmail.com>
Co-authored-by: Chunan Zeng <zcnrex@gmail.com>
2025-04-04 23:03:59 -07:00
Xiaoyu Zhang
924ca7c92c Add DeepSeek V3/R1 shared experts fusion (#4918) 2025-04-04 01:59:29 -07:00
Lianmin Zheng
74885a848b Revert "Replace enable_flashinfer_mla argument with attention_backend" (#5048) 2025-04-03 13:30:56 -07:00
Baizhou Zhang
e8999b13b7 Replace enable_flashinfer_mla argument with attention_backend (#5005) 2025-04-03 02:53:58 -07:00
Jinyan Chen
23c764b18a [Feature] Support DeepEP Low Latency (#4767)
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: laixinn <xielx@shanghaitech.edu.cn>
Co-authored-by: ch-wan <cwan39@gatech.edu>
2025-04-01 09:23:25 -07:00
Mick
5cb552b1d4 refactor: multimodal data (#4754) 2025-03-31 09:57:51 -07:00
Baizhou Zhang
4a63bc32b7 [Fix] Add torch compile for torch.clamp back (#4936) 2025-03-30 20:46:07 -07:00
Baizhou Zhang
e62d60fe6d [Fix] avoid stream sync and torch compile in prefill for fa3 backend (#4932) 2025-03-30 13:53:44 -07:00
Lianmin Zheng
b26bc86b36 Support page size > 1 + eagle (#4908) 2025-03-30 00:46:23 -07:00
Baizhou Zhang
20c90be23d [Feature] Support FA3 backend for MLA (#4831) 2025-03-28 18:30:14 -07:00
tarinkk
7f19e083c1 Support (1 <= dp < tp) in the dp attention in DeepEP (#4770)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
2025-03-27 17:09:35 -07:00
AinL
17000d2b3a Remove Unintended Capture Batch Sizes in AMD HIP Graph Runner (#4638) 2025-03-27 08:41:33 -07:00
fzyzcjy
92bb49a7f9 Patch PyTorch's bug that cross-process tensor transfer will lead to wrong device (#4565) 2025-03-27 00:22:33 -07:00
fzyzcjy
26f07294f1 Warn users when release_memory_occupation is called without memory saver enabled (#4566) 2025-03-26 00:18:14 -07:00
Mick
1e86457c90 model: Minicpmo (#3023) 2025-03-24 20:08:40 -07:00
Stefan He
5d7edc8e55 Support FA3 as Attention backend by using --attention-backend fa3 (#4680)
Co-authored-by: qsong <qsong@linkedin.com>
Co-authored-by: qingquansong <ustcsqq@gmail.com>
2025-03-23 23:28:11 -07:00
Mick
11577cedb7 refactor: bug fixes and refactor for vlm (#4661) 2025-03-22 22:48:49 -07:00
JieXin Liang
9e93ef3f8e [fix] fix illegal mem access and clean up triton attention backend (#4571) 2025-03-20 02:01:52 -07:00
Jinyan Chen
f44db16c8e [Feature] Integrate DeepEP into SGLang (#4232)
Co-authored-by: Cheng Wan <cwan39@gatech.edu>
Co-authored-by: Xuting Zhou <xutingz@nvidia.com>
2025-03-19 08:16:31 -07:00
JieXin Liang
c0e9a36c5f Optimize Triton decoding kernel for dynamic workload (#4553) 2025-03-18 21:25:38 -07:00
aoshen524
588865f0e0 [Feature] Support Tensor Parallelism and Weight Slicing for Lora (#4274)
Co-authored-by: ShenAo1111 <1377693092@qq.com>
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-03-18 20:33:07 -07:00
James Liu
9e0186f352 [Feature] Support EAGLE 3 (#4247) 2025-03-18 07:35:23 -07:00
Mick
d373a48c98 fix: second_per_grid_ts should be used to get mrope position (#3682) 2025-03-17 18:12:38 -07:00
Lianmin Zheng
5493c3343e Fix data parallel + tensor parallel (#4499) 2025-03-17 05:13:16 -07:00
萝卜菜
d6d21640d3 [Feature] Support Deepseek-VL2 (#2798)
Co-authored-by: Edenzzzz <wtan45@wisc.edu>
Co-authored-by: Chayenne <zhaochen20@outlook.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
2025-03-16 23:07:59 -07:00
Mick
9d02bb3e2a Urgent model support: support gemma-3-it (#4424) 2025-03-16 17:37:32 -07:00
lukec
a53fe428f9 Support FlashMLA backend (#4472)
Co-authored-by: yinfan98 <1106310035@qq.com>
2025-03-16 09:07:06 -07:00
Ying Sheng
1b859295f4 [Eagle] Remove the greedy branch and some redundant code (#4363)
Co-authored-by: Sehoon Kim <sehoon@x.ai>
2025-03-16 02:48:55 -07:00
JieXin Liang
1a3fa75f2f [Fix] use torch.cat instead of torch.concat to prevent entering the Autograd backends. (#4466) 2025-03-16 00:02:47 -07:00
wangyu
1ce4878d31 feat(remote_model): support variable remote backend for model loader (#3964)
Signed-off-by: wangyu <wangyu.steph@bytedance.com>
2025-03-14 00:40:44 -07:00
Lianmin Zheng
8e66fbecee Improve DP attention (#4390)
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-03-13 08:23:56 -07:00
Lianmin Zheng
45de89719c Revert "[XPU][CPU] Enable the native path of DeepSeek" (#4367) 2025-03-12 23:45:52 -07:00
Meng, Hengyu
71046fcd71 [XPU][CPU] Enable the native path of DeepSeek (#4086)
Co-authored-by: Zhang, Liangang <liangang.zhang@intel.com>
2025-03-12 22:26:29 -07:00
Lianmin Zheng
c76040e31b Support page size > 1 (#4356) 2025-03-12 22:22:39 -07:00
Lianmin Zheng
e35a93fa8a Move output processing logic from scheduler.py into a separate file (#4354) 2025-03-12 16:21:49 -07:00
Lianmin Zheng
d40ee62b5d Update nightly tests (#4352) 2025-03-12 15:36:13 -07:00
Yineng Zhang
d1da58e275 unify is_cuda and is_hip (#4321) 2025-03-11 18:12:56 -07:00