sglang

Author	SHA1	Message	Date
Baizhou Zhang	a42736bbb8	Support MHA with chunked prefix cache for DeepSeek chunked prefill (#5113 )	2025-04-15 22:01:22 -07:00
Lianmin Zheng	0769b14bf9	[Minor] Move torch.compile patch to a better place (#5397 )	2025-04-15 18:37:07 -07:00
Yineng Zhang	fa909dc3c4	feat: update model_specific_adjustment (#5344 ) Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>	2025-04-15 14:45:15 -07:00
Zhaoyang Hao	5d13440162	[FIX] Fix concatenation error in capture_bs when open --disable-cuda-graph-padding and without MTP (#5412 )	2025-04-15 01:42:27 -07:00
Yineng Zhang	57de7c6b5f	feat: use fa3 mla by default on hopper (#5210 ) Co-authored-by: yundai424 <yundai424@gmail.com> Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>	2025-04-12 01:09:25 -07:00
Mick	e53a0b3d5b	[fix] fix mrope positions not picked up (#5265 )	2025-04-11 01:29:45 -07:00
Cheng Wan	038bc5d521	Support `--enable-llama4-multimodal` (#5254 )	2025-04-11 01:24:14 -07:00
Richard Zou	a879811c4b	Fix torch.compile cacheing (#5259 ) Co-authored-by: zhyncs <me@zhyncs.com>	2025-04-10 18:08:45 -07:00
Yineng Zhang	4cb53ecd0c	fix: log warning when disable cuda graph (#5209 )	2025-04-09 14:16:13 -07:00
Jinyan Chen	bc3f6db2dd	[Fix] DeepEP Compatibility with Low Latency (#5068 ) Co-authored-by: ch-wan <cwan39@gatech.edu>	2025-04-08 20:31:31 -07:00
Chunan Zeng	a7c3f74bec	[FA3 Feature] Support multi modal Llama-3.2-11B-Vision-Instruct (#5103 )	2025-04-07 22:58:08 -07:00
Chang Su	f04c80dc42	Add Llama4 support (#5092 ) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: fzyzcjy <ch271828n@outlook.com> Co-authored-by: ispobock <ispobaoke@163.com>	2025-04-07 00:29:36 -07:00
Baizhou Zhang	efbae697b3	[Revision] Replace enable_flashinfer_mla argument with attention_backend (#5052 )	2025-04-05 01:23:02 -07:00
Stefan He	ca8d02abd5	FA3 Spec Decoding to support top k = 1 and add cuda graph support (#5050 ) Co-authored-by: Qingquan Song <ustcsqq@gmail.com> Co-authored-by: Chunan Zeng <zcnrex@gmail.com>	2025-04-04 23:03:59 -07:00
Xiaoyu Zhang	924ca7c92c	Add DeepSeek V3/R1 shared experts fusion (#4918 )	2025-04-04 01:59:29 -07:00
Lianmin Zheng	74885a848b	Revert "Replace enable_flashinfer_mla argument with attention_backend" (#5048 )	2025-04-03 13:30:56 -07:00
Baizhou Zhang	e8999b13b7	Replace enable_flashinfer_mla argument with attention_backend (#5005 )	2025-04-03 02:53:58 -07:00
Jinyan Chen	23c764b18a	[Feature] Support DeepEP Low Latency (#4767 ) Co-authored-by: sleepcoo <sleepcoo@gmail.com> Co-authored-by: laixinn <xielx@shanghaitech.edu.cn> Co-authored-by: ch-wan <cwan39@gatech.edu>	2025-04-01 09:23:25 -07:00
Mick	5cb552b1d4	refactor: multimodal data (#4754 )	2025-03-31 09:57:51 -07:00
Baizhou Zhang	4a63bc32b7	[Fix] Add torch compile for torch.clamp back (#4936 )	2025-03-30 20:46:07 -07:00
Baizhou Zhang	e62d60fe6d	[Fix] avoid stream sync and torch compile in prefill for fa3 backend (#4932 )	2025-03-30 13:53:44 -07:00
Lianmin Zheng	b26bc86b36	Support page size > 1 + eagle (#4908 )	2025-03-30 00:46:23 -07:00
Baizhou Zhang	20c90be23d	[Feature] Support FA3 backend for MLA (#4831 )	2025-03-28 18:30:14 -07:00
tarinkk	7f19e083c1	Support (1 <= dp < tp) in the dp attention in DeepEP (#4770 ) Co-authored-by: Cheng Wan <cwan39@gatech.edu>	2025-03-27 17:09:35 -07:00
AinL	17000d2b3a	Remove Unintended Capture Batch Sizes in AMD HIP Graph Runner (#4638 )	2025-03-27 08:41:33 -07:00
fzyzcjy	92bb49a7f9	Patch PyTorch's bug that cross-process tensor transfer will lead to wrong device (#4565 )	2025-03-27 00:22:33 -07:00
fzyzcjy	26f07294f1	Warn users when release_memory_occupation is called without memory saver enabled (#4566 )	2025-03-26 00:18:14 -07:00
Mick	1e86457c90	model: Minicpmo (#3023 )	2025-03-24 20:08:40 -07:00
Stefan He	5d7edc8e55	Support FA3 as Attention backend by using `--attention-backend fa3` (#4680 ) Co-authored-by: qsong <qsong@linkedin.com> Co-authored-by: qingquansong <ustcsqq@gmail.com>	2025-03-23 23:28:11 -07:00
Mick	11577cedb7	refactor: bug fixes and refactor for vlm (#4661 )	2025-03-22 22:48:49 -07:00
JieXin Liang	9e93ef3f8e	[fix] fix illegal mem access and clean up triton attention backend (#4571 )	2025-03-20 02:01:52 -07:00
Jinyan Chen	f44db16c8e	[Feature] Integrate DeepEP into SGLang (#4232 ) Co-authored-by: Cheng Wan <cwan39@gatech.edu> Co-authored-by: Xuting Zhou <xutingz@nvidia.com>	2025-03-19 08:16:31 -07:00
JieXin Liang	c0e9a36c5f	Optimize Triton decoding kernel for dynamic workload (#4553 )	2025-03-18 21:25:38 -07:00
aoshen524	588865f0e0	[Feature] Support Tensor Parallelism and Weight Slicing for Lora (#4274 ) Co-authored-by: ShenAo1111 <1377693092@qq.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2025-03-18 20:33:07 -07:00
James Liu	9e0186f352	[Feature] Support EAGLE 3 (#4247 )	2025-03-18 07:35:23 -07:00
Mick	d373a48c98	fix: second_per_grid_ts should be used to get mrope position (#3682 )	2025-03-17 18:12:38 -07:00
Lianmin Zheng	5493c3343e	Fix data parallel + tensor parallel (#4499 )	2025-03-17 05:13:16 -07:00
萝卜菜	d6d21640d3	[Feature] Support Deepseek-VL2 (#2798 ) Co-authored-by: Edenzzzz <wtan45@wisc.edu> Co-authored-by: Chayenne <zhaochen20@outlook.com> Co-authored-by: Yi Zhang <1109276519@qq.com>	2025-03-16 23:07:59 -07:00
Mick	9d02bb3e2a	Urgent model support: support gemma-3-it (#4424 )	2025-03-16 17:37:32 -07:00
lukec	a53fe428f9	Support FlashMLA backend (#4472 ) Co-authored-by: yinfan98 <1106310035@qq.com>	2025-03-16 09:07:06 -07:00
Ying Sheng	1b859295f4	[Eagle] Remove the greedy branch and some redundant code (#4363 ) Co-authored-by: Sehoon Kim <sehoon@x.ai>	2025-03-16 02:48:55 -07:00
JieXin Liang	1a3fa75f2f	[Fix] use `torch.cat` instead of `torch.concat` to prevent entering the `Autograd` backends. (#4466 )	2025-03-16 00:02:47 -07:00
wangyu	1ce4878d31	feat(remote_model): support variable remote backend for model loader (#3964 ) Signed-off-by: wangyu <wangyu.steph@bytedance.com>	2025-03-14 00:40:44 -07:00
Lianmin Zheng	8e66fbecee	Improve DP attention (#4390 ) Co-authored-by: dhou-xai <dhou@x.ai> Co-authored-by: SangBin Cho <rkooo567@gmail.com>	2025-03-13 08:23:56 -07:00
Lianmin Zheng	45de89719c	Revert "[XPU][CPU] Enable the native path of DeepSeek" (#4367 )	2025-03-12 23:45:52 -07:00
Meng, Hengyu	71046fcd71	[XPU][CPU] Enable the native path of DeepSeek (#4086 ) Co-authored-by: Zhang, Liangang <liangang.zhang@intel.com>	2025-03-12 22:26:29 -07:00
Lianmin Zheng	c76040e31b	Support page size > 1 (#4356 )	2025-03-12 22:22:39 -07:00
Lianmin Zheng	e35a93fa8a	Move output processing logic from scheduler.py into a separate file (#4354 )	2025-03-12 16:21:49 -07:00
Lianmin Zheng	d40ee62b5d	Update nightly tests (#4352 )	2025-03-12 15:36:13 -07:00
Yineng Zhang	d1da58e275	unify is_cuda and is_hip (#4321 )	2025-03-11 18:12:56 -07:00

1 2 3 4 5 ...

294 Commits