sglang

Author	SHA1	Message	Date
Rain H	71fc7b7fad	[Fix] KV-cache eviction mismatch across PP ranks in DeepSeek V3/R1 (#10214 )	2025-09-09 02:07:13 -07:00
Cao E	7577f0e40f	Add graph runner support with torch compile on CPU (#7843 )	2025-09-07 21:33:58 -07:00
Qiaolin Yu	8cda5a622c	Standalone speculative decoding (#10090 )	2025-09-07 20:55:09 -07:00
Lzhang-hub	37d83c6e6d	Qwen2.5-VL eagle3 infer (#8801 )	2025-09-07 20:44:34 -07:00
Cheng Wan	21af5c0404	[Fix] Compatibility between DP attention and pipeline parallelism (#10100 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-09-06 01:34:10 -07:00
Kaixi Hou	90dfe3de4c	[NVIDIA] disable chunked prefix cache when dp and blackwell is used (#9861 )	2025-09-06 14:05:16 +08:00
Chi-Chih Chang	ad26f298e2	fix double sparsity initialization (#6905 )	2025-09-06 11:45:24 +08:00
timmy-feng	4ed9053ecf	Remove mrope position sync (#9460 ) Co-authored-by: Nathan Wang <nathan.r.wang@gmail.com>	2025-09-03 11:40:53 -07:00
Lianmin Zheng	60e37f8028	Move parsers under a single folder (#9912 )	2025-09-02 18:25:04 -07:00
Guoyuan Lin	5e194b2143	[Model] Support Meituan LongCat-Flash && LongCat-Flash-MTP (#9824 )	2025-08-30 23:29:21 -07:00
Qiaolin Yu	4a4772ae03	Support speculative decoding in hybrid attention backend (#9573 )	2025-08-28 01:11:42 -07:00
Lianmin Zheng	fd71b11b1d	move is_sm90_supported/is_sm100_supported to python/sglang/srt/utils.py (#9679 )	2025-08-27 03:34:29 -07:00
miter	a0b22f2f17	remove redundant rank0_log function. (#9560 ) Co-authored-by: linhuang <linhuang@ruijie.com.cn>	2025-08-24 23:17:55 -07:00
fzyzcjy	2600fc0d47	Overlapped weight offload (#8034 )	2025-08-23 02:06:46 -07:00
Yongfei Xu	9708d353b7	Support MHA with chunked prefix cache for flashinfer/flashmla backend, support page size > 1 for MHA chunked prefix (#8616 ) Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com>	2025-08-21 18:19:44 -07:00
fzyzcjy	55d336cb08	Refactor weight offloading logic (#8521 )	2025-08-21 03:48:13 -07:00
VDV1985	2c4b4b786b	[feature] Ascend NPU graph support (#9399 ) Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: yezhifeng (D) <y00897525@china.huawei.com> Co-authored-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: ssshinigami <44640852+ssshinigami@users.noreply.github.com>	2025-08-20 21:13:27 -07:00
Liangsheng Yin	eb19ccadae	[bug] fix errors related to context length in SD (#9388 )	2025-08-21 10:32:34 +08:00
Even Zhou	de2dd73831	Revert "[feature] Rework Ascend NPU graph support" (#9385 )	2025-08-20 00:35:10 -07:00
Even Zhou	3680d6f88b	[feature] Rework Ascend NPU graph support (#9350 ) Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: yezhifeng (D) <y00897525@china.huawei.com> Co-authored-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: ssshinigami <44640852+ssshinigami@users.noreply.github.com>	2025-08-19 20:32:27 -07:00
Even Zhou	f4fafacc5d	Revert "[feature] Ascend NPU graph support (#8027 )" (#9348 )	2025-08-19 10:11:23 -07:00
fzyzcjy	4c0bb411e5	Further fix memory pool leak error (#9298 )	2025-08-18 00:58:06 -07:00
Shangming Cai	384f8ab5ce	[PD] Support PD disaggregation with Prefill PP (#8846 ) Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com> Signed-off-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: root <huzhiyuan@xiaohongshu.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: Francis <38564764+ssssnow@users.noreply.github.com> Co-authored-by: zitto <zhjc1124@gmail.com>	2025-08-16 18:31:31 -07:00
VDV1985	94371dbbd6	[feature] Ascend NPU graph support (#8027 ) Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: yezhifeng (D) <y00897525@china.huawei.com> Co-authored-by: anon189Ty <Stari_Falcon@outlook.com> Co-authored-by: Maksim <makcum888e@mail.ru> Co-authored-by: ssshinigami <44640852+ssshinigami@users.noreply.github.com>	2025-08-16 17:25:17 -07:00
Trevor Morris	eff4eb3fdd	Add fp4 quantize before all-gather for Flashinfer cutlass MoE DP (max throughput) (#7667 )	2025-08-15 22:08:11 -07:00
Cheng Wan	295895120d	[6/N] MoE Refactor: Cleanup MoE-related configs (#8849 )	2025-08-14 21:14:53 -07:00
Cheng Wan	b87aacb5c5	[DP Attention] Refactor: adding some utility functions (#9136 )	2025-08-13 21:08:06 -07:00
Lianmin Zheng	9e426466af	Clean up allocators (#9134 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-08-13 13:56:04 -07:00
jacky.cheng	25caa7a8a9	[AMD] Support Wave attention backend with AMD GPU optimizations (#8660 ) Signed-off-by: Stanley Winata <stanley.winata@amd.com> Signed-off-by: Harsh Menon <harsh@nod-labs.com> Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Signed-off-by: xintin <gaurav.verma@amd.com> Co-authored-by: Harsh Menon <harsh@nod-labs.com> Co-authored-by: Stanley Winata <stanley.winata@amd.com> Co-authored-by: Stanley Winata <68087699+raikonenfnu@users.noreply.github.com> Co-authored-by: Stanley Winata <stanley@nod-labs.com> Co-authored-by: Ivan Butygin <ivan.butygin@gmail.com> Co-authored-by: nithinsubbiah <nithinsubbiah@gmail.com> Co-authored-by: Nithin Meganathan <18070964+nithinsubbiah@users.noreply.github.com> Co-authored-by: Ivan Butygin <ibutygin@amd.com>	2025-08-12 13:49:11 -07:00
Cheng Wan	5f5b3b2449	[5/n] DP Enhancement: Correct `num_token_non_padded` (#9107 )	2025-08-12 12:23:46 -07:00
DarkSharpness	b4ac2b9c0c	[Fix] Fix dual chunk model default behavior (#9032 )	2025-08-11 23:50:23 -07:00
Lianmin Zheng	f2887498f0	Simplify memory pool (#9033 )	2025-08-10 17:32:28 -07:00
Stefan He	8ecf6b9d24	Support Flatten Tensor Update Weights to speed up MOE Update Weights by 20% (#8079 )	2025-08-10 16:08:59 -07:00
YiXR	0418b9d4ea	[Optimization] Update estimated_num_new_pages logic in TokenToKVPoolAllocator (#8794 ) Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com> Co-authored-by: Xingrui Yi <yixingrui@linux.alibaba.com>	2025-08-10 16:01:51 -07:00
JiLi	6b847a9a05	Optimize: Cache CUDA device to reduce redundant calls during tensor l… (#8996 )	2025-08-10 00:32:57 -07:00
DarkSharpness	7ba5ad5766	[Fix] Fix flashinfer cpu <-> gpu synchronization (#8340 )	2025-08-10 03:11:40 +00:00
DarkSharpness	19bc77f05c	[Fix] Fix hicache backend (#8991 )	2025-08-09 17:16:25 -07:00
Cheng Wan	5018809222	[DP] fix: engine crash when decode batch is padded (#8995 )	2025-08-09 01:29:29 -07:00
DarkSharpness	fc42ff7b63	[Fix] Fix wrong backend chosen in hybrid backend (#8989 )	2025-08-08 21:21:17 -07:00
PGFLMG	b7cd743038	[Feat] QWen-1M context support[2/2]: Update block sparse attention backend (#5949 )	2025-08-06 23:49:36 -07:00
eigen	6ad6c8c9e6	feat: openai oss attention sink support with trtllm-gen backend #8825 (#8834 ) Co-authored-by: averyhuang <averyh@nvidia.com>	2025-08-06 19:18:27 -07:00
Trevor Morris	c0e84297c2	Use reduce scatter for DP (#8539 )	2025-08-06 16:21:26 -07:00
HouseWest	ca47e24f5d	[Feature] improve TBO: two chunk overlap (#8144 )	2025-08-05 21:11:01 -07:00
eigen	40e3b2beeb	feat: add trtllm-gen mha from direct call (#8782 ) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>	2025-08-05 03:28:39 -07:00
Baizhou Zhang	f2d68ded6d	Rename lora_path to lora_id in batches (#8437 )	2025-08-03 21:08:28 -07:00
Cheng Wan	0e0eef00ce	[DP] fix the compatibility issue between DP attention and `--attention-backend triton` (#8723 )	2025-08-03 03:06:57 -07:00
Cheng Wan	cb099d2095	[CUDA Graph] save cuda graph memory by using next_token_logits_buffer (#8579 )	2025-08-03 03:06:47 -07:00
Nicolas Castet	82e6c3a65a	Add support for NCCL symmetric memory for TP allreduces (#8238 )	2025-08-01 23:30:55 +00:00
Cheng Wan	6c88f6c8d9	[5/N] MoE Refactor: Update MoE parallelism arguments (#8658 )	2025-08-01 01:20:03 -07:00
Faraz	4b04998d38	TRTLLM Gen MLA Decode Kernel Integration (same as #7938 ) (#8632 ) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>	2025-07-31 16:03:40 -07:00

1 2 3 4 5 ...

480 Commits