sglang

Author	SHA1	Message	Date
Kevin Xiang Li	3b3b3baf9f	Double vision prefill throughput by defaulting to optimal vision attention backend (#8484 ) Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com>	2025-08-13 02:08:30 -07:00
fzyzcjy	9394ed6386	Fix gpt-oss ~2x memory consumption issue (#9146 )	2025-08-13 00:11:43 -07:00
Stefan He	930fe467bd	Support Triton FP8 Gemm can handle hidden_dim not divisible by 16 (#9093 ) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>	2025-08-12 21:21:55 -07:00
Elfie Guo	8723b4f146	Use FlashInfer's TRTLLM FP8 Blockscale GEMM (#8588 )	2025-08-12 20:08:40 -07:00
huangtingwei	0edda32001	Support page first layout zero copy for mooncake store (#8651 ) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>	2025-08-12 15:59:26 -07:00
jacky.cheng	25caa7a8a9	[AMD] Support Wave attention backend with AMD GPU optimizations (#8660 ) Signed-off-by: Stanley Winata <stanley.winata@amd.com> Signed-off-by: Harsh Menon <harsh@nod-labs.com> Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com> Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com> Signed-off-by: xintin <gaurav.verma@amd.com> Co-authored-by: Harsh Menon <harsh@nod-labs.com> Co-authored-by: Stanley Winata <stanley.winata@amd.com> Co-authored-by: Stanley Winata <68087699+raikonenfnu@users.noreply.github.com> Co-authored-by: Stanley Winata <stanley@nod-labs.com> Co-authored-by: Ivan Butygin <ivan.butygin@gmail.com> Co-authored-by: nithinsubbiah <nithinsubbiah@gmail.com> Co-authored-by: Nithin Meganathan <18070964+nithinsubbiah@users.noreply.github.com> Co-authored-by: Ivan Butygin <ibutygin@amd.com>	2025-08-12 13:49:11 -07:00
ichernob	83123f481e	[Quantization] Supported w8a8 int8 quantized Gemma3 and Qwen-VL models (#8619 ) Co-authored-by: ronnie_zheng <zl19940307@163.com>	2025-08-12 13:31:18 -07:00
ronnie_zheng	48afa8f14f	[feat] Enable Ascend profiling on SGLang (#8610 ) Co-authored-by: liyou_b <2953090824@qq.com>	2025-08-12 13:28:31 -07:00
Jiaqi Gu	c9ee738515	Fuse writing KV buffer into rope kernel (part 2: srt) (#9014 ) Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>	2025-08-12 13:15:30 -07:00
Cheng Wan	5f5b3b2449	[5/n] DP Enhancement: Correct `num_token_non_padded` (#9107 )	2025-08-12 12:23:46 -07:00
Chang Su	f2a5de284b	[Bugfix] Fix accuracy-test-1-gpu failure caused by `builtin_tools` (#9114 )	2025-08-12 09:56:13 -07:00
fzyzcjy	9aea255522	Fuse writing KV buffer into rope kernel (part 1: sgl-kernel) (#9077 )	2025-08-12 01:46:40 -07:00
fzyzcjy	5190ba7f42	Fuse two kernels of hidden states padding into quantization kernel (#9005 ) Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>	2025-08-12 01:20:13 -07:00
Chang Su	9c83d74da3	bugfix: Fix the commentary msg extraction in GptOssDetector (#9097 )	2025-08-11 23:53:10 -07:00
DarkSharpness	b4ac2b9c0c	[Fix] Fix dual chunk model default behavior (#9032 )	2025-08-11 23:50:23 -07:00
Jianwei Dong	83262dcb29	Fix mismatch between padded_scales shape and reshape dimensions in modelopt quantization (#8766 )	2025-08-11 23:44:40 -07:00
zixuanzhang226	c46c75f8c0	feat: add fused moe config for Qwen3-30B-A3B on B200 (#9087 )	2025-08-11 23:25:36 -07:00
Makcum888e	2aaf22c46c	Optimization for AscendPagedTokenToKVPoolAllocator (#8293 ) Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: VDV1985 <vladdv85@mail.ru>	2025-08-11 23:06:39 -07:00
Lifu Huang	5ded39cab2	Fix race condition in async lora unload (#9084 )	2025-08-11 22:59:29 -07:00
Chang Su	a218490136	(gpt-oss, oai, chat): Remove Harmony Integration and Implement Native GPT-OSS Tool Call Support (#9043 )	2025-08-11 18:59:18 -07:00
Zhiqiang Xie	0eec4cb6cc	HiCache, add bench long context plus minor fixs (#9086 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-08-11 16:54:52 -07:00
Faradawn Yang	ff1f68252c	[fix] Set Radix tree root node hash to None - Nvidia Dynamo Integration (#9030 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-08-11 14:20:39 -07:00
Zhiqiang Xie	9f78f391ae	HiCache Storage: generate hash when inserting new nodes (#9053 )	2025-08-11 14:18:59 -07:00
Faraz	f508cd3cb7	TRTLLM-MLA FP8 path (#8638 ) Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>	2025-08-11 14:02:13 -07:00
Xiaoyu Zhang	44e86480e8	fuse allreduce and residual_rmsnorm (#8731 )	2025-08-11 13:50:53 -07:00
SijiaYang	90f44b74e6	fix: w4afp8 accuracy problem and rebase (#8752 ) Signed-off-by: yangsijia.614 <yangsijia.614@bytedance.com> Co-authored-by: Jinwu <ayrnb@users.noreply.github.com>	2025-08-11 13:41:19 -07:00
Liangsheng Yin	f9afa7dceb	Fix docs for clip max new tokens (#9082 )	2025-08-11 13:15:21 -07:00
Jimmy	0d9e89ec69	[PD]decode: add CLIP_MAX_NEW_TOKEN for pop_preallocated (#8866 )	2025-08-11 13:08:11 -07:00
633WHU	3bffe11279	Fix chunked prefill size validation for disabled state (#8973 ) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-08-11 11:05:29 -07:00
Baizhou Zhang	75e6a7cde1	Support radix cache for Lora feature (#7216 )	2025-08-11 10:14:11 -07:00
Chang Su	a6452b7188	bugfix: Fix output_ids extraction in detokenizer_manager (#9047 )	2025-08-11 03:17:32 -07:00
zhyncs	f4ae50e97c	fix: use flashinfer v0.2.11.post1	2025-08-11 02:49:25 -07:00
Yineng Zhang	84cb449eec	Revert "chore: upgrade flashinfer 0.2.11 (#9036 )" (#9057 )	2025-08-11 00:16:39 -07:00
Yineng Zhang	9d834fdcc1	Revert "feat: update flashinfer ar oneshot params (#8687 )" (#9054 )	2025-08-10 23:24:42 -07:00
Lianmin Zheng	2449a0afe2	Refactor the docs (#9031 )	2025-08-10 19:49:45 -07:00
Yineng Zhang	dd001a5477	chore: upgrade flashinfer 0.2.11 (#9036 )	2025-08-10 17:35:37 -07:00
Lianmin Zheng	4ea9d74a3e	Simplify health check (#9034 )	2025-08-10 17:35:05 -07:00
Yineng Zhang	dd949ace23	Revert "[1/2][resubmit] sgl-kernel: Fuse routed scaling factor into m… (#9035 )	2025-08-10 17:34:54 -07:00
Lianmin Zheng	f2887498f0	Simplify memory pool (#9033 )	2025-08-10 17:32:28 -07:00
Stefan He	8ecf6b9d24	Support Flatten Tensor Update Weights to speed up MOE Update Weights by 20% (#8079 )	2025-08-10 16:08:59 -07:00
YiXR	0418b9d4ea	[Optimization] Update estimated_num_new_pages logic in TokenToKVPoolAllocator (#8794 ) Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com> Co-authored-by: Xingrui Yi <yixingrui@linux.alibaba.com>	2025-08-10 16:01:51 -07:00
Lianmin Zheng	b58ae7a2a0	Simplify frontend language (#9029 )	2025-08-10 10:59:30 -07:00
Lifu Huang	f8a173bb50	Improve LoRA Perf by Deprecating FlashInfer and Eliminating Redundant Tensor Ops (#8940 )	2025-08-10 01:04:45 -07:00
JiLi	6b847a9a05	Optimize: Cache CUDA device to reduce redundant calls during tensor l… (#8996 )	2025-08-10 00:32:57 -07:00
DarkSharpness	7ba5ad5766	[Fix] Fix flashinfer cpu <-> gpu synchronization (#8340 )	2025-08-10 03:11:40 +00:00
DarkSharpness	19bc77f05c	[Fix] Fix hicache backend (#8991 )	2025-08-09 17:16:25 -07:00
huangtingwei	86497d99f2	fix page first per layer pf2lf kernel (#8915 ) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>	2025-08-09 17:16:11 -07:00
cctry	5c31b35db2	[hicache] Optimization for DMA copy (#8245 )	2025-08-09 17:16:07 -07:00
Lianmin Zheng	ef48d5547e	Fix CI (#9013 )	2025-08-09 16:00:10 -07:00
Xiaoyu Zhang	a886564a18	fix flashinfer allreduce fusion import bug (#9007 )	2025-08-09 13:47:05 -07:00

1 2 3 4 5 ...

3131 Commits