Commit Graph

3131 Commits

Author SHA1 Message Date
Kevin Xiang Li
3b3b3baf9f Double vision prefill throughput by defaulting to optimal vision attention backend (#8484)
Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com>
2025-08-13 02:08:30 -07:00
fzyzcjy
9394ed6386 Fix gpt-oss ~2x memory consumption issue (#9146) 2025-08-13 00:11:43 -07:00
Stefan He
930fe467bd Support Triton FP8 Gemm can handle hidden_dim not divisible by 16 (#9093)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-08-12 21:21:55 -07:00
Elfie Guo
8723b4f146 Use FlashInfer's TRTLLM FP8 Blockscale GEMM (#8588) 2025-08-12 20:08:40 -07:00
huangtingwei
0edda32001 Support page first layout zero copy for mooncake store (#8651)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-08-12 15:59:26 -07:00
jacky.cheng
25caa7a8a9 [AMD] Support Wave attention backend with AMD GPU optimizations (#8660)
Signed-off-by: Stanley Winata <stanley.winata@amd.com>
Signed-off-by: Harsh Menon <harsh@nod-labs.com>
Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Co-authored-by: Harsh Menon <harsh@nod-labs.com>
Co-authored-by: Stanley Winata <stanley.winata@amd.com>
Co-authored-by: Stanley Winata <68087699+raikonenfnu@users.noreply.github.com>
Co-authored-by: Stanley Winata <stanley@nod-labs.com>
Co-authored-by: Ivan Butygin <ivan.butygin@gmail.com>
Co-authored-by: nithinsubbiah <nithinsubbiah@gmail.com>
Co-authored-by: Nithin Meganathan <18070964+nithinsubbiah@users.noreply.github.com>
Co-authored-by: Ivan Butygin <ibutygin@amd.com>
2025-08-12 13:49:11 -07:00
ichernob
83123f481e [Quantization] Supported w8a8 int8 quantized Gemma3 and Qwen-VL models (#8619)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2025-08-12 13:31:18 -07:00
ronnie_zheng
48afa8f14f [feat] Enable Ascend profiling on SGLang (#8610)
Co-authored-by: liyou_b <2953090824@qq.com>
2025-08-12 13:28:31 -07:00
Jiaqi Gu
c9ee738515 Fuse writing KV buffer into rope kernel (part 2: srt) (#9014)
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
2025-08-12 13:15:30 -07:00
Cheng Wan
5f5b3b2449 [5/n] DP Enhancement: Correct num_token_non_padded (#9107) 2025-08-12 12:23:46 -07:00
Chang Su
f2a5de284b [Bugfix] Fix accuracy-test-1-gpu failure caused by builtin_tools (#9114) 2025-08-12 09:56:13 -07:00
fzyzcjy
9aea255522 Fuse writing KV buffer into rope kernel (part 1: sgl-kernel) (#9077) 2025-08-12 01:46:40 -07:00
fzyzcjy
5190ba7f42 Fuse two kernels of hidden states padding into quantization kernel (#9005)
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2025-08-12 01:20:13 -07:00
Chang Su
9c83d74da3 bugfix: Fix the commentary msg extraction in GptOssDetector (#9097) 2025-08-11 23:53:10 -07:00
DarkSharpness
b4ac2b9c0c [Fix] Fix dual chunk model default behavior (#9032) 2025-08-11 23:50:23 -07:00
Jianwei Dong
83262dcb29 Fix mismatch between padded_scales shape and reshape dimensions in modelopt quantization (#8766) 2025-08-11 23:44:40 -07:00
zixuanzhang226
c46c75f8c0 feat: add fused moe config for Qwen3-30B-A3B on B200 (#9087) 2025-08-11 23:25:36 -07:00
Makcum888e
2aaf22c46c Optimization for AscendPagedTokenToKVPoolAllocator (#8293)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: VDV1985 <vladdv85@mail.ru>
2025-08-11 23:06:39 -07:00
Lifu Huang
5ded39cab2 Fix race condition in async lora unload (#9084) 2025-08-11 22:59:29 -07:00
Chang Su
a218490136 (gpt-oss, oai, chat): Remove Harmony Integration and Implement Native GPT-OSS Tool Call Support (#9043) 2025-08-11 18:59:18 -07:00
Zhiqiang Xie
0eec4cb6cc HiCache, add bench long context plus minor fixs (#9086)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-11 16:54:52 -07:00
Faradawn Yang
ff1f68252c [fix] Set Radix tree root node hash to None - Nvidia Dynamo Integration (#9030)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-11 14:20:39 -07:00
Zhiqiang Xie
9f78f391ae HiCache Storage: generate hash when inserting new nodes (#9053) 2025-08-11 14:18:59 -07:00
Faraz
f508cd3cb7 TRTLLM-MLA FP8 path (#8638)
Signed-off-by: Faraz Khoubsirat <58580514+farazkh80@users.noreply.github.com>
2025-08-11 14:02:13 -07:00
Xiaoyu Zhang
44e86480e8 fuse allreduce and residual_rmsnorm (#8731) 2025-08-11 13:50:53 -07:00
SijiaYang
90f44b74e6 fix: w4afp8 accuracy problem and rebase (#8752)
Signed-off-by: yangsijia.614 <yangsijia.614@bytedance.com>
Co-authored-by: Jinwu <ayrnb@users.noreply.github.com>
2025-08-11 13:41:19 -07:00
Liangsheng Yin
f9afa7dceb Fix docs for clip max new tokens (#9082) 2025-08-11 13:15:21 -07:00
Jimmy
0d9e89ec69 [PD]decode: add CLIP_MAX_NEW_TOKEN for pop_preallocated (#8866) 2025-08-11 13:08:11 -07:00
633WHU
3bffe11279 Fix chunked prefill size validation for disabled state (#8973)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-11 11:05:29 -07:00
Baizhou Zhang
75e6a7cde1 Support radix cache for Lora feature (#7216) 2025-08-11 10:14:11 -07:00
Chang Su
a6452b7188 bugfix: Fix output_ids extraction in detokenizer_manager (#9047) 2025-08-11 03:17:32 -07:00
zhyncs
f4ae50e97c fix: use flashinfer v0.2.11.post1 2025-08-11 02:49:25 -07:00
Yineng Zhang
84cb449eec Revert "chore: upgrade flashinfer 0.2.11 (#9036)" (#9057) 2025-08-11 00:16:39 -07:00
Yineng Zhang
9d834fdcc1 Revert "feat: update flashinfer ar oneshot params (#8687)" (#9054) 2025-08-10 23:24:42 -07:00
Lianmin Zheng
2449a0afe2 Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00
Yineng Zhang
dd001a5477 chore: upgrade flashinfer 0.2.11 (#9036) 2025-08-10 17:35:37 -07:00
Lianmin Zheng
4ea9d74a3e Simplify health check (#9034) 2025-08-10 17:35:05 -07:00
Yineng Zhang
dd949ace23 Revert "[1/2][resubmit] sgl-kernel: Fuse routed scaling factor into m… (#9035) 2025-08-10 17:34:54 -07:00
Lianmin Zheng
f2887498f0 Simplify memory pool (#9033) 2025-08-10 17:32:28 -07:00
Stefan He
8ecf6b9d24 Support Flatten Tensor Update Weights to speed up MOE Update Weights by 20% (#8079) 2025-08-10 16:08:59 -07:00
YiXR
0418b9d4ea [Optimization] Update estimated_num_new_pages logic in TokenToKVPoolAllocator (#8794)
Signed-off-by: Xingrui Yi <yixingrui@linux.alibaba.com>
Co-authored-by: Xingrui Yi <yixingrui@linux.alibaba.com>
2025-08-10 16:01:51 -07:00
Lianmin Zheng
b58ae7a2a0 Simplify frontend language (#9029) 2025-08-10 10:59:30 -07:00
Lifu Huang
f8a173bb50 Improve LoRA Perf by Deprecating FlashInfer and Eliminating Redundant Tensor Ops (#8940) 2025-08-10 01:04:45 -07:00
JiLi
6b847a9a05 Optimize: Cache CUDA device to reduce redundant calls during tensor l… (#8996) 2025-08-10 00:32:57 -07:00
DarkSharpness
7ba5ad5766 [Fix] Fix flashinfer cpu <-> gpu synchronization (#8340) 2025-08-10 03:11:40 +00:00
DarkSharpness
19bc77f05c [Fix] Fix hicache backend (#8991) 2025-08-09 17:16:25 -07:00
huangtingwei
86497d99f2 fix page first per layer pf2lf kernel (#8915)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-08-09 17:16:11 -07:00
cctry
5c31b35db2 [hicache] Optimization for DMA copy (#8245) 2025-08-09 17:16:07 -07:00
Lianmin Zheng
ef48d5547e Fix CI (#9013) 2025-08-09 16:00:10 -07:00
Xiaoyu Zhang
a886564a18 fix flashinfer allreduce fusion import bug (#9007) 2025-08-09 13:47:05 -07:00