Commit Graph

4638 Commits

Author SHA1 Message Date
Hongbo Xu
a669bc2f74 Replace sglang.srt.layers.quantization.scalar_types with sgl_kernel.scalar_type (#8951) 2025-08-13 19:41:41 -07:00
Nicolas Castet
6b7c24712c Fix broken trtllm_mha attn backend with gpt-oss (#9161) 2025-08-13 16:40:55 -07:00
Sundara Raman Ramachandran
a027a9b4b3 [Generative Score API] Optimization to Remove Decode. (#8840) 2025-08-14 05:12:24 +08:00
Lianmin Zheng
9e426466af Clean up allocators (#9134)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2025-08-13 13:56:04 -07:00
DiweiSun
2f20f43026 Swap xeon ci to gnr server (#9042) 2025-08-13 12:39:19 -07:00
Zhihao Liu
65736dc524 [Model] Support Qwen3ForSequenceClassification for Qwen3-Embed Model (#7957) 2025-08-13 11:14:54 -07:00
Yineng Zhang
7b56e494be chore: bump v0.5.0rc1 (#9069) 2025-08-13 10:44:14 -07:00
Ke Bao
0ff6d1fce1 Support FA3 backend for gpt-oss (#9028) 2025-08-13 10:41:50 -07:00
Teng Ma
4a16a71c36 [PD] feat: mooncake use batch reg/dereg (#8910)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-08-13 09:54:34 -07:00
Francis
a16923efab [PD] optimize kv cache transfer directly using batch transfer (#9149)
Co-authored-by: Shangming Cai <csmthu@gmail.com>
2025-08-13 09:54:14 -07:00
Simo Lin
6337d9057c [router] optimize Rust compilation and development workflow (#9133) 2025-08-13 05:14:25 -07:00
Yineng Zhang
71fb8c9527 feat: update fa3 (#9126) 2025-08-13 20:07:08 +08:00
Ke Bao
94f44b88d1 Update fa3 interface and add unit test (#9150) 2025-08-13 20:05:02 +08:00
Kevin Xiang Li
3b3b3baf9f Double vision prefill throughput by defaulting to optimal vision attention backend (#8484)
Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com>
2025-08-13 02:08:30 -07:00
kk
35e6bc92e3 Update docker file for MI35x base image update to support gpt-oss mxfp4 model (#9111)
Co-authored-by: wunhuang <wunhuang@amd.com>
2025-08-13 00:55:31 -07:00
fzyzcjy
9394ed6386 Fix gpt-oss ~2x memory consumption issue (#9146) 2025-08-13 00:11:43 -07:00
Stefan He
930fe467bd Support Triton FP8 Gemm can handle hidden_dim not divisible by 16 (#9093)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-08-12 21:21:55 -07:00
Trevor Morris
13c48dcf88 [1/2][resubmit again] sgl-kernel: Fuse routed scaling factor into moe_fused_gate (#9088) 2025-08-12 20:12:38 -07:00
Elfie Guo
8723b4f146 Use FlashInfer's TRTLLM FP8 Blockscale GEMM (#8588) 2025-08-12 20:08:40 -07:00
li chaoran
62f99e08b3 fix: wrong docker hub org name (#9137)
Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com>
2025-08-12 19:26:19 -07:00
DarkSharpness
86a0be65d8 [Feature] Support custom set kv buffer kernel (#8884) 2025-08-12 16:56:51 -07:00
huangtingwei
0edda32001 Support page first layout zero copy for mooncake store (#8651)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-08-12 15:59:26 -07:00
Yineng Zhang
924827c3de chore: use cp310 (#9130) 2025-08-12 15:33:22 -07:00
Yineng Zhang
c81daf838d fix: update Dockerfile (#9129) 2025-08-12 15:01:29 -07:00
jacky.cheng
25caa7a8a9 [AMD] Support Wave attention backend with AMD GPU optimizations (#8660)
Signed-off-by: Stanley Winata <stanley.winata@amd.com>
Signed-off-by: Harsh Menon <harsh@nod-labs.com>
Signed-off-by: nithinsubbiah <nithinsubbiah@gmail.com>
Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>
Signed-off-by: xintin <gaurav.verma@amd.com>
Co-authored-by: Harsh Menon <harsh@nod-labs.com>
Co-authored-by: Stanley Winata <stanley.winata@amd.com>
Co-authored-by: Stanley Winata <68087699+raikonenfnu@users.noreply.github.com>
Co-authored-by: Stanley Winata <stanley@nod-labs.com>
Co-authored-by: Ivan Butygin <ivan.butygin@gmail.com>
Co-authored-by: nithinsubbiah <nithinsubbiah@gmail.com>
Co-authored-by: Nithin Meganathan <18070964+nithinsubbiah@users.noreply.github.com>
Co-authored-by: Ivan Butygin <ibutygin@amd.com>
2025-08-12 13:49:11 -07:00
Hangzhi
03d114496f Fix typos in supported models documentation (#9119) 2025-08-12 13:35:24 -07:00
ichernob
83123f481e [Quantization] Supported w8a8 int8 quantized Gemma3 and Qwen-VL models (#8619)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2025-08-12 13:31:18 -07:00
ronnie_zheng
48afa8f14f [feat] Enable Ascend profiling on SGLang (#8610)
Co-authored-by: liyou_b <2953090824@qq.com>
2025-08-12 13:28:31 -07:00
li chaoran
2ecbd8b8bf [feat] add ascend readme and docker release (#8700)
Signed-off-by: mywaaagh_admin <pkwarcraft@gmail.com>
Signed-off-by: lichaoran <pkwarcraft@gmail.com>
Co-authored-by: Even Zhou <even.y.zhou@outlook.com>
Co-authored-by: ronnie_zheng <zl19940307@163.com>
2025-08-12 13:25:42 -07:00
Yineng Zhang
305b27c124 fix: update Dockerfile (#9125) 2025-08-12 13:23:10 -07:00
Simo Lin
1ce30dd13e [router] update router documentation (#9121) 2025-08-12 13:16:34 -07:00
Jiaqi Gu
c9ee738515 Fuse writing KV buffer into rope kernel (part 2: srt) (#9014)
Co-authored-by: fzyzcjy <5236035+fzyzcjy@users.noreply.github.com>
2025-08-12 13:15:30 -07:00
ishandhanani
1f9ec65374 fix(docker): update sgl_kernel version to 0.3.4 in Dockerfile.gb200 (#9118) 2025-08-12 13:12:33 -07:00
Chang Su
ad359d1c71 router: Fix user guide link README.md (#9122) 2025-08-12 12:29:10 -07:00
Cheng Wan
5f5b3b2449 [5/n] DP Enhancement: Correct num_token_non_padded (#9107) 2025-08-12 12:23:46 -07:00
Shangming Cai
4caca4f6b4 Fix typo in REVIEWERS (#9113)
Signed-off-by: Shangming Cai <csmthu@gmail.com>
2025-08-12 11:55:49 -07:00
Chang Su
f2a5de284b [Bugfix] Fix accuracy-test-1-gpu failure caused by builtin_tools (#9114) 2025-08-12 09:56:13 -07:00
Liangsheng Yin
445f9dca6e Runtime check CUDA driver version to avoid unresolved green context symbols (#9021) 2025-08-12 09:26:10 -07:00
Yineng Zhang
3a9afe2a42 chore: bump sgl-kernel v0.3.4 (#9103) 2025-08-12 01:48:47 -07:00
fzyzcjy
9aea255522 Fuse writing KV buffer into rope kernel (part 1: sgl-kernel) (#9077) 2025-08-12 01:46:40 -07:00
Yichao Cheng
fcc11e5ed5 update support new models doc (#9096)
Signed-off-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com>
2025-08-12 01:21:02 -07:00
fzyzcjy
5190ba7f42 Fuse two kernels of hidden states padding into quantization kernel (#9005)
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2025-08-12 01:20:13 -07:00
Hsiang-Yu Tsou
5438886c87 docs: fix broken links in README.md (#9075) 2025-08-12 00:03:35 -07:00
Chang Su
9c83d74da3 bugfix: Fix the commentary msg extraction in GptOssDetector (#9097) 2025-08-11 23:53:10 -07:00
DarkSharpness
b4ac2b9c0c [Fix] Fix dual chunk model default behavior (#9032) 2025-08-11 23:50:23 -07:00
Jianwei Dong
83262dcb29 Fix mismatch between padded_scales shape and reshape dimensions in modelopt quantization (#8766) 2025-08-11 23:44:40 -07:00
zixuanzhang226
c46c75f8c0 feat: add fused moe config for Qwen3-30B-A3B on B200 (#9087) 2025-08-11 23:25:36 -07:00
Makcum888e
2aaf22c46c Optimization for AscendPagedTokenToKVPoolAllocator (#8293)
Co-authored-by: ronnie_zheng <zl19940307@163.com>
Co-authored-by: VDV1985 <vladdv85@mail.ru>
2025-08-11 23:06:39 -07:00
Lifu Huang
29a610b4d9 Fix broken CI TestRequestLengthValidation (#9095) 2025-08-11 22:59:56 -07:00
Lifu Huang
5ded39cab2 Fix race condition in async lora unload (#9084) 2025-08-11 22:59:29 -07:00