Commit Graph

221 Commits

Author SHA1 Message Date
maxiao1
75cd34d172 change sgl_kernel WARP_SIZE to 64 2025-11-03 10:17:53 +08:00
maxiao
251235c229 适配v0.5.4 2025-10-25 12:16:25 +08:00
blzheng
13fb8b5489 [CPU] Optimize FP16 decode_attention_cpu (#10652) 2025-10-22 21:39:51 -07:00
Fan Yin
23afdfd1c2 [sgl-kernel] support flashmla libtorch (#11717) 2025-10-21 21:17:50 -07:00
Serge Panev
2b1da821b5 [NVIDIA] Add new SMs support for Spark & Thor (#11287)
Signed-off-by: Serge Panev <spanev@nvidia.com>
2025-10-22 02:02:24 +08:00
hlu1
3b80232d06 [DeepseekV32] Add fast_topk_transform_ragged_fused kernel (#11815)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-10-19 17:13:39 -07:00
Johnny
252dc4e112 [NVIDIA] FA3/FA4 Fix (#11606)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-10-19 17:10:10 -07:00
Fan Yin
3289da5b41 [sgl-kernel] support hadamard (#11663) 2025-10-15 19:00:44 -07:00
Qi Yuhang
6c01844f45 [sgl-kernel][3/N]Support Expert Specialization Grouped GEMM (#11674) 2025-10-15 13:39:31 -07:00
Yineng Zhang
f792e3c561 Revert "[NVIDIA] BUMP FA3 (#11444)" (#11582) 2025-10-13 20:51:45 -07:00
Qi Yuhang
dc48c4c0e3 [sgl-kernel][2/N]Support Expert Specialization Grouped GEMM (#11534) 2025-10-13 16:24:48 -07:00
Johnny
b8c430f1ce [NVIDIA] BUMP FA3 (#11444)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
2025-10-13 09:30:57 -07:00
Qi Yuhang
9a30914e94 [sgl-kernel][1/N]Support Expert Specialization Grouped GEMM (#11432)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: PGFLMG <1106310035@qq.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-10-12 20:19:21 -07:00
PGFLMG
8fdcd98efe [7/n] decouple quantization impl from vllm dependency - gguf kernel (#11019) 2025-10-11 14:04:57 -07:00
fzyzcjy
21337b22b9 Reland [1/2] Optimizations and refactors about quant kernel (#10312)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-10-11 15:59:03 +08:00
Yuan Luo
4f42c8cd3e [sgl-kernel] Support float64 moe_sum_reduce cuda kernel (#11068)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-10-07 14:31:11 +00:00
DarkSharpness
e0b2d3eebe [Feature] Add a fast-topk to sgl-kernel for DeepSeek v3.2 (#11194)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-10-05 10:19:03 -07:00
Zhihao Zhang
24f7cb1ece [speculative decoding] rename lookahead to ngram (#11010)
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
2025-09-28 21:06:59 -07:00
Yuan Luo
42245551ef [sgl-kernel] Optimize concat_mla_k kernel (#10543)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: PGFLMG <1106310035@qq.com>
2025-09-28 23:04:22 +08:00
Yuhao Yao
fe531d6f4e [Bug] Fix Issue#10215 (#10572) 2025-09-25 09:51:50 +08:00
Qi Yuhang
0f04a5f428 Optimize cutlass int8 gemm kernel for large M on SM89 Ada GPU (#10714) 2025-09-21 17:04:27 -07:00
Yuan Luo
616a3e20df [sgl-kernel] Support moe_sum_reduce cuda kernel (#10321)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-09-19 14:12:09 +08:00
Zhihao Zhang
e7bc600304 [Feature] Speculative decoding support lookahead (#9873)
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2025-09-18 16:42:41 -07:00
Qi Yuhang
9b876889b7 Update CUTLASS. Refine KernelSchedule for fp8 (grouped) gemm. (#10491) 2025-09-16 02:47:37 -07:00
fzyzcjy
3b25dc127a [1/2] Speed up trtllm_mla attention backend (>10% e2e) (#10473) 2025-09-15 11:53:21 -07:00
fzyzcjy
258d02c86d Fix correction bias undefined behavior for nvfp4 models (#10426) 2025-09-14 18:41:09 -07:00
Lianmin Zheng
c9ec4cae5b Fix the style of sgl kernel (#10398) 2025-09-12 22:20:21 -07:00
Hubert Lu
fe68c1486f Fix errors of hicache kernels in sgl-kernel for ROCm (#10339) 2025-09-11 14:54:34 -07:00
Yineng Zhang
6d55f60e77 Revert "[1/2] Optimizations and refactors about quant kernel (#9534)" (#10292) 2025-09-10 18:24:23 -07:00
Rain Jiang
2286e85e77 pass a_scale from fp8 quant result instead of hard code to 1.0f (#10241)
Co-authored-by: Yichen Wang <yichen.wang@bytedance.com>
Co-authored-by: Jinwu Guo <641876696@qq.com>
2025-09-10 12:56:05 -07:00
huangtingwei
5be8c2f7f7 Page first direct IO kernel (#10060)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-09-10 13:35:34 +08:00
Yi Zhang
8cbe1538ef Add mamba kernel (#10234) 2025-09-09 12:58:43 -07:00
blzheng
d1d4074c4e [CPU] Add gelu_and_mul kernel in sgl-kernel and add ut (#9300) 2025-09-08 23:23:13 -07:00
fzyzcjy
0096798ed6 [1/2] Speed up prefill mla attention (#10156) 2025-09-08 09:00:33 -07:00
Yuhao Yao
ee0b3c5bad [1/N][Bug] Fix w4afp8 MoE NaN issue (sgl-kernel, fixed) (#10108) 2025-09-07 21:39:07 -07:00
Cao E
7577f0e40f Add graph runner support with torch compile on CPU (#7843) 2025-09-07 21:33:58 -07:00
Qi Yuhang
85ed8e0a5e Optimize nvfp4 block scaled gemm kernel when M is small. (#10101) 2025-09-06 22:31:00 -07:00
Jianying
dd1e268938 CUTLASS fp8 blockwise gemm support of sm120 (#9969) 2025-09-06 22:28:54 -07:00
hlu1
5f1eb20484 [chore] Remove unused ep_moe cuda kernels (#9956) 2025-09-06 01:35:50 -07:00
hlu1
4c22ebe2e8 Disable kernel cutlass_mla_decode on SM103 (#10058)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-09-06 01:35:18 -07:00
DevashishLal-CB
dbb1235d58 [Fix] illegal sync based on undefined behaviour (#9620)
Signed-off-by: Devashish Lal <devashish@rivosinc.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-09-06 11:54:48 +08:00
Yineng Zhang
0e78c63c0e Revert "[1/N][Bug] Fix w4afp8 MoE NaN issue (sgl-kernel) (#9953)" (#10097) 2025-09-05 19:57:53 -07:00
fzyzcjy
bd7f882142 Support copying tensor from cpu to gpu without using copy engines (#10007) 2025-09-05 20:07:19 +08:00
fzyzcjy
339f8eef09 [1/2] Optimizations and refactors about quant kernel (#9534) 2025-09-05 18:45:08 +08:00
Yuhao Yao
f78b7fd16d [1/N][Bug] Fix w4afp8 MoE NaN issue (sgl-kernel) (#9953) 2025-09-03 18:28:27 +08:00
Lianmin Zheng
d631290e32 Remove annoying warnings in sgl kernel build (#9905) 2025-09-02 20:18:25 -07:00
chenxj
d4a938417d [feat] Support tp mode for DeepSeek-R1-W4AFP8 (#8118)
Co-authored-by: yuhyao <827623970@qq.com>
2025-09-01 22:17:26 -07:00
hlu1
1e85589dc5 Make fp4_quantize kernels work on sm103 (#9807)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-08-29 21:15:08 -07:00
Kaixi Hou
5c34b4f1c7 [NVIDIA] [2/N] Optimize silu_and_mul_scaled_fp4_grouped_quant perf (#9556) 2025-08-29 17:17:03 -07:00
hlu1
7a16db9bd9 Make sm100 fp8 kernels available on sm103 (#9789)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-08-28 23:47:29 -07:00