Commit Graph

97 Commits

Author SHA1 Message Date
maxiao1
75cd34d172 change sgl_kernel WARP_SIZE to 64 2025-11-03 10:17:53 +08:00
maxiao
251235c229 适配v0.5.4 2025-10-25 12:16:25 +08:00
Fan Yin
23afdfd1c2 [sgl-kernel] support flashmla libtorch (#11717) 2025-10-21 21:17:50 -07:00
hlu1
3b80232d06 [DeepseekV32] Add fast_topk_transform_ragged_fused kernel (#11815)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
2025-10-19 17:13:39 -07:00
Johnny
252dc4e112 [NVIDIA] FA3/FA4 Fix (#11606)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-10-19 17:10:10 -07:00
Fan Yin
3289da5b41 [sgl-kernel] support hadamard (#11663) 2025-10-15 19:00:44 -07:00
Qi Yuhang
6c01844f45 [sgl-kernel][3/N]Support Expert Specialization Grouped GEMM (#11674) 2025-10-15 13:39:31 -07:00
Yineng Zhang
f792e3c561 Revert "[NVIDIA] BUMP FA3 (#11444)" (#11582) 2025-10-13 20:51:45 -07:00
Johnny
b8c430f1ce [NVIDIA] BUMP FA3 (#11444)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com>
2025-10-13 09:30:57 -07:00
Qi Yuhang
9a30914e94 [sgl-kernel][1/N]Support Expert Specialization Grouped GEMM (#11432)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: PGFLMG <1106310035@qq.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-10-12 20:19:21 -07:00
PGFLMG
8fdcd98efe [7/n] decouple quantization impl from vllm dependency - gguf kernel (#11019) 2025-10-11 14:04:57 -07:00
fzyzcjy
21337b22b9 Reland [1/2] Optimizations and refactors about quant kernel (#10312)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-10-11 15:59:03 +08:00
DarkSharpness
e0b2d3eebe [Feature] Add a fast-topk to sgl-kernel for DeepSeek v3.2 (#11194)
Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com>
2025-10-05 10:19:03 -07:00
Yuan Luo
616a3e20df [sgl-kernel] Support moe_sum_reduce cuda kernel (#10321)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-09-19 14:12:09 +08:00
Zhihao Zhang
e7bc600304 [Feature] Speculative decoding support lookahead (#9873)
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
2025-09-18 16:42:41 -07:00
fzyzcjy
3b25dc127a [1/2] Speed up trtllm_mla attention backend (>10% e2e) (#10473) 2025-09-15 11:53:21 -07:00
Lianmin Zheng
c9ec4cae5b Fix the style of sgl kernel (#10398) 2025-09-12 22:20:21 -07:00
Yineng Zhang
6d55f60e77 Revert "[1/2] Optimizations and refactors about quant kernel (#9534)" (#10292) 2025-09-10 18:24:23 -07:00
huangtingwei
5be8c2f7f7 Page first direct IO kernel (#10060)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-09-10 13:35:34 +08:00
Yi Zhang
8cbe1538ef Add mamba kernel (#10234) 2025-09-09 12:58:43 -07:00
fzyzcjy
0096798ed6 [1/2] Speed up prefill mla attention (#10156) 2025-09-08 09:00:33 -07:00
hlu1
5f1eb20484 [chore] Remove unused ep_moe cuda kernels (#9956) 2025-09-06 01:35:50 -07:00
fzyzcjy
bd7f882142 Support copying tensor from cpu to gpu without using copy engines (#10007) 2025-09-05 20:07:19 +08:00
fzyzcjy
339f8eef09 [1/2] Optimizations and refactors about quant kernel (#9534) 2025-09-05 18:45:08 +08:00
Kaixi Hou
5c34b4f1c7 [NVIDIA] [2/N] Optimize silu_and_mul_scaled_fp4_grouped_quant perf (#9556) 2025-08-29 17:17:03 -07:00
Hubert Lu
711390a971 [AMD] Support Hierarchical Caching on AMD GPUs (#8236) 2025-08-28 15:27:07 -07:00
Kaixi Hou
e5638573c1 [NVIDA] [1/N] Nvfp4 Masked Gemm: Add quant op for the flashinfer grouped gemm (#9200) 2025-08-22 12:19:45 -07:00
Hubert Lu
704ced1b2e [AMD] Remove the deprecated C10_WARP_SIZE (#9356) 2025-08-21 18:16:35 -07:00
fzyzcjy
42c8704560 Add PDL support for quant kernel and rope kernel (#9106) 2025-08-20 01:56:29 -07:00
Hubert Lu
c6c379ab31 [AMD] Reorganize hip-related header files in sgl-kernel (#9320) 2025-08-18 16:53:44 -07:00
Lianmin Zheng
c480a3f6ea Minor style fixes for sgl-kernel (#9289) 2025-08-18 09:38:35 -07:00
kousakawang
0fc54b971e [fix]: fix cutlass moe ut and and Opt H20 cutlass groupGemm performance (#9272)
Co-authored-by: wanghanpei <wanghanpei@bytedance.com>
2025-08-17 13:09:49 -07:00
Yuan Luo
53dcc750b6 [sgl-kernel] Support FlashInfer top_k_top_p_sampling_from_logits (#9060)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-08-14 10:56:36 -07:00
Peng Zhang
5aa1ebd242 [2/n]decouple quantization implementation from vLLM dependency (#8112)
Co-authored-by: walker-ai <yiyun.wyt@antgroup.com>
Co-authored-by: leoneo <1320612015@qq.com>
2025-08-14 03:19:03 -07:00
Ke Bao
94f44b88d1 Update fa3 interface and add unit test (#9150) 2025-08-13 20:05:02 +08:00
Trevor Morris
13c48dcf88 [1/2][resubmit again] sgl-kernel: Fuse routed scaling factor into moe_fused_gate (#9088) 2025-08-12 20:12:38 -07:00
DarkSharpness
86a0be65d8 [Feature] Support custom set kv buffer kernel (#8884) 2025-08-12 16:56:51 -07:00
fzyzcjy
9aea255522 Fuse writing KV buffer into rope kernel (part 1: sgl-kernel) (#9077) 2025-08-12 01:46:40 -07:00
Yineng Zhang
dd949ace23 Revert "[1/2][resubmit] sgl-kernel: Fuse routed scaling factor into m… (#9035) 2025-08-10 17:34:54 -07:00
huangtingwei
86497d99f2 fix page first per layer pf2lf kernel (#8915)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
2025-08-09 17:16:11 -07:00
Trevor Morris
591c232f7c [1/2][resubmit] sgl-kernel: Fuse routed scaling factor into moe_fused_gate (select_experts) (#8770) 2025-08-08 17:55:06 -07:00
Liangsheng Yin
f9f0138f80 Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" (#8706) 2025-08-02 20:14:30 +08:00
Trevor Morris
f642524fd9 [1/2] sgl-kernel: Fuse routed scaling factor into select_experts (#8364) 2025-08-01 18:14:24 -07:00
Cheng Wan
a5f5ab4030 update sgl-kernel for EP: kernel part (#8514)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: Ke Bao <ispobaoke@gmail.com>
2025-07-30 22:19:55 -07:00
Hubert Lu
af4b9bae95 [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs (#7135)
Co-authored-by: yiakwy-xpu-ml-framework-team <961186938@qq.com>
Co-authored-by: HAI <hixiao@gmail.com>
2025-07-24 23:44:28 -07:00
li haoyang
28d4d47280 [Feature] Integrate quick allreduce and select the best allreduce implementation (#6619)
Signed-off-by: Haoyang Li <Haoyang.Li@amd.com>
Co-authored-by: ilmarkov <imarkov@redhat.com>
2025-07-24 20:48:42 -07:00
Zhiqiang Xie
b43263307f Hicache IO kernel refactoring (#8264) 2025-07-23 16:49:03 +08:00
ykcombat
1ebec1a8b0 [Feature] CUDA Green Context Support (#7649) 2025-07-15 02:49:16 +08:00
Ke Bao
a3398d8478 Optimize moe align block size kernel (#7794) 2025-07-07 09:20:30 +08:00
Lianmin Zheng
5589b75024 Add treemask mode to build_eagle_tree & release sgl-kernel 0.2.3 (#7756)
Co-authored-by: Pranjal Shankhdhar <pranjal.ssh@gmail.com>
2025-07-05 12:17:05 -07:00