Commit Graph

63 Commits

Author SHA1 Message Date
Yineng Zhang
6d55f60e77 Revert "[1/2] Optimizations and refactors about quant kernel (#9534)" (#10292) 2025-09-10 18:24:23 -07:00
hlu1
5f1eb20484 [chore] Remove unused ep_moe cuda kernels (#9956) 2025-09-06 01:35:50 -07:00
fzyzcjy
339f8eef09 [1/2] Optimizations and refactors about quant kernel (#9534) 2025-09-05 18:45:08 +08:00
fzyzcjy
e85cb1ce9d Fix quant kernel test errors and benchmark wrong output speeds (#7604) 2025-08-21 03:48:41 -07:00
Yuan Luo
53dcc750b6 [sgl-kernel] Support FlashInfer top_k_top_p_sampling_from_logits (#9060)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-08-14 10:56:36 -07:00
henryg
841810f227 [Perf] Tunings for SM100 FP8 CUTLASS kernel (#8818) 2025-08-13 21:59:22 -07:00
fzyzcjy
9aea255522 Fuse writing KV buffer into rope kernel (part 1: sgl-kernel) (#9077) 2025-08-12 01:46:40 -07:00
Yuan Luo
1bd5316873 fix benchmark fp8 blockwise group gemm (#8815) 2025-08-06 21:02:21 +08:00
Stefan He
db7343c992 fix per token cuda kernel hidden dim cannot divide by 16 (#8543) 2025-08-01 09:27:18 -07:00
Peter Pan
6bdd27861b [Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 (#8013) 2025-08-01 22:01:24 +08:00
Cheng Wan
a5f5ab4030 update sgl-kernel for EP: kernel part (#8514)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: Ke Bao <ispobaoke@gmail.com>
2025-07-30 22:19:55 -07:00
Elfie Guo
5c9c275bc8 Use FlashInfer FP4 gemm. (#8241) 2025-07-27 01:05:22 -07:00
fzyzcjy
e34cf6ad75 Fix bench script making input data on L2 cache (#7739) 2025-07-27 00:30:24 -07:00
Qi Yuhang
426b74936a Add nvfp4 scaled mm benchmark. (#8401) 2025-07-26 23:18:04 -07:00
Hubert Lu
af4b9bae95 [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs (#7135)
Co-authored-by: yiakwy-xpu-ml-framework-team <961186938@qq.com>
Co-authored-by: HAI <hixiao@gmail.com>
2025-07-24 23:44:28 -07:00
Peter Pan
0f8b538614 [fix] benchmark : routed_scaling_factor is None (#8059)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-07-22 08:55:35 -07:00
Baizhou Zhang
282eb59ff3 Add bf16 output option for dsv3_router_gemm kernel (#7999) 2025-07-20 09:49:37 +08:00
Yi Zhang
2998c4bdf4 [optimize] fuse renormalize into moe_topk_softmax (#7744)
Co-authored-by: ispobock <ispobaoke@gmail.com>
2025-07-03 12:42:44 -07:00
ayrnb
2c4feaf308 Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture (#7278)
Co-authored-by: HydraQYH <QYH820@Outlook.com>
Co-authored-by: TianQiLin666666 <1834987979@qq.com>
2025-07-02 23:27:03 -07:00
Baizhou Zhang
7248272ccc Add dsv3 router gemm kernel (#7627) 2025-06-29 23:31:55 -07:00
Ke Bao
04b35190e2 Add dsv3 fused a gemm to sgl-kernel (#7630) 2025-06-29 02:52:24 -07:00
Ke Bao
57ab776910 Fuse sorted_token_ids padding to moe_align_block_size kernel (#7437) 2025-06-24 17:44:27 -07:00
xutizhou
506c4928f5 feat: integrate deepgemm into EPMoE (#6821)
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: TianQiLin666666 <1834987979@qq.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
2025-06-23 01:38:58 -07:00
JieXin Liang
ab1a4fa5cb [fix] fix cutlass_mla_backend with cuda_graph and add sm_scale for sgl-kernel cutlass_mla (#7184) 2025-06-14 12:45:41 -07:00
fzyzcjy
aa46ed34d2 Remove 200us slow concat kernel (part 1: kernel) (#7145) 2025-06-13 01:58:29 -07:00
Yuan Luo
84727a5139 [sgl-kernel] Add cuda kernel for moe_ep_silu_and_mul (#6919)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-06-11 20:43:08 -07:00
JieXin Liang
18efb5e8e0 [perf][sgl-kernel] extend cutlass_mla_decode to support num_head < 128 (#6929) 2025-06-08 19:37:34 -07:00
Yuan Luo
43baba649e [EP] Add cuda kernel for moe_ep_post_reorder (#6837)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-06-05 00:33:47 -07:00
Xiaoyu Zhang
bd75690f4e fix ep_moe_reorder kernel bugs (#6858)
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
2025-06-04 19:13:59 +08:00
Yuan Luo
55444ed667 [EP] Add cuda kernel for moe_ep_pre_reorder (#6699)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-06-01 20:49:01 -07:00
ChangyiYang
485a023bd8 refactor apply_w8a8_block_fp8_linear in fp (#6545) 2025-05-29 00:15:11 -07:00
HandH1998
4d643f6c7a [1/2] Support Qserve (#6457)
Co-authored-by: yych0745 <1398089567@qq.com>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
2025-05-21 19:48:59 -07:00
Lianmin Zheng
e8e18dcdcc Revert "fix some typos" (#6244) 2025-05-12 12:53:26 -07:00
applesaucethebun
d738ab52f8 fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-13 01:42:38 +08:00
applesaucethebun
2ce8793519 Add typo checker in pre-commit (#6179)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-11 12:55:00 +08:00
Zhaoyi Li
3c9740d200 update variable naming and comments for rocm (#5299) 2025-04-11 23:15:05 -07:00
Yineng Zhang
136b8e6afb fix: remove cublas_grouped_gemm (#5307) 2025-04-11 16:22:37 -07:00
Xiaoyu Zhang
f730362ee2 reduce moe_align_block_size_kernel small batch mode overhead (#5086) 2025-04-09 17:59:35 -07:00
Yi Zhang
ebf495f013 sgl-kernel use cutlass latest version for fp8 blockwise gemm (#5207) 2025-04-09 11:47:04 -07:00
Xiaoyu Zhang
2c8fd99363 [sgl-kernel] per token group quant support COLUMN MAJOR (#4817) 2025-04-02 18:29:59 -07:00
Qingquan Song
45dcfc2e76 Add deepseek style fused moe group gate selection kernel (#4530) 2025-03-29 11:51:45 -07:00
Chunan Zeng
65c24c28f9 [Quant Kernel] refactored per token group quant fp8 to support int8 up-to 2x faster (#4396) 2025-03-23 23:44:17 -07:00
JieXin Liang
1a3fa75f2f [Fix] use torch.cat instead of torch.concat to prevent entering the Autograd backends. (#4466) 2025-03-16 00:02:47 -07:00
Qingquan Song
61e4433caf Add moe topk softmax templated from vllm (#4302) 2025-03-14 12:03:33 -07:00
Yineng Zhang
2937387a50 fix accuracy issue (#4376) 2025-03-13 02:06:22 -07:00
Qingquan Song
4068e01292 Fix per token fp8 quant precision (#4362) 2025-03-12 21:19:05 -07:00
Shi Shuai
817d43705c feat: support ep size < 32 for sgl kernel (#4348) 2025-03-12 20:50:46 -07:00
Rex
07f944631e Add awq dequantize kernel to sgl with 1x to 3x speedup (#4104) 2025-03-12 00:10:02 -07:00
Xiaoyu Zhang
7130a7cea9 refine sgl_moe_align_block_size_benchmark (#4327) 2025-03-11 22:48:38 -07:00
Stefan He
95085d65e9 [Refactor] Reducing code duplication across FP8 CUDA quantization kernels (#4163) 2025-03-06 22:58:52 -08:00