Commit Graph

49 Commits

Author SHA1 Message Date
Hubert Lu
af4b9bae95 [AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs (#7135)
Co-authored-by: yiakwy-xpu-ml-framework-team <961186938@qq.com>
Co-authored-by: HAI <hixiao@gmail.com>
2025-07-24 23:44:28 -07:00
Peter Pan
0f8b538614 [fix] benchmark : routed_scaling_factor is None (#8059)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
2025-07-22 08:55:35 -07:00
Baizhou Zhang
282eb59ff3 Add bf16 output option for dsv3_router_gemm kernel (#7999) 2025-07-20 09:49:37 +08:00
Yi Zhang
2998c4bdf4 [optimize] fuse renormalize into moe_topk_softmax (#7744)
Co-authored-by: ispobock <ispobaoke@gmail.com>
2025-07-03 12:42:44 -07:00
ayrnb
2c4feaf308 Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture (#7278)
Co-authored-by: HydraQYH <QYH820@Outlook.com>
Co-authored-by: TianQiLin666666 <1834987979@qq.com>
2025-07-02 23:27:03 -07:00
Baizhou Zhang
7248272ccc Add dsv3 router gemm kernel (#7627) 2025-06-29 23:31:55 -07:00
Ke Bao
04b35190e2 Add dsv3 fused a gemm to sgl-kernel (#7630) 2025-06-29 02:52:24 -07:00
Ke Bao
57ab776910 Fuse sorted_token_ids padding to moe_align_block_size kernel (#7437) 2025-06-24 17:44:27 -07:00
xutizhou
506c4928f5 feat: integrate deepgemm into EPMoE (#6821)
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: TianQiLin666666 <1834987979@qq.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
2025-06-23 01:38:58 -07:00
JieXin Liang
ab1a4fa5cb [fix] fix cutlass_mla_backend with cuda_graph and add sm_scale for sgl-kernel cutlass_mla (#7184) 2025-06-14 12:45:41 -07:00
fzyzcjy
aa46ed34d2 Remove 200us slow concat kernel (part 1: kernel) (#7145) 2025-06-13 01:58:29 -07:00
Yuan Luo
84727a5139 [sgl-kernel] Add cuda kernel for moe_ep_silu_and_mul (#6919)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-06-11 20:43:08 -07:00
JieXin Liang
18efb5e8e0 [perf][sgl-kernel] extend cutlass_mla_decode to support num_head < 128 (#6929) 2025-06-08 19:37:34 -07:00
Yuan Luo
43baba649e [EP] Add cuda kernel for moe_ep_post_reorder (#6837)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-06-05 00:33:47 -07:00
Xiaoyu Zhang
bd75690f4e fix ep_moe_reorder kernel bugs (#6858)
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
2025-06-04 19:13:59 +08:00
Yuan Luo
55444ed667 [EP] Add cuda kernel for moe_ep_pre_reorder (#6699)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-06-01 20:49:01 -07:00
ChangyiYang
485a023bd8 refactor apply_w8a8_block_fp8_linear in fp (#6545) 2025-05-29 00:15:11 -07:00
HandH1998
4d643f6c7a [1/2] Support Qserve (#6457)
Co-authored-by: yych0745 <1398089567@qq.com>
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
2025-05-21 19:48:59 -07:00
Lianmin Zheng
e8e18dcdcc Revert "fix some typos" (#6244) 2025-05-12 12:53:26 -07:00
applesaucethebun
d738ab52f8 fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-13 01:42:38 +08:00
applesaucethebun
2ce8793519 Add typo checker in pre-commit (#6179)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-11 12:55:00 +08:00
Zhaoyi Li
3c9740d200 update variable naming and comments for rocm (#5299) 2025-04-11 23:15:05 -07:00
Yineng Zhang
136b8e6afb fix: remove cublas_grouped_gemm (#5307) 2025-04-11 16:22:37 -07:00
Xiaoyu Zhang
f730362ee2 reduce moe_align_block_size_kernel small batch mode overhead (#5086) 2025-04-09 17:59:35 -07:00
Yi Zhang
ebf495f013 sgl-kernel use cutlass latest version for fp8 blockwise gemm (#5207) 2025-04-09 11:47:04 -07:00
Xiaoyu Zhang
2c8fd99363 [sgl-kernel] per token group quant support COLUMN MAJOR (#4817) 2025-04-02 18:29:59 -07:00
Qingquan Song
45dcfc2e76 Add deepseek style fused moe group gate selection kernel (#4530) 2025-03-29 11:51:45 -07:00
Chunan Zeng
65c24c28f9 [Quant Kernel] refactored per token group quant fp8 to support int8 up-to 2x faster (#4396) 2025-03-23 23:44:17 -07:00
JieXin Liang
1a3fa75f2f [Fix] use torch.cat instead of torch.concat to prevent entering the Autograd backends. (#4466) 2025-03-16 00:02:47 -07:00
Qingquan Song
61e4433caf Add moe topk softmax templated from vllm (#4302) 2025-03-14 12:03:33 -07:00
Yineng Zhang
2937387a50 fix accuracy issue (#4376) 2025-03-13 02:06:22 -07:00
Qingquan Song
4068e01292 Fix per token fp8 quant precision (#4362) 2025-03-12 21:19:05 -07:00
Shi Shuai
817d43705c feat: support ep size < 32 for sgl kernel (#4348) 2025-03-12 20:50:46 -07:00
Rex
07f944631e Add awq dequantize kernel to sgl with 1x to 3x speedup (#4104) 2025-03-12 00:10:02 -07:00
Xiaoyu Zhang
7130a7cea9 refine sgl_moe_align_block_size_benchmark (#4327) 2025-03-11 22:48:38 -07:00
Stefan He
95085d65e9 [Refactor] Reducing code duplication across FP8 CUDA quantization kernels (#4163) 2025-03-06 22:58:52 -08:00
Stefan He
63ee26d162 Add sgl_per_token_quant_fp8 (#4089) 2025-03-06 20:53:05 -08:00
Xiaoyu Zhang
ad55f17182 [quant kernel] sgl-kernel support per_tensor_quant fp8 (#3786) 2025-03-06 18:05:43 -08:00
Xiaoyu Zhang
55a7ec388f use warp shuffle style reduce and flashinfer vectorize (#3628) 2025-02-19 20:53:51 +08:00
Baizhou Zhang
67fc595bb8 [Feature] Apply Cublas Grouped Gemm kernel (#3629) 2025-02-18 15:18:31 +08:00
yizhang2077
640363ad20 support blockwise fp8 matmul kernel (#3267) 2025-02-13 01:49:33 +08:00
Xiaoyu Zhang
bb418ced80 optimize per token group quant fp8 (#3490) 2025-02-11 22:19:05 +08:00
Xiaoyu Zhang
81262c7b72 clean up useless file (#3192) 2025-01-28 14:29:30 +08:00
HandH1998
82392da830 support w8a8 fp8 kernel with CUTLASS (#3047)
Co-authored-by: yych0745 <1398089567@qq.com>
2025-01-26 15:46:51 +08:00
Ke Bao
7bad7e75bf Add shapes for int8 gemm benchmark (#3093) 2025-01-24 12:27:30 +08:00
Xiaoyu Zhang
ac2dc35d0e support lightning_attention_decode in sgl-kernel for MiniMax-Text-01 (#3030) 2025-01-23 15:29:20 +08:00
Yineng Zhang
b7f3fec13c minor: rename bench for sgl kernel (#2909) 2025-01-16 05:55:43 +08:00
Xiaoyu Zhang
d08c77c434 Sampling penalties memory interface (#2870) 2025-01-13 23:09:00 +08:00
Ke Bao
0f3eb1d294 Support cutlass Int8 gemm (#2752) 2025-01-06 22:51:22 +08:00