sglang

Author	SHA1	Message	Date
Yineng Zhang	6d55f60e77	Revert "[1/2] Optimizations and refactors about quant kernel (#9534 )" (#10292 )	2025-09-10 18:24:23 -07:00
hlu1	5f1eb20484	[chore] Remove unused ep_moe cuda kernels (#9956 )	2025-09-06 01:35:50 -07:00
fzyzcjy	339f8eef09	[1/2] Optimizations and refactors about quant kernel (#9534 )	2025-09-05 18:45:08 +08:00
fzyzcjy	e85cb1ce9d	Fix quant kernel test errors and benchmark wrong output speeds (#7604 )	2025-08-21 03:48:41 -07:00
Yuan Luo	53dcc750b6	[sgl-kernel] Support FlashInfer top_k_top_p_sampling_from_logits (#9060 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-08-14 10:56:36 -07:00
henryg	841810f227	[Perf] Tunings for SM100 FP8 CUTLASS kernel (#8818 )	2025-08-13 21:59:22 -07:00
fzyzcjy	9aea255522	Fuse writing KV buffer into rope kernel (part 1: sgl-kernel) (#9077 )	2025-08-12 01:46:40 -07:00
Yuan Luo	1bd5316873	fix benchmark fp8 blockwise group gemm (#8815 )	2025-08-06 21:02:21 +08:00
Stefan He	db7343c992	fix per token cuda kernel hidden dim cannot divide by 16 (#8543 )	2025-08-01 09:27:18 -07:00
Peter Pan	6bdd27861b	[Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 (#8013 )	2025-08-01 22:01:24 +08:00
Cheng Wan	a5f5ab4030	update sgl-kernel for EP: kernel part (#8514 ) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com>	2025-07-30 22:19:55 -07:00
Elfie Guo	5c9c275bc8	Use FlashInfer FP4 gemm. (#8241 )	2025-07-27 01:05:22 -07:00
fzyzcjy	e34cf6ad75	Fix bench script making input data on L2 cache (#7739 )	2025-07-27 00:30:24 -07:00
Qi Yuhang	426b74936a	Add nvfp4 scaled mm benchmark. (#8401 )	2025-07-26 23:18:04 -07:00
Hubert Lu	af4b9bae95	[AMD] Add silu_and_mul, gelu_and_mul, gelu_tanh_and_mul, and gelu_quick kernels for AMD GPUs (#7135 ) Co-authored-by: yiakwy-xpu-ml-framework-team <961186938@qq.com> Co-authored-by: HAI <hixiao@gmail.com>	2025-07-24 23:44:28 -07:00
Peter Pan	0f8b538614	[fix] benchmark : routed_scaling_factor is None (#8059 ) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>	2025-07-22 08:55:35 -07:00
Baizhou Zhang	282eb59ff3	Add bf16 output option for dsv3_router_gemm kernel (#7999 )	2025-07-20 09:49:37 +08:00
Yi Zhang	2998c4bdf4	[optimize] fuse renormalize into moe_topk_softmax (#7744 ) Co-authored-by: ispobock <ispobaoke@gmail.com>	2025-07-03 12:42:44 -07:00
ayrnb	2c4feaf308	Add CUTLASS FP8 Blockscale MoE kernel for Hopper architecture (#7278 ) Co-authored-by: HydraQYH <QYH820@Outlook.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com>	2025-07-02 23:27:03 -07:00
Baizhou Zhang	7248272ccc	Add dsv3 router gemm kernel (#7627 )	2025-06-29 23:31:55 -07:00
Ke Bao	04b35190e2	Add dsv3 fused a gemm to sgl-kernel (#7630 )	2025-06-29 02:52:24 -07:00
Ke Bao	57ab776910	Fuse sorted_token_ids padding to moe_align_block_size kernel (#7437 )	2025-06-24 17:44:27 -07:00
xutizhou	506c4928f5	feat: integrate deepgemm into EPMoE (#6821 ) Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com> Co-authored-by: TianQiLin666666 <1834987979@qq.com> Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>	2025-06-23 01:38:58 -07:00
JieXin Liang	ab1a4fa5cb	[fix] fix cutlass_mla_backend with cuda_graph and add sm_scale for sgl-kernel cutlass_mla (#7184 )	2025-06-14 12:45:41 -07:00
fzyzcjy	aa46ed34d2	Remove 200us slow concat kernel (part 1: kernel) (#7145 )	2025-06-13 01:58:29 -07:00
Yuan Luo	84727a5139	[sgl-kernel] Add cuda kernel for moe_ep_silu_and_mul (#6919 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-06-11 20:43:08 -07:00
JieXin Liang	18efb5e8e0	[perf][sgl-kernel] extend cutlass_mla_decode to support num_head < 128 (#6929 )	2025-06-08 19:37:34 -07:00
Yuan Luo	43baba649e	[EP] Add cuda kernel for moe_ep_post_reorder (#6837 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-06-05 00:33:47 -07:00
Xiaoyu Zhang	bd75690f4e	fix ep_moe_reorder kernel bugs (#6858 ) Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>	2025-06-04 19:13:59 +08:00
Yuan Luo	55444ed667	[EP] Add cuda kernel for moe_ep_pre_reorder (#6699 ) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>	2025-06-01 20:49:01 -07:00
ChangyiYang	485a023bd8	refactor apply_w8a8_block_fp8_linear in fp (#6545 )	2025-05-29 00:15:11 -07:00
HandH1998	4d643f6c7a	[1/2] Support Qserve (#6457 ) Co-authored-by: yych0745 <1398089567@qq.com> Co-authored-by: sleepcoo <sleepcoo@gmail.com>	2025-05-21 19:48:59 -07:00
Lianmin Zheng	e8e18dcdcc	Revert "fix some typos" (#6244 )	2025-05-12 12:53:26 -07:00
applesaucethebun	d738ab52f8	fix some typos (#6209 ) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>	2025-05-13 01:42:38 +08:00
applesaucethebun	2ce8793519	Add typo checker in pre-commit (#6179 ) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>	2025-05-11 12:55:00 +08:00
Zhaoyi Li	3c9740d200	update variable naming and comments for rocm (#5299 )	2025-04-11 23:15:05 -07:00
Yineng Zhang	136b8e6afb	fix: remove cublas_grouped_gemm (#5307 )	2025-04-11 16:22:37 -07:00
Xiaoyu Zhang	f730362ee2	reduce moe_align_block_size_kernel small batch mode overhead (#5086 )	2025-04-09 17:59:35 -07:00
Yi Zhang	ebf495f013	sgl-kernel use cutlass latest version for fp8 blockwise gemm (#5207 )	2025-04-09 11:47:04 -07:00
Xiaoyu Zhang	2c8fd99363	[sgl-kernel] per token group quant support COLUMN MAJOR (#4817 )	2025-04-02 18:29:59 -07:00
Qingquan Song	45dcfc2e76	Add deepseek style fused moe group gate selection kernel (#4530 )	2025-03-29 11:51:45 -07:00
Chunan Zeng	65c24c28f9	[Quant Kernel] refactored per token group quant fp8 to support int8 up-to 2x faster (#4396 )	2025-03-23 23:44:17 -07:00
JieXin Liang	1a3fa75f2f	[Fix] use `torch.cat` instead of `torch.concat` to prevent entering the `Autograd` backends. (#4466 )	2025-03-16 00:02:47 -07:00
Qingquan Song	61e4433caf	Add moe topk softmax templated from vllm (#4302 )	2025-03-14 12:03:33 -07:00
Yineng Zhang	2937387a50	fix accuracy issue (#4376 )	2025-03-13 02:06:22 -07:00
Qingquan Song	4068e01292	Fix per token fp8 quant precision (#4362 )	2025-03-12 21:19:05 -07:00
Shi Shuai	817d43705c	feat: support ep size < 32 for sgl kernel (#4348 )	2025-03-12 20:50:46 -07:00
Rex	07f944631e	Add awq dequantize kernel to sgl with 1x to 3x speedup (#4104 )	2025-03-12 00:10:02 -07:00
Xiaoyu Zhang	7130a7cea9	refine sgl_moe_align_block_size_benchmark (#4327 )	2025-03-11 22:48:38 -07:00
Stefan He	95085d65e9	[Refactor] Reducing code duplication across FP8 CUDA quantization kernels (#4163 )	2025-03-06 22:58:52 -08:00

1 2

63 Commits