Commit Graph

69 Commits

Author SHA1 Message Date
yinfan98
d2e507df3c [Misc] clean up vllm in sgl-kernel test (#5189) 2025-04-09 01:22:13 -07:00
yinfan98
9798e72baa [Misc] Use pytest.mark.skipif in sgl-kernel test (#5137) 2025-04-07 21:35:14 -07:00
Yi Zhang
bcbbf519f9 sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (#5079) 2025-04-05 14:23:20 -07:00
Xiaoyu Zhang
924ca7c92c Add DeepSeek V3/R1 shared experts fusion (#4918) 2025-04-04 01:59:29 -07:00
yinfan98
b8b6008f47 [Fix] fix fa3 build at cu118 (#5036) 2025-04-03 11:52:35 -07:00
Xiaoyu Zhang
2c8fd99363 [sgl-kernel] per token group quant support COLUMN MAJOR (#4817) 2025-04-02 18:29:59 -07:00
yinfan98
37c66ec856 [feat] add fa3 in sgl-kernel (#4902)
Co-authored-by: Sleepcoo <Sleepcoo@gmail.com>
2025-03-30 12:57:10 -07:00
Adarsh Shirawalmath
9fccda3111 [Feature] use pytest for sgl-kernel (#4896) 2025-03-30 10:36:52 -07:00
Yi Zhang
5ec5eaf760 fix allreduce test (#4909) 2025-03-29 23:16:53 -07:00
Qingquan Song
45dcfc2e76 Add deepseek style fused moe group gate selection kernel (#4530) 2025-03-29 11:51:45 -07:00
yinfan98
ddf8981d91 Delete test_deep_gemm.py (#4891) 2025-03-29 10:46:11 -07:00
Trevor Morris
e9f8e42318 Support FP4 gemm (1/2) (#3899) 2025-03-24 19:50:23 -07:00
Chunan Zeng
65c24c28f9 [Quant Kernel] refactored per token group quant fp8 to support int8 up-to 2x faster (#4396) 2025-03-23 23:44:17 -07:00
AniZpZ
321ab756bc [1/3] fix dsv3 awq issue (#4556)
Co-authored-by: leoneo <1320612015@qq.com>
2025-03-22 01:07:17 -07:00
Wenbo Yang
75b656488a Support serving DeepSeek-R1-Channel-INT8 with 32 L40S. (#4418) 2025-03-17 00:03:43 -07:00
Yineng Zhang
9971dc2283 Revert "feat: Add FlashMLA submodule (#4449)" (#4470) 2025-03-16 01:30:05 -07:00
Ying Sheng
52a34d7448 Add greedy verification kernel (#4383) 2025-03-16 00:58:26 -07:00
JieXin Liang
1a3fa75f2f [Fix] use torch.cat instead of torch.concat to prevent entering the Autograd backends. (#4466) 2025-03-16 00:02:47 -07:00
Shi Shuai
81f431eded feat: Add FlashMLA submodule (#4449)
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
2025-03-15 23:30:25 -07:00
Qingquan Song
61e4433caf Add moe topk softmax templated from vllm (#4302) 2025-03-14 12:03:33 -07:00
Yineng Zhang
2937387a50 fix accuracy issue (#4376) 2025-03-13 02:06:22 -07:00
Qingquan Song
4068e01292 Fix per token fp8 quant precision (#4362) 2025-03-12 21:19:05 -07:00
Rex
07f944631e Add awq dequantize kernel to sgl with 1x to 3x speedup (#4104) 2025-03-12 00:10:02 -07:00
Xiaoyu Zhang
23308a9032 fix per_token_group_quant_fp8 illegal memory when num_groups % 16 != 0 (#4231) 2025-03-10 01:42:58 -07:00
Lianmin Zheng
aa957102a9 Simplify tests & Fix trtllm custom allreduce registration (#4252) 2025-03-10 01:24:22 -07:00
laixin
c553e1604c DeepGemm integrate to sgl-kernel (#4165)
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: HandH1998 <1335248067@qq.com>
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: yinfan98 <1106310035@qq.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-03-10 00:35:07 -07:00
Lianmin Zheng
8abf74e3c9 Rename files in sgl kernel to avoid nested folder structure (#4213)
Co-authored-by: zhyncs <me@zhyncs.com>
2025-03-08 22:54:51 -08:00
lukec
b93ef5e56d Remove the vllm dependency from the moe_align function (#4164)
Co-authored-by: Hongbosherlock <hongbosherlock@gmail.com>
2025-03-07 22:42:16 -08:00
Stefan He
63ee26d162 Add sgl_per_token_quant_fp8 (#4089) 2025-03-06 20:53:05 -08:00
Xiaoyu Zhang
ad55f17182 [quant kernel] sgl-kernel support per_tensor_quant fp8 (#3786) 2025-03-06 18:05:43 -08:00
Lianmin Zheng
110e006673 Reorganize python source files in sgl-kernel with multiple files (#4027) 2025-03-03 06:36:40 -08:00
Lianmin Zheng
6b45a21d16 Reorganize c++ source files in sgl-kernel with multiple folders (#4025) 2025-03-03 05:32:30 -08:00
Chayenne
18bb216c28 Revert "[MOE] enable efficient moe_alignment multi-blocks execution (3x~6x)" (#3982) 2025-02-28 23:57:17 -08:00
yiakwy-xpu-ml-framework-team
1c96fa86cf [MOE] enable efficient moe_alignment multi-blocks execution (3x~6x) (#3613) 2025-02-27 19:42:48 -08:00
Baizhou Zhang
67fc595bb8 [Feature] Apply Cublas Grouped Gemm kernel (#3629) 2025-02-18 15:18:31 +08:00
yizhang2077
640363ad20 support blockwise fp8 matmul kernel (#3267) 2025-02-13 01:49:33 +08:00
Xiaoyu Zhang
bb418ced80 optimize per token group quant fp8 (#3490) 2025-02-11 22:19:05 +08:00
Yineng Zhang
f9905d59a8 support speculative decoding kernel in sgl-kernel (#3373)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2025-02-07 20:29:51 +08:00
Xiaoyu Zhang
ad3499858e clean moe align block kernel code and add acc test (#3332) 2025-02-06 16:42:36 +08:00
Yineng Zhang
827aa8730b cleanup sgl-kernel kernels (#3175) 2025-01-27 19:11:01 +08:00
Byron Hsu
514f37c32b [kernel] Fix position ids in rope (#3173) 2025-01-27 17:09:51 +08:00
Byron Hsu
fb11a43981 [kernel] Integrate flashinfer's rope with higher precision and better perf (#3134) 2025-01-27 15:28:00 +08:00
HandH1998
82392da830 support w8a8 fp8 kernel with CUTLASS (#3047)
Co-authored-by: yych0745 <1398089567@qq.com>
2025-01-26 15:46:51 +08:00
Yineng Zhang
95f789adb0 minor: cleanup sgl-kernel (#3143) 2025-01-26 14:29:58 +08:00
Xiaoyu Zhang
5d9d15e70f support fp32 in sampling_scaling_penalties kernel (#3121) 2025-01-25 16:52:17 +08:00
Yineng Zhang
5de4051bcf feat: integrate sampling kernels into sgl-kernel (#3086)
Co-authored-by: Zihao Ye <expye@outlook.com>
2025-01-24 01:54:47 +08:00
Xiaoyu Zhang
e0cd65c2b6 [hotfix] fix test_sampling_scaling_penalties.py ci test (#3084) 2025-01-24 00:33:59 +08:00
Xiaoyu Zhang
f1b6861828 use flashinfer vec_dtypes in sgl_kernel (#3083) 2025-01-23 22:19:04 +08:00
Yineng Zhang
0da0989ad4 sync flashinfer and update sgl-kernel tests (#3081) 2025-01-23 21:13:55 +08:00
Xiaoyu Zhang
ac2dc35d0e support lightning_attention_decode in sgl-kernel for MiniMax-Text-01 (#3030) 2025-01-23 15:29:20 +08:00