Commit Graph

58 Commits

Author SHA1 Message Date
Trevor Morris
e9f8e42318 Support FP4 gemm (1/2) (#3899) 2025-03-24 19:50:23 -07:00
Chunan Zeng
65c24c28f9 [Quant Kernel] refactored per token group quant fp8 to support int8 up-to 2x faster (#4396) 2025-03-23 23:44:17 -07:00
AniZpZ
321ab756bc [1/3] fix dsv3 awq issue (#4556)
Co-authored-by: leoneo <1320612015@qq.com>
2025-03-22 01:07:17 -07:00
Wenbo Yang
75b656488a Support serving DeepSeek-R1-Channel-INT8 with 32 L40S. (#4418) 2025-03-17 00:03:43 -07:00
Yineng Zhang
9971dc2283 Revert "feat: Add FlashMLA submodule (#4449)" (#4470) 2025-03-16 01:30:05 -07:00
Ying Sheng
52a34d7448 Add greedy verification kernel (#4383) 2025-03-16 00:58:26 -07:00
JieXin Liang
1a3fa75f2f [Fix] use torch.cat instead of torch.concat to prevent entering the Autograd backends. (#4466) 2025-03-16 00:02:47 -07:00
Shi Shuai
81f431eded feat: Add FlashMLA submodule (#4449)
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
2025-03-15 23:30:25 -07:00
Qingquan Song
61e4433caf Add moe topk softmax templated from vllm (#4302) 2025-03-14 12:03:33 -07:00
Yineng Zhang
2937387a50 fix accuracy issue (#4376) 2025-03-13 02:06:22 -07:00
Qingquan Song
4068e01292 Fix per token fp8 quant precision (#4362) 2025-03-12 21:19:05 -07:00
Rex
07f944631e Add awq dequantize kernel to sgl with 1x to 3x speedup (#4104) 2025-03-12 00:10:02 -07:00
Xiaoyu Zhang
23308a9032 fix per_token_group_quant_fp8 illegal memory when num_groups % 16 != 0 (#4231) 2025-03-10 01:42:58 -07:00
Lianmin Zheng
aa957102a9 Simplify tests & Fix trtllm custom allreduce registration (#4252) 2025-03-10 01:24:22 -07:00
laixin
c553e1604c DeepGemm integrate to sgl-kernel (#4165)
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
Co-authored-by: HandH1998 <1335248067@qq.com>
Co-authored-by: shuaills <shishuaiuoe@gmail.com>
Co-authored-by: yinfan98 <1106310035@qq.com>
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-03-10 00:35:07 -07:00
Lianmin Zheng
8abf74e3c9 Rename files in sgl kernel to avoid nested folder structure (#4213)
Co-authored-by: zhyncs <me@zhyncs.com>
2025-03-08 22:54:51 -08:00
lukec
b93ef5e56d Remove the vllm dependency from the moe_align function (#4164)
Co-authored-by: Hongbosherlock <hongbosherlock@gmail.com>
2025-03-07 22:42:16 -08:00
Stefan He
63ee26d162 Add sgl_per_token_quant_fp8 (#4089) 2025-03-06 20:53:05 -08:00
Xiaoyu Zhang
ad55f17182 [quant kernel] sgl-kernel support per_tensor_quant fp8 (#3786) 2025-03-06 18:05:43 -08:00
Lianmin Zheng
110e006673 Reorganize python source files in sgl-kernel with multiple files (#4027) 2025-03-03 06:36:40 -08:00
Lianmin Zheng
6b45a21d16 Reorganize c++ source files in sgl-kernel with multiple folders (#4025) 2025-03-03 05:32:30 -08:00
Chayenne
18bb216c28 Revert "[MOE] enable efficient moe_alignment multi-blocks execution (3x~6x)" (#3982) 2025-02-28 23:57:17 -08:00
yiakwy-xpu-ml-framework-team
1c96fa86cf [MOE] enable efficient moe_alignment multi-blocks execution (3x~6x) (#3613) 2025-02-27 19:42:48 -08:00
Baizhou Zhang
67fc595bb8 [Feature] Apply Cublas Grouped Gemm kernel (#3629) 2025-02-18 15:18:31 +08:00
yizhang2077
640363ad20 support blockwise fp8 matmul kernel (#3267) 2025-02-13 01:49:33 +08:00
Xiaoyu Zhang
bb418ced80 optimize per token group quant fp8 (#3490) 2025-02-11 22:19:05 +08:00
Yineng Zhang
f9905d59a8 support speculative decoding kernel in sgl-kernel (#3373)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2025-02-07 20:29:51 +08:00
Xiaoyu Zhang
ad3499858e clean moe align block kernel code and add acc test (#3332) 2025-02-06 16:42:36 +08:00
Yineng Zhang
827aa8730b cleanup sgl-kernel kernels (#3175) 2025-01-27 19:11:01 +08:00
Byron Hsu
514f37c32b [kernel] Fix position ids in rope (#3173) 2025-01-27 17:09:51 +08:00
Byron Hsu
fb11a43981 [kernel] Integrate flashinfer's rope with higher precision and better perf (#3134) 2025-01-27 15:28:00 +08:00
HandH1998
82392da830 support w8a8 fp8 kernel with CUTLASS (#3047)
Co-authored-by: yych0745 <1398089567@qq.com>
2025-01-26 15:46:51 +08:00
Yineng Zhang
95f789adb0 minor: cleanup sgl-kernel (#3143) 2025-01-26 14:29:58 +08:00
Xiaoyu Zhang
5d9d15e70f support fp32 in sampling_scaling_penalties kernel (#3121) 2025-01-25 16:52:17 +08:00
Yineng Zhang
5de4051bcf feat: integrate sampling kernels into sgl-kernel (#3086)
Co-authored-by: Zihao Ye <expye@outlook.com>
2025-01-24 01:54:47 +08:00
Xiaoyu Zhang
e0cd65c2b6 [hotfix] fix test_sampling_scaling_penalties.py ci test (#3084) 2025-01-24 00:33:59 +08:00
Xiaoyu Zhang
f1b6861828 use flashinfer vec_dtypes in sgl_kernel (#3083) 2025-01-23 22:19:04 +08:00
Yineng Zhang
0da0989ad4 sync flashinfer and update sgl-kernel tests (#3081) 2025-01-23 21:13:55 +08:00
Xiaoyu Zhang
ac2dc35d0e support lightning_attention_decode in sgl-kernel for MiniMax-Text-01 (#3030) 2025-01-23 15:29:20 +08:00
Yineng Zhang
bf669606eb feat: integrate bmm_fp8 kernel into sgl-kernel (#3056) 2025-01-23 00:39:38 +08:00
Yineng Zhang
9d9b482a39 feat: integrate activation kernels into sgl-kernel (#3053) 2025-01-22 23:25:45 +08:00
Yineng Zhang
7353fb9b97 feat: integrate norm kernels into sgl-kernel (#3052) 2025-01-22 21:32:48 +08:00
Ke Bao
0ac019f171 Support sm90 Int8 gemm (#3035) 2025-01-21 22:21:54 +08:00
Yineng Zhang
5a0d680a14 feat: add flashinfer as 3rdparty and use rmsnorm as example (#3033) 2025-01-21 20:44:49 +08:00
Byron Hsu
b5caa22dfb [kernel] port rope cuda kernel to sgl-kernel (#2993)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
2025-01-20 20:58:51 +08:00
yizhang2077
6cb3974e77 optimize custom allreduce kernel (#2904) 2025-01-16 03:04:25 +08:00
Xiaoyu Zhang
d08c77c434 Sampling penalties memory interface (#2870) 2025-01-13 23:09:00 +08:00
Xiaoyu Zhang
e2b16c4716 add sampling_scaling_penalties kernel (#2846) 2025-01-12 19:38:17 -08:00
Ke Bao
58f9060efe Update int8 gemm config (#2774) 2025-01-07 19:47:37 +08:00
Ke Bao
0f3eb1d294 Support cutlass Int8 gemm (#2752) 2025-01-06 22:51:22 +08:00