Commit Graph

221 Commits

Author SHA1 Message Date
Yi Zhang
bcbbf519f9 sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (#5079) 2025-04-05 14:23:20 -07:00
Yineng Zhang
3f287b8579 support sgl-kernel on blackwell (#5074) 2025-04-04 16:59:32 -07:00
Xiaoyu Zhang
924ca7c92c Add DeepSeek V3/R1 shared experts fusion (#4918) 2025-04-04 01:59:29 -07:00
Yineng Zhang
d7954b7682 bump sgl-kernel v0.0.7 (#5046) 2025-04-03 13:38:13 -07:00
yinfan98
b8b6008f47 [Fix] fix fa3 build at cu118 (#5036) 2025-04-03 11:52:35 -07:00
Zhiqiang Xie
9d0b36c47a fix deepgemm as well (#5030) 2025-04-03 02:41:37 -07:00
Yuhong Guo
7d8c0ce7ce [Build] Support build sgl-kernel with ccache (#5020) 2025-04-03 00:22:37 -07:00
Zhiqiang Xie
a2aea59b6e update cutlass tag (#5011) 2025-04-02 18:30:30 -07:00
Xiaoyu Zhang
2c8fd99363 [sgl-kernel] per token group quant support COLUMN MAJOR (#4817) 2025-04-02 18:29:59 -07:00
Yuhong Guo
ee47a6c1c3 [Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (#4953) 2025-03-31 12:00:34 -07:00
Yineng Zhang
6384d31776 bump sgl-kernel v0.0.6 (#4950) 2025-03-31 11:24:09 -07:00
yinfan98
c7457191a0 [Fix] revert clean m.def for cudagraph (#4944) 2025-03-31 02:08:55 -07:00
Yineng Zhang
4814ecaff9 cleanup sgl-kernel (#4933) 2025-03-30 14:12:30 -07:00
yinfan98
37c66ec856 [feat] add fa3 in sgl-kernel (#4902)
Co-authored-by: Sleepcoo <Sleepcoo@gmail.com>
2025-03-30 12:57:10 -07:00
Yineng Zhang
195a09f57c fix bmm fp8 (#4926) 2025-03-30 12:15:20 -07:00
Adarsh Shirawalmath
9fccda3111 [Feature] use pytest for sgl-kernel (#4896) 2025-03-30 10:36:52 -07:00
Yi Zhang
5ec5eaf760 fix allreduce test (#4909) 2025-03-29 23:16:53 -07:00
yinfan98
0d7fe866f9 [Misc] Clean m.def and add Development Tips (#4890) 2025-03-29 23:06:18 -07:00
Yineng Zhang
54b9a2de0a remove setup for sgl-kernel (#4899) 2025-03-29 12:47:38 -07:00
yinfan98
8e7b31546c quick fix: add default for new kernel (#4898) 2025-03-29 12:31:59 -07:00
Qingquan Song
45dcfc2e76 Add deepseek style fused moe group gate selection kernel (#4530) 2025-03-29 11:51:45 -07:00
yinfan98
ddf8981d91 Delete test_deep_gemm.py (#4891) 2025-03-29 10:46:11 -07:00
yinfan98
05625b9792 [Docs] Update DeepGEMM at README.md (#4886) 2025-03-29 09:53:39 -07:00
Yineng Zhang
ec3ee0289d fix sgl-kernel cu118 build (#4872) 2025-03-28 17:23:51 -07:00
Yineng Zhang
92941ce7b5 bump sgl-kernel 0.0.5.post4 (#4768) 2025-03-28 14:40:53 -07:00
Yineng Zhang
2bb0e7cf43 fix sampling issue (#4871) 2025-03-28 14:07:21 -07:00
yinfan98
4db29e82ec [Feat] support deepgemm for cmake (#4864) 2025-03-28 10:51:44 -07:00
Yineng Zhang
6dea5c96bf Revert "get the python version from env (#4729)" (#4863) 2025-03-28 08:07:48 -07:00
DavidChan
5eae67cb1f get the python version from env (#4729) 2025-03-27 22:26:42 -07:00
Yineng Zhang
31dfff7da7 use default for torch.ops (#4835) 2025-03-27 19:09:58 -07:00
Yineng Zhang
8bf6d7f406 support cmake for sgl-kernel (#4706)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: yinfan98 <1106310035@qq.com>
2025-03-27 01:42:28 -07:00
Yi Pan
45fdf1f7f3 Fix shared memory OOM on sm86 GPUs. (#4797) 2025-03-26 10:41:53 -07:00
Trevor Morris
e9f8e42318 Support FP4 gemm (1/2) (#3899) 2025-03-24 19:50:23 -07:00
Chunan Zeng
65c24c28f9 [Quant Kernel] refactored per token group quant fp8 to support int8 up-to 2x faster (#4396) 2025-03-23 23:44:17 -07:00
Alex Sun
af6535e7aa [ROCm] Enable MTP (NextN) on AMD GPU (#4631) 2025-03-23 22:58:05 -07:00
AniZpZ
321ab756bc [1/3] fix dsv3 awq issue (#4556)
Co-authored-by: leoneo <1320612015@qq.com>
2025-03-22 01:07:17 -07:00
Chunan Zeng
6a384d5c01 Speed up per token and per tensor quant by 15% (#4639) 2025-03-22 00:37:57 -07:00
Shu Wang
ad4e58bf67 Support fp8 gemm for blackwell (#4558) 2025-03-20 12:40:28 -07:00
strgrb
f9c53cbb42 Create col-major and tma-aligned x_scale for deep_gemm.gemm_fp8_fp8_bf16_nt (#4515)
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
2025-03-19 00:02:43 -07:00
Yineng Zhang
988ab646ec bump v0.0.5.post3 (#4520) 2025-03-17 13:05:59 -07:00
Wenbo Yang
75b656488a Support serving DeepSeek-R1-Channel-INT8 with 32 L40S. (#4418) 2025-03-17 00:03:43 -07:00
yiakwy-xpu-ml-framework-team
9b8333d992 [ROCm] enable moe topk softmax in amd (#4448) 2025-03-16 18:16:55 -07:00
Yi Zhang
25e1816eff fix custom allreduce performance/accuracy problem (#4477) 2025-03-16 12:16:30 -07:00
Ying Sheng
1b859295f4 [Eagle] Remove the greedy branch and some redundant code (#4363)
Co-authored-by: Sehoon Kim <sehoon@x.ai>
2025-03-16 02:48:55 -07:00
Yineng Zhang
9971dc2283 Revert "feat: Add FlashMLA submodule (#4449)" (#4470) 2025-03-16 01:30:05 -07:00
Lianmin Zheng
3db35c1af4 Release sgl-kernel v0.0.5.post2 (#4469) 2025-03-16 01:01:53 -07:00
Ying Sheng
52a34d7448 Add greedy verification kernel (#4383) 2025-03-16 00:58:26 -07:00
JieXin Liang
1a3fa75f2f [Fix] use torch.cat instead of torch.concat to prevent entering the Autograd backends. (#4466) 2025-03-16 00:02:47 -07:00
Shi Shuai
81f431eded feat: Add FlashMLA submodule (#4449)
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
2025-03-15 23:30:25 -07:00
Yineng Zhang
862fe52241 bump v0.0.5.post1 (#4437) 2025-03-14 15:00:26 -07:00