DefTruth
|
12ef7e3bc3
|
bugfix: fix merge_state_v2 cuda graph (#5419)
|
2025-04-15 10:18:47 -07:00 |
|
DefTruth
|
388e15c0db
|
kernel: support slightly faster merge_state_v2 cuda kernel (#5381)
|
2025-04-14 21:28:23 -07:00 |
|
Yineng Zhang
|
b62e7e99b8
|
feat: adapt merge_state (#5337)
|
2025-04-12 21:14:04 -07:00 |
|
Yineng Zhang
|
812e82f35e
|
fix: solve cu118 issue for cutlass mla (#5331)
|
2025-04-12 12:51:09 -07:00 |
|
PGFLMG
|
4879e50c6d
|
[Feat] Add sparse attn to sgl-kernel (#5327)
|
2025-04-12 11:36:36 -07:00 |
|
Zhaoyi Li
|
3c9740d200
|
update variable naming and comments for rocm (#5299)
|
2025-04-11 23:15:05 -07:00 |
|
Trevor Morris
|
f65b8d5c89
|
Blackwell Cutlass MLA kernel (#5142)
|
2025-04-11 22:16:51 -07:00 |
|
Yineng Zhang
|
136b8e6afb
|
fix: remove cublas_grouped_gemm (#5307)
|
2025-04-11 16:22:37 -07:00 |
|
Richard Zou
|
76f44c2a8d
|
Fix deepseek-v3 with torch.compile in PyTorch 2.6. (#5213)
|
2025-04-10 09:14:38 -07:00 |
|
Xiaoyu Zhang
|
f730362ee2
|
reduce moe_align_block_size_kernel small batch mode overhead (#5086)
|
2025-04-09 17:59:35 -07:00 |
|
Yi Zhang
|
ebf495f013
|
sgl-kernel use cutlass latest version for fp8 blockwise gemm (#5207)
|
2025-04-09 11:47:04 -07:00 |
|
Ma Mingfei
|
a73c4df438
|
Add optimized native kernels in sgl-kernel (#5150)
Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com>
Co-authored-by: YanbingJiang <yanbing.jiang@intel.com>
Co-authored-by: blzheng <beilei.zheng@intel.com>
|
2025-04-08 09:37:46 -07:00 |
|
Yi Zhang
|
bcbbf519f9
|
sgl-kernel transfer custom allreduce from trt kernel to vllm kernel (#5079)
|
2025-04-05 14:23:20 -07:00 |
|
yinfan98
|
b8b6008f47
|
[Fix] fix fa3 build at cu118 (#5036)
|
2025-04-03 11:52:35 -07:00 |
|
Xiaoyu Zhang
|
2c8fd99363
|
[sgl-kernel] per token group quant support COLUMN MAJOR (#4817)
|
2025-04-02 18:29:59 -07:00 |
|
Yuhong Guo
|
ee47a6c1c3
|
[Build] Fix cuda12.8 build error in nvfp4_scaled_mm_kernels.cu (#4953)
|
2025-03-31 12:00:34 -07:00 |
|
yinfan98
|
c7457191a0
|
[Fix] revert clean m.def for cudagraph (#4944)
|
2025-03-31 02:08:55 -07:00 |
|
yinfan98
|
37c66ec856
|
[feat] add fa3 in sgl-kernel (#4902)
Co-authored-by: Sleepcoo <Sleepcoo@gmail.com>
|
2025-03-30 12:57:10 -07:00 |
|
Yineng Zhang
|
195a09f57c
|
fix bmm fp8 (#4926)
|
2025-03-30 12:15:20 -07:00 |
|
yinfan98
|
0d7fe866f9
|
[Misc] Clean m.def and add Development Tips (#4890)
|
2025-03-29 23:06:18 -07:00 |
|
Qingquan Song
|
45dcfc2e76
|
Add deepseek style fused moe group gate selection kernel (#4530)
|
2025-03-29 11:51:45 -07:00 |
|
Yineng Zhang
|
ec3ee0289d
|
fix sgl-kernel cu118 build (#4872)
|
2025-03-28 17:23:51 -07:00 |
|
Yineng Zhang
|
8bf6d7f406
|
support cmake for sgl-kernel (#4706)
Co-authored-by: hebiao064 <hebiaobuaa@gmail.com>
Co-authored-by: yinfan98 <1106310035@qq.com>
|
2025-03-27 01:42:28 -07:00 |
|
Yi Pan
|
45fdf1f7f3
|
Fix shared memory OOM on sm86 GPUs. (#4797)
|
2025-03-26 10:41:53 -07:00 |
|
Trevor Morris
|
e9f8e42318
|
Support FP4 gemm (1/2) (#3899)
|
2025-03-24 19:50:23 -07:00 |
|
Chunan Zeng
|
65c24c28f9
|
[Quant Kernel] refactored per token group quant fp8 to support int8 up-to 2x faster (#4396)
|
2025-03-23 23:44:17 -07:00 |
|
Alex Sun
|
af6535e7aa
|
[ROCm] Enable MTP (NextN) on AMD GPU (#4631)
|
2025-03-23 22:58:05 -07:00 |
|
AniZpZ
|
321ab756bc
|
[1/3] fix dsv3 awq issue (#4556)
Co-authored-by: leoneo <1320612015@qq.com>
|
2025-03-22 01:07:17 -07:00 |
|
Chunan Zeng
|
6a384d5c01
|
Speed up per token and per tensor quant by 15% (#4639)
|
2025-03-22 00:37:57 -07:00 |
|
Shu Wang
|
ad4e58bf67
|
Support fp8 gemm for blackwell (#4558)
|
2025-03-20 12:40:28 -07:00 |
|
Wenbo Yang
|
75b656488a
|
Support serving DeepSeek-R1-Channel-INT8 with 32 L40S. (#4418)
|
2025-03-17 00:03:43 -07:00 |
|
yiakwy-xpu-ml-framework-team
|
9b8333d992
|
[ROCm] enable moe topk softmax in amd (#4448)
|
2025-03-16 18:16:55 -07:00 |
|
Yi Zhang
|
25e1816eff
|
fix custom allreduce performance/accuracy problem (#4477)
|
2025-03-16 12:16:30 -07:00 |
|
Ying Sheng
|
1b859295f4
|
[Eagle] Remove the greedy branch and some redundant code (#4363)
Co-authored-by: Sehoon Kim <sehoon@x.ai>
|
2025-03-16 02:48:55 -07:00 |
|
Ying Sheng
|
52a34d7448
|
Add greedy verification kernel (#4383)
|
2025-03-16 00:58:26 -07:00 |
|
Qingquan Song
|
61e4433caf
|
Add moe topk softmax templated from vllm (#4302)
|
2025-03-14 12:03:33 -07:00 |
|
Yineng Zhang
|
2937387a50
|
fix accuracy issue (#4376)
|
2025-03-13 02:06:22 -07:00 |
|
Qingquan Song
|
4068e01292
|
Fix per token fp8 quant precision (#4362)
|
2025-03-12 21:19:05 -07:00 |
|
Shi Shuai
|
817d43705c
|
feat: support ep size < 32 for sgl kernel (#4348)
|
2025-03-12 20:50:46 -07:00 |
|
Elfie Guo
|
7c86671131
|
Support Blackwell Block Scale FP8 Gemm (#4278)
|
2025-03-12 14:17:11 -07:00 |
|
Rex
|
07f944631e
|
Add awq dequantize kernel to sgl with 1x to 3x speedup (#4104)
|
2025-03-12 00:10:02 -07:00 |
|
Stefan He
|
e0917e6bd0
|
Remove vllm ops scaled fp8 quant and accelerate per token quant by 20-28% (#4215)
Co-authored-by: Stefan He <bhe@linkedin.com>
|
2025-03-12 00:08:03 -07:00 |
|
yigex
|
690e1f2371
|
[AMD] Fix rocm sgl-kernel missing modules error (#4311)
Co-authored-by: yiakwy-xpu-ml-framework-team <leiwang2@amd.com>
|
2025-03-11 10:35:28 -07:00 |
|
Lianmin Zheng
|
cf0ccd406e
|
Optimize rope in sgl kernel (#4267)
|
2025-03-10 10:07:45 -07:00 |
|
Xiaoyu Zhang
|
23308a9032
|
fix per_token_group_quant_fp8 illegal memory when num_groups % 16 != 0 (#4231)
|
2025-03-10 01:42:58 -07:00 |
|
Lianmin Zheng
|
aa957102a9
|
Simplify tests & Fix trtllm custom allreduce registration (#4252)
|
2025-03-10 01:24:22 -07:00 |
|
Lianmin Zheng
|
7c0541b385
|
Move activation.cu to sgl-kernel/elementwise (#4250)
|
2025-03-09 22:41:13 -07:00 |
|
Lianmin Zheng
|
730d084f2a
|
Minor style fix for sgl-kernel (#4243)
|
2025-03-09 20:15:13 -07:00 |
|
Lianmin Zheng
|
eb06dbcbf8
|
Move rope and bmm into sgl-kernel (#4241)
|
2025-03-09 18:38:15 -07:00 |
|
Lianmin Zheng
|
8abf74e3c9
|
Rename files in sgl kernel to avoid nested folder structure (#4213)
Co-authored-by: zhyncs <me@zhyncs.com>
|
2025-03-08 22:54:51 -08:00 |
|