hlu1
|
1e85589dc5
|
Make fp4_quantize kernels work on sm103 (#9807)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
|
2025-08-29 21:15:08 -07:00 |
|
Kaixi Hou
|
5c34b4f1c7
|
[NVIDIA] [2/N] Optimize silu_and_mul_scaled_fp4_grouped_quant perf (#9556)
|
2025-08-29 17:17:03 -07:00 |
|
hlu1
|
7a16db9bd9
|
Make sm100 fp8 kernels available on sm103 (#9789)
Signed-off-by: Hao Lu <14827759+hlu1@users.noreply.github.com>
|
2025-08-28 23:47:29 -07:00 |
|
Ma Mingfei
|
5ad296bda1
|
Optimize prefill performance on cpu backend (#8750)
|
2025-08-28 17:21:55 -07:00 |
|
Hubert Lu
|
711390a971
|
[AMD] Support Hierarchical Caching on AMD GPUs (#8236)
|
2025-08-28 15:27:07 -07:00 |
|
Rain Jiang
|
6b39f9cf8c
|
Support compile sgl-kernel on cuda 13.0 (#9721)
|
2025-08-28 10:18:03 -07:00 |
|
PGFLMG
|
aa3eba8eb4
|
[sgl-kernel] misc: update deepgemm version for sgl-kernel (#9340)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
Co-authored-by: fzyzcjy <ch271828n@outlook.com>
|
2025-08-27 12:01:30 -07:00 |
|
Rain Jiang
|
79e6a8a6ac
|
support cuda 13.0 and trtllm kernel by Aug 25 2025 (#9495)
|
2025-08-26 23:13:27 -07:00 |
|
Qi Yuhang
|
fda4792620
|
Update CUTLASS 4.2 & Enable K-Major Scale Factor for SM90 FP8 Blockwise Group GEMM (#9559)
|
2025-08-24 23:24:43 -07:00 |
|
Kaixi Hou
|
e5638573c1
|
[NVIDA] [1/N] Nvfp4 Masked Gemm: Add quant op for the flashinfer grouped gemm (#9200)
|
2025-08-22 12:19:45 -07:00 |
|
kousakawang
|
5fd311d33e
|
[code clean] add H20 cutlass groupGemm default config (#9333)
Co-authored-by: wanghanpei <wanghanpei@bytedance.com>
|
2025-08-21 19:23:29 -07:00 |
|
Yuhao Yao
|
de4990a5b2
|
[Bug] Fix w4afp8 moe kernel (#9392)
|
2025-08-21 03:45:18 -07:00 |
|
fzyzcjy
|
42c8704560
|
Add PDL support for quant kernel and rope kernel (#9106)
|
2025-08-20 01:56:29 -07:00 |
|
Hubert Lu
|
c6c379ab31
|
[AMD] Reorganize hip-related header files in sgl-kernel (#9320)
|
2025-08-18 16:53:44 -07:00 |
|
Lianmin Zheng
|
c480a3f6ea
|
Minor style fixes for sgl-kernel (#9289)
|
2025-08-18 09:38:35 -07:00 |
|
kousakawang
|
0fc54b971e
|
[fix]: fix cutlass moe ut and and Opt H20 cutlass groupGemm performance (#9272)
Co-authored-by: wanghanpei <wanghanpei@bytedance.com>
|
2025-08-17 13:09:49 -07:00 |
|
Hubert Lu
|
9c3e95d98b
|
[AMD] Expand test coverage for AMD CI and enable apply_token_bitmask_inplace_cuda in sgl-kernel (#8268)
|
2025-08-15 12:32:51 -07:00 |
|
Liangsheng Yin
|
0c8594e67d
|
Optional extension for green context (#9231)
|
2025-08-15 21:33:52 +08:00 |
|
jy-song-hub
|
4fc09e0df0
|
Fp4 MOE quant kernel optimization (#8777)
Co-authored-by: Rain Jiang <96632942+rainj-me@users.noreply.github.com>
|
2025-08-15 01:46:16 -07:00 |
|
Yuan Luo
|
53dcc750b6
|
[sgl-kernel] Support FlashInfer top_k_top_p_sampling_from_logits (#9060)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-08-14 10:56:36 -07:00 |
|
Yuan Luo
|
432f2053dd
|
[sgl-kernel] 1/N Refactor sglang cutlass 3x - gemm fp8 blockwise sm90 (#8913)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-08-14 10:55:54 -07:00 |
|
Peng Zhang
|
5aa1ebd242
|
[2/n]decouple quantization implementation from vLLM dependency (#8112)
Co-authored-by: walker-ai <yiyun.wyt@antgroup.com>
Co-authored-by: leoneo <1320612015@qq.com>
|
2025-08-14 03:19:03 -07:00 |
|
henryg
|
841810f227
|
[Perf] Tunings for SM100 FP8 CUTLASS kernel (#8818)
|
2025-08-13 21:59:22 -07:00 |
|
Lianmin Zheng
|
9e426466af
|
Clean up allocators (#9134)
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
|
2025-08-13 13:56:04 -07:00 |
|
Ke Bao
|
94f44b88d1
|
Update fa3 interface and add unit test (#9150)
|
2025-08-13 20:05:02 +08:00 |
|
Trevor Morris
|
13c48dcf88
|
[1/2][resubmit again] sgl-kernel: Fuse routed scaling factor into moe_fused_gate (#9088)
|
2025-08-12 20:12:38 -07:00 |
|
DarkSharpness
|
86a0be65d8
|
[Feature] Support custom set kv buffer kernel (#8884)
|
2025-08-12 16:56:51 -07:00 |
|
Liangsheng Yin
|
445f9dca6e
|
Runtime check CUDA driver version to avoid unresolved green context symbols (#9021)
|
2025-08-12 09:26:10 -07:00 |
|
fzyzcjy
|
9aea255522
|
Fuse writing KV buffer into rope kernel (part 1: sgl-kernel) (#9077)
|
2025-08-12 01:46:40 -07:00 |
|
Yineng Zhang
|
dd949ace23
|
Revert "[1/2][resubmit] sgl-kernel: Fuse routed scaling factor into m… (#9035)
|
2025-08-10 17:34:54 -07:00 |
|
huangtingwei
|
86497d99f2
|
fix page first per layer pf2lf kernel (#8915)
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
|
2025-08-09 17:16:11 -07:00 |
|
cctry
|
5c31b35db2
|
[hicache] Optimization for DMA copy (#8245)
|
2025-08-09 17:16:07 -07:00 |
|
Trevor Morris
|
591c232f7c
|
[1/2][resubmit] sgl-kernel: Fuse routed scaling factor into moe_fused_gate (select_experts) (#8770)
|
2025-08-08 17:55:06 -07:00 |
|
triple-mu
|
444013585d
|
Fix typos and unify size(s)/stride(s) API calls (#8799)
|
2025-08-08 00:18:08 -07:00 |
|
Chunyuan WU
|
08f8f49016
|
[CPU][sgl-kernel] biased_grouped_topk: fix correction_bias dtype to float32 (#8212)
Co-authored-by: jianan-gu <jianan.gu@intel.com>
Co-authored-by: YanbingJiang <yanbing.jiang@intel.com>
|
2025-08-04 18:28:31 -07:00 |
|
Xiaoyu Zhang
|
f57d2dc162
|
[sgl-kernel] avoid per_token_quant_fp8.cu hardcode sm_count (#8738)
|
2025-08-04 12:55:57 +08:00 |
|
Qi Yuhang
|
d9def43dcd
|
[Perf]Use Cooperative Schedule for H100 & H200 & H800 in fp8_blockwise_scaled_grouped_mm (#8722)
|
2025-08-02 21:13:47 -07:00 |
|
Liangsheng Yin
|
603f5ce020
|
[Bug] fix green context's incompatibility with cuda < 12.4 (#8701)
|
2025-08-02 15:23:11 -07:00 |
|
Liangsheng Yin
|
f9f0138f80
|
Revert "[1/2] sgl-kernel: Fuse routed scaling factor into select_experts" (#8706)
|
2025-08-02 20:14:30 +08:00 |
|
Trevor Morris
|
f642524fd9
|
[1/2] sgl-kernel: Fuse routed scaling factor into select_experts (#8364)
|
2025-08-01 18:14:24 -07:00 |
|
YanbingJiang
|
1fe691a429
|
Fix FP8 block quantization when N or K is not multiples of 128 (#8648)
|
2025-08-01 15:57:19 -07:00 |
|
Stefan He
|
db7343c992
|
fix per token cuda kernel hidden dim cannot divide by 16 (#8543)
|
2025-08-01 09:27:18 -07:00 |
|
Peter Pan
|
6bdd27861b
|
[Kimi K2] dsv3_router_gemm supports NUM_EXPERTS == 384 (#8013)
|
2025-08-01 22:01:24 +08:00 |
|
Tao He
|
5d15fb8c9d
|
[bugifx] QWen-1M context support[2/3] using current cuda stream in the DCA's kernel for bugfix. (#8611)
Signed-off-by: Tao He <linzhu.ht@alibaba-inc.com>
Co-authored-by: sa-buc <linzhu.ht@w32d09270.cloud.sqa.na131>
|
2025-07-31 22:41:39 +08:00 |
|
Cheng Wan
|
a5f5ab4030
|
update sgl-kernel for EP: kernel part (#8514)
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: Ke Bao <ispobaoke@gmail.com>
|
2025-07-30 22:19:55 -07:00 |
|
Qi Yuhang
|
9b9e82539b
|
[Fix]Fix index oob in get_group_gemm_starts kernel. (#8564)
|
2025-07-30 19:49:35 -07:00 |
|
Yuan Luo
|
3bdcdd134b
|
[Hot-Fix] moe_aligned_block_size CI failed in AMD (#8461)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com>
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
|
2025-07-31 00:28:32 +08:00 |
|
Xiaoyu Zhang
|
7a4309cc8a
|
[sgl-kernel performace] fix fp8 quant kernels dispatch __nv_fp8_e4m3 bug to improve performance 10%-20% (#8499)
Co-authored-by: Ke Bao <ispobaoke@gmail.com>
|
2025-07-29 23:31:54 +08:00 |
|
Xiaoyu Zhang
|
2262369905
|
Revert "[kernel] opt moe align block kernel by block/warp scan algorithm" (#8457)
|
2025-07-28 01:35:43 -07:00 |
|
strgrb
|
fb4ce17de6
|
Fix per_token_group_quant_8bit when hidden_dim // group_size is not divided by 4. (#8449)
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
|
2025-07-28 01:32:46 -07:00 |
|