Commit Graph

533 Commits

Author SHA1 Message Date
Yineng Zhang
5f1a485d9e Revert "[ROCm] Use tl.range() in block GEMM kernels with `num_stage… (#3632) 2025-02-17 18:01:21 +08:00
Wen-Heng (Jack) Chung
03caefeb51 [ROCm] Use tl.range() in block GEMM kernels with num_stages set by host. (#3535)
Co-authored-by: HAI <hixiao@gmail.com>
2025-02-16 01:40:38 -08:00
Yineng Zhang
70f894b810 feat: support flashinfer mla attention for deepseek v3 (#3550) 2025-02-14 08:50:14 +08:00
Wen-Heng (Jack) Chung
871a4aa1bf [ROCm] Add ROCm tuning configs for AMD Instinct MI325X. (#3536) 2025-02-12 20:09:36 -08:00
yizhang2077
98eecbda54 integrate blockwise fp8 kernel (#3529) 2025-02-13 04:39:33 +08:00
Liangsheng Yin
8616357a97 Fix deepseek awq v3 (#3450) 2025-02-12 22:09:52 +08:00
Xiaoyu Zhang
45e3a7bc41 use sgl_per_token_group_quant_fp8 kernel (#3493) 2025-02-12 18:40:42 +08:00
Ke Bao
7e6d5fc694 Support Eagle cuda graph for Triton backend (#3500) 2025-02-12 02:27:45 +08:00
Wen-Heng (Jack) Chung
cadd5dbe6a Tune MI300X fused MoE Triton kernel JSON config. (#3492) 2025-02-11 10:27:25 -08:00
yigex
fdf04a1426 [ROCm] Add ROCm tuning config to block gemm and Re-tune for AMD Radeon Graphics (#3418)
Co-authored-by: Bruce Xue <yigex@xilinx.com>
Co-authored-by: HAI <hixiao@gmail.com>
2025-02-10 23:55:04 -08:00
Xiaoyu Zhang
2f47d710ae refine some typo (#3473) 2025-02-10 23:35:44 +08:00
Ying Sheng
d23cb9a01e [Eagle] reduce one draft forward (#3468) 2025-02-10 20:21:49 +08:00
Ke Bao
2d61132374 Support Eagle2 for Triton backend (#3466) 2025-02-10 20:00:42 +08:00
Yineng Zhang
27c4c9cf52 remove _grouped_size_compiled_for_decode_kernels (#3453) 2025-02-10 13:01:21 +08:00
Yineng Zhang
36f6fc5093 feat: enable ragged fa3 by default on hopper 12.4+ (#3442) 2025-02-10 07:43:01 +08:00
Yineng Zhang
64c8713573 remove activation dependency in fused_moe (#3433) 2025-02-10 01:18:57 +08:00
Yineng Zhang
014cab4dd2 update forward_return_lse (#3425) 2025-02-09 20:18:44 +08:00
Ying Sheng
7b4e61fff3 [Fix] Fix eagle with disable cuda graph (#3411) 2025-02-09 08:40:00 +08:00
lukec
4530136e61 Add H20 fp8 w8a8 gemm config (#3386) 2025-02-08 15:36:31 +08:00
lizamd
e868d0b60e update waves_per_eu to 1 (#3356) 2025-02-07 13:08:06 +08:00
Xiaoyu Zhang
cdae77b03d optimize moe_align_kernel cuda (#3347) 2025-02-07 00:53:46 +08:00
Wen-Heng (Jack) Chung
32de54ed1a [ROCm] Fix fp8 unrolledx4 matmul kernel. (#3325)
Co-authored-by: HAI <hixiao@gmail.com>
2025-02-05 18:44:15 -08:00
Ke Bao
a322051e31 Support custom mask for Triton attention (#3317) 2025-02-06 01:16:02 +08:00
Ke Bao
de5533341e Update Triton extend backend interface (#3309) 2025-02-05 18:12:22 +08:00
Wen-Heng (Jack) Chung
7ab84948d8 [ROCm] Logic to decide whether to used manually unrolled kernel. (#3306) 2025-02-04 19:12:20 -08:00
Wen-Heng (Jack) Chung
c2723a42a5 [ROCm] Manually unroll _w8a8_block_fp8_matmul kernel on AMD GPU. (#3299) 2025-02-05 07:15:40 +08:00
Wen-Heng (Jack) Chung
c7256ca836 [ROCm] Add tuning configs for AMD Radeon Graphics. (#3294) 2025-02-04 10:34:57 -08:00
Ke Bao
a07364ccc5 Update Triton decode backend interface (#3292) 2025-02-04 23:26:04 +08:00
HAI
2c1a695ff1 ROCm: sgl-kernel enablement starting with sgl_moe_align_block (#3287) 2025-02-04 21:44:44 +08:00
Yineng Zhang
d39899e85c upgrade flashinfer v0.2.0.post2 (#3288)
Co-authored-by: pankajroark <pankajroark@users.noreply.github.com>
2025-02-04 21:41:40 +08:00
kushanam
d54cee1441 adding Triton configs for DeepSeekV3 on Blackwell (#3272) 2025-02-04 04:12:09 +08:00
Yineng Zhang
013021b6a1 refactor EAGLE 2 (#3269)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: merrymercy <lianminzheng@gmail.com>
Co-authored-by: Ying1123 <sqy1415@gmail.com>
2025-02-03 20:52:30 +08:00
zifeitong
28b0a62bb3 Bug: Fix min_p sampling crash when using flashinfer backend (#3207)
Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com>
2025-02-02 15:36:07 -08:00
Wen-Heng (Jack) Chung
d9eb9358cc Tune paged attention parameters for AMD GPU. (#3255) 2025-02-01 17:29:45 -08:00
Yineng Zhang
8db776f049 support QuickGELU (#3250) 2025-02-01 19:31:47 +08:00
Yineng Zhang
4eb4b401cc update and simplify CustomOp (#3249) 2025-02-01 18:56:44 +08:00
Ke Bao
1ebe1d6de5 Optimize MoE topk with torch compile (#3236) 2025-02-01 01:36:50 +08:00
Yineng Zhang
7811bfdaa7 compatible with flashinfer v0.2 (#3235) 2025-02-01 01:32:18 +08:00
Ke Bao
c02e313914 Fix block wise fp8 torch compile (#3232) 2025-01-31 19:56:02 +08:00
Byron Hsu
734daedd8f [fix] Clamp logprob with dtype min to prevent -inf (#3224) 2025-01-31 17:04:04 +08:00
Mick
9f635ea50d [Fix] Address remaining issues of supporting MiniCPMV (#2977) 2025-01-28 00:22:13 -08:00
Byron Hsu
988d0a4bfc [kernel] Use sgl_kernel rope (#3169)
Co-authored-by: zhyncs <me@zhyncs.com>
2025-01-28 14:33:11 +08:00
Yineng Zhang
2f79f58873 feat: use sgl-kernel 0.0.3 in sglang (#3179) 2025-01-27 21:39:52 +08:00
Lianmin Zheng
53cef81587 Improve weight loading and code style (#3174) 2025-01-27 03:00:41 -08:00
yigex
351a72d40b add dsv3 mi300 triton config for block scale (#3146) 2025-01-27 17:25:53 +08:00
Lianmin Zheng
52c03f16b9 Add activation parameters to fused_moe (#3170) 2025-01-27 00:23:37 -08:00
Lianmin Zheng
1dda8c5e4c Return more infos for computing average acceptance length (#3152) 2025-01-26 04:51:54 -08:00
Ke Wen
862bcff833 Support loading of larger models with on-the-fly quantization (#3061) 2025-01-22 21:33:17 -08:00
Lianmin Zheng
8b84e69f25 Fix tp token sync for dp attention (#3062) 2025-01-22 18:51:40 -08:00
Lianmin Zheng
022614d26e Add some flags to allow sync token ids across TP ranks (#3060) 2025-01-22 15:05:51 -08:00