Ke Bao
|
0475448ee3
|
Optimize triton swa kernel by skipping computation (#8860)
|
2025-08-06 21:37:50 +08:00 |
|
Yineng Zhang
|
1466c1b896
|
feat: support glm4 tuning (#8473)
|
2025-07-28 14:32:58 -07:00 |
|
Yuxuan Zhang
|
6d6a8bc278
|
GLM-4.5 Model Support (#8224)
Co-authored-by: Lifu Huang <lifu.hlf@gmail.com>
Co-authored-by: Binyao Jiang <byjiang1996@gmail.com>
Co-authored-by: Stefan He <hebiaobuaa@gmail.com>
|
2025-07-27 22:54:07 -07:00 |
|
Cheng Wan
|
abda2542d5
|
Fix tuning_fused_moe_triton.py (#8175)
|
2025-07-19 17:33:50 -07:00 |
|
Yuan Luo
|
253454de9b
|
Integrate triton moe kernel (#7689)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-07-06 20:05:49 -07:00 |
|
Xiaoyu Zhang
|
0ae1e9a755
|
refine fused_moe benchmark (#7221)
|
2025-06-15 21:21:32 -07:00 |
|
Quanfeng Li
|
ef32677444
|
Fix positional argument (#7093)
|
2025-06-11 18:31:13 -07:00 |
|
Xiaoyu Zhang
|
3712abfaf9
|
Fuse routed scaling factor in deepseek (#6970)
|
2025-06-08 15:24:24 -07:00 |
|
Xiaoyu Zhang
|
fa3592cfeb
|
rebase h20 fused_moe config (#6966)
|
2025-06-08 05:01:34 -07:00 |
|
Yineng Zhang
|
1fb76ebb93
|
Revert "Fuse routed scaling factor in topk_reduce kernel (#6220)" (#6968)
|
2025-06-07 21:02:49 -07:00 |
|
Xiaoyu Zhang
|
515ef4facb
|
Fuse routed scaling factor in topk_reduce kernel (#6220)
|
2025-06-07 11:06:50 -07:00 |
|
zyksir
|
8e3797be1c
|
support 1 shot allreduce in 1-node and 2-node using mscclpp (#6277)
|
2025-06-04 22:11:24 -07:00 |
|
Cheng Wan
|
81964328b7
|
Set num_fused_shared_experts as num_shared_experts when shared_experts fusion is not disabled (#6736)
|
2025-06-04 15:53:22 -07:00 |
|
Cheng Wan
|
8a5480528d
|
[Refactor] Rename n_share_experts_fusion as num_fused_shared_experts (#6735)
|
2025-06-03 17:48:24 -07:00 |
|
JieXin Liang
|
d9d35def3d
|
[test] add ut and bm for get_last_loc (#6746)
|
2025-05-29 11:47:21 -07:00 |
|
fzyzcjy
|
6df81e8a39
|
Support tuning DeepEP configs (#6742)
|
2025-05-29 08:12:22 -07:00 |
|
ChangyiYang
|
485a023bd8
|
refactor apply_w8a8_block_fp8_linear in fp (#6545)
|
2025-05-29 00:15:11 -07:00 |
|
Xiaoyu Zhang
|
076103535c
|
fix log_info_on_rank0 error when run benchmark (#6260)
|
2025-05-28 00:20:01 -07:00 |
|
Yuan Luo
|
c087ddd686
|
Refine pre_reorder_triton_kernel slightly to improve performance (#6627)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-05-28 00:15:23 -07:00 |
|
fzyzcjy
|
ef8ec07b2c
|
Support tuning moe for llama 4 model (#6042)
|
2025-05-12 15:47:01 -07:00 |
|
Lianmin Zheng
|
e8e18dcdcc
|
Revert "fix some typos" (#6244)
|
2025-05-12 12:53:26 -07:00 |
|
applesaucethebun
|
d738ab52f8
|
fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-05-13 01:42:38 +08:00 |
|
Lifu Huang
|
6e2da51561
|
Replace time.time() to time.perf_counter() for benchmarking. (#6178)
Signed-off-by: Lifu Huang <lifu.hlf@gmail.com>
|
2025-05-11 14:32:49 -07:00 |
|
Xiaoyu Zhang
|
1cc326032d
|
simplify fused_moe config logging (#5801)
|
2025-04-28 17:04:54 -07:00 |
|
Yi Zhang
|
a0251a3fd6
|
add fused moe config for qwen3moe fp8/bf16 (#5849)
|
2025-04-28 11:55:52 -07:00 |
|
Xiaoyu Zhang
|
e132cba2a8
|
fused moe triton tuning script support qwen3 (#5842)
|
2025-04-28 09:13:04 -07:00 |
|
XinyuanTong
|
0045f4b2af
|
feat: Add fused moe triton config for qwen3 moe on h100 (#5833)
|
2025-04-28 08:37:13 -07:00 |
|
Zhaoyi Li
|
c555d794f7
|
Minor update for ROCm variable style (#5562)
|
2025-04-19 23:45:27 -07:00 |
|
lambert0312
|
61e7c4dd21
|
Add A800 shared experts fused MoE kernel tuning configs for DeepSeek V3/R1 (#5368)
|
2025-04-14 18:39:44 -07:00 |
|
Xiaoyu Zhang
|
3e4794aad8
|
refine fused_moe tuning docs (#5294)
|
2025-04-12 10:01:13 -07:00 |
|
Xiaoyu Zhang
|
924ca7c92c
|
Add DeepSeek V3/R1 shared experts fusion (#4918)
|
2025-04-04 01:59:29 -07:00 |
|
Brayden Zhong
|
b149b39353
|
[CI] Remove unused imports with Ruff to pre-commit config, only to benchmarks/docs/examples folder (#3969)
|
2025-03-27 19:45:02 -07:00 |
|
Chunan Zeng
|
14269198e3
|
[Benchmark] tilelang vs deepgemm vs w8a8_block_fp8_matmul (#4735)
|
2025-03-24 20:56:31 -07:00 |
|
Tongbao Zhang
|
3980ff1be6
|
rename benchmark_deepgemm_fp8_group_gemm.py (#4605)
|
2025-03-23 23:35:20 -07:00 |
|
penguin_wwy
|
38f25e87fc
|
Correcting default configuration when benchmarking fused_moe (#4665)
|
2025-03-22 00:52:34 -07:00 |
|
JieXin Liang
|
1a3fa75f2f
|
[Fix] use torch.cat instead of torch.concat to prevent entering the Autograd backends. (#4466)
|
2025-03-16 00:02:47 -07:00 |
|
Xiaoyu Zhang
|
7130a7cea9
|
refine sgl_moe_align_block_size_benchmark (#4327)
|
2025-03-11 22:48:38 -07:00 |
|
yych0745
|
6a02b32d07
|
Add A100 tuning configs for DeepSeek R1/V3 channel-wise INT8 (#4287)
Co-authored-by: HandH1998 <1335248067@qq.com>
|
2025-03-11 00:49:06 -07:00 |
|
Lianmin Zheng
|
66301e124f
|
Improve code styles (#4021)
|
2025-03-03 03:20:23 -08:00 |
|
Lianmin Zheng
|
ac2387279e
|
Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
|
2025-03-03 00:12:04 -08:00 |
|
Stefan He
|
0194948fd9
|
Optimize Triton Kernel of Group GEMM in DeepGEMM Benchmark (#4014)
|
2025-03-02 23:29:55 -08:00 |
|
Stefan He
|
b7e274f2d9
|
Add Benchmark for DeepGEMM Group GEMM (#3993)
|
2025-03-02 17:47:21 -08:00 |
|
Xiaoyu Zhang
|
50f28f65a0
|
fix typo in deep gemm benchmarking(#3991)
|
2025-03-02 00:34:00 -08:00 |
|
Xiaoyu Zhang
|
90a55e2566
|
add deepgemm and sglang fp8 block-wise gemm benchmark (#3893)
|
2025-03-01 23:01:58 -08:00 |
|
Chayenne
|
18bb216c28
|
Revert "[MOE] enable efficient moe_alignment multi-blocks execution (3x~6x)" (#3982)
|
2025-02-28 23:57:17 -08:00 |
|
yiakwy-xpu-ml-framework-team
|
1c96fa86cf
|
[MOE] enable efficient moe_alignment multi-blocks execution (3x~6x) (#3613)
|
2025-02-27 19:42:48 -08:00 |
|
laixin
|
b0df5d240b
|
Tuning Script for Feature DeepSeek V3/R1 INT8 Quantization (block-wise) (#3922)
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
|
2025-02-27 10:59:46 +00:00 |
|
yigex
|
ddf39d3fce
|
[ROCm] Optimal MOE Tuning for AMD Radeon Graphics (#3567)
|
2025-02-17 17:54:10 -08:00 |
|
Xiaoyu Zhang
|
c38f3aed24
|
support multi-gpu block-gemm tuning (#3639)
|
2025-02-18 00:00:35 +08:00 |
|
yigex
|
fdf04a1426
|
[ROCm] Add ROCm tuning config to block gemm and Re-tune for AMD Radeon Graphics (#3418)
Co-authored-by: Bruce Xue <yigex@xilinx.com>
Co-authored-by: HAI <hixiao@gmail.com>
|
2025-02-10 23:55:04 -08:00 |
|