Xiaoyu Zhang
|
924ca7c92c
|
Add DeepSeek V3/R1 shared experts fusion (#4918)
|
2025-04-04 01:59:29 -07:00 |
|
penguin_wwy
|
38f25e87fc
|
Correcting default configuration when benchmarking fused_moe (#4665)
|
2025-03-22 00:52:34 -07:00 |
|
Xiaoyu Zhang
|
7130a7cea9
|
refine sgl_moe_align_block_size_benchmark (#4327)
|
2025-03-11 22:48:38 -07:00 |
|
yych0745
|
6a02b32d07
|
Add A100 tuning configs for DeepSeek R1/V3 channel-wise INT8 (#4287)
Co-authored-by: HandH1998 <1335248067@qq.com>
|
2025-03-11 00:49:06 -07:00 |
|
Lianmin Zheng
|
66301e124f
|
Improve code styles (#4021)
|
2025-03-03 03:20:23 -08:00 |
|
Lianmin Zheng
|
ac2387279e
|
Support penalty in overlap mode; return logprob with chunked prefill; improve benchmark scripts (#3988)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
Co-authored-by: dhou-xai <dhou@x.ai>
Co-authored-by: Hanming Lu <hanming_lu@berkeley.edu>
|
2025-03-03 00:12:04 -08:00 |
|
Chayenne
|
18bb216c28
|
Revert "[MOE] enable efficient moe_alignment multi-blocks execution (3x~6x)" (#3982)
|
2025-02-28 23:57:17 -08:00 |
|
yiakwy-xpu-ml-framework-team
|
1c96fa86cf
|
[MOE] enable efficient moe_alignment multi-blocks execution (3x~6x) (#3613)
|
2025-02-27 19:42:48 -08:00 |
|
laixin
|
b0df5d240b
|
Tuning Script for Feature DeepSeek V3/R1 INT8 Quantization (block-wise) (#3922)
Co-authored-by: sleepcoo <sleepcoo@gmail.com>
|
2025-02-27 10:59:46 +00:00 |
|
yigex
|
ddf39d3fce
|
[ROCm] Optimal MOE Tuning for AMD Radeon Graphics (#3567)
|
2025-02-17 17:54:10 -08:00 |
|
Xiaoyu Zhang
|
2f47d710ae
|
refine some typo (#3473)
|
2025-02-10 23:35:44 +08:00 |
|
Yineng Zhang
|
fad315cb8e
|
fix EAGLE 2 non greedy case (#3407)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
|
2025-02-09 07:28:34 +08:00 |
|
GaoYuYang
|
849f58d617
|
Update fused_moe's benchmark (#3346)
|
2025-02-08 21:58:21 +08:00 |
|
yiakwy-xpu-ml-framework-team
|
64480df495
|
[BUG] fix moe benchmark when bs*seq is small (#3382)
|
2025-02-08 15:39:44 +08:00 |
|
Xiaoyu Zhang
|
cdae77b03d
|
optimize moe_align_kernel cuda (#3347)
|
2025-02-07 00:53:46 +08:00 |
|
Xiaoyu Zhang
|
ad3499858e
|
clean moe align block kernel code and add acc test (#3332)
|
2025-02-06 16:42:36 +08:00 |
|
yigex
|
351a72d40b
|
add dsv3 mi300 triton config for block scale (#3146)
|
2025-01-27 17:25:53 +08:00 |
|
Lianmin Zheng
|
27acf63bbd
|
Use torch.compile for scaling penalty (#3133)
|
2025-01-25 18:27:33 -08:00 |
|
yiakwy-xpu-ml-framework-team
|
10bfce71b3
|
fix moe align blocks benchmark (#3003)
|
2025-01-20 19:33:29 +08:00 |
|
Xiaoyu Zhang
|
d08c77c434
|
Sampling penalties memory interface (#2870)
|
2025-01-13 23:09:00 +08:00 |
|
Lianmin Zheng
|
bdc1acf6cd
|
Misc fix for min_p_sampling, --cuda-graph-bs (#2761)
|
2025-01-07 02:52:53 -08:00 |
|
Xiaoyu Zhang
|
380930a959
|
add benchmark_moe_align_blocks (#2767)
|
2025-01-07 14:20:50 +08:00 |
|
HandH1998
|
afa0341e57
|
Update Triton configs for block fp8 kernels (#2641)
|
2024-12-29 22:53:47 +08:00 |
|
Yineng Zhang
|
7863e4368a
|
add configs for block fp8 related kernels (#2628)
Co-authored-by: HandH1998 <1335248067@qq.com>
|
2024-12-28 23:12:04 +08:00 |
|
Xiaoyu Zhang
|
9a23c48456
|
h100 tuning fused_moe_triton for qwen2 moe (#2560)
|
2024-12-26 03:13:31 -08:00 |
|
Ke Bao
|
e835a50021
|
Reorg moe code (#2563)
|
2024-12-24 01:10:22 +08:00 |
|
Xiaoyu Zhang
|
3844feb9bb
|
Add a unittest for fused_moe (#2416)
|
2024-12-08 22:46:10 -08:00 |
|
Lianmin Zheng
|
07ec07ad1f
|
Improve torch compile for fused moe (#2327)
|
2024-12-03 01:58:25 -08:00 |
|
Lianmin Zheng
|
33deca81b5
|
Add more fused moe benchmark utilities (#2314)
|
2024-12-02 04:26:55 -08:00 |
|
Xiaoyu Zhang
|
262e370f78
|
[benchmark] Add fused_moe_triton benchmark and tuning tools (#2225)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
Co-authored-by: HAI <hixiao@gmail.com>
|
2024-11-29 13:36:45 -08:00 |
|