Commit Graph

158 Commits

Author SHA1 Message Date
simveit
bb121214c2 Variance measure for reasoning benchmark (#3677) 2025-02-20 03:49:49 +08:00
Zhanghao Wu
f93e915817 [Docs] Add SkyPilot DeepSeek example (#3706) 2025-02-20 02:10:23 +08:00
Yineng Zhang
fe0673f1cc set NCCL_IB_GID_INDEX=3 for multi node NVIDIA InfiniBand if needed (#3698) 2025-02-19 20:50:22 +08:00
yigex
ddf39d3fce [ROCm] Optimal MOE Tuning for AMD Radeon Graphics (#3567) 2025-02-17 17:54:10 -08:00
Xiaoyu Zhang
c38f3aed24 support multi-gpu block-gemm tuning (#3639) 2025-02-18 00:00:35 +08:00
Shenggui Li
c9565e49e7 [docker] added rdma support (#3619) 2025-02-17 15:36:16 +08:00
simveit
3d4a8f9bc0 Benchmark for reasoning models (#3532)
Co-authored-by: Chayenne <zhaochen20@outlook.com>
2025-02-17 03:07:30 +08:00
Yineng Zhang
ac963be234 update flashinfer-python (#3557) 2025-02-14 09:52:56 +08:00
Yineng Zhang
e0b9a423c8 chore: bump v0.4.3 (#3556) 2025-02-14 09:43:14 +08:00
Yineng Zhang
20de05a753 update README (#3543) 2025-02-13 17:22:11 +08:00
Jhin
bf2a70872e Update DeepSeek V3 Doc (#3541) 2025-02-12 23:15:37 -08:00
Xiaoyu Zhang
693c2600e0 refine deepseek_v3 launch server doc (#3522) 2025-02-12 17:27:07 +08:00
yigex
fdf04a1426 [ROCm] Add ROCm tuning config to block gemm and Re-tune for AMD Radeon Graphics (#3418)
Co-authored-by: Bruce Xue <yigex@xilinx.com>
Co-authored-by: HAI <hixiao@gmail.com>
2025-02-10 23:55:04 -08:00
Xiaoyu Zhang
2f47d710ae refine some typo (#3473) 2025-02-10 23:35:44 +08:00
Yineng Zhang
cddb1cdf8f chore: bump v0.4.2.post4 (#3459) 2025-02-10 14:12:16 +08:00
Yineng Zhang
fad315cb8e fix EAGLE 2 non greedy case (#3407)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2025-02-09 07:28:34 +08:00
Yineng Zhang
f90db8bc07 fix typo 2025-02-08 22:16:42 +08:00
Ke Bao
d8ad597048 Add deepseek-v3 a100 serving example (#3404) 2025-02-08 22:13:52 +08:00
GaoYuYang
849f58d617 Update fused_moe's benchmark (#3346) 2025-02-08 21:58:21 +08:00
yiakwy-xpu-ml-framework-team
64480df495 [BUG] fix moe benchmark when bs*seq is small (#3382) 2025-02-08 15:39:44 +08:00
Yineng Zhang
c1f5f99f60 chore: bump v0.4.2.post3 (#3369) 2025-02-07 08:20:03 -08:00
Xiaoyu Zhang
cdae77b03d optimize moe_align_kernel cuda (#3347) 2025-02-07 00:53:46 +08:00
Ke Bao
6792411e7f [Doc] Add optimization option guide for deepseek v3 (#3349) 2025-02-06 23:28:09 +08:00
Yineng Zhang
7348d9627e add AMD guide for DeepSeek-R1 (#3338) 2025-02-06 16:54:40 +08:00
Xiaoyu Zhang
ad3499858e clean moe align block kernel code and add acc test (#3332) 2025-02-06 16:42:36 +08:00
Yineng Zhang
07e58a2dcb update README (#3324) 2025-02-06 07:13:05 +08:00
Baizhou Zhang
70817a7eae [Feature] Define backends and add Triton backend for Lora (#3161)
Co-authored-by: Ying Sheng <sqy1415@gmail.com>
2025-02-03 22:09:13 -08:00
Yineng Zhang
7b020cca2d add tuning block wise fp8 (#3242)
Co-authored-by: HandH1998 <007aabbcc411@gmail.com>
2025-02-01 03:58:18 +08:00
yigex
351a72d40b add dsv3 mi300 triton config for block scale (#3146) 2025-01-27 17:25:53 +08:00
Lianmin Zheng
27acf63bbd Use torch.compile for scaling penalty (#3133) 2025-01-25 18:27:33 -08:00
Xiaoyu Zhang
ac2dc35d0e support lightning_attention_decode in sgl-kernel for MiniMax-Text-01 (#3030) 2025-01-23 15:29:20 +08:00
yiakwy-xpu-ml-framework-team
10bfce71b3 fix moe align blocks benchmark (#3003) 2025-01-20 19:33:29 +08:00
Xiaoyu Zhang
83452dbb4a fix file name spelling mistake and useless variable in minmax-text-01-lightning_attention (#2971) 2025-01-18 18:56:13 -08:00
Xiaoyu Zhang
c2f212d672 optimize MiniMax-Text-01 lightning_attn_decode triton (#2966) 2025-01-18 23:41:01 +08:00
Zhiqiang Xie
13387e6b7a Multi-turn benchmark for hierarchical caching (#2942) 2025-01-17 16:17:24 -08:00
Xiaoyu Zhang
78e974b2a5 [kernel] MiniMax-Text-01 decode lightning_attn with triton (#2920) 2025-01-16 12:51:38 -08:00
Lianmin Zheng
bc6915e3b9 Improve type annotation and styles (#2926) 2025-01-16 12:51:11 -08:00
Xiaoyu Zhang
ab31793661 [kernel] MiniMax-Text-01 prefill lightning_attn with triton (#2911) 2025-01-16 14:18:29 +08:00
Yineng Zhang
80002562a8 docs: update README (#2878) 2025-01-14 12:48:17 +08:00
Lianmin Zheng
46d4431889 Add a new api configure_logging to allow dumping the requests (#2875) 2025-01-13 14:24:00 -08:00
Xiaoyu Zhang
d08c77c434 Sampling penalties memory interface (#2870) 2025-01-13 23:09:00 +08:00
Yineng Zhang
41d7e5b7e6 docs: update link (#2857) 2025-01-13 18:40:48 +08:00
Lianmin Zheng
72c7776355 Fix linear.py and improve weight loading (#2851)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-01-13 01:39:14 -08:00
Ke Bao
85b2e05770 Add int8 quant kernel (#2848) 2025-01-13 13:16:58 +08:00
Yineng Zhang
197cbf9bab docs: update README (#2841) 2025-01-11 23:11:38 +08:00
Yineng Zhang
f624901cdd chore: bump v0.4.1.post5 (#2840) 2025-01-11 23:10:02 +08:00
sleepcoo
4f077c01b8 minor: support specifying local dataset path for gsm8k and hellaswag (#2816) 2025-01-09 22:24:42 +08:00
Lianmin Zheng
bdc1acf6cd Misc fix for min_p_sampling, --cuda-graph-bs (#2761) 2025-01-07 02:52:53 -08:00
Xiaoyu Zhang
380930a959 add benchmark_moe_align_blocks (#2767) 2025-01-07 14:20:50 +08:00
Rodrigo Garcia
a990daff9c Included multi-node DeepSeekv3 example (#2707) 2025-01-02 22:17:03 +08:00