Commit Graph

172 Commits

Author SHA1 Message Date
fzyzcjy
c49c1d9226 Remove 200us slow concat kernel (part 2: srt) (#7020) 2025-06-13 15:19:31 -07:00
pansicheng
2f4ec752bc filter by num_hidden_layers (#7056) 2025-06-13 00:53:09 -07:00
fzyzcjy
f6ebba537a Support both approximate and exact expert distribution collection (#6964) 2025-06-09 20:56:17 -07:00
fzyzcjy
de1350ea20 Minor remove one kernel for DeepSeek (#6977) 2025-06-08 17:41:35 -07:00
Xiaoyu Zhang
3712abfaf9 Fuse routed scaling factor in deepseek (#6970) 2025-06-08 15:24:24 -07:00
Yineng Zhang
1fb76ebb93 Revert "Fuse routed scaling factor in topk_reduce kernel (#6220)" (#6968) 2025-06-07 21:02:49 -07:00
Pavani Majety
c2c4f57f63 [DeepseekR1-FP4] Add Support for nvidia/DeepSeekR1-FP4 model (#6853)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-06-07 17:24:35 -07:00
Xiaoyu Zhang
515ef4facb Fuse routed scaling factor in topk_reduce kernel (#6220) 2025-06-07 11:06:50 -07:00
JieXin Liang
22fe787852 [sgl-kernel] update deepgemm (#6942) 2025-06-06 23:24:41 -07:00
miter
f8eaaab817 [fix] logical_to_all_physical_map index 256 is out of bounds in EP parallel. (#6767)
Signed-off-by: miter <miterv@outlook.com>
2025-06-06 21:32:33 -07:00
HAI
b819381fec AITER backend extension and workload optimizations (#6838)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>
2025-06-05 23:00:18 -07:00
Cheng Wan
81964328b7 Set num_fused_shared_experts as num_shared_experts when shared_experts fusion is not disabled (#6736) 2025-06-04 15:53:22 -07:00
Cheng Wan
8a5480528d [Refactor] Rename n_share_experts_fusion as num_fused_shared_experts (#6735) 2025-06-03 17:48:24 -07:00
fzyzcjy
0ea330ca34 Fix wrong weight reference in dynamic EPLB (#6818) 2025-06-02 23:26:04 -07:00
Li Hui
69dd878b51 Fix shared experts fusion error (#6289) 2025-05-30 01:16:11 -07:00
Zilin Zhu
51cdd81f97 [fix][RL] Fix DeepSeekV3ForCausalLM.post_load_weights for multiple update weight (#6265) 2025-05-29 16:28:10 -07:00
fzyzcjy
31589e177e Speed up when having padding tokens two-batch overlap (#6668)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
2025-05-28 16:00:58 -07:00
fzyzcjy
541a985f85 Fuse routed_scaling_factor in DeepSeek (#6710) 2025-05-28 15:53:37 -07:00
HAI
183d9f969c DeepSeek: enable none block-quant FP8 quantizations (#6638) 2025-05-27 09:06:40 -07:00
fzyzcjy
32cd707002 Support TP in attention for two batch overlap (#6634) 2025-05-26 20:28:12 -07:00
fzyzcjy
0ca3e56802 Tiny fix missing expert location dispatch info (#6620) 2025-05-26 08:58:31 -07:00
Yi Zhang
65f091310c refactor qwen moe code, use communicator to support tp+dp (#6581) 2025-05-25 23:01:10 -07:00
fzyzcjy
0d47788025 Support overlapping two batches (#4068) 2025-05-24 17:39:07 -07:00
fzyzcjy
b2388433be Add back DeepSeek non-TBO branches (#6578) 2025-05-24 17:34:00 -07:00
fzyzcjy
a38376fa99 Refactor attention into multiple stages (#6477) 2025-05-24 17:33:25 -07:00
fzyzcjy
fc992a09f9 Support updating expert locations dynamically (#6388) 2025-05-21 21:59:33 -07:00
Baizhou Zhang
d4c038daed [Fix]Fix capture fail bug for DeepSeek (#6275) 2025-05-21 11:11:20 -07:00
fzyzcjy
ccfe5c009d Support redundant experts in expert parallel (#6461) 2025-05-21 02:05:53 -07:00
fzyzcjy
d6e1d28c8a Refactor DeepSeek attention dispatching (#6476) 2025-05-21 02:03:39 -07:00
Lianmin Zheng
03886917bd Disable all two stream overlap on amd (#6475) 2025-05-20 19:06:59 -07:00
fzyzcjy
13feffd082 Fix master CI for DeepSeek (#6447) 2025-05-20 00:31:42 -07:00
fzyzcjy
e98afbe042 Support dispatching logical to physical experts (#6385) 2025-05-19 22:13:55 -07:00
HAI
6317c5c61f Address performance regression: disable multiple streams on ROCm (#6412) 2025-05-19 21:16:20 -07:00
fzyzcjy
d0443275f0 Refactor DeepSeek logic into atomic operations (#6326) 2025-05-19 21:05:30 -07:00
fzyzcjy
1b19df4b2a Refactor communication logic of DeepSeek for extensibility and understandability (#6321) 2025-05-19 20:14:48 -07:00
fzyzcjy
f0653886a5 Expert distribution recording without overhead for EPLB (#4957) 2025-05-19 20:07:43 -07:00
fzyzcjy
72bfb0baf0 Refactor DeepSeek MoE layer to unify the two forward branches (#6325) 2025-05-18 15:34:36 -07:00
fzyzcjy
2716830802 Speed up when having padding tokens in DeepEP (#6175) 2025-05-17 16:44:05 -07:00
fzyzcjy
2df9d40aa6 Minor code cleanup refactor for DeepSeek models (#6324) 2025-05-16 19:06:03 -07:00
fzyzcjy
8dc191f237 Fix one wasted kernel in DeepSeek and minor refactor (#6316) 2025-05-16 19:05:33 -07:00
fzyzcjy
f194e14fb7 Reduce MoE memory usage (#6147) 2025-05-15 09:38:28 -07:00
Cheng Wan
b2e95f62b4 Fix two issues related to --moe-dense-tp-size=1 (#5657)
Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
Co-authored-by: 颉沆 <xiehang.lsy@alibaba-inc.com>
2025-05-12 23:51:39 -07:00
Cheng Wan
25c83fff6a Performing Vocabulary Parallelism for LM Head across Attention TP Groups (#5558)
Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
2025-05-11 23:36:29 -07:00
applesaucethebun
2ce8793519 Add typo checker in pre-commit (#6179)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-11 12:55:00 +08:00
JieXin Liang
c178abdabc [fix] fix determine_n_share_experts_fusion (#6118) 2025-05-10 01:19:09 -07:00
xu-yfei
e30c273bc9 opt flashinfer mla cat (#5822)
Co-authored-by: xuyongfei.xyf <xuyongfei.xyf@antgroup.com>
2025-05-08 23:17:14 -07:00
JieXin Liang
5e02330137 [perf] dsv3 bmm fallback to bf16 (#5662) 2025-05-08 11:43:39 -07:00
lukec
acc816d8a2 DeepEP normal support deepgemm-contiguous (#5626)
Co-authored-by: Yingyi Huang <yingyihuang2000@outlook.com>
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
Co-authored-by: Xuting Zhou <xutingz@nvidia.com>
Co-authored-by: ZhengHSI <zhenghsi@qq.com>
2025-05-08 01:20:32 -07:00
Baizhou Zhang
73600673bb Clean logs for DeepSeek-V3 launching (#6079) 2025-05-07 18:54:50 -07:00
JieXin Liang
b70957fcf8 [refactor] slightly tidy fp8 module (#5993) 2025-05-07 17:28:24 -07:00