Commit Graph

187 Commits

Author SHA1 Message Date
Cheng Wan
e879d8b7a8 [Feature] Comprehensive Hybrid Parallelism Support (#6389) 2025-06-20 14:43:11 -07:00
Li Hui
a06912ad8b Fix judgment condition for enabling Deepseek V3/R1 shared expert fusion optimization (#7371) 2025-06-19 21:58:00 -07:00
YanbingJiang
094c116f7d Update python API of activation, topk, norm and rope and remove vllm dependency (#6614)
Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
Co-authored-by: jianan-gu <jianan.gu@intel.com>
Co-authored-by: sdp <sdp@gnr799219.jf.intel.com>
2025-06-17 22:11:50 -07:00
Yineng Zhang
4f204db57c fix: resolve b200 dsv3 mtp issue (#7286) 2025-06-17 16:22:46 -07:00
AniZpZ
3eb4a800e8 Fix AWQ Dequant and Weight Loading of deepseek v2 (#6842) 2025-06-17 13:45:10 -07:00
Charles Chen
8c16da334e Fix Deepseek R1 0528 FP4 tensor name mismatch issue during weights loading. (#7164) 2025-06-17 11:26:23 -07:00
u4lr451
10d60cd41b feat: mtp support dp-attention (#6081)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
2025-06-17 00:33:28 -07:00
JieXin Liang
5ca07eed90 [fix] fix DeepGEMM blackwell input quant & ut & fix style and log (#7247) 2025-06-16 11:45:54 -07:00
fzyzcjy
349bb2c92a Fix error when disabling new DeepGEMM (#7198) 2025-06-14 19:24:54 -07:00
JieXin Liang
55561e2553 [fix] fix determine_num_fused_shared_experts (#7180) 2025-06-14 17:41:22 -07:00
JieXin Liang
ed54bf9d19 [fix] fix dsv3 weight loader tqdm and simplify shared experts fusion (#7181) 2025-06-14 11:56:29 -07:00
fzyzcjy
b57d87c297 Fix shared experts fusion + weight requant (#7177) 2025-06-14 02:35:18 -07:00
fzyzcjy
93cec4335f Support new DeepGEMM (#7172) 2025-06-13 23:00:17 -07:00
fzyzcjy
b4c41f7276 Refactor DeepGEMM integration (#7150) 2025-06-13 20:41:03 -07:00
fzyzcjy
5b1afa7814 Re-quantize DeepSeek model weights to support DeepGEMM new input format (#7156) 2025-06-13 15:57:45 -07:00
fzyzcjy
c49c1d9226 Remove 200us slow concat kernel (part 2: srt) (#7020) 2025-06-13 15:19:31 -07:00
pansicheng
2f4ec752bc filter by num_hidden_layers (#7056) 2025-06-13 00:53:09 -07:00
fzyzcjy
f6ebba537a Support both approximate and exact expert distribution collection (#6964) 2025-06-09 20:56:17 -07:00
fzyzcjy
de1350ea20 Minor remove one kernel for DeepSeek (#6977) 2025-06-08 17:41:35 -07:00
Xiaoyu Zhang
3712abfaf9 Fuse routed scaling factor in deepseek (#6970) 2025-06-08 15:24:24 -07:00
Yineng Zhang
1fb76ebb93 Revert "Fuse routed scaling factor in topk_reduce kernel (#6220)" (#6968) 2025-06-07 21:02:49 -07:00
Pavani Majety
c2c4f57f63 [DeepseekR1-FP4] Add Support for nvidia/DeepSeekR1-FP4 model (#6853)
Signed-off-by: Pavani Majety <pmajety@nvidia.com>
2025-06-07 17:24:35 -07:00
Xiaoyu Zhang
515ef4facb Fuse routed scaling factor in topk_reduce kernel (#6220) 2025-06-07 11:06:50 -07:00
JieXin Liang
22fe787852 [sgl-kernel] update deepgemm (#6942) 2025-06-06 23:24:41 -07:00
miter
f8eaaab817 [fix] logical_to_all_physical_map index 256 is out of bounds in EP parallel. (#6767)
Signed-off-by: miter <miterv@outlook.com>
2025-06-06 21:32:33 -07:00
HAI
b819381fec AITER backend extension and workload optimizations (#6838)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>
2025-06-05 23:00:18 -07:00
Cheng Wan
81964328b7 Set num_fused_shared_experts as num_shared_experts when shared_experts fusion is not disabled (#6736) 2025-06-04 15:53:22 -07:00
Cheng Wan
8a5480528d [Refactor] Rename n_share_experts_fusion as num_fused_shared_experts (#6735) 2025-06-03 17:48:24 -07:00
fzyzcjy
0ea330ca34 Fix wrong weight reference in dynamic EPLB (#6818) 2025-06-02 23:26:04 -07:00
Li Hui
69dd878b51 Fix shared experts fusion error (#6289) 2025-05-30 01:16:11 -07:00
Zilin Zhu
51cdd81f97 [fix][RL] Fix DeepSeekV3ForCausalLM.post_load_weights for multiple update weight (#6265) 2025-05-29 16:28:10 -07:00
fzyzcjy
31589e177e Speed up when having padding tokens two-batch overlap (#6668)
Co-authored-by: Cheng Wan <54331508+ch-wan@users.noreply.github.com>
2025-05-28 16:00:58 -07:00
fzyzcjy
541a985f85 Fuse routed_scaling_factor in DeepSeek (#6710) 2025-05-28 15:53:37 -07:00
HAI
183d9f969c DeepSeek: enable none block-quant FP8 quantizations (#6638) 2025-05-27 09:06:40 -07:00
fzyzcjy
32cd707002 Support TP in attention for two batch overlap (#6634) 2025-05-26 20:28:12 -07:00
fzyzcjy
0ca3e56802 Tiny fix missing expert location dispatch info (#6620) 2025-05-26 08:58:31 -07:00
Yi Zhang
65f091310c refactor qwen moe code, use communicator to support tp+dp (#6581) 2025-05-25 23:01:10 -07:00
fzyzcjy
0d47788025 Support overlapping two batches (#4068) 2025-05-24 17:39:07 -07:00
fzyzcjy
b2388433be Add back DeepSeek non-TBO branches (#6578) 2025-05-24 17:34:00 -07:00
fzyzcjy
a38376fa99 Refactor attention into multiple stages (#6477) 2025-05-24 17:33:25 -07:00
fzyzcjy
fc992a09f9 Support updating expert locations dynamically (#6388) 2025-05-21 21:59:33 -07:00
Baizhou Zhang
d4c038daed [Fix]Fix capture fail bug for DeepSeek (#6275) 2025-05-21 11:11:20 -07:00
fzyzcjy
ccfe5c009d Support redundant experts in expert parallel (#6461) 2025-05-21 02:05:53 -07:00
fzyzcjy
d6e1d28c8a Refactor DeepSeek attention dispatching (#6476) 2025-05-21 02:03:39 -07:00
Lianmin Zheng
03886917bd Disable all two stream overlap on amd (#6475) 2025-05-20 19:06:59 -07:00
fzyzcjy
13feffd082 Fix master CI for DeepSeek (#6447) 2025-05-20 00:31:42 -07:00
fzyzcjy
e98afbe042 Support dispatching logical to physical experts (#6385) 2025-05-19 22:13:55 -07:00
HAI
6317c5c61f Address performance regression: disable multiple streams on ROCm (#6412) 2025-05-19 21:16:20 -07:00
fzyzcjy
d0443275f0 Refactor DeepSeek logic into atomic operations (#6326) 2025-05-19 21:05:30 -07:00
fzyzcjy
1b19df4b2a Refactor communication logic of DeepSeek for extensibility and understandability (#6321) 2025-05-19 20:14:48 -07:00