Commit Graph

31 Commits

Author SHA1 Message Date
kk
e96973742c Optimized deepseek-v3/r1 model performance on mxfp4 run (#10008)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
2025-09-04 15:11:22 -07:00
Yuan Luo
ec15c8360e Optimize Qwen3-moe model by using flashinfer fused allreduce (#9973)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-09-04 20:48:53 +08:00
Yineng Zhang
1b2ff4fb7f Revert "Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671)" (#9959) 2025-09-03 00:50:04 -07:00
kk
0dfd54d11d Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: wghuang <wghuang@amd.com>
2025-09-02 22:26:28 -07:00
Lianmin Zheng
fd71b11b1d move is_sm90_supported/is_sm100_supported to python/sglang/srt/utils.py (#9679) 2025-08-27 03:34:29 -07:00
strgrb
88fbc31b50 Support trtllm_allreduce_fusion in flashinfer for cuda<12.8 (#9339)
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
2025-08-20 16:54:30 -07:00
Xiaoyu Zhang
f96413c444 Refactor allreduce add rmsnorm pattern (#9278) 2025-08-20 02:03:08 -07:00
Trevor Morris
eff4eb3fdd Add fp4 quantize before all-gather for Flashinfer cutlass MoE DP (max throughput) (#7667) 2025-08-15 22:08:11 -07:00
Cheng Wan
295895120d [6/N] MoE Refactor: Cleanup MoE-related configs (#8849) 2025-08-14 21:14:53 -07:00
Philo
004f7f1972 [typo fix] Fix a typo in communicator.py (#9183)
Signed-off-by: Philo <lul16@foxmail.com>
2025-08-14 17:29:38 -07:00
Cheng Wan
b87aacb5c5 [DP Attention] Refactor: adding some utility functions (#9136) 2025-08-13 21:08:06 -07:00
Xiaoyu Zhang
44e86480e8 fuse allreduce and residual_rmsnorm (#8731) 2025-08-11 13:50:53 -07:00
Yineng Zhang
9d834fdcc1 Revert "feat: update flashinfer ar oneshot params (#8687)" (#9054) 2025-08-10 23:24:42 -07:00
Cheng Wan
5018809222 [DP] fix: engine crash when decode batch is padded (#8995) 2025-08-09 01:29:29 -07:00
eigen
faa25df1ae feat: update flashinfer ar oneshot params (#8687) 2025-08-09 00:51:27 -07:00
Cheng Wan
a47baff12c [hotfix] use the original implementation in 8785 (#8994) 2025-08-08 21:47:25 -07:00
Trevor Morris
c0e84297c2 Use reduce scatter for DP (#8539) 2025-08-06 16:21:26 -07:00
Trevor Morris
32f2815451 Do layernorm before allgather for DP attention (#8631) 2025-08-03 00:53:08 -07:00
Cheng Wan
6c88f6c8d9 [5/N] MoE Refactor: Update MoE parallelism arguments (#8658) 2025-08-01 01:20:03 -07:00
Cheng Wan
c0fb25e949 DP Enhancement (#8280) 2025-07-24 21:36:21 -07:00
Xiaoyu Zhang
49a5915f53 [ready b200] fuse allreduce+add_rmsnorm in prepare_attention + mlp module (#7775) 2025-07-10 15:12:39 -07:00
Xiaoyu Zhang
2e7ab862e3 Fix illegal memory in trtllm allreduce fusion (#7864) 2025-07-08 11:47:17 -07:00
Xiaoyu Zhang
8e64140e35 [b200] support trt-llm allreduce fuse rms_norm_add kernel (#7621) 2025-07-02 19:36:20 -07:00
Cheng Wan
e879d8b7a8 [Feature] Comprehensive Hybrid Parallelism Support (#6389) 2025-06-20 14:43:11 -07:00
Cheng Wan
3c2274fbee Implement gather before attn (#6378) 2025-06-15 21:08:56 -07:00
Yineng Zhang
fa6723f08f Revert "fix communicator for non-dp lm head (#6662)" (#6677) 2025-05-27 12:22:59 -07:00
Cheng Wan
a3d7f4b673 fix communicator for non-dp lm head (#6662) 2025-05-27 02:31:12 -07:00
fzyzcjy
32cd707002 Support TP in attention for two batch overlap (#6634) 2025-05-26 20:28:12 -07:00
fzyzcjy
ebd1ed49d4 Tiny refactor communicator (#6646) 2025-05-26 20:24:17 -07:00
fzyzcjy
f456037396 Utilize static dispatching for communicator (#6577) 2025-05-24 17:34:35 -07:00
fzyzcjy
1b19df4b2a Refactor communication logic of DeepSeek for extensibility and understandability (#6321) 2025-05-19 20:14:48 -07:00