kk
|
e96973742c
|
Optimized deepseek-v3/r1 model performance on mxfp4 run (#10008)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: HAI <hixiao@gmail.com>
Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com>
|
2025-09-04 15:11:22 -07:00 |
|
Yuan Luo
|
ec15c8360e
|
Optimize Qwen3-moe model by using flashinfer fused allreduce (#9973)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
|
2025-09-04 20:48:53 +08:00 |
|
Yineng Zhang
|
1b2ff4fb7f
|
Revert "Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671)" (#9959)
|
2025-09-03 00:50:04 -07:00 |
|
kk
|
0dfd54d11d
|
Optimized deepseek-v3/r1 model performance on mxfp4 run (#9671)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: wghuang <wghuang@amd.com>
|
2025-09-02 22:26:28 -07:00 |
|
Lianmin Zheng
|
fd71b11b1d
|
move is_sm90_supported/is_sm100_supported to python/sglang/srt/utils.py (#9679)
|
2025-08-27 03:34:29 -07:00 |
|
strgrb
|
88fbc31b50
|
Support trtllm_allreduce_fusion in flashinfer for cuda<12.8 (#9339)
Co-authored-by: Zhang Kaihong <zhangkaihong.zkh@alibaba-inc.com>
|
2025-08-20 16:54:30 -07:00 |
|
Xiaoyu Zhang
|
f96413c444
|
Refactor allreduce add rmsnorm pattern (#9278)
|
2025-08-20 02:03:08 -07:00 |
|
Trevor Morris
|
eff4eb3fdd
|
Add fp4 quantize before all-gather for Flashinfer cutlass MoE DP (max throughput) (#7667)
|
2025-08-15 22:08:11 -07:00 |
|
Cheng Wan
|
295895120d
|
[6/N] MoE Refactor: Cleanup MoE-related configs (#8849)
|
2025-08-14 21:14:53 -07:00 |
|
Philo
|
004f7f1972
|
[typo fix] Fix a typo in communicator.py (#9183)
Signed-off-by: Philo <lul16@foxmail.com>
|
2025-08-14 17:29:38 -07:00 |
|
Cheng Wan
|
b87aacb5c5
|
[DP Attention] Refactor: adding some utility functions (#9136)
|
2025-08-13 21:08:06 -07:00 |
|
Xiaoyu Zhang
|
44e86480e8
|
fuse allreduce and residual_rmsnorm (#8731)
|
2025-08-11 13:50:53 -07:00 |
|
Yineng Zhang
|
9d834fdcc1
|
Revert "feat: update flashinfer ar oneshot params (#8687)" (#9054)
|
2025-08-10 23:24:42 -07:00 |
|
Cheng Wan
|
5018809222
|
[DP] fix: engine crash when decode batch is padded (#8995)
|
2025-08-09 01:29:29 -07:00 |
|
eigen
|
faa25df1ae
|
feat: update flashinfer ar oneshot params (#8687)
|
2025-08-09 00:51:27 -07:00 |
|
Cheng Wan
|
a47baff12c
|
[hotfix] use the original implementation in 8785 (#8994)
|
2025-08-08 21:47:25 -07:00 |
|
Trevor Morris
|
c0e84297c2
|
Use reduce scatter for DP (#8539)
|
2025-08-06 16:21:26 -07:00 |
|
Trevor Morris
|
32f2815451
|
Do layernorm before allgather for DP attention (#8631)
|
2025-08-03 00:53:08 -07:00 |
|
Cheng Wan
|
6c88f6c8d9
|
[5/N] MoE Refactor: Update MoE parallelism arguments (#8658)
|
2025-08-01 01:20:03 -07:00 |
|
Cheng Wan
|
c0fb25e949
|
DP Enhancement (#8280)
|
2025-07-24 21:36:21 -07:00 |
|
Xiaoyu Zhang
|
49a5915f53
|
[ready b200] fuse allreduce+add_rmsnorm in prepare_attention + mlp module (#7775)
|
2025-07-10 15:12:39 -07:00 |
|
Xiaoyu Zhang
|
2e7ab862e3
|
Fix illegal memory in trtllm allreduce fusion (#7864)
|
2025-07-08 11:47:17 -07:00 |
|
Xiaoyu Zhang
|
8e64140e35
|
[b200] support trt-llm allreduce fuse rms_norm_add kernel (#7621)
|
2025-07-02 19:36:20 -07:00 |
|
Cheng Wan
|
e879d8b7a8
|
[Feature] Comprehensive Hybrid Parallelism Support (#6389)
|
2025-06-20 14:43:11 -07:00 |
|
Cheng Wan
|
3c2274fbee
|
Implement gather before attn (#6378)
|
2025-06-15 21:08:56 -07:00 |
|
Yineng Zhang
|
fa6723f08f
|
Revert "fix communicator for non-dp lm head (#6662)" (#6677)
|
2025-05-27 12:22:59 -07:00 |
|
Cheng Wan
|
a3d7f4b673
|
fix communicator for non-dp lm head (#6662)
|
2025-05-27 02:31:12 -07:00 |
|
fzyzcjy
|
32cd707002
|
Support TP in attention for two batch overlap (#6634)
|
2025-05-26 20:28:12 -07:00 |
|
fzyzcjy
|
ebd1ed49d4
|
Tiny refactor communicator (#6646)
|
2025-05-26 20:24:17 -07:00 |
|
fzyzcjy
|
f456037396
|
Utilize static dispatching for communicator (#6577)
|
2025-05-24 17:34:35 -07:00 |
|
fzyzcjy
|
1b19df4b2a
|
Refactor communication logic of DeepSeek for extensibility and understandability (#6321)
|
2025-05-19 20:14:48 -07:00 |
|