Trevor Morris
|
e806f708c9
|
[PD] Make bootstrap code common between NIXL and Mooncake (#6473)
|
2025-05-27 12:47:38 -07:00 |
|
Yineng Zhang
|
fa6723f08f
|
Revert "fix communicator for non-dp lm head (#6662)" (#6677)
|
2025-05-27 12:22:59 -07:00 |
|
fzyzcjy
|
673ff668f7
|
Speed up expert location update (#6661)
|
2025-05-27 10:00:09 -07:00 |
|
fzyzcjy
|
447be24228
|
Fix OOM when updating expert locations (#6660)
|
2025-05-27 09:59:53 -07:00 |
|
HAI
|
183d9f969c
|
DeepSeek: enable none block-quant FP8 quantizations (#6638)
|
2025-05-27 09:06:40 -07:00 |
|
Ke Bao
|
631950280a
|
Support EAGLE draft extend CUDA graph (#6606)
Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>
|
2025-05-27 02:35:17 -07:00 |
|
Cheng Wan
|
a3d7f4b673
|
fix communicator for non-dp lm head (#6662)
|
2025-05-27 02:31:12 -07:00 |
|
Yi Zhang
|
b18416fbf8
|
Fix qwen3 tbo/dp-lm-head (#6652)
|
2025-05-27 00:38:27 -07:00 |
|
Mick
|
ce9d690ef4
|
fix: fix nightly test from updating transformers (#6658)
|
2025-05-27 00:28:11 -07:00 |
|
Baizhou Zhang
|
bdaefbbfbd
|
Add environment flag for disabling message queue broadcaster (#6403)
|
2025-05-26 22:32:41 -07:00 |
|
Chang Su
|
ae33584235
|
[Bugfix]: Fix call for function_call_parser.multi_format_detector in adapter.py (#6650)
|
2025-05-26 21:57:10 -07:00 |
|
Lifu Huang
|
477a101cbd
|
Refactor LoRA handling to support adapter tensors in fused format (#6585)
|
2025-05-26 21:51:54 -07:00 |
|
fzyzcjy
|
1a8f5f6836
|
Super tiny rename environment variable (#6648)
|
2025-05-26 21:01:16 -07:00 |
|
fzyzcjy
|
32cd707002
|
Support TP in attention for two batch overlap (#6634)
|
2025-05-26 20:28:12 -07:00 |
|
fzyzcjy
|
ebd1ed49d4
|
Tiny refactor communicator (#6646)
|
2025-05-26 20:24:17 -07:00 |
|
Xinyuan Tong
|
d6864ce6d6
|
[New Model] Devstral support (#6547)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
|
2025-05-26 19:27:48 -07:00 |
|
Shi Shuai
|
755a36614b
|
fix: added "\n" to qwen25 tool parser structural tags (#6631)
|
2025-05-26 19:25:45 -07:00 |
|
Lifu Huang
|
79a39ac0cc
|
follow-up: move Idefics2 to a shared location to eliminate unexpected dependency. (#6603)
|
2025-05-26 19:23:59 -07:00 |
|
shangmingc
|
3ce94f71f9
|
[PD] Handle P/D failure and reconnect without affecting other instances (#6263)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
|
2025-05-26 19:21:01 -07:00 |
|
fzyzcjy
|
ca95556c76
|
Tiny fix sampler error when prob is not contiguous (#6639)
|
2025-05-26 19:19:08 -07:00 |
|
fzyzcjy
|
0ca3e56802
|
Tiny fix missing expert location dispatch info (#6620)
|
2025-05-26 08:58:31 -07:00 |
|
fzyzcjy
|
5c7aa00976
|
Fix EPLB algorithm fail to run when using 3 nodes for prefill (#6629)
|
2025-05-26 08:43:24 -07:00 |
|
fzyzcjy
|
fe386acae6
|
Automatically configure for EPLB-related args (#6628)
|
2025-05-26 08:42:49 -07:00 |
|
Yi Zhang
|
14d1075f2c
|
fix qwen3moe eplb prefill bug (#6617)
|
2025-05-26 02:15:21 -07:00 |
|
Lifu Huang
|
0d503090aa
|
Supported precomputed feature for Kimi VL (#6599)
|
2025-05-26 01:24:13 -07:00 |
|
fzyzcjy
|
501efc3d36
|
Tiny fix CI (#6611)
|
2025-05-25 23:36:34 -07:00 |
|
Yi Zhang
|
f9bab3d591
|
qwen3moe support two batch overlap (#6598)
|
2025-05-25 23:08:16 -07:00 |
|
Chang Su
|
16f69b1f65
|
feat: Improve Mistral and Qwen25 function call parsing (#6597)
|
2025-05-25 23:07:23 -07:00 |
|
Yi Zhang
|
65f091310c
|
refactor qwen moe code, use communicator to support tp+dp (#6581)
|
2025-05-25 23:01:10 -07:00 |
|
Yineng Zhang
|
7eb9d8e594
|
chore: upgrade transformers 4.52.3 (#6575)
Co-authored-by: Mick <mickjagger19@icloud.com>
|
2025-05-25 22:49:58 -07:00 |
|
fzyzcjy
|
6bebef60a7
|
Support accurate length control for bench serving (#6594)
|
2025-05-25 22:46:23 -07:00 |
|
fzyzcjy
|
25be63d0b2
|
Auto handle PD disaggregation in bench_serving (#6587)
Co-authored-by: yizhang2077 <1109276519@qq.com>
|
2025-05-25 22:41:27 -07:00 |
|
fzyzcjy
|
93e53f6e0b
|
Logging and minor fixes to two batch overlap and EPLB (#6595)
|
2025-05-25 22:36:40 -07:00 |
|
fzyzcjy
|
a191a0e47c
|
Improve performance of two batch overlap in some imbalanced cases (#6593)
|
2025-05-25 22:36:18 -07:00 |
|
fzyzcjy
|
8c7279c24e
|
Fix profiling will crash the server when using num_steps (#6586)
|
2025-05-25 22:36:02 -07:00 |
|
fzyzcjy
|
0ca1811715
|
Support fake perfectly balanced EP dispatch algorithm (#6571)
|
2025-05-25 22:35:51 -07:00 |
|
fzyzcjy
|
2c3a6fe1de
|
Fix bench_serving does not support changing warmup requests (#6439)
|
2025-05-25 22:35:36 -07:00 |
|
wangxiyu191
|
8b33d8df90
|
[PD] Fix prefill_servers in mini_lb (#6527)
|
2025-05-26 10:38:41 +08:00 |
|
fzyzcjy
|
5ccf8fe1a0
|
Hint users when weight update timeouts (#6570)
|
2025-05-25 09:13:17 -07:00 |
|
Shenggui Li
|
3f23d8cdf1
|
added support for tied weights in qwen pipeline parallelism (#6546)
|
2025-05-25 00:00:56 -07:00 |
|
Lifu Huang
|
022012aae8
|
Support Phi-4 Multi-Modal (text + vision only) (#6494)
|
2025-05-24 21:43:38 -07:00 |
|
Chang Su
|
681e7af32b
|
[OAI] Support non-normalized logprobs in OpenAI server (#5961)
|
2025-05-24 21:35:55 -07:00 |
|
Xinyuan Tong
|
681fdc264b
|
Refactor vlm embedding routine to use precomputed feature (#6543)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
|
2025-05-24 18:39:21 -07:00 |
|
fzyzcjy
|
0d47788025
|
Support overlapping two batches (#4068)
|
2025-05-24 17:39:07 -07:00 |
|
fzyzcjy
|
f456037396
|
Utilize static dispatching for communicator (#6577)
|
2025-05-24 17:34:35 -07:00 |
|
fzyzcjy
|
b2388433be
|
Add back DeepSeek non-TBO branches (#6578)
|
2025-05-24 17:34:00 -07:00 |
|
fzyzcjy
|
a38376fa99
|
Refactor attention into multiple stages (#6477)
|
2025-05-24 17:33:25 -07:00 |
|
kk
|
7a5e6ce1cb
|
Fix GPU OOM (#6564)
Co-authored-by: michael <michael.zhang@amd.com>
|
2025-05-24 16:38:39 -07:00 |
|
Yineng Zhang
|
7e257cd666
|
chore: bump v0.4.6.post5 (#6566)
|
2025-05-24 00:48:05 -07:00 |
|
fzyzcjy
|
c4831e2fcf
|
Fix accuracy is zero when enabling moe-dense-tp-size as in large scale EP (#6567)
|
2025-05-24 00:27:10 -07:00 |
|