Commit Graph

2298 Commits

Author SHA1 Message Date
shangmingc
3ce94f71f9 [PD] Handle P/D failure and reconnect without affecting other instances (#6263)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-05-26 19:21:01 -07:00
fzyzcjy
ca95556c76 Tiny fix sampler error when prob is not contiguous (#6639) 2025-05-26 19:19:08 -07:00
fzyzcjy
0ca3e56802 Tiny fix missing expert location dispatch info (#6620) 2025-05-26 08:58:31 -07:00
fzyzcjy
5c7aa00976 Fix EPLB algorithm fail to run when using 3 nodes for prefill (#6629) 2025-05-26 08:43:24 -07:00
fzyzcjy
fe386acae6 Automatically configure for EPLB-related args (#6628) 2025-05-26 08:42:49 -07:00
Yi Zhang
14d1075f2c fix qwen3moe eplb prefill bug (#6617) 2025-05-26 02:15:21 -07:00
Lifu Huang
0d503090aa Supported precomputed feature for Kimi VL (#6599) 2025-05-26 01:24:13 -07:00
fzyzcjy
501efc3d36 Tiny fix CI (#6611) 2025-05-25 23:36:34 -07:00
Yi Zhang
f9bab3d591 qwen3moe support two batch overlap (#6598) 2025-05-25 23:08:16 -07:00
Chang Su
16f69b1f65 feat: Improve Mistral and Qwen25 function call parsing (#6597) 2025-05-25 23:07:23 -07:00
Yi Zhang
65f091310c refactor qwen moe code, use communicator to support tp+dp (#6581) 2025-05-25 23:01:10 -07:00
Yineng Zhang
7eb9d8e594 chore: upgrade transformers 4.52.3 (#6575)
Co-authored-by: Mick <mickjagger19@icloud.com>
2025-05-25 22:49:58 -07:00
fzyzcjy
6bebef60a7 Support accurate length control for bench serving (#6594) 2025-05-25 22:46:23 -07:00
fzyzcjy
25be63d0b2 Auto handle PD disaggregation in bench_serving (#6587)
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-05-25 22:41:27 -07:00
fzyzcjy
93e53f6e0b Logging and minor fixes to two batch overlap and EPLB (#6595) 2025-05-25 22:36:40 -07:00
fzyzcjy
a191a0e47c Improve performance of two batch overlap in some imbalanced cases (#6593) 2025-05-25 22:36:18 -07:00
fzyzcjy
8c7279c24e Fix profiling will crash the server when using num_steps (#6586) 2025-05-25 22:36:02 -07:00
fzyzcjy
0ca1811715 Support fake perfectly balanced EP dispatch algorithm (#6571) 2025-05-25 22:35:51 -07:00
fzyzcjy
2c3a6fe1de Fix bench_serving does not support changing warmup requests (#6439) 2025-05-25 22:35:36 -07:00
wangxiyu191
8b33d8df90 [PD] Fix prefill_servers in mini_lb (#6527) 2025-05-26 10:38:41 +08:00
fzyzcjy
5ccf8fe1a0 Hint users when weight update timeouts (#6570) 2025-05-25 09:13:17 -07:00
Shenggui Li
3f23d8cdf1 added support for tied weights in qwen pipeline parallelism (#6546) 2025-05-25 00:00:56 -07:00
Lifu Huang
022012aae8 Support Phi-4 Multi-Modal (text + vision only) (#6494) 2025-05-24 21:43:38 -07:00
Chang Su
681e7af32b [OAI] Support non-normalized logprobs in OpenAI server (#5961) 2025-05-24 21:35:55 -07:00
Xinyuan Tong
681fdc264b Refactor vlm embedding routine to use precomputed feature (#6543)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
2025-05-24 18:39:21 -07:00
fzyzcjy
0d47788025 Support overlapping two batches (#4068) 2025-05-24 17:39:07 -07:00
fzyzcjy
f456037396 Utilize static dispatching for communicator (#6577) 2025-05-24 17:34:35 -07:00
fzyzcjy
b2388433be Add back DeepSeek non-TBO branches (#6578) 2025-05-24 17:34:00 -07:00
fzyzcjy
a38376fa99 Refactor attention into multiple stages (#6477) 2025-05-24 17:33:25 -07:00
kk
7a5e6ce1cb Fix GPU OOM (#6564)
Co-authored-by: michael <michael.zhang@amd.com>
2025-05-24 16:38:39 -07:00
Yineng Zhang
7e257cd666 chore: bump v0.4.6.post5 (#6566) 2025-05-24 00:48:05 -07:00
fzyzcjy
c4831e2fcf Fix accuracy is zero when enabling moe-dense-tp-size as in large scale EP (#6567) 2025-05-24 00:27:10 -07:00
Neo
2e37fa07ba [FIX]remove ServerArgs duplicate code (#6485) 2025-05-23 22:54:41 -07:00
Byron Hsu
2d831c6ef9 [PD] Support structured output (#6560) 2025-05-23 21:49:00 -07:00
Chang Su
ed0c3035cd feat(Tool Calling): Support required and specific function mode (#6550) 2025-05-23 21:00:37 -07:00
Yi Zhang
e6f113569e support eplb for qwen3 (#6533) 2025-05-23 18:31:30 -07:00
Chang Su
7b02c32679 [Bugfix](gemma3_mm): handle flatten_batch constraint for multiple images (#6562) 2025-05-23 18:11:54 -07:00
miter
fefa19fec0 Update cmdline --enable-dp-attention help string for Qwen 2/3 Moe models. (#6524)
Signed-off-by: miter <miterv@outlook.com>
2025-05-23 15:20:21 -07:00
Byron Hsu
8233cc10fd [PD] Support logprob & Add failure test (#6558) 2025-05-23 14:29:20 -07:00
HandH1998
1b2e8f76d9 [2/2] Support Qserve (#6521) 2025-05-23 12:39:18 -07:00
Byron Hsu
d2e0881a34 [PD] support spec decode (#6507)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-05-23 12:03:05 -07:00
Li Hui
2f42749184 Fix topk inference performance reduce (#6474) 2025-05-23 02:58:31 -07:00
Chang Su
4685fbb888 [VLM] Support chunk prefill for VLM (#6355)
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-05-22 20:32:41 -07:00
Byron Hsu
0a4fc73b48 [PD] Fix failure abort (#6535) 2025-05-22 20:32:03 -07:00
Yineng Zhang
a6970a17f3 misc: fix accept_length (#6536) 2025-05-22 14:27:10 -07:00
ryang
a6ae3af15e Support XiaomiMiMo inference with mtp (#6059) 2025-05-22 14:14:49 -07:00
Yineng Zhang
0b07c4a99f chore: upgrade sgl-kernel v0.1.4 (#6532) 2025-05-22 13:28:16 -07:00
lukec
fc0e3b9174 Support qwen3 deepep (#6120) 2025-05-22 11:04:45 -07:00
shangmingc
58f10679e1 Fix missing http status import for PD failure handler (#6520)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-05-22 15:23:54 +08:00
fzyzcjy
7a80f56513 Support dynamically rebalancing experts using EPLB (#6469) 2025-05-21 23:13:21 -07:00