Commit Graph

2321 Commits

Author SHA1 Message Date
Yuan Luo
c087ddd686 Refine pre_reorder_triton_kernel slightly to improve performance (#6627)
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
2025-05-28 00:15:23 -07:00
Chang Su
41ba767f0c feat: Add warnings for invalid tool_choice and UTs (#6582) 2025-05-27 16:53:19 -07:00
Chang Su
bdb962d755 fix(tool call): Fix tool_index in PythonicDetector and issues with mixed output in non-streaming (#6678) 2025-05-27 16:18:42 -07:00
fzyzcjy
87068b5cc7 Support gathering expert distribution details (#6665) 2025-05-27 15:32:59 -07:00
fzyzcjy
a564e001b5 Fix DeepEP error in Qwen 3 MoE models (#6673) 2025-05-27 15:12:54 -07:00
Trevor Morris
e806f708c9 [PD] Make bootstrap code common between NIXL and Mooncake (#6473) 2025-05-27 12:47:38 -07:00
Yineng Zhang
fa6723f08f Revert "fix communicator for non-dp lm head (#6662)" (#6677) 2025-05-27 12:22:59 -07:00
fzyzcjy
673ff668f7 Speed up expert location update (#6661) 2025-05-27 10:00:09 -07:00
fzyzcjy
447be24228 Fix OOM when updating expert locations (#6660) 2025-05-27 09:59:53 -07:00
HAI
183d9f969c DeepSeek: enable none block-quant FP8 quantizations (#6638) 2025-05-27 09:06:40 -07:00
Ke Bao
631950280a Support EAGLE draft extend CUDA graph (#6606)
Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>
2025-05-27 02:35:17 -07:00
Cheng Wan
a3d7f4b673 fix communicator for non-dp lm head (#6662) 2025-05-27 02:31:12 -07:00
Yi Zhang
b18416fbf8 Fix qwen3 tbo/dp-lm-head (#6652) 2025-05-27 00:38:27 -07:00
Mick
ce9d690ef4 fix: fix nightly test from updating transformers (#6658) 2025-05-27 00:28:11 -07:00
Baizhou Zhang
bdaefbbfbd Add environment flag for disabling message queue broadcaster (#6403) 2025-05-26 22:32:41 -07:00
Chang Su
ae33584235 [Bugfix]: Fix call for function_call_parser.multi_format_detector in adapter.py (#6650) 2025-05-26 21:57:10 -07:00
Lifu Huang
477a101cbd Refactor LoRA handling to support adapter tensors in fused format (#6585) 2025-05-26 21:51:54 -07:00
fzyzcjy
1a8f5f6836 Super tiny rename environment variable (#6648) 2025-05-26 21:01:16 -07:00
fzyzcjy
32cd707002 Support TP in attention for two batch overlap (#6634) 2025-05-26 20:28:12 -07:00
fzyzcjy
ebd1ed49d4 Tiny refactor communicator (#6646) 2025-05-26 20:24:17 -07:00
Xinyuan Tong
d6864ce6d6 [New Model] Devstral support (#6547)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
2025-05-26 19:27:48 -07:00
Shi Shuai
755a36614b fix: added "\n" to qwen25 tool parser structural tags (#6631) 2025-05-26 19:25:45 -07:00
Lifu Huang
79a39ac0cc follow-up: move Idefics2 to a shared location to eliminate unexpected dependency. (#6603) 2025-05-26 19:23:59 -07:00
shangmingc
3ce94f71f9 [PD] Handle P/D failure and reconnect without affecting other instances (#6263)
Signed-off-by: Shangming Cai <caishangming@linux.alibaba.com>
2025-05-26 19:21:01 -07:00
fzyzcjy
ca95556c76 Tiny fix sampler error when prob is not contiguous (#6639) 2025-05-26 19:19:08 -07:00
fzyzcjy
0ca3e56802 Tiny fix missing expert location dispatch info (#6620) 2025-05-26 08:58:31 -07:00
fzyzcjy
5c7aa00976 Fix EPLB algorithm fail to run when using 3 nodes for prefill (#6629) 2025-05-26 08:43:24 -07:00
fzyzcjy
fe386acae6 Automatically configure for EPLB-related args (#6628) 2025-05-26 08:42:49 -07:00
Yi Zhang
14d1075f2c fix qwen3moe eplb prefill bug (#6617) 2025-05-26 02:15:21 -07:00
Lifu Huang
0d503090aa Supported precomputed feature for Kimi VL (#6599) 2025-05-26 01:24:13 -07:00
fzyzcjy
501efc3d36 Tiny fix CI (#6611) 2025-05-25 23:36:34 -07:00
Yi Zhang
f9bab3d591 qwen3moe support two batch overlap (#6598) 2025-05-25 23:08:16 -07:00
Chang Su
16f69b1f65 feat: Improve Mistral and Qwen25 function call parsing (#6597) 2025-05-25 23:07:23 -07:00
Yi Zhang
65f091310c refactor qwen moe code, use communicator to support tp+dp (#6581) 2025-05-25 23:01:10 -07:00
Yineng Zhang
7eb9d8e594 chore: upgrade transformers 4.52.3 (#6575)
Co-authored-by: Mick <mickjagger19@icloud.com>
2025-05-25 22:49:58 -07:00
fzyzcjy
6bebef60a7 Support accurate length control for bench serving (#6594) 2025-05-25 22:46:23 -07:00
fzyzcjy
25be63d0b2 Auto handle PD disaggregation in bench_serving (#6587)
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-05-25 22:41:27 -07:00
fzyzcjy
93e53f6e0b Logging and minor fixes to two batch overlap and EPLB (#6595) 2025-05-25 22:36:40 -07:00
fzyzcjy
a191a0e47c Improve performance of two batch overlap in some imbalanced cases (#6593) 2025-05-25 22:36:18 -07:00
fzyzcjy
8c7279c24e Fix profiling will crash the server when using num_steps (#6586) 2025-05-25 22:36:02 -07:00
fzyzcjy
0ca1811715 Support fake perfectly balanced EP dispatch algorithm (#6571) 2025-05-25 22:35:51 -07:00
fzyzcjy
2c3a6fe1de Fix bench_serving does not support changing warmup requests (#6439) 2025-05-25 22:35:36 -07:00
wangxiyu191
8b33d8df90 [PD] Fix prefill_servers in mini_lb (#6527) 2025-05-26 10:38:41 +08:00
fzyzcjy
5ccf8fe1a0 Hint users when weight update timeouts (#6570) 2025-05-25 09:13:17 -07:00
Shenggui Li
3f23d8cdf1 added support for tied weights in qwen pipeline parallelism (#6546) 2025-05-25 00:00:56 -07:00
Lifu Huang
022012aae8 Support Phi-4 Multi-Modal (text + vision only) (#6494) 2025-05-24 21:43:38 -07:00
Chang Su
681e7af32b [OAI] Support non-normalized logprobs in OpenAI server (#5961) 2025-05-24 21:35:55 -07:00
Xinyuan Tong
681fdc264b Refactor vlm embedding routine to use precomputed feature (#6543)
Signed-off-by: Xinyuan Tong <justinning0323@outlook.com>
2025-05-24 18:39:21 -07:00
fzyzcjy
0d47788025 Support overlapping two batches (#4068) 2025-05-24 17:39:07 -07:00
fzyzcjy
f456037396 Utilize static dispatching for communicator (#6577) 2025-05-24 17:34:35 -07:00