Commit Graph

393 Commits

Author SHA1 Message Date
Cheng Wan
ac5010e0ba Fix CUDA Graph Check under Deepep with DP FFN (#7451) 2025-06-22 20:35:58 -07:00
Liangsheng Yin
05c9bc8956 [minor] simplify the TokenToKVPoolAllocator (#7414) 2025-06-22 12:37:18 +08:00
Lifu Huang
1998ce4046 Refactor LoRAManager and LoRAMemoryPool state management logic for dynamic LoRA loading support (#7412) 2025-06-21 16:09:19 -07:00
Cheng Wan
e879d8b7a8 [Feature] Comprehensive Hybrid Parallelism Support (#6389) 2025-06-20 14:43:11 -07:00
Stefan He
3774f07825 Multi-Stage Awake: Support Resume and Pause KV Cache and Weights separately (#7099) 2025-06-19 00:56:37 -07:00
YanbingJiang
094c116f7d Update python API of activation, topk, norm and rope and remove vllm dependency (#6614)
Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
Co-authored-by: jianan-gu <jianan.gu@intel.com>
Co-authored-by: sdp <sdp@gnr799219.jf.intel.com>
2025-06-17 22:11:50 -07:00
u4lr451
10d60cd41b feat: mtp support dp-attention (#6081)
Co-authored-by: austindeng <austindeng@tencent.com>
Co-authored-by: tianqilin.99 <tianqilin.99@bytedance.com>
Co-authored-by: Qiaolin Yu <liin1211@outlook.com>
Co-authored-by: ch-wan <cwan39@gatech.edu>
2025-06-17 00:33:28 -07:00
KavioYu
873ae12cee support custom weight loader for model runner (#7122)
Co-authored-by: kavioyu <kavioyu@tencent.com>
2025-06-16 16:28:15 -07:00
Lianmin Zheng
c64290dcb5 Use seq_len_fill_value in the cuda graph runners (#7233) 2025-06-16 15:57:07 -07:00
woodx
e30ef368ab Feat/support rerank (#6058) 2025-06-16 10:50:01 -07:00
fzyzcjy
b4c41f7276 Refactor DeepGEMM integration (#7150) 2025-06-13 20:41:03 -07:00
Lianmin Zheng
6b12d6a8d5 Simplify the heuristics for setting --mem-fraction-static (#7054) 2025-06-10 19:01:39 -07:00
kyle-pena-kuzco
b56de8f943 Open AI API hidden states (#6716) 2025-06-10 14:37:29 -07:00
Lianmin Zheng
4a102a2b02 Minor style fix in cuda_graph_runner.py (#7053) 2025-06-10 06:32:41 -07:00
Lianmin Zheng
6406408a70 Clean up server_args.py (#7037) 2025-06-10 05:34:29 -07:00
Byron Hsu
c2b16795b5 Add decode req pool (#6980) 2025-06-09 21:23:36 -07:00
Baizhou Zhang
6716b41786 Update default settings for blackwell (#7023) 2025-06-09 20:37:47 -07:00
Lianmin Zheng
dc0705a504 Simplify prepare_extend_after_decode (#6987) 2025-06-09 16:39:21 -07:00
Baizhou Zhang
971a0dfa32 Extend cuda graph capture bs for B200 (#6937) 2025-06-08 05:13:22 -07:00
fzyzcjy
2fc1299562 Remove unnecessary kernels of num_token_non_padded (#6965) 2025-06-08 05:09:17 -07:00
Lianmin Zheng
20d3ad3b58 Fix CI and triton moe Configs (#6974) 2025-06-08 05:06:46 -07:00
fzyzcjy
f5599ef124 Refactor global_server_args_dict (#6866) 2025-06-07 03:10:35 -07:00
fzyzcjy
c499591ac8 Add canary for EPLB rebalancing (#6895) 2025-06-07 03:09:33 -07:00
Lianmin Zheng
60fdad7cf3 Sync the changes on cuda graph runners (#6932) 2025-06-06 18:23:52 -07:00
HAI
b819381fec AITER backend extension and workload optimizations (#6838)
Co-authored-by: wunhuang <wunhuang@amd.com>
Co-authored-by: Hubert Lu <Hubert.Lu@amd.com>
2025-06-05 23:00:18 -07:00
fzyzcjy
0de5e7d40f Support layerwise rebalancing experts (#6851) 2025-06-05 00:05:52 -07:00
zyksir
8e3797be1c support 1 shot allreduce in 1-node and 2-node using mscclpp (#6277) 2025-06-04 22:11:24 -07:00
Cheng Wan
81964328b7 Set num_fused_shared_experts as num_shared_experts when shared_experts fusion is not disabled (#6736) 2025-06-04 15:53:22 -07:00
Cheng Wan
8a5480528d [Refactor] Rename n_share_experts_fusion as num_fused_shared_experts (#6735) 2025-06-03 17:48:24 -07:00
fzyzcjy
b6d0ce9f78 Minor add metrics to expert location updater (#6816) 2025-06-02 23:59:11 -07:00
fzyzcjy
e05e29d178 Refactor CustomOp to avoid confusing bugs (#5382) 2025-06-02 10:27:36 -07:00
Lianmin Zheng
20fd53b8f6 Correctly abort the failed grammar requests & Improve the handling of abort (#6803) 2025-06-01 19:00:07 -07:00
Lianmin Zheng
2d72fc47cf Improve profiler and integrate profiler in bench_one_batch_server (#6787) 2025-05-31 15:53:55 -07:00
YanbingJiang
888cb175a6 Add intel_amx backend for Radix Attention for CPU (#6408)
Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com>
Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>
2025-05-30 21:37:42 -07:00
Jianan Ji
22630ca242 Support sliding window in triton backend (#6509) 2025-05-30 01:11:53 -07:00
Baizhou Zhang
f2bd3515fb Tune memory arguments on B200 (#6718) 2025-05-29 00:03:22 -07:00
fzyzcjy
ae6a5b2950 Minor refactor two-batch overlap (#6682) 2025-05-28 15:54:17 -07:00
fzyzcjy
673ff668f7 Speed up expert location update (#6661) 2025-05-27 10:00:09 -07:00
fzyzcjy
447be24228 Fix OOM when updating expert locations (#6660) 2025-05-27 09:59:53 -07:00
Yi Zhang
14d1075f2c fix qwen3moe eplb prefill bug (#6617) 2025-05-26 02:15:21 -07:00
fzyzcjy
5ccf8fe1a0 Hint users when weight update timeouts (#6570) 2025-05-25 09:13:17 -07:00
fzyzcjy
0d47788025 Support overlapping two batches (#4068) 2025-05-24 17:39:07 -07:00
kk
7a5e6ce1cb Fix GPU OOM (#6564)
Co-authored-by: michael <michael.zhang@amd.com>
2025-05-24 16:38:39 -07:00
fzyzcjy
c4831e2fcf Fix accuracy is zero when enabling moe-dense-tp-size as in large scale EP (#6567) 2025-05-24 00:27:10 -07:00
Chang Su
4685fbb888 [VLM] Support chunk prefill for VLM (#6355)
Co-authored-by: yizhang2077 <1109276519@qq.com>
2025-05-22 20:32:41 -07:00
ryang
a6ae3af15e Support XiaomiMiMo inference with mtp (#6059) 2025-05-22 14:14:49 -07:00
fzyzcjy
7a80f56513 Support dynamically rebalancing experts using EPLB (#6469) 2025-05-21 23:13:21 -07:00
fzyzcjy
fc992a09f9 Support updating expert locations dynamically (#6388) 2025-05-21 21:59:33 -07:00
Baizhou Zhang
d4c038daed [Fix]Fix capture fail bug for DeepSeek (#6275) 2025-05-21 11:11:20 -07:00
fzyzcjy
ccfe5c009d Support redundant experts in expert parallel (#6461) 2025-05-21 02:05:53 -07:00