Commit Graph

347 Commits

Author SHA1 Message Date
Trevor Morris
5962e70d8d FlashInfer NVFP4 MoE with EP & 2-stream shared expert (#7327)
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Co-authored-by: alcanderian <alcanderian@gmail.com>
2025-06-22 13:38:47 -07:00
Cheng Wan
e879d8b7a8 [Feature] Comprehensive Hybrid Parallelism Support (#6389) 2025-06-20 14:43:11 -07:00
KavioYu
873ae12cee support custom weight loader for model runner (#7122)
Co-authored-by: kavioyu <kavioyu@tencent.com>
2025-06-16 16:28:15 -07:00
Byron Hsu
7d316991b2 [PD] Update prefill.py (#7190) 2025-06-14 15:59:54 -07:00
Povilas Kanapickas
bd7cfbd2f8 [Fix] Reduce busy polling when scheduler is idle (#6026) 2025-06-12 14:58:22 -07:00
Lianmin Zheng
dbdf76ca98 Clean up docs for server args and sampling parameters (generated by grok) (#7076) 2025-06-10 19:55:42 -07:00
Lianmin Zheng
6b12d6a8d5 Simplify the heuristics for setting --mem-fraction-static (#7054) 2025-06-10 19:01:39 -07:00
kyle-pena-kuzco
b56de8f943 Open AI API hidden states (#6716) 2025-06-10 14:37:29 -07:00
Lianmin Zheng
6406408a70 Clean up server_args.py (#7037) 2025-06-10 05:34:29 -07:00
fzyzcjy
f6ebba537a Support both approximate and exact expert distribution collection (#6964) 2025-06-09 20:56:17 -07:00
Lianmin Zheng
dc0705a504 Simplify prepare_extend_after_decode (#6987) 2025-06-09 16:39:21 -07:00
Swipe4057
e1ce44cdb1 Disabling mixed chunked prefill when eagle is enabled (#6874) 2025-06-07 03:06:58 -07:00
Lianmin Zheng
60fdad7cf3 Sync the changes on cuda graph runners (#6932) 2025-06-06 18:23:52 -07:00
fzyzcjy
0de5e7d40f Support layerwise rebalancing experts (#6851) 2025-06-05 00:05:52 -07:00
zyksir
8e3797be1c support 1 shot allreduce in 1-node and 2-node using mscclpp (#6277) 2025-06-04 22:11:24 -07:00
Cheng Wan
81964328b7 Set num_fused_shared_experts as num_shared_experts when shared_experts fusion is not disabled (#6736) 2025-06-04 15:53:22 -07:00
Marc Sun
37f1547587 [FEAT] Add transformers backend support (#5929) 2025-06-03 21:05:29 -07:00
Cheng Wan
8a5480528d [Refactor] Rename n_share_experts_fusion as num_fused_shared_experts (#6735) 2025-06-03 17:48:24 -07:00
Lianmin Zheng
2d72fc47cf Improve profiler and integrate profiler in bench_one_batch_server (#6787) 2025-05-31 15:53:55 -07:00
YanbingJiang
888cb175a6 Add intel_amx backend for Radix Attention for CPU (#6408)
Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com>
Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>
2025-05-30 21:37:42 -07:00
Baizhou Zhang
73def253b5 Fix mem_fraction_static for AMD CI (#6748) 2025-05-29 12:37:30 -07:00
fzyzcjy
3ab7d9b55e Support picking variants of EPLB algorithms (#6728) 2025-05-29 08:12:01 -07:00
fzyzcjy
7e5071c92a Super tiny enable sole usage of expert distribution metrics and update doc (#6680) 2025-05-29 08:11:38 -07:00
Baizhou Zhang
f2bd3515fb Tune memory arguments on B200 (#6718) 2025-05-29 00:03:22 -07:00
JieXin Liang
535c838674 [fix] more mem for draft_extend cuda_graph (#6726) 2025-05-28 23:25:18 -07:00
Ke Bao
631950280a Support EAGLE draft extend CUDA graph (#6606)
Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>
2025-05-27 02:35:17 -07:00
Lifu Huang
477a101cbd Refactor LoRA handling to support adapter tensors in fused format (#6585) 2025-05-26 21:51:54 -07:00
fzyzcjy
fe386acae6 Automatically configure for EPLB-related args (#6628) 2025-05-26 08:42:49 -07:00
fzyzcjy
0ca1811715 Support fake perfectly balanced EP dispatch algorithm (#6571) 2025-05-25 22:35:51 -07:00
fzyzcjy
0d47788025 Support overlapping two batches (#4068) 2025-05-24 17:39:07 -07:00
Neo
2e37fa07ba [FIX]remove ServerArgs duplicate code (#6485) 2025-05-23 22:54:41 -07:00
miter
fefa19fec0 Update cmdline --enable-dp-attention help string for Qwen 2/3 Moe models. (#6524)
Signed-off-by: miter <miterv@outlook.com>
2025-05-23 15:20:21 -07:00
HandH1998
1b2e8f76d9 [2/2] Support Qserve (#6521) 2025-05-23 12:39:18 -07:00
fzyzcjy
7a80f56513 Support dynamically rebalancing experts using EPLB (#6469) 2025-05-21 23:13:21 -07:00
fzyzcjy
9484eba4ad Support logging expert balancedness metrics (#6482) 2025-05-21 23:05:33 -07:00
fzyzcjy
ccfe5c009d Support redundant experts in expert parallel (#6461) 2025-05-21 02:05:53 -07:00
HAI
5c0b38f369 aiter attention-backend (default enabled on AMD/ROCm) (#6381) 2025-05-20 22:52:41 -07:00
fzyzcjy
e98afbe042 Support dispatching logical to physical experts (#6385) 2025-05-19 22:13:55 -07:00
fzyzcjy
f0653886a5 Expert distribution recording without overhead for EPLB (#4957) 2025-05-19 20:07:43 -07:00
Trevor Morris
7adf245ba2 [Metrics] Add KV events publishing (#6098) 2025-05-19 14:19:54 -07:00
fzyzcjy
fd08c04821 Support custom DeepEP tuning config (#6257) 2025-05-17 17:09:42 -07:00
Lianmin Zheng
e07a6977e7 Minor improvements of TokenizerManager / health check (#6327) 2025-05-15 15:29:25 -07:00
Yi Liu
cfc9f9ab8d Fix gpu mem check on CPU (#6317)
Signed-off-by: yiliu30 <yi4.liu@intel.com>
2025-05-15 09:37:45 -07:00
Lianmin Zheng
d18c6b3358 Support incremental streaming of logprob/token_ids between scheduler and detokenizer (#6225)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-05-12 14:33:38 -07:00
Lianmin Zheng
e8e18dcdcc Revert "fix some typos" (#6244) 2025-05-12 12:53:26 -07:00
Ying Sheng
bad7c26fdc [PP] Fix init_memory_pool desync & add PP for mixtral (#6223) 2025-05-12 12:38:09 -07:00
applesaucethebun
d738ab52f8 fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-13 01:42:38 +08:00
Lianmin Zheng
fba8eccd7e Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (#6201)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-05-12 00:17:33 -07:00
Cheng Wan
25c83fff6a Performing Vocabulary Parallelism for LM Head across Attention TP Groups (#5558)
Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
2025-05-11 23:36:29 -07:00
Xiaoyu Zhang
9f2c9568f0 [doc] add a note for --n-share-experts-fusion args (#6154) 2025-05-11 23:18:38 -07:00