Trevor Morris
|
5962e70d8d
|
FlashInfer NVFP4 MoE with EP & 2-stream shared expert (#7327)
Co-authored-by: JieXin Liang <Alcanderian@users.noreply.github.com>
Co-authored-by: alcanderian <alcanderian@gmail.com>
|
2025-06-22 13:38:47 -07:00 |
|
Cheng Wan
|
e879d8b7a8
|
[Feature] Comprehensive Hybrid Parallelism Support (#6389)
|
2025-06-20 14:43:11 -07:00 |
|
KavioYu
|
873ae12cee
|
support custom weight loader for model runner (#7122)
Co-authored-by: kavioyu <kavioyu@tencent.com>
|
2025-06-16 16:28:15 -07:00 |
|
Byron Hsu
|
7d316991b2
|
[PD] Update prefill.py (#7190)
|
2025-06-14 15:59:54 -07:00 |
|
Povilas Kanapickas
|
bd7cfbd2f8
|
[Fix] Reduce busy polling when scheduler is idle (#6026)
|
2025-06-12 14:58:22 -07:00 |
|
Lianmin Zheng
|
dbdf76ca98
|
Clean up docs for server args and sampling parameters (generated by grok) (#7076)
|
2025-06-10 19:55:42 -07:00 |
|
Lianmin Zheng
|
6b12d6a8d5
|
Simplify the heuristics for setting --mem-fraction-static (#7054)
|
2025-06-10 19:01:39 -07:00 |
|
kyle-pena-kuzco
|
b56de8f943
|
Open AI API hidden states (#6716)
|
2025-06-10 14:37:29 -07:00 |
|
Lianmin Zheng
|
6406408a70
|
Clean up server_args.py (#7037)
|
2025-06-10 05:34:29 -07:00 |
|
fzyzcjy
|
f6ebba537a
|
Support both approximate and exact expert distribution collection (#6964)
|
2025-06-09 20:56:17 -07:00 |
|
Lianmin Zheng
|
dc0705a504
|
Simplify prepare_extend_after_decode (#6987)
|
2025-06-09 16:39:21 -07:00 |
|
Swipe4057
|
e1ce44cdb1
|
Disabling mixed chunked prefill when eagle is enabled (#6874)
|
2025-06-07 03:06:58 -07:00 |
|
Lianmin Zheng
|
60fdad7cf3
|
Sync the changes on cuda graph runners (#6932)
|
2025-06-06 18:23:52 -07:00 |
|
fzyzcjy
|
0de5e7d40f
|
Support layerwise rebalancing experts (#6851)
|
2025-06-05 00:05:52 -07:00 |
|
zyksir
|
8e3797be1c
|
support 1 shot allreduce in 1-node and 2-node using mscclpp (#6277)
|
2025-06-04 22:11:24 -07:00 |
|
Cheng Wan
|
81964328b7
|
Set num_fused_shared_experts as num_shared_experts when shared_experts fusion is not disabled (#6736)
|
2025-06-04 15:53:22 -07:00 |
|
Marc Sun
|
37f1547587
|
[FEAT] Add transformers backend support (#5929)
|
2025-06-03 21:05:29 -07:00 |
|
Cheng Wan
|
8a5480528d
|
[Refactor] Rename n_share_experts_fusion as num_fused_shared_experts (#6735)
|
2025-06-03 17:48:24 -07:00 |
|
Lianmin Zheng
|
2d72fc47cf
|
Improve profiler and integrate profiler in bench_one_batch_server (#6787)
|
2025-05-31 15:53:55 -07:00 |
|
YanbingJiang
|
888cb175a6
|
Add intel_amx backend for Radix Attention for CPU (#6408)
Co-authored-by: Chunyuan WU <chunyuan.wu@intel.com>
Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>
|
2025-05-30 21:37:42 -07:00 |
|
Baizhou Zhang
|
73def253b5
|
Fix mem_fraction_static for AMD CI (#6748)
|
2025-05-29 12:37:30 -07:00 |
|
fzyzcjy
|
3ab7d9b55e
|
Support picking variants of EPLB algorithms (#6728)
|
2025-05-29 08:12:01 -07:00 |
|
fzyzcjy
|
7e5071c92a
|
Super tiny enable sole usage of expert distribution metrics and update doc (#6680)
|
2025-05-29 08:11:38 -07:00 |
|
Baizhou Zhang
|
f2bd3515fb
|
Tune memory arguments on B200 (#6718)
|
2025-05-29 00:03:22 -07:00 |
|
JieXin Liang
|
535c838674
|
[fix] more mem for draft_extend cuda_graph (#6726)
|
2025-05-28 23:25:18 -07:00 |
|
Ke Bao
|
631950280a
|
Support EAGLE draft extend CUDA graph (#6606)
Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>
|
2025-05-27 02:35:17 -07:00 |
|
Lifu Huang
|
477a101cbd
|
Refactor LoRA handling to support adapter tensors in fused format (#6585)
|
2025-05-26 21:51:54 -07:00 |
|
fzyzcjy
|
fe386acae6
|
Automatically configure for EPLB-related args (#6628)
|
2025-05-26 08:42:49 -07:00 |
|
fzyzcjy
|
0ca1811715
|
Support fake perfectly balanced EP dispatch algorithm (#6571)
|
2025-05-25 22:35:51 -07:00 |
|
fzyzcjy
|
0d47788025
|
Support overlapping two batches (#4068)
|
2025-05-24 17:39:07 -07:00 |
|
Neo
|
2e37fa07ba
|
[FIX]remove ServerArgs duplicate code (#6485)
|
2025-05-23 22:54:41 -07:00 |
|
miter
|
fefa19fec0
|
Update cmdline --enable-dp-attention help string for Qwen 2/3 Moe models. (#6524)
Signed-off-by: miter <miterv@outlook.com>
|
2025-05-23 15:20:21 -07:00 |
|
HandH1998
|
1b2e8f76d9
|
[2/2] Support Qserve (#6521)
|
2025-05-23 12:39:18 -07:00 |
|
fzyzcjy
|
7a80f56513
|
Support dynamically rebalancing experts using EPLB (#6469)
|
2025-05-21 23:13:21 -07:00 |
|
fzyzcjy
|
9484eba4ad
|
Support logging expert balancedness metrics (#6482)
|
2025-05-21 23:05:33 -07:00 |
|
fzyzcjy
|
ccfe5c009d
|
Support redundant experts in expert parallel (#6461)
|
2025-05-21 02:05:53 -07:00 |
|
HAI
|
5c0b38f369
|
aiter attention-backend (default enabled on AMD/ROCm) (#6381)
|
2025-05-20 22:52:41 -07:00 |
|
fzyzcjy
|
e98afbe042
|
Support dispatching logical to physical experts (#6385)
|
2025-05-19 22:13:55 -07:00 |
|
fzyzcjy
|
f0653886a5
|
Expert distribution recording without overhead for EPLB (#4957)
|
2025-05-19 20:07:43 -07:00 |
|
Trevor Morris
|
7adf245ba2
|
[Metrics] Add KV events publishing (#6098)
|
2025-05-19 14:19:54 -07:00 |
|
fzyzcjy
|
fd08c04821
|
Support custom DeepEP tuning config (#6257)
|
2025-05-17 17:09:42 -07:00 |
|
Lianmin Zheng
|
e07a6977e7
|
Minor improvements of TokenizerManager / health check (#6327)
|
2025-05-15 15:29:25 -07:00 |
|
Yi Liu
|
cfc9f9ab8d
|
Fix gpu mem check on CPU (#6317)
Signed-off-by: yiliu30 <yi4.liu@intel.com>
|
2025-05-15 09:37:45 -07:00 |
|
Lianmin Zheng
|
d18c6b3358
|
Support incremental streaming of logprob/token_ids between scheduler and detokenizer (#6225)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2025-05-12 14:33:38 -07:00 |
|
Lianmin Zheng
|
e8e18dcdcc
|
Revert "fix some typos" (#6244)
|
2025-05-12 12:53:26 -07:00 |
|
Ying Sheng
|
bad7c26fdc
|
[PP] Fix init_memory_pool desync & add PP for mixtral (#6223)
|
2025-05-12 12:38:09 -07:00 |
|
applesaucethebun
|
d738ab52f8
|
fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-05-13 01:42:38 +08:00 |
|
Lianmin Zheng
|
fba8eccd7e
|
Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (#6201)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2025-05-12 00:17:33 -07:00 |
|
Cheng Wan
|
25c83fff6a
|
Performing Vocabulary Parallelism for LM Head across Attention TP Groups (#5558)
Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
|
2025-05-11 23:36:29 -07:00 |
|
Xiaoyu Zhang
|
9f2c9568f0
|
[doc] add a note for --n-share-experts-fusion args (#6154)
|
2025-05-11 23:18:38 -07:00 |
|