Commit Graph

325 Commits

Author SHA1 Message Date
fzyzcjy
7e5071c92a Super tiny enable sole usage of expert distribution metrics and update doc (#6680) 2025-05-29 08:11:38 -07:00
Baizhou Zhang
f2bd3515fb Tune memory arguments on B200 (#6718) 2025-05-29 00:03:22 -07:00
JieXin Liang
535c838674 [fix] more mem for draft_extend cuda_graph (#6726) 2025-05-28 23:25:18 -07:00
Ke Bao
631950280a Support EAGLE draft extend CUDA graph (#6606)
Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>
2025-05-27 02:35:17 -07:00
Lifu Huang
477a101cbd Refactor LoRA handling to support adapter tensors in fused format (#6585) 2025-05-26 21:51:54 -07:00
fzyzcjy
fe386acae6 Automatically configure for EPLB-related args (#6628) 2025-05-26 08:42:49 -07:00
fzyzcjy
0ca1811715 Support fake perfectly balanced EP dispatch algorithm (#6571) 2025-05-25 22:35:51 -07:00
fzyzcjy
0d47788025 Support overlapping two batches (#4068) 2025-05-24 17:39:07 -07:00
Neo
2e37fa07ba [FIX]remove ServerArgs duplicate code (#6485) 2025-05-23 22:54:41 -07:00
miter
fefa19fec0 Update cmdline --enable-dp-attention help string for Qwen 2/3 Moe models. (#6524)
Signed-off-by: miter <miterv@outlook.com>
2025-05-23 15:20:21 -07:00
HandH1998
1b2e8f76d9 [2/2] Support Qserve (#6521) 2025-05-23 12:39:18 -07:00
fzyzcjy
7a80f56513 Support dynamically rebalancing experts using EPLB (#6469) 2025-05-21 23:13:21 -07:00
fzyzcjy
9484eba4ad Support logging expert balancedness metrics (#6482) 2025-05-21 23:05:33 -07:00
fzyzcjy
ccfe5c009d Support redundant experts in expert parallel (#6461) 2025-05-21 02:05:53 -07:00
HAI
5c0b38f369 aiter attention-backend (default enabled on AMD/ROCm) (#6381) 2025-05-20 22:52:41 -07:00
fzyzcjy
e98afbe042 Support dispatching logical to physical experts (#6385) 2025-05-19 22:13:55 -07:00
fzyzcjy
f0653886a5 Expert distribution recording without overhead for EPLB (#4957) 2025-05-19 20:07:43 -07:00
Trevor Morris
7adf245ba2 [Metrics] Add KV events publishing (#6098) 2025-05-19 14:19:54 -07:00
fzyzcjy
fd08c04821 Support custom DeepEP tuning config (#6257) 2025-05-17 17:09:42 -07:00
Lianmin Zheng
e07a6977e7 Minor improvements of TokenizerManager / health check (#6327) 2025-05-15 15:29:25 -07:00
Yi Liu
cfc9f9ab8d Fix gpu mem check on CPU (#6317)
Signed-off-by: yiliu30 <yi4.liu@intel.com>
2025-05-15 09:37:45 -07:00
Lianmin Zheng
d18c6b3358 Support incremental streaming of logprob/token_ids between scheduler and detokenizer (#6225)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-05-12 14:33:38 -07:00
Lianmin Zheng
e8e18dcdcc Revert "fix some typos" (#6244) 2025-05-12 12:53:26 -07:00
Ying Sheng
bad7c26fdc [PP] Fix init_memory_pool desync & add PP for mixtral (#6223) 2025-05-12 12:38:09 -07:00
applesaucethebun
d738ab52f8 fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-13 01:42:38 +08:00
Lianmin Zheng
fba8eccd7e Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (#6201)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
2025-05-12 00:17:33 -07:00
Cheng Wan
25c83fff6a Performing Vocabulary Parallelism for LM Head across Attention TP Groups (#5558)
Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
2025-05-11 23:36:29 -07:00
Xiaoyu Zhang
9f2c9568f0 [doc] add a note for --n-share-experts-fusion args (#6154) 2025-05-11 23:18:38 -07:00
Lianmin Zheng
03227c5fa6 [CI] Reorganize the 8 gpu tests (#6192) 2025-05-11 10:55:06 -07:00
applesaucethebun
2ce8793519 Add typo checker in pre-commit (#6179)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
2025-05-11 12:55:00 +08:00
Zhu Chen
fa7d7fd9e5 [Feature] Add FlashAttention3 as a backend for VisionAttention (#5764)
Co-authored-by: othame <chenzhu_912@zju.edu.cn>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
2025-05-08 10:01:19 -07:00
fzyzcjy
4c7b42424c Hint users DeepEP normal mode is incompatible with CUDA Graph (#5014) 2025-05-07 22:40:59 +08:00
Song Zhang
00c2c1f08b [Feature] Support for Ascend NPU backend (#3853)
Signed-off-by: Song Zhang <gepin.zs@antgroup.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
2025-05-06 20:32:53 -07:00
Liangsheng Yin
a3e4e9bf9e Better PD initialization (#5751) 2025-05-07 01:12:57 +08:00
shangmingc
56f6589ecb [PD] Optimize disaggregation ib device help info (#5781) 2025-05-05 13:47:37 +08:00
Ke Bao
ebaba85655 Update ci test and doc for MTP api change (#5952) 2025-05-01 09:30:27 -07:00
Ying Sheng
11383cec3c [PP] Add pipeline parallelism (#5724) 2025-04-30 18:18:07 -07:00
Chang Su
2b06484bd1 feat: support pythonic tool call and index in tool call streaming (#5725) 2025-04-29 17:30:44 -07:00
JieXin Liang
e4b6133b78 [fix] relax mem_fraction_static for h200 (#5893)
Co-authored-by: alcanerian <alcanerian@gmail.com>
2025-04-29 17:01:12 -07:00
Ke Bao
dd408ee481 Auto set draft model path for MTP (#5793) 2025-04-29 16:25:40 -07:00
Qiaolin Yu
8c0cfca87d Feat: support cuda graph for LoRA (#4115)
Co-authored-by: Beichen Ma <mabeichen12@gmail.com>
2025-04-28 23:30:44 -07:00
Lianmin Zheng
849c83a0c0 [CI] test chunked prefill more (#5798) 2025-04-28 10:57:17 -07:00
Trevor Morris
84810da4ae Add Cutlass MLA attention backend (#5390) 2025-04-27 20:58:53 -07:00
Lianmin Zheng
621e96bf9b [CI] Fix ci tests (#5769) 2025-04-27 07:18:10 -07:00
Kebe
7e944246c3 Add memory_saver check (#4986)
Signed-off-by: Kebe <mail@kebe7jun.com>
2025-04-26 20:20:05 -07:00
vzed
df2cf583ce we fix the non existent access of decrypted_config_file (#5685) 2025-04-26 18:32:37 -07:00
Mick
c998d04b46 vlm: enable radix cache for qwen-vl models (#5349)
Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
2025-04-23 20:35:05 -07:00
Yineng Zhang
04f2abcb34 fix: gemma 3 not use softcap (#5622) 2025-04-22 01:16:08 -07:00
Lianmin Zheng
1343200299 Clean up mem settings (#5610) 2025-04-21 17:19:00 -07:00
Liangsheng Yin
e69a219074 Enhance GPU memory settings (#5604) 2025-04-21 15:15:00 -07:00