fzyzcjy
|
7e5071c92a
|
Super tiny enable sole usage of expert distribution metrics and update doc (#6680)
|
2025-05-29 08:11:38 -07:00 |
|
Baizhou Zhang
|
f2bd3515fb
|
Tune memory arguments on B200 (#6718)
|
2025-05-29 00:03:22 -07:00 |
|
JieXin Liang
|
535c838674
|
[fix] more mem for draft_extend cuda_graph (#6726)
|
2025-05-28 23:25:18 -07:00 |
|
Ke Bao
|
631950280a
|
Support EAGLE draft extend CUDA graph (#6606)
Co-authored-by: Sehoon Kim <sehoonkim@berkeley.edu>
|
2025-05-27 02:35:17 -07:00 |
|
Lifu Huang
|
477a101cbd
|
Refactor LoRA handling to support adapter tensors in fused format (#6585)
|
2025-05-26 21:51:54 -07:00 |
|
fzyzcjy
|
fe386acae6
|
Automatically configure for EPLB-related args (#6628)
|
2025-05-26 08:42:49 -07:00 |
|
fzyzcjy
|
0ca1811715
|
Support fake perfectly balanced EP dispatch algorithm (#6571)
|
2025-05-25 22:35:51 -07:00 |
|
fzyzcjy
|
0d47788025
|
Support overlapping two batches (#4068)
|
2025-05-24 17:39:07 -07:00 |
|
Neo
|
2e37fa07ba
|
[FIX]remove ServerArgs duplicate code (#6485)
|
2025-05-23 22:54:41 -07:00 |
|
miter
|
fefa19fec0
|
Update cmdline --enable-dp-attention help string for Qwen 2/3 Moe models. (#6524)
Signed-off-by: miter <miterv@outlook.com>
|
2025-05-23 15:20:21 -07:00 |
|
HandH1998
|
1b2e8f76d9
|
[2/2] Support Qserve (#6521)
|
2025-05-23 12:39:18 -07:00 |
|
fzyzcjy
|
7a80f56513
|
Support dynamically rebalancing experts using EPLB (#6469)
|
2025-05-21 23:13:21 -07:00 |
|
fzyzcjy
|
9484eba4ad
|
Support logging expert balancedness metrics (#6482)
|
2025-05-21 23:05:33 -07:00 |
|
fzyzcjy
|
ccfe5c009d
|
Support redundant experts in expert parallel (#6461)
|
2025-05-21 02:05:53 -07:00 |
|
HAI
|
5c0b38f369
|
aiter attention-backend (default enabled on AMD/ROCm) (#6381)
|
2025-05-20 22:52:41 -07:00 |
|
fzyzcjy
|
e98afbe042
|
Support dispatching logical to physical experts (#6385)
|
2025-05-19 22:13:55 -07:00 |
|
fzyzcjy
|
f0653886a5
|
Expert distribution recording without overhead for EPLB (#4957)
|
2025-05-19 20:07:43 -07:00 |
|
Trevor Morris
|
7adf245ba2
|
[Metrics] Add KV events publishing (#6098)
|
2025-05-19 14:19:54 -07:00 |
|
fzyzcjy
|
fd08c04821
|
Support custom DeepEP tuning config (#6257)
|
2025-05-17 17:09:42 -07:00 |
|
Lianmin Zheng
|
e07a6977e7
|
Minor improvements of TokenizerManager / health check (#6327)
|
2025-05-15 15:29:25 -07:00 |
|
Yi Liu
|
cfc9f9ab8d
|
Fix gpu mem check on CPU (#6317)
Signed-off-by: yiliu30 <yi4.liu@intel.com>
|
2025-05-15 09:37:45 -07:00 |
|
Lianmin Zheng
|
d18c6b3358
|
Support incremental streaming of logprob/token_ids between scheduler and detokenizer (#6225)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2025-05-12 14:33:38 -07:00 |
|
Lianmin Zheng
|
e8e18dcdcc
|
Revert "fix some typos" (#6244)
|
2025-05-12 12:53:26 -07:00 |
|
Ying Sheng
|
bad7c26fdc
|
[PP] Fix init_memory_pool desync & add PP for mixtral (#6223)
|
2025-05-12 12:38:09 -07:00 |
|
applesaucethebun
|
d738ab52f8
|
fix some typos (#6209)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-05-13 01:42:38 +08:00 |
|
Lianmin Zheng
|
fba8eccd7e
|
Log if cuda graph is used & extend cuda graph capture to cuda-graph-max-bs (#6201)
Co-authored-by: SangBin Cho <rkooo567@gmail.com>
|
2025-05-12 00:17:33 -07:00 |
|
Cheng Wan
|
25c83fff6a
|
Performing Vocabulary Parallelism for LM Head across Attention TP Groups (#5558)
Co-authored-by: liusy58 <liusy58@linux.alibaba.com>
|
2025-05-11 23:36:29 -07:00 |
|
Xiaoyu Zhang
|
9f2c9568f0
|
[doc] add a note for --n-share-experts-fusion args (#6154)
|
2025-05-11 23:18:38 -07:00 |
|
Lianmin Zheng
|
03227c5fa6
|
[CI] Reorganize the 8 gpu tests (#6192)
|
2025-05-11 10:55:06 -07:00 |
|
applesaucethebun
|
2ce8793519
|
Add typo checker in pre-commit (#6179)
Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca>
|
2025-05-11 12:55:00 +08:00 |
|
Zhu Chen
|
fa7d7fd9e5
|
[Feature] Add FlashAttention3 as a backend for VisionAttention (#5764)
Co-authored-by: othame <chenzhu_912@zju.edu.cn>
Co-authored-by: Mick <mickjagger19@icloud.com>
Co-authored-by: Yi Zhang <1109276519@qq.com>
|
2025-05-08 10:01:19 -07:00 |
|
fzyzcjy
|
4c7b42424c
|
Hint users DeepEP normal mode is incompatible with CUDA Graph (#5014)
|
2025-05-07 22:40:59 +08:00 |
|
Song Zhang
|
00c2c1f08b
|
[Feature] Support for Ascend NPU backend (#3853)
Signed-off-by: Song Zhang <gepin.zs@antgroup.com>
Co-authored-by: 22dimensions <waitingwind@foxmail.com>
|
2025-05-06 20:32:53 -07:00 |
|
Liangsheng Yin
|
a3e4e9bf9e
|
Better PD initialization (#5751)
|
2025-05-07 01:12:57 +08:00 |
|
shangmingc
|
56f6589ecb
|
[PD] Optimize disaggregation ib device help info (#5781)
|
2025-05-05 13:47:37 +08:00 |
|
Ke Bao
|
ebaba85655
|
Update ci test and doc for MTP api change (#5952)
|
2025-05-01 09:30:27 -07:00 |
|
Ying Sheng
|
11383cec3c
|
[PP] Add pipeline parallelism (#5724)
|
2025-04-30 18:18:07 -07:00 |
|
Chang Su
|
2b06484bd1
|
feat: support pythonic tool call and index in tool call streaming (#5725)
|
2025-04-29 17:30:44 -07:00 |
|
JieXin Liang
|
e4b6133b78
|
[fix] relax mem_fraction_static for h200 (#5893)
Co-authored-by: alcanerian <alcanerian@gmail.com>
|
2025-04-29 17:01:12 -07:00 |
|
Ke Bao
|
dd408ee481
|
Auto set draft model path for MTP (#5793)
|
2025-04-29 16:25:40 -07:00 |
|
Qiaolin Yu
|
8c0cfca87d
|
Feat: support cuda graph for LoRA (#4115)
Co-authored-by: Beichen Ma <mabeichen12@gmail.com>
|
2025-04-28 23:30:44 -07:00 |
|
Lianmin Zheng
|
849c83a0c0
|
[CI] test chunked prefill more (#5798)
|
2025-04-28 10:57:17 -07:00 |
|
Trevor Morris
|
84810da4ae
|
Add Cutlass MLA attention backend (#5390)
|
2025-04-27 20:58:53 -07:00 |
|
Lianmin Zheng
|
621e96bf9b
|
[CI] Fix ci tests (#5769)
|
2025-04-27 07:18:10 -07:00 |
|
Kebe
|
7e944246c3
|
Add memory_saver check (#4986)
Signed-off-by: Kebe <mail@kebe7jun.com>
|
2025-04-26 20:20:05 -07:00 |
|
vzed
|
df2cf583ce
|
we fix the non existent access of decrypted_config_file (#5685)
|
2025-04-26 18:32:37 -07:00 |
|
Mick
|
c998d04b46
|
vlm: enable radix cache for qwen-vl models (#5349)
Co-authored-by: Xinyuan Tong <justinning0323@outlook.com>
|
2025-04-23 20:35:05 -07:00 |
|
Yineng Zhang
|
04f2abcb34
|
fix: gemma 3 not use softcap (#5622)
|
2025-04-22 01:16:08 -07:00 |
|
Lianmin Zheng
|
1343200299
|
Clean up mem settings (#5610)
|
2025-04-21 17:19:00 -07:00 |
|
Liangsheng Yin
|
e69a219074
|
Enhance GPU memory settings (#5604)
|
2025-04-21 15:15:00 -07:00 |
|