Commit Graph

147 Commits

Author SHA1 Message Date
wangxiaoteng888
fff5df3efe [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (#5968)
### What this PR does / why we need it?
The force-free secondary release request causes the node to crash. When
requests are pulled too quickly, they should not be added to the
delay-free queue.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-01-17 18:49:27 +08:00
lidenghui1110
48e10de8c9 [Bugfix] fix cpu offload hang with tp=1 (#5963)
### What this PR does / why we need it?
As issue #5948 reported,when using cpu_offload_connector with TP=1, the
server will hang on starting, we found several bugs here to fix.
1. some crash error encountered because of code changed with vllm
version updating, some of them can be fixed as #5948, and this PR fixed
all of them.
2. hang problem described in #5948, the direct reason is that in
cpu_offload_connector, RPC client using the same client id in scheduler
and worker when tensor_parrallel_size is 1, this PR force the client id
to be different, then it is fixed.

- Why we didn't find this hang problem before?
Because we using --distributed-executor-backend mp or
tensor_parrallel_size > 1 in our test, in our old test case, the
scheduler and workers are different procceses, then client ids build by
`worker-{os.getpid()}` are not the same. But when using
tensor_parrallel_size=1, vllm will use uniproc as
distributed-executor-backend by default, the scheduler and worker will
by in the same proccess, then client ids are the same and hang.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
2026-01-17 11:50:13 +08:00
lty
3cb0af0bcf [Refactor]Refactor of vllm_ascend/distributed module (#5910)
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604

This PR is a refactoring of vllm_ascend/distributed.
### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?


- vLLM version: v0.13.0
- vLLM main:
11b6af5280

Signed-off-by: lty <linhebiwen@gmail.com>
2026-01-15 16:26:53 +08:00
wjunLu
c11a05c4e1 [Main2Main] Upgrade vllm commit to 0113 (#5839)
### What this PR does / why we need it?
Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9)

- Modify import paths due to the refactors
https://github.com/vllm-project/vllm/pull/31916
https://github.com/vllm-project/vllm/pull/32054

- Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional
arguments but 3 were given` due to
https://github.com/vllm-project/vllm/pull/24498

- Skip the async-scheduling tests in
`tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never
verified
https://github.com/vllm-project/vllm/pull/31998

- Skip some pooling tests, which are caused by
https://github.com/vllm-project/vllm/pull/32148
where vllm is also failed
https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4

We will reopen those tests when main2main reachs
https://github.com/vllm-project/vllm/pull/32243

- Skip some cases in
`tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are
broken by
https://github.com/vllm-project/vllm/pull/32118

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-01-15 09:48:53 +08:00
lty
295018ec0f [Refactor]Refactor of vllm_ascend/distributed module (#5719)
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604

This PR is a refactoring of vllm_ascend/distributed, moving all
kv_transfer realtaed codes into a dedicated folder, which has already
been done in vLLM

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?


- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: lty <linhebiwen@gmail.com>
2026-01-15 08:57:40 +08:00
liziyu
e1bed43cff [P/D] bugfix for p node force free requset (#5431)
### What this PR does / why we need it?
Fix the bug where the P-node's schedule dead after it force-frees a
request due to timeout and then receives the completed kv cache pulled
by the D-node again. By add list to recode all requests.


- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-01-14 08:51:31 +08:00
liziyu
eed9e366a7 [Bugfix][P/D] fix layerwise connector for decoder tp size > num kv heads (#5846)
### What this PR does / why we need it?
Fix layerwise connector for decoder tp size > num kv heads. In this case
prefiller should push kv cache to all decoder npu.

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-01-13 17:30:33 +08:00
DreamerLeader
db7cf9b0ca [bugfix] A2 Environment Pooling for Memcache Compatibility (#5601)
### What this PR does / why we need it?
When running memcache in the A2 environment, the logic for registering
memory needs to be added. Additionally, there is a link establishment
conflict between memcache and HCCS during initialization in A2, so the
link should be established in advance.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
7157596103

---------

Signed-off-by: fangjianwei <f30058701@china.huawei.com>
Co-authored-by: fangjianwei <f30058701@china.huawei.com>
2026-01-13 09:07:38 +08:00
zzhxxx
db12c1e2c8 [Perf] Supports compute-communication overlap in the forward of sfa_v1 in the Sharded-CP feature. (#5701)
### What this PR does / why we need it?
> Extracted from PR #5513
Based on the Sharded-CP feature PR:#4702;
RFC:https://github.com/vllm-project/vllm/issues/30055

### All-gather KV Cache for Communication Overlap:
- This PR adjusts the calculation order in the SFA.
- split `index_select` into `indexer_select_pre_process` and
`indexer_select_post_process`.
- Combine `nope`, `rope` and `index-k` into a tensor to perform
asynchronous all-gather.

### benchmark:
input=40k && num_batch_token=20k
- before:
```
Mean TTFT (ms):                          2614.52
Median TTFT (ms):                        3148.03
P50 TTFT (ms):                           3148.03
P90 TTFT (ms):                           3163.48
P99 TTFT (ms):                           3170.20
```

- after:
```
Mean TTFT (ms):                          2529.92
Median TTFT (ms):                        3051.69
P50 TTFT (ms):                           3051.69
P90 TTFT (ms):                           3067.31
P99 TTFT (ms):                           3072.15
```

### Does this PR introduce _any_ user-facing change?
None
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
2026-01-11 09:47:27 +08:00
zxr2333
78b554dda9 [P/D] layerwise connector supports DeepSeek-V3.2 sparse attention && Distribute transfer tasks to redundant kv_head cards (#5722)
### What this PR does / why we need it?
Add new function to mooncake layerwise connector, including:
1. supports sparse attention, for DeepSeek-V3.2
2. Distribute transfer tasks to redundant kv_head cards

This PR is related to [[RFC]: CDCP Scheduling for Disaggregated
Prefilling with KV Cache Layerwise Push
Support](https://github.com/vllm-project/vllm-ascend/issues/4842)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2026-01-10 23:04:16 +08:00
wangxiaoteng888
aa987ffe87 [P/D][bugfix]Fix the PCP port mapping error issue (#5706)
### What this PR does / why we need it?
Fix the PCP port mapping error issue.In a multi-node PD separation
scenario, when the PCP feature is enabled, there is an issue with the
ZMQ transmission port. Specifically, the IP and port received by Side D
do not match. The cause of this issue is an error in the port mapping
update strategy logic.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-01-10 22:43:52 +08:00
fems14
ff4c1a47b3 [bugfix] Fixing KV Pool Memory Retention and Performance Degradation Issues (#5751)
### What this PR does / why we need it?
1.Fixed memory retention on certain GPUs caused by missing PUT
operations.

2.Fixed performance degradation resulting from architectural
incompatibilities in the underlying refactor.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: fems14 <1804143737@qq.com>
2026-01-09 17:46:23 +08:00
lidenghui1110
481138e1d2 [bugfix] adapt to new implemented get_kv_cache_spec in cpuoffload connector (#4311)
### What this PR does / why we need it?
func `get_kv_cache_spec` in model_runner changed a lot and caused error
in cpuoffloading connector which is copied from model_runner, this PR
adapts to new implemented `get_kv_cache_spec` to fix it.

### How was this patch tested?

- vLLM version: v0.11.0
- vLLM main:
2918c1b49c

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
2026-01-08 09:15:09 +08:00
zzhxxx
f7db812ed7 [refactor] Refactor the interface for shard weight and remove the flashcomm2 o_shared interface. (#5181)
### What this PR does / why we need it?
- Delete the environment variable
`VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED`
- Introduce layer_sharding as a configurable feature in
additional_config
- Revise the term "shared weight" to "shard weight."
Configuration : The feature is opt-in via the additional_config
argument:
```
--additional-config '{
  "layer_sharding": ["o_proj", "q_b_proj"]
}'
```

This is orthogonal to standard tensor parallelism and weight replication
strategies. It is treated as a separate, explicit feature.It can be used
in any scenario, combined with the
flashcomm2https://github.com/vllm-project/vllm-ascend/pull/3232 feature
or the ShardedCP #4702 feature, to achieve significant performance.



- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
2026-01-08 09:05:02 +08:00
zxr2333
20a8cf061b [BugFix][P/D] Fix pre-create link parameter error (#5694)
### What this PR does / why we need it?
Fix pre-create link parameter error, `batch_transfer_sync_write`
requires list.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2026-01-08 08:41:10 +08:00
Mengqing Cao
3f4f2b4ae6 [Refactor] Import global var form vllm instead of overwirte it (#5469)
### What this PR does / why we need it?
Import global var form vllm instead of overwirte it, so that we could
use the correct global variant value

- vLLM version: v0.13.0
- vLLM main:
5326c89803
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
2026-01-07 18:41:45 +08:00
UnifiedCacheManager
d6bb17f10e [Bugfix]Add register_kv_cache in ucm_connector (#5657)
### What this PR does / why we need it?
To adapt different shapes of the KV cache, UCM optimized the
initialization of store by moving it into `register_kv_caches`.
Therefore, this update adds `register_kv_caches` interface to
UCMConnectorV1.

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>
2026-01-07 11:30:33 +08:00
liziyu
330e25ab1d [P/D] Performance enhancement of Layerwise connector in TP asymmetric scenarios (#5540)
### What this PR does / why we need it?
[P/D] Performance enhancement of Layerwise connector in TP asymmetric
scenarios
1. Session fusion: For transmission tasks at each layer, aggregate
transmission tasks with the same destination and merge them into a
single task for assignment.
2. Alltoall aggregation: For TP asymmetric scenarios, perform all
alltoall operations at once according to the block granularity for all
requests.

[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache
Layerwise Push Support
https://github.com/vllm-project/vllm-ascend/issues/4842
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-01-06 20:25:36 +08:00
Shanshan Shen
b94d589769 [MM][Bugfix] Update hf_config to hf_text_config (#5319)
### What this PR does / why we need it?

Following https://github.com/vllm-project/vllm-ascend/pull/5205, update
`hf_config` to `hf_text_config`.

Find more details at
https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3675417534
and
https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3677920872.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-06 16:41:39 +08:00
Chao Lei
473431e7e2 [P/D]Remove mooncake kvpool unused parameter local_hostname (#5574)
### What this PR does / why we need it?
In mooncake kvpool, `local_hostname` is not used. Instead, the local IP
is obtained directly via `get_ip()`. Therefore, remove this parameter to
avoid confusion.

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: LCAIZJ <leichao139636@163.com>
2026-01-05 20:18:59 +08:00
baxingpiaochong
46c2fc6a3c [KVPOOL]decode save kvcache (#5168)
### What this PR does / why we need it?

kvpool decode save kvcache
now only support mla

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: baxingpiaochong <771405853@qq.com>
Co-authored-by: Chao Lei <leichao139636@163.com>
2026-01-04 22:22:01 +08:00
lidenghui1110
d462577504 [Recover] [Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892) (revert in #4981) (#5511)
PR #4892 was revert in #4981, we recover it now. For the potential bug
break deepseek3.2 in PD case, we will find it out and fix it.

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

---------

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
2026-01-04 16:49:33 +08:00
Chao Lei
d193316ded [P/D] Bugfix zmq send/receive failed (#5503)
### What this PR does / why we need it?
Currently, when the MooncakeConnector interacts via ZeroMQ, it throws
the following exception upon send/receive failure:
**Issue 1:** The currently used `zmq.REQ` socket follows a strict
request-reply pattern, requiring an alternating sequence of send →
receive → send → receive... If either a send() or receive() operation
fails, the ZeroMQ socket becomes unusable.
**Solution:** When a send() or receive() exception occurs, close and
delete the ZeroMQ socket, and recreate it upon next use.

**Issue 2:** In `_handle_request`, if `_send_done_recv_signal` raises an
exception, the exception is thrown immediately and subsequent code is
not executed, causing the decode logic to fail to properly release the
request.
**Solution:** Move the call to `_send_done_recv_signal` to the end of
the function.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Signed-off-by: LCAIZJ <leichao139636@163.com>
2025-12-31 19:17:08 +08:00
zxr2333
46a1614387 [P/D] Improve the performance of Layerwise Connector (#5303)
### What this PR does / why we need it?
Improve the performance of Layerwise Connector, mainly includes the
following points:
1. Use event synchronize to replace stream synchronize.
2. Access metaserver when scheduling.
3. Transfer kvcache each Chunk prefill segmentation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
2025-12-31 15:09:01 +08:00
Jade Zheng
38570cfeb6 [Feature] Support kv nz feature for DeepSeek decode node in disagg-prefill scenario (#3072)
By converting the KV cache from ND to NZ format when the decode node
receives it, this PR ensures that the KV NZ feature works correctly
during the decoding phase in disagg-prefill scenario.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Co-authored-by: ghphotoframe <854746559@qq.com>
Co-authored-by: alex101-ops <alex1015718386@gmail.com>
2025-12-31 14:24:04 +08:00
wangxiaochao6
a539ae753a [feature] mooncake support pcp/dcp in common conditions (#5224)
### What this PR does / why we need it?
1. This PR is proposed to support complicated pcp/dcp parallelisms in
Prefill and Decode nodes in Mooncake, such as Prefill: TP8/PCP2DCP8 and
Decode: TP8/DCP4/DP2, which is not supported now. We establish the link
mappings to transfer KVCache between prefill and decode nodes. The main
function is realized in Function of `_get_kv_split_metadata` in
Mooncake_connector.py
2. After a prefill rank is pulled KVCache by a decode rank, the decode
rank will send `DONE_RECVING_MSG` to the prefill rank and the prefill
rank will free its KVCache blocks. If a prefill rank is pulled KVCache
more than one time by several decode ranks and it surely could happen in
complicated pcp/dcp parallelisms, it will cause the prefill rank free
its KVCache blocks for several times, which could cause memory issue.
This PR solve this issue by counting the times of prefill rank would be
pulled KVCache and in the last time, it will free the prefill rank
KVCache blocks. The related code is in Function of `run_busy_loop` in
Mooncake_connector.py
3. If a prefill rank is not pulled KVCache by any decode ranks, the
first rank in decode node will send "DONE_RECVING_MSG" to free its
blocks. The related code is in Function of
`_send_done_signal_to_free_remote_port` in Mooncake_connector.py

### How was this patch tested?
This PR is tested in many pcp/dcp parallelisms, and the accuracy are all
correct.
MLA model:
Prefill node:  TP8/DP2, Decode node: TP8/DP2
Prefill node:  TP8/PCP2/DCP8, Decode node: TP8/DP2
Prefill node:  TP8/PCP2/DCP8, Decode node: TP8/DCP4/DP2
Prefill node:  TP8/PCP2/DCP4, Decode node: TP4/DCP2/DP4
Prefill node:  TP8/PCP2/DCP2, Decode node: TP4/DCP4/DP4
Prefill node:  TP8/PCP2, Decode node: TP4/DCP2

GQA model:
Prefill node:  TP8/DP2, Decode node: TP8/DP2
Prefill node:  TP8/PCP2/DCP2, Decode node: TP8/DP2
Prefill node:  TP8/PCP2/DCP2, Decode node: TP8/DCP2/DP2
Prefill node:  TP8/PCP2/DCP2, Decode node: TP4/DP4
Prefill node:  TP16/DCP2/PCP1, Decode node: TP8/DCP2/DP2


- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
- Co-author by: Daishixun dsxtsteven@sina.com

---------

Signed-off-by: wangxiaochao <w00642655@china.huawei.com>
Co-authored-by: wangxiaochao <w00642655@china.huawei.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-31 09:53:03 +08:00
fems14
2ef4d1979e [bugfix][main]KV Pool for KV Transfer in PD Disaggregation Scenarios (#5398)
### What this PR does / why we need it?
1.KV Pool for KV Transfer in PD Disaggregation Scenarios Error
Resolution
2.Update KV Pool Documentation

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
254f6b9867

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-12-27 09:53:57 +08:00
ApsarasX
3d9954eff0 [Bugfix] Use hf_text_config instead of hf_config to support multimodal PD-Disaggregated (#5205)
### What this PR does / why we need it?
In code files such as`mooncake_connector.py`,
`vllm_config.model_config.hf_config` is used to get the LLM configs.
This approach works for LLMs, but not for multi-modal models. For
multi-modal models, `vllm_config.model_config.hf_text_config` must be
used instead to get the LLM configs.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UT
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: ApsarasX <apsarax@outlook.com>
2025-12-22 20:21:45 +08:00
shaopeng-666
fd9a47c04d fix vl pd smoke error (#5103)
### What this PR does / why we need it?
Fix VL model mooncacke PD smoke test error
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: 李少鹏 <lishaopeng21@huawei.com>
2025-12-18 22:20:45 +08:00
zzhxxx
a74a1196c5 [Feat] Support MLP_TP feature, exclude MOE layer (#4999)
#4257 This PR implements the dense_ffn TP of the first three layers of
the deepseek model, I have refactored this PR and used very little code
to support the implementation of this feature.
This PR adds a function `is_moe_layer` to mlp_tp, which supports MLP TP
in models with both mlp and moe, such as deepseek or chat GLM.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: 子潜 <ziqian@U-DMKXH32D-2015.local>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-18 20:06:53 +08:00
dsxsteven
97537709ae [BugFix] Fix mooncake bug in PCP scenario (#5055)
### What this PR does / why we need it?
The mooncake_connector.py file was importing the wrong arguments to the
file, which could cause errors when use PCP; this issue has been
corrected.

### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: daishixun <dsxsteven@sina.com>
2025-12-17 16:32:16 +08:00
weiguihua2
bf97048bce [feat]pd disaggregated support cross-machine (#5008)
### What this PR does / why we need it?
pd disaggregated support cross-machine.
We send the primary and secondary node information of node p to node d.
When node d pulls the KV data, it retrieves the corresponding primary or
secondary node information from the mapping.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2025-12-17 09:28:03 +08:00
Chao Lei
9c02fa9867 [bugfix] Fix mooncake kvpool accuracy issue (#4976)
### What this PR does / why we need it?

The current KVPool has a accuracy issue
https://github.com/vllm-project/vllm-ascend/issues/4412. This PR aims to
fix the precision problem without impacting prefill performance.

Note:Due to a bug in ADXL, calling `current_event.synchronize()` may
occasionally hang. This issue will be fixed in Cann version 8.5.rc1. You
can manually build the master branch of the project at
https://gitcode.com/cann/hixl to resolve this issue before the 8.5.RC1
release.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: LCAIZJ <leichao139636@163.com>
2025-12-16 11:33:16 +08:00
UnifiedCacheManager
195eac665b [Core][Worker] Add UCMConnector for KV Cache Offloading (#4411)
### What this PR does / why we need it?

This PR introduces the initial integration of **UCM (Unified Cache
Management)** into the vllm-ascend distributed KV-cache system.

Specifically, it adds:
- A new `UCMConnector` implementation under the distributed KV-transfer
framework.
- Support for offloading KV-cache blocks to external UCM backends (DRAM
/ NFS / Localdisk), depending on UCM configuration).
- Integration with vLLM V1 KV connector interface, including metadata
handling and role registration.

**Why it is needed:**
- UCM provides a unified, high-performance storage layer for KV-cache
externalization.
- This enables vllm-ascend to support out-of-core KV-cache workloads,
improve memory efficiency, and leverage hardware-accelerated storage
paths (RDMA / NFS / hybrid modes).
- This connector is a required component to allow future work on
multi-node inference + UCM-based scaling.

---

### Does this PR introduce _any_ user-facing change?

Yes, but limited:

- A new `kv_connector=UCMConnector` option becomes available through the
configuration interface.
- When selected, vllm-ascend workers may initialize UCM and offload
KV-cache blocks externally.
- No default behaviors are changed. Users must explicitly enable this
connector.

This PR does **not** modify:
- existing APIs,
- default execution paths,
- model runner behavior,
- user workflow unless `UCMConnector` is configured.

---

### How was this patch tested?

---

### Prefix Caching Benchmark

We provide preliminary measurements for TTFT (ms) under VLLM benchmark.
Tests run on 2 * Ascend 910B3, vllm-ascend 0.11.0, Tensor Parallel size
2, with UCM (Localdisk) enabled.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>
2025-12-16 10:53:30 +08:00
fems14
b662d914a4 [bugfix] [main] Fix KV cache query inconsistency across different TP ranks in the KV Pool (#5030)
### What this PR does / why we need it?
In the current KV Pool scenario for models like MLA and GQA, where
different TP ranks generate identical KV caches, the system is designed
to store only a single copy. The previous approach allowed each card to
query storage requirements dynamically, but inconsistent query results
across cards led to incorrect storage. To fix this, the new solution
pre-allocates storage responsibilities; each card now simply stores its
pre-assigned blocks, bypassing the inconsistent query step and ensuring
data correctness.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: fems14 <1804143737@qq.com>
2025-12-15 21:56:05 +08:00
baxingpiaochong
95e6400128 [KVPool]Fix PP get bug (#5007)
### What this PR does / why we need it?

When kv caches are evicted from the key-value pool, it's possible that
the kv cache for pp0 is still active, but the kv cache for pp1 has
already been evicted. Therefore, a unified check is needed during the
get operation.


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: baxingpiaochong <771405853@qq.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-15 20:27:57 +08:00
zzhxxx
e16444f21f [Bugfix] Fix the bug in initializing the shared_weight communication domain in sfa-cp, and fix the mtp weight load in pp>1 situation (#4913)
### What this PR does / why we need it?
In PR #4188, a small bug was introduced that caused sfa-cp to be unable
to find the global_pp_size parameter during initialization, and this PR
fixed the issue.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-15 16:21:49 +08:00
AlvisGong
ba28d54f35 [Perf]enable prefill flashcommon3 (#4065)
### What this PR does / why we need it?
moe multistream overlap to improve the performance.

### How was this patch tested?
--additional-config '{"multistream_overlap_gate": true}'

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: AlvisGong <gwly0401@163.com>
Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
2025-12-14 09:34:13 +08:00
wangxiyuan
5211e991ad Revert "[Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892)" (#4981)
This reverts commit 332b547728.

This break deepseek3.2 in PD case.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
2025-12-13 18:58:55 +08:00
lty
0cdf98ac48 [usability]Modify the default value of the protocol to ascend (#4959)
### What this PR does / why we need it?
The recommended configuration in the document kv_pool.md is ascend.
Modify the default value of the protocol to ascend,Improve usability

#### 1.Configure mooncake.json

The environment variable **MOONCAKE_CONFIG_PATH** is configured to the
full path where mooncake.json is located.

```
{
    "local_hostname": "xx.xx.xx.xx",
    "metadata_server": "P2PHANDSHAKE",
    "protocol": "ascend",
    "device_name": "",
    "alloc_in_same_node": true,
    "master_server_address": "xx.xx.xx.xx:50088",
    "global_segment_size": "1GB" (1024MB/1048576KB/1073741824B/1073741824)
}
```

**local_hostname**: Configured as the IP address of the current master
node.
**metadata_server**: Configured as **P2PHANDSHAKE**.  
**protocol:** Configured for Ascend to use Mooncake's HCCL
communication.
**device_name**: ""  
**alloc_in_same_node**: Indicator for preferring local buffer allocation
strategy.
**master_server_address**: Configured with the IP and port of the master
service.
**global_segment_size**: Expands the kvcache size registered by the PD
node to the master.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Mooncake does not set up a protocol to launch the pooled VLLM service;
test whether the pooling function is working.

Signed-off-by: lty <linhebiwen@gmail.com>
2025-12-12 16:56:18 +08:00
lidenghui1110
d65fb194d9 [Feat] Add custom Embedding tensor model parallel (#2616)
Similar to #2309 , this PR introduces Embedding tensor model parallel to
achieve decreasing of memory consumption. It support both eager mode and
graph mode.

And this PR refactor module tensor parallel configurations supported in
#2309, #2167, #2120, merge all config into `finegrained_tp_config` in
`additional_config`, including:
`lmhead_tensor_parallel_size`
`oproj_tensor_parallel_size`
`embedding_tensor_parallel_size`
`mlp_tensor_parallel_size`

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Co-authored-by: zzhx1 <zzh_201018@outlook.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: zzhxx <zhangzihang23@mails.ucas.ac.cn>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-12 14:41:20 +08:00
Slightwind
b8a317caac [main][Bugfix] Remove the ZMQ communication setup on the D node (#4926)
In the PD separation scenario, the D node does not need to perform get
operations, and therefore does not need to create ZeroMQ (ZMQ)
communication.
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: SlightwindSec <slightwindsec@gmail.com>
2025-12-12 14:37:26 +08:00
Jade Zheng
3fade30275 [Bugfix] Prevent engine hang during KVCacheSendingThread startup (#4754)
Previously, if the KVCacheSendingThread couldn't create a socket because
of port conflicts or other problems, the main thread would wait
endlessly for the ready_event signal, causing the entire engine
initialization to freeze. This update fixes the issue by adding timeouts
for thread startup and handling unexpected thread exits, so the
initialization process no longer gets stuck indefinitely.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-11 18:39:25 +08:00
lidenghui1110
332b547728 [Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892)
### What this PR does / why we need it?
Current mooncake connector has following problems with PP and MTP
enabled:
1. MTP layer kv caches are not transfered, it may cause decreasing of
accept ratio: This PR add MTP layer indices for last PP stage after
calculating end_layer in transfer_kv_cache
2. While MTP enabled, PP layers divided by default may cause imbalance
between stages, we need to use `VLLM_PP_LAYER_PARTITION` environment to
make it balance by hand, but in mooncake connector kv transfer, decode
doesn't know the partition of prefill node: This PR add config
`pp_layer_partition` in `kv_connector_extra_config` to make decode node
acquire the partition information of prefill node.

### Does this PR introduce _any_ user-facing change?
When prefill using `VLLM_PP_LAYER_PARTITION` environment, add
`pp_layer_partition` in `kv_connector_extra_config` like below:
```
export VLLM_PP_LAYER_PARTITION=33,28
"kv_connector_extra_config": {
    "use_ascend_direct": true,
    "prefill": {
            "dp_size": 1,
            "tp_size": 8,
            "pp_size": 2,
            "pp_layer_partition": "33,28"
     },
     "decode": {
            "dp_size": 16,
            "tp_size": 1,
            "pp_size": 1
     }
}
```

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
2025-12-11 17:23:21 +08:00
zzhxxx
eac72f5f23 [Feat] Flashcomm2 use o_shared linear (#4188)
### What this PR does / why we need it?

It is mentioned in the [flashcomm2 technical
report](https://gitcode.com/ascend-tribe/ascend-inference-cluster/blob/main/FlashComm/FlashComm2%E5%A4%A7%E6%A8%A1%E5%9E%8B%E6%8E%A8%E7%90%86%E4%B8%AD%E4%BB%A5%E5%AD%98%E6%8D%A2%E4%BC%A0%E7%9A%84%E9%80%9A%E4%BF%A1%E4%BC%98%E5%8C%96%E6%8A%80%E6%9C%AF.pdf)
that FC2 will introduce full redundant storage of the o_proj matrix,
which will put pressure on the memory. Therefore, the technical report
proposed a compromise solution using otp2, but it will introduce
additional reduce-scatter communication.

We propose a shared linear feature (#2931 ) that supports distributing
weights layer by layer to each card, avoiding the need for TP splitting,
and can solve the memory issue.

This PR depends on #3232 and #2931

### Flashcomm2 flowchart
<img width="1142" height="878" alt="PixPin_2025-11-14_13-37-39"
src="https://github.com/user-attachments/assets/d45ea8db-d8ef-4d45-8e18-abd4d82ce3e0"
/>

### Does this PR introduce _any_ user-facing change?

Use environment variables
```bash
export VLLM_ASCEND_FLASHCOMM2_PARALLEL_SIZE=1
export VLLM_ASCEND_ENABLE_FLASHCOMM2_OSHARED=1
```


- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zzhx1 <zzh_201018@outlook.com>
Signed-off-by: zzhxx <2783294813@qq.com>
Co-authored-by: zzh02232027 <zzh02232027@antgroup.com>
Co-authored-by: clrs97 <524936896@qq.com>
Co-authored-by: Levi-JQ <yujinqi2@huawei.com>
2025-12-11 12:43:04 +08:00
lidenghui1110
a82b0fa70e mooncake connector support pipeline parallel & fix pp with flashcomm1 (#4054)
### What this PR does / why we need it?
To support pipeline parallel with PD disaggregation, this PR support PP
in mooncake connector and fix other bugs when enable pp with other
optimization params, including following changes:
- mooncake connector support pp in prefill, we do not support decode pp
currently
- fix bugs when enable both pp and flashcomm1
- optimize ascend-scheduler to support full batch in multiple pipeline
stages, original implementation would cause all pipeline stages
batch_size total summed to max_num_seq, which makes pipeline is not
full, this optimization can make all stages running with full batch_size
= max_num_seq, the same changes will contribute to vllm scheduler too.

### Does this PR introduce _any_ user-facing change?
add `pp_size` in mooncake connector kv_connector_extra_config
```
"kv_connector_extra_config": {
            "use_ascend_direct": true,
            "prefill": {
                    "dp_size": 1,
                    "tp_size": 4,
                    "pp_size": 4
             },
             "decode": {
                    "dp_size": 16,
                    "tp_size": 1
             }
        }
```

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: chenxiao <Jaychou1620@Gmail.com>
Signed-off-by: Kurumi5210 <Jaychou1620@Gmail.com>
Signed-off-by: Kurumi5210 <jaychou1620@gmail.com>
Signed-off-by: 秋刀鱼 <jaychou1620@Gmail.com>
Co-authored-by: chenxiao <Jaychou1620@Gmail.com>
Co-authored-by: zss <zss@qq.com>
Co-authored-by: zss <3265779424@qq.com>
2025-12-10 16:01:43 +08:00
wangxiyuan
835b4c8f1d Drop torchair (#4814)
aclgraph is stable and fast now. Let's drop torchair graph mode now.

TODO: some logic to adapt torchair should be cleaned up as well. We'll
do it in the following PR.

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2025-12-10 09:20:40 +08:00
wangxiaoteng888
a77045f355 [P/D][main]Offline the llmdatadist connector related parts of the code and files. (#4780)
### What this PR does / why we need it?
As support for the mooncake connector is now available, the llmdatadist
connector is no longer being maintained, so the llmdatadist-related
files need to be retired.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2025-12-09 22:36:43 +08:00
lty
dee00d0de3 [Usability]local_buffer_size support for units: GB, MB, KB, B (#4829)
What this PR does / why we need it?
Improve usability,local_buffer_size support for units: GB, MB, KB, B,
For example, "2GB"
{
    "local_hostname": "XXX.XXX.XXX.XXX",
    "metadata_server": "P2PHANDSHAKE",
    "protocol": "ascend",
    "device_name": "",
    "use_ascend_direct": true,
    "master_server_address": "XXX.XXX.XXX.XXX:50088",
    "global_segment_size": 60000000000,
    "local_buffer_size": "2GB"
}

Does this PR introduce any user-facing change?
local_buffer_size support for units: GB, MB, KB, B

How was this patch tested?
Mooncake configures local_buffer_size as GB, MB, KB, B
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

Signed-off-by: lty <linxianchong1@huawei.com>
2025-12-09 17:52:24 +08:00
baxingpiaochong
dda027e680 [KVPOOl]Support pp (#4761)
### What this PR does / why we need it?
Support pp for kv pool

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: baxingpiaochong <771405853@qq.com>
2025-12-09 16:15:26 +08:00