32 Commits

Author SHA1 Message Date
zxr2333
ef9964389f [v0.18.0][BugFix][P/D]Fix layerwise connector out of memory during large buffer transfer (#7752)
### What this PR does / why we need 
Fix layerwise connector out of memory during large buffer transfer.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By nightly.

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
2026-03-31 22:16:53 +08:00
wangxiaoteng888
82e26b5a6e [BugFix][v0.18.0]Adjust request map pop time (#7857)
### What this PR does / why we need it?
Adjust request map pop time.This pull request optimizes the KV cache
transfer mechanism by streamlining how requests are tracked and cleaned
up. By removing unnecessary mapping structures and adjusting the timing
of request removal, the system achieves more efficient state management
during the transfer process.
pick-from:https://github.com/vllm-project/vllm-ascend/pull/7855


### How was this patch tested?
By ci
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-03-31 18:55:36 +08:00
zxr2333
ab928ed586 [v0.18.0][P/D][Feature]Layerwise connector supports Mamba prefill prefix caching (#7796)
### What this PR does / why we need it?
Mooncake layerwise connector supports Mamba prefix caching on prefiller
nodes.

### Does this PR introduce _any_ user-facing change?
Yes. Use `--enable-prefix-caching` and `--mamba-cache-mode align` to
enable mamba align mode prefix caching on P/D prefill nodes. This
function does not supports on decode nodes now.

### How was this patch tested?
By P/D E2E test.

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-03-31 09:25:22 +08:00
zouyida2052
0210cc0b07 lower log level in PD Disaggregation (#7589)
### What this PR does / why we need it?
This log is printed too frequently and unecessary, Thus lowering its
level from INFO to DEBUG.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

---------

Signed-off-by: zouyida2052 <zouyida2002@gmail.com>
2026-03-24 18:03:17 +08:00
liziyu
73cadecfb4 [P/D] [Bugfix] fix mooncake layerconnector dead when update_decoder_info fail (#7514)
### What this PR does / why we need it?
Fix mooncake layerconnector dead when update_decoder_info fail. For the
scenario where node D is dead, node P failing to update_decoder_info
should not cause node P to become dead.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
by CI

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-03-24 15:49:46 +08:00
zxr2333
67aad1fce8 [BugFix][P/D] fix padding error on FullGraph mode && fix layerwise connector mamba accuracy (#7506)
### What this PR does / why we need it?
1. When the FullGraph mode is used, the branches in the Triton operator
are compiled and fixed during the graph capture process, causing the
branch condition in the `fused_recurrent_gated_delta_rule` operator,
which checks whether `ssm_state_indices >= 0` before writing to the SSM
cache, to become invalid. Now, the write operation is performed
regardless of the value. This results in the operator performing address
offset calculations and writing to the SSM cache based on the -1 offset
after -1 is used for padding in vLLM GDN backend. Since the conv cache
and SSM cache in vLLM Ascend implementation are actually a single
continuous tensor divided into two parts, this leads to data overwriting
and the generation of NaN values.
This PR addresses two cases where padding -1 is required in the GDN
metadata builder. The same logic is used to replace the padding with 0
to avoid the problem of memory overwriting, because block 0 is a
reserved block.
2. Fix layerwise connector bug for mamba cache sending on heterogeneous
TP.

- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-03-24 15:15:55 +08:00
wangxiaoteng888
c7157af8f7 [P/D] LayerwiseConnector supports the virtual push functionality on node D. (#7361)
### What this PR does / why we need it?
LayerwiseConnector supports the virtual push functionality on node D.By
adding a do_virtual flag to request metadata, the system can now
identify and process certain requests virtually, bypassing the actual KV
cache transfer process. This allows for immediate completion of these
requests from the consumer's perspective, potentially enabling
optimizations or specific testing scenarios where physical data transfer
is not required.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-03-18 10:50:02 +08:00
Chao Lei
d9ac7e8539 [Bugfix] Assertion error when decode prefix cache fully hits (#7236)
### What this PR does / why we need it?
#### Problem
When decode node enables prefix cache and the local prefix cache fully
hits, the following assertion error occurs:
```
(EngineCore_DP3 pid=34912)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/engine/core.py", line 520, in step_with_batch_queue
(EngineCore_DP3 pid=34912)     engine_core_outputs = self.scheduler.update_from_output(
(EngineCore_DP3 pid=34912)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP3 pid=34912)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/core/sched/scheduler.py", line 1520, in update_from_output
(EngineCore_DP3 pid=34912)     self._update_from_kv_xfer_finished(kv_connector_output)
(EngineCore_DP3 pid=34912)   File "/usr/local/python3.11.14/lib/python3.11/site-packages/vllm/v1/core/sched/scheduler.py", line 2120, in _update_from_kv_xfer_finished
(EngineCore_DP3 pid=34912)     assert RequestStatus.is_finished(req.status)
(EngineCore_DP3 pid=34912)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP3 pid=34912) AssertionError
```

The error is triggered in scheduler.py at _update_from_kv_xfer_finished:
```
  if req.status == RequestStatus.WAITING_FOR_REMOTE_KVS:
      self.finished_recving_kv_req_ids.add(req_id)
  else:
      assert RequestStatus.is_finished(req.status)
```

  #### Root Cause

When decode node has prefix cache enabled and local prefix cache fully
hits:

1. get_num_new_matched_tokens returns ext_tokens=0, load_kv_async=False
when decode prefix cache fully hits
  2. Request status becomes RUNNING (not WAITING_FOR_REMOTE_KVS)
3. However, update_state_after_alloc still adds the request to
_reqs_need_recv because remote_block_ids exists in kv_transfer_params
  4. Worker processes the request in _handle_request:
- _transfer_kv_cache returns immediately (no actual transfer,
local_block_ids is empty)
    - finally block still calls update_done_task_count(request_id)
  5. finished_recving contains this request
6. When _update_from_kv_xfer_finished processes finished_recving,
request status is RUNNING
  7. Assertion fails

  #### Solution

In _handle_request, only notify scheduler (update_done_task_count) when
actual KV transfer happened (local_block_ids is not empty). The signals
to notify Prefill to release KVCache
(_send_done_signal_to_free_remote_port and _send_done_recv_signal) are
still sent regardless.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: LCAIZJ <leichao139636@163.com>
2026-03-17 15:17:45 +00:00
zxr2333
5645ca8392 [BugFix]A2 MOE method&& layerwise MTP bugfix && Mamba gdn_metadata bugfix (#7364)
### What this PR does / why we need it?
Some bug fixes, mainly including:
1. For A2, the number of experts each single card cannot be greater than
16 when using MC2. The PR fixed the error in the A2 moe communication
method selection, which would cause the selection of an incorrect
communication method when the number of model experts exceeds 256. For
example, when using an A2 16-cards model to load the PD-disaggregation D
node with Qwen3.5 series models, the incorrect MC2 method would be
chosen.
2. Fixed the issue where the layerwise connector sends the kv-cache of
the MTP layer multiple times when `num_spec_tokens` > 1. Now, the
kv-cache is sent only when the MTP layer is forward for the first time.
3. Fix the accuracy issue of qwen3.5 when using MTP for PD
disaggregation. The cause is that `num_decode_draft_tokens` does not
consider that `spec_tokens` are not existed during the first inference
when PD disaggregation (`spec_tokens` are generated during the first
inference). However, `spec_tokens_padding` is added by
`recomputed_scheduler`. As a result, `gdn_metadata` incorrectly
considers that the prefill with a length of 2 is performed.
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: zxr2333 <64738772+nwpu-zxr@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-03-17 23:03:45 +08:00
pichangping
3f39ac9c8d [Feature]Supports DSv3.1 PD separation and C8 quantization (#7222)
Co-authored-by: kunpengW-code <1289706727@qq.com>
Co-authored-by: linsheng1 <1950916997@qq.com>

### What this PR does / why we need it?
Currently, chunked prefill is forcibly enabled. DeepSeek V3.1 W8A8C8
supports only the PD separation scenario. C8 refers to quantizing the KV
cache to int8, which aims to reduce the GPU memory usage of the KV cache
and improve the inference throughput.
Constraints: 
1. Only the PD separation mode can be used and
MooncakeLayerwiseConnector can be used to run the model.
2. Currently, only the activation value supports dynamic quantization,
and the KV cache supports static quantization. C8 quantization with MTP
is not supported. You can use ModelSlim for quantization. The
quantization procedure is as follows:
pip install transformers==4.48.2
git clone https://gitcode.com/Ascend/msmodelslim.git
cd msmodelslim
bash install.sh
cd example/DeepSeek/
python3 quant_deepseek_w8a8.py --model_path <path/weight> --save_path
<path/quant_weight>
--anti_dataset../common/deepseek_anti_prompt_50_v3_1.json
--calib_dataset../common/deepseek_calib_prompt_50_v3_1.json --rot
--trust_remote_code True --fa_quant --dynamic --anti_method m6

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: pichangping <1337510399@qq.com>
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
2026-03-16 22:49:05 +08:00
zxr2333
239683c7a6 [P/D]Mooncake Layerwise Connector supports hybrid attention manager with multiple kvcache groups (#7022)
### What this PR does / why we need it?
Mooncake Layerwise Connector supports hybrid attention manager with
multiple kvcache groups.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
By CI.

- vLLM version: v0.16.0
- vLLM main:
15d76f74e2

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-03-10 23:59:20 +08:00
Yuzhou Tong
9180dd6c51 [BugFix][PCP] Fix presion bugs for pcp/dcp in PD disaggregate (#6876)
### What this PR does / why we need it?
Fix a bug for PD disaggregate of PCP/DCP, some conditions only consider
MLA while ignoring DSA.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
15d76f74e2
- vLLM Ascend main: 81fb7d5779

Signed-off-by: tongyuzhou <tongyuzhou1@huawei.com>
Co-authored-by: tongyuzhou <tongyuzhou1@huawei.com>
2026-03-02 16:11:00 +08:00
SILONG ZENG
e2237819a9 [CI]Fixed the spell check function in typos.toml (#6753)
### What this PR does / why we need it?
The incorrect regular expression syntax `.*[UE4M3|ue4m3].*` actually
ignores all words containing any of the following characters: `u, e, 4,
m, 3, |`

```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
    ".*UE8M0.*", ".*[UE4M3|ue4m3].*", ".*eles.*", ".*fo.*", ".*ba.*",
    ".*ot.*", ".*[Tt]h[rR].*"]
```
===fix===>
```yaml
extend-ignore-identifiers-re = [".*Unc.*", ".*_thw",
    ".*UE8M0.*", ".*(UE4M3|ue4m3]).*", ".*eles.*", ".*fo.*", ".*ba.*",
    ".*ot.*", ".*[Tt]h[rR].*"]
```

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
9562912cea

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-02-14 11:57:26 +08:00
wangxiaoteng888
b881fab416 [P/D][PCP] mooncake layerwise support pcp function (#6627)
### What this PR does / why we need it?
mooncake layerwise support pcp function
PCP (Prefill Context Parallelism) Support: Introduced explicit support
for Prefill Context Parallelism (PCP) and Decode Context Parallelism
(DCP) in the Mooncake layerwise KV cache transfer mechanism, allowing
for more granular control and awareness of parallel configurations
during data transfer.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
2026-02-12 11:02:25 +08:00
lty
c3db1aca2f [Refactor]refactor p2p connector (#6551)
### What this PR does / why we need it?
Redundant code is removed, and repeated logic is combined through the
p2p connector refactor, making the code easy to extend.

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
P节点:
```
vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \
  --host 0.0.0.0 \
  --port 8002 \
  --data-parallel-size 2 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --seed 1024 \
  --served-model-name model \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 16 \
  --enforce-eager \
  --trust-remote-code \
  --gpu-memory-utilization 0.92 \
  --quantization ascend \
  --async-scheduling \
  --additional-config '{"ascend_scheduler_config":{"enabled":true}}' \
  --kv-transfer-config \
  '{
        "kv_connector": "MultiConnector",
        "kv_role": "kv_producer",
        "kv_connector_extra_config": {
                "use_layerwise": false,
                "connectors": [
                        {
                                "kv_connector": "MooncakeConnectorV1",
                                "kv_role": "kv_producer",
                                "kv_port": "30000",
                                "kv_connector_extra_config": {
                                        "use_ascend_direct": true,
                                        "prefill": {
                                                "dp_size": 2,
                                                "tp_size": 8
                                        },
                                        "decode": {
                                                "dp_size": 4,
                                                "tp_size": 4
                                        }
                                }
                        },
			{
                                "kv_connector": "AscendStoreConnector",
                                "kv_role": "kv_producer",
                                "kv_connector_extra_config": {
                                        "backend": "mooncake",
                                        "mooncake_rpc_port":"0"
                                }
                        }

                ]
        }
  }'
```

D节点:
```
vllm serve /mnt/share/DeepSeek-V3.2-Exp-W8A8 \
  --host 0.0.0.0 \
  --port 8003 \
  --data-parallel-size 4 \
  --tensor-parallel-size 4 \
  --enable-expert-parallel \
  --seed 1024 \
  --served-model-name model \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --max-num-seqs 16 \
  --enforce-eager \
  --trust-remote-code \
  --gpu-memory-utilization 0.92  \
  --quantization ascend \
  --async-scheduling \
  --additional-config '{"ascend_scheduler_config":{"enabled":true}}' \
  --kv-transfer-config \
  '{
        "kv_connector": "MultiConnector",
        "kv_role": "kv_consumer",
        "kv_connector_extra_config": {
                "use_layerwise": false,
                "connectors": [
                        {
                                "kv_connector": "MooncakeConnectorV1",
                                "kv_role": "kv_consumer",
                                "kv_port": "30100",
                                "kv_connector_extra_config": {
                                        "use_ascend_direct": true,
                                        "prefill": {
                                                "dp_size": 2,
                                                "tp_size": 8
                                        },
                                        "decode": {
                                                "dp_size": 4,
                                                "tp_size": 4
                                        }
                                }
                        },{

                                "kv_connector": "AscendStoreConnector",
                                "kv_role": "kv_consumer",
                                "kv_connector_extra_config": {
                                        "backend": "mooncake",
                                        "mooncake_rpc_port":"1"
                                }
                        }

                ]
        }
  }'
```
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: lty <linhebiwen@gmail.com>
2026-02-07 09:27:15 +08:00
lidenghui1110
79803932e2 [Kernel] Add AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer (#6366)
### What this PR does / why we need it?
As #2947 describe, we need to transpose kv cache layout after GQA kv
transfer when prefill and decode tensor parallel size are heterogeneous,
in the previous implementation, we use `npu_paged_cache_load ` +
`tranpose` + `_npu_reshape_and_cache` to do this work.

But obviously, it is not an efficient plan, the ops above need to be
called for each layer, which introduces 3 * layer_num kernel launch, and
6 * layer_num data movement between L1 Cache and HBM for one request on
decode node. Usually, decode node uses graph mode, so these op kernels
will be called between decode forward launched by an async thread in
mooncacke connector, this kernels maybe last for several decode forward
and TTFT will increase by 3~4 decode forward time.

In this PR, we implement an AscendC fused op
`transpose_kv_cache_by_block` to do this with only once kernel launch
and move data between L1 Cache and HBM only once.

After using this fused op, the time cost in transpose kv cacke layout
can be decreased to 0.24ms from 7ms in UT on 910C, and in PD
disaggregation scenario, TTFT can decrease about 90 ~ 110 ms in
qwen3-235B.

| request_num | original | fused_op|
|:----------------------:|:---------------:|:-------------------:|
|           1            |      643 ms      |        578 ms        |
|          128           |     1480 ms      |       1368 ms        |

### Does this PR introduce _any_ user-facing change?
Use fused op by default, incase the op has bug in any scenario, provide
fallback choice using env to disable it.

**DISABLE fused op by add following env**
`export VLLM_ASCEND_FUSION_OP_TRANSPOSE_KV_CACHE_BY_BLOCK=0`

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
2026-02-03 14:10:01 +08:00
liziyu
d252e4f5ec [P/D] Using the cache load operator to replace the index select operator. (#6295)
### What this PR does / why we need it?
Using the cache load operator to replace the index select operator.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-01-30 14:27:53 +08:00
zxr2333
14bd55f30c [P/D][BugFix] Fix layerwise P/D request_id error (#6360)
### What this PR does / why we need it?
Fix layerwise Connector P/D request_id error, due to vllm pr:
https://github.com/vllm-project/vllm/pull/27987, which will add a random
suffix to request_id in EngineCore.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-01-29 20:19:05 +08:00
JiangWeixiang
41a52beb26 [bugfix] resolve kv cache leak on P-side due to incorrect req_id (#6325)
### What this PR does / why we need it?
This PR fixes a critical bug in the PD-separated inference pipeline
where KV cache on the Prefill (P) side was not being properly released.
The issue arises when multiple clients use the same x-request-id: to
avoid request ID collisions, both Prefill and Decode nodes append a
random suffix to the incoming x-request-id. A previous PR ensured
consistency by having the P-side pass its final request_id as
remote_request_id to the D-side via kv_transfer_param. However, during
KV cache cleanup, the D-side incorrectly used the local req_id (instead
of remote_request_id) to select the target P-side rank. This mismatch
caused the P-side KV cache to remain unreleased on certain ranks,
leading to memory leaks. This PR corrects the logic to use
remote_request_id consistently when determining the P-side rank.
### Does this PR introduce _any_ user-facing change?
No. 
### How was this patch tested?
The fix was validated by running multiple concurrent benchmark instances

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: ghphotoframe <854746559@qq.com>
2026-01-29 16:05:56 +08:00
yuxinshan
0bb1f91c2c [Feature] Mooncake connector get remote ptp size (#5822)
### What this PR does / why we need it?
To support elastic scaling when using mooncake connector, we should
support to **configure different tp sizes for different nodes**.
As a result, we transfer the prefill node information, such as tp size,
through **the request's kv_transfer_params**.
The decode nodes **get the prefill tp size** through the request's
kv_transfer_params, instead of getting it from the configuration of the
mooncake connector .

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: yuxinshan <syx_ctyg@126.com>
Signed-off-by: CalvinXKY <kyxiezju@163.com>
2026-01-26 14:28:33 +08:00
SILONG ZENG
153da1a669 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #4) (#6200)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
| `vllm_ascend/distributed/kv_transfer/__init__.py` |
| `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_connector.py` |
|
`vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py`
|

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-01-24 20:40:48 +08:00
liziyu
f66bcdfb29 [P/D] Mooncake connector add zmq socket fail log (#6155)
Mooncake connector add zmq socket fail log

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-01-24 12:06:42 +08:00
weiguihua2
4173255c0c [main][Bugix] fix kv pcp+pooling+pd separation bug (#6153)
### What this PR does / why we need it?
Rectify the problem that the pcp and pd separation and kv pooling
scenario.

In the pooling scenario, multi_nodes_meta_mapping is empty. As a result,
an error is reported when the remote_host information is obtained
through the get_remote_port_send_num method.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2026-01-23 16:15:04 +08:00
wangxiaoteng888
82a2b3bcc7 [P/D]Add ssl cert for metaserver proxy (#5875)
### What this PR does / why we need it?
When the P node accesses the proxy meteserver, add the SSL certificate
and the CA certificate path to improve security.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-01-23 11:11:44 +08:00
zhangxinyuehfad
819a4459ce Drop vLLM 0.13.0 support (#6069)
### What this PR does / why we need it?
Drop vLLM 0.13.0 support, upgrade to 0.14.0

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-23 09:45:08 +08:00
wangxiaoteng888
f2c0ced06d [P/D][PCP]bugfix pcp force free twice caused logger error (#6124)
### What this PR does / why we need it?
The issue of the D node mistakenly sending the pull-end signal twice,
leading to the P node printing logger errors abnormally, has been
resolved.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-01-22 16:24:33 +08:00
Li Wang
484e7c59dc [CI] optimize lint term (#5986)
### What this PR does / why we need it?
This patch purpose to optimize the lint check term. The main idea is to
reduce unnecessary installation time.
1. The installation of vllm is not must, only append the path of vllm
src to the `PATHONPATH` is effective
2. This installation of `requirements-dev.txt` is not must, we have a
pre-built image `quay.io/ascend-ci/vllm-ascend:lint` with all the
requirements installed in advance.
**NOTE**: the conditions for triggering image builds are: 1).Daily
scheduled build; 2) Build when requirements are modified; 3) Manual
build. This ensures that the dependencies in our image are up-to-date to
the greatest extent possible.
3. The `mypy` was separated from the `pre-commit` hook for performance
reasons; we found that integrating `mypy` into the `pre-commit` hook
resulted in poor performance.
4. Reduce the CPU core consumption from 16 -> 8

### Does this PR introduce _any_ user-facing change?
The end-to-end lint time was optimized from 20min/per PR to 8min/per PR
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-22 15:46:59 +08:00
JiangWeixiang
cef04b3555 [bugfix] adapt_remote_request_id (#6051)
This PR addresses a request ID mismatch issue in the PD
(Prefill-Decoding) separation deployment scenario for vllm-ascend.
Upstream vLLM recently mitigated request ID collisions by appending a
random suffix to each request_id (e.g., req-123 → req-123-abc), refer to
[PR-27987](https://github.com/vllm-project/vllm/pull/27987 ) &
[PR-29665](https://github.com/vllm-project/vllm/pull/29665). While this
works in single-node deployments, it breaks compatibility in
PD-separated setups: the Producer (Prefill node) and Consumer (Decoding
node) end up with different request_id values, preventing the Consumer
from correctly retrieving the KV cache generated by the Producer.
To resolve this, this PR introduces a new field remote_request_id in the
metadata passed via mooncake_connector. The Producer preserves and
forwards the original (unmodified) request_id as remote_request_id. The
Consumer then uses this remote_request_id—instead of its locally
generated suffixed ID—to fetch the correct KV cache from the Prefill
node.
This ensures consistent request identification across PD nodes while
maintaining compatibility with upstream vLLM’s request ID deduplication
mechanism.
<img width="1279" height="781" alt="image"
src="https://github.com/user-attachments/assets/274238c1-dab6-4d3a-9ee4-6e578679b762"
/>

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: ghphotoframe <854746559@qq.com>
Co-authored-by: jiangweixiang <jwx02384838@antgroup.com>
2026-01-22 10:48:40 +08:00
wangxiaochao6
bc486d9530 [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (#5960)
### What this PR does / why we need it?
In PD disaggregation case, when P has multi nodes, mooncake fails to
send data. Fix the issue in this PR.

The details:
If a P rank does not need to transfer kv cache to any one D rank, D node
should send a message to P node to release the kv
cache in P node. If P has multi nodes, D node should know the
corresponding IP in each P node, then D node can send message to the
right P node. Otherwise, send data error will happen. This PR fix this
issue by providing P nodes IP to D node through Parameter
`remote_port_send_num`.

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: wangxiaochao <w00642655@china.huawei.com>
Co-authored-by: wangxiaochao <w00642655@china.huawei.com>
2026-01-19 16:35:13 +08:00
wangxiaoteng888
fff5df3efe [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (#5968)
### What this PR does / why we need it?
The force-free secondary release request causes the node to crash. When
requests are pulled too quickly, they should not be added to the
delay-free queue.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-01-17 18:49:27 +08:00
wjunLu
c11a05c4e1 [Main2Main] Upgrade vllm commit to 0113 (#5839)
### What this PR does / why we need it?
Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9)

- Modify import paths due to the refactors
https://github.com/vllm-project/vllm/pull/31916
https://github.com/vllm-project/vllm/pull/32054

- Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional
arguments but 3 were given` due to
https://github.com/vllm-project/vllm/pull/24498

- Skip the async-scheduling tests in
`tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never
verified
https://github.com/vllm-project/vllm/pull/31998

- Skip some pooling tests, which are caused by
https://github.com/vllm-project/vllm/pull/32148
where vllm is also failed
https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4

We will reopen those tests when main2main reachs
https://github.com/vllm-project/vllm/pull/32243

- Skip some cases in
`tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are
broken by
https://github.com/vllm-project/vllm/pull/32118

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-01-15 09:48:53 +08:00
lty
295018ec0f [Refactor]Refactor of vllm_ascend/distributed module (#5719)
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604

This PR is a refactoring of vllm_ascend/distributed, moving all
kv_transfer realtaed codes into a dedicated folder, which has already
been done in vLLM

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?


- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: lty <linhebiwen@gmail.com>
2026-01-15 08:57:40 +08:00