Commit Graph

38 Commits

Author SHA1 Message Date
lty
33b8ca4e96 [Feature]KV pool supports sparse attention (#6339)
### What this PR does / why we need it?
The kv pooling feature is adapted to Sparse Attention to support models
such as Deepseek V3.2.

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
```
vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \
  --host $local_ip \
  --port 8002 \
  --served-model-name model \
  --data-parallel-size 1 \
  --tensor-parallel-size 8 \
  --prefill-context-parallel-size 2 \
  --decode-context-parallel-size 1 \
  --cp-kv-cache-interleave-size 128 \
  --block-size 128 \
  --enable-expert-parallel \
  --no-enable-prefix-caching \
  --no-enable-chunked-prefill \
  --max-num-seqs 4 \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code \
  --enforce-eager \
  --quantization ascend \
  --additional_config '{"ascend_scheduler_config":{"enabled":false}}' \
  --kv-transfer-config \
    '{
            "kv_connector": "AscendStoreConnector",
            "kv_role": "kv_both",
            "kv_connector_extra_config": {
	            "backend": "mooncake",
              "lookup_rpc_port":"0",
              "use_layerwise": false
            }
    }'
```

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: lty <linhebiwen@gmail.com>
2026-02-05 10:36:52 +08:00
DreamerLeader
2dac18afea [Bugfix]Fix of Pooling Code and Update of Pooling Usage Guide (#6126)
### What this PR does / why we need it?
Fix of Pooling Code and Update of Pooling Usage Guide
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
pr:[[Bugfix]Fixed precision issues caused by pooled request
pooling](https://github.com/vllm-project/vllm-ascend/pull/6049)
readyhttps://github.com/vllm-project/vllm-ascend/pull/6049
read for review
- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Signed-off-by: fangjianwei <f30058701@china.huawei.com>
Signed-off-by: DreamerLeader <88812830+DreamerLeader@users.noreply.github.com>
Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: fangjianwei <f30058701@china.huawei.com>
2026-02-04 16:35:41 +08:00
lidenghui1110
79803932e2 [Kernel] Add AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer (#6366)
### What this PR does / why we need it?
As #2947 describe, we need to transpose kv cache layout after GQA kv
transfer when prefill and decode tensor parallel size are heterogeneous,
in the previous implementation, we use `npu_paged_cache_load ` +
`tranpose` + `_npu_reshape_and_cache` to do this work.

But obviously, it is not an efficient plan, the ops above need to be
called for each layer, which introduces 3 * layer_num kernel launch, and
6 * layer_num data movement between L1 Cache and HBM for one request on
decode node. Usually, decode node uses graph mode, so these op kernels
will be called between decode forward launched by an async thread in
mooncacke connector, this kernels maybe last for several decode forward
and TTFT will increase by 3~4 decode forward time.

In this PR, we implement an AscendC fused op
`transpose_kv_cache_by_block` to do this with only once kernel launch
and move data between L1 Cache and HBM only once.

After using this fused op, the time cost in transpose kv cacke layout
can be decreased to 0.24ms from 7ms in UT on 910C, and in PD
disaggregation scenario, TTFT can decrease about 90 ~ 110 ms in
qwen3-235B.

| request_num | original | fused_op|
|:----------------------:|:---------------:|:-------------------:|
|           1            |      643 ms      |        578 ms        |
|          128           |     1480 ms      |       1368 ms        |

### Does this PR introduce _any_ user-facing change?
Use fused op by default, incase the op has bug in any scenario, provide
fallback choice using env to disable it.

**DISABLE fused op by add following env**
`export VLLM_ASCEND_FUSION_OP_TRANSPOSE_KV_CACHE_BY_BLOCK=0`

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
2026-02-03 14:10:01 +08:00
lty
082aa2e5b7 [Bugfix]The service fails to be started when the memcache pool is enabled (#6229)
### What this PR does / why we need it?
The service fails to be started when the memcache pool is enabled
without configuring the mooncake path.

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
```
#memcache
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/memfabric_hybrid/set_env.sh
source /usr/local/memcache_hybrid/set_env.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/usr/local/memcache_hybrid/latest/config/mmc-local.conf

vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \
  --host $local_ip \
  --port 8002 \
  --served-model-name model \
  --data-parallel-size 2 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --no-enable-prefix-caching \
  --no-enable-chunked-prefill \
  --max-num-seqs 4 \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --enforce-eager \
  --quantization ascend \
  --additional_config '{"ascend_scheduler_config":{"enabled":false}}' \
  --kv-transfer-config \
    '{
            "kv_connector": "AscendStoreConnector",
            "kv_role": "kv_both",
            "kv_connector_extra_config": {
	            "backend": "memcache",
                "lookup_rpc_port":"0"
            }
    }'
```

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: lty <linhebiwen@gmail.com>
2026-02-02 16:26:18 +08:00
liziyu
d252e4f5ec [P/D] Using the cache load operator to replace the index select operator. (#6295)
### What this PR does / why we need it?
Using the cache load operator to replace the index select operator.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-01-30 14:27:53 +08:00
zxr2333
14bd55f30c [P/D][BugFix] Fix layerwise P/D request_id error (#6360)
### What this PR does / why we need it?
Fix layerwise Connector P/D request_id error, due to vllm pr:
https://github.com/vllm-project/vllm/pull/27987, which will add a random
suffix to request_id in EngineCore.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-01-29 20:19:05 +08:00
JiangWeixiang
41a52beb26 [bugfix] resolve kv cache leak on P-side due to incorrect req_id (#6325)
### What this PR does / why we need it?
This PR fixes a critical bug in the PD-separated inference pipeline
where KV cache on the Prefill (P) side was not being properly released.
The issue arises when multiple clients use the same x-request-id: to
avoid request ID collisions, both Prefill and Decode nodes append a
random suffix to the incoming x-request-id. A previous PR ensured
consistency by having the P-side pass its final request_id as
remote_request_id to the D-side via kv_transfer_param. However, during
KV cache cleanup, the D-side incorrectly used the local req_id (instead
of remote_request_id) to select the target P-side rank. This mismatch
caused the P-side KV cache to remain unreleased on certain ranks,
leading to memory leaks. This PR corrects the logic to use
remote_request_id consistently when determining the P-side rank.
### Does this PR introduce _any_ user-facing change?
No. 
### How was this patch tested?
The fix was validated by running multiple concurrent benchmark instances

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: ghphotoframe <854746559@qq.com>
2026-01-29 16:05:56 +08:00
yuxinshan
0bb1f91c2c [Feature] Mooncake connector get remote ptp size (#5822)
### What this PR does / why we need it?
To support elastic scaling when using mooncake connector, we should
support to **configure different tp sizes for different nodes**.
As a result, we transfer the prefill node information, such as tp size,
through **the request's kv_transfer_params**.
The decode nodes **get the prefill tp size** through the request's
kv_transfer_params, instead of getting it from the configuration of the
mooncake connector .

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: yuxinshan <syx_ctyg@126.com>
Signed-off-by: CalvinXKY <kyxiezju@163.com>
2026-01-26 14:28:33 +08:00
wangxiyuan
4e3919e965 Reapply "[Refactor] Unify full-graph parameter update logic (#6041)" (#6227) (#6231)
This reverts commit 95649344aa.

The CI failure doesn't related to this change. Let's reapply it.

- vLLM version: v0.14.0
- vLLM main:
d68209402d
2026-01-26 09:04:54 +08:00
wangxiyuan
95649344aa Revert "[Refactor] Unify full-graph parameter update logic (#6041)" (#6227)
This reverts commit 8966a99710.

It breaks the test
`tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py::test_deepseek_mtp_correctness[True-FULL_DECODE_ONLY-2-wemaster/deepseek_mtp_main_random_bf16]`

- vLLM version: v0.14.0
- vLLM main:
d68209402d
2026-01-25 15:25:38 +08:00
SILONG ZENG
6ccccad102 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #5) (#5996)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
|
`.../distributed/kv_transfer/kv_pool/ascend_store/ascend_store_connector.py`
|
|
`vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/backend/backend.py`
|
| `
.../distributed/kv_transfer/kv_pool/ascend_store/backend/memcache_backend.py`
|
| `
.../distributed/kv_transfer/kv_pool/ascend_store/backend/mooncake_backend.py`
|
| `
vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/config_data.py`
|
| `
vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/kv_transfer.py`
|
| `
vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_scheduler.py`
|
| `
vllm_ascend/distributed/kv_transfer/kv_pool/ascend_store/pool_worker.py`
|
| `
.../distributed/kv_transfer/kv_pool/cpu_offload/cpu_kv_cache_manager.py`
|
| `
.../distributed/kv_transfer/kv_pool/cpu_offload/cpu_offload_connector.py`
|
| ` vllm_ascend/distributed/kv_transfer/kv_pool/cpu_offload/metadata.py`
|
| ` vllm_ascend/distributed/kv_transfer/kv_pool/ucm_connector.py` |
| `
vllm_ascend/distributed/kv_transfer/utils/mooncake_transfer_engine.py` |
| ` vllm_ascend/distributed/kv_transfer/utils/utils.py` |
| ` vllm_ascend/kv_offload/cpu_npu.py` |
| ` vllm_ascend/kv_offload/npu.py` |
| ` vllm_ascend/lora/lora_ops.py` |
| ` vllm_ascend/lora/punica_npu.py` |
| ` vllm_ascend/lora/utils.py` |

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
Signed-off-by: SILONG ZENG <2609716663@qq.com>
2026-01-24 22:45:38 +08:00
SILONG ZENG
153da1a669 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #4) (#6200)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
| `vllm_ascend/distributed/kv_transfer/__init__.py` |
| `vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_connector.py` |
|
`vllm_ascend/distributed/kv_transfer/kv_p2p/mooncake_layerwise_connector.py`
|

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-01-24 20:40:48 +08:00
LICO67373
8966a99710 [Refactor] Unify full-graph parameter update logic (#6041)
### What this PR does / why we need it?

**Refactor: Unify full-graph parameter update logic**

This PR consolidates the scattered full-graph parameter update logic
into a unified approach, improving code architecture and eliminating
duplication.

**Key improvements:**

1. **Unified interface**
- Create `update_full_graph_params` as the single entry point for all
full-graph updates
   - Replace multiple scattered update calls with one unified function
- Remove ~50 lines of duplicated if-else logic across
`model_runner_v1.py` and `eagle_proposer.py`

2. **Better architecture**
- Move update logic to respective Backend classes
(`AscendAttentionBackend`, `AscendMLABackend`)
   - Each Backend manages its own parameter update logic internally
   - Simplify caller code to just dispatch to the appropriate Backend

3. **Cleaner parameter handling**
   - Remove unnecessary `pcp_size` and `dcp_size` parameter passing
   - Get parallel configuration directly from distributed groups
   - Consistent with how other parts of the codebase obtain these values

**Why we need it:**
- **Maintainability**: Future changes only need to be made in one place
per Backend
- **Code quality**: Follows DRY principle and Single Responsibility
Principle
- **Readability**: Cleaner, more intuitive code structure

### Does this PR introduce _any_ user-facing change?

**No.** This is a pure refactoring with no functional changes - same
behavior, cleaner code.

### How was this patch tested?

- All existing unit tests pass with updated mocks
- No new tests needed (pure refactoring, no behavior changes)
- CI validates correctness

---

- vLLM version: v0.13.0

Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: drslark <slarksblood@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2026-01-24 20:12:57 +08:00
liziyu
f66bcdfb29 [P/D] Mooncake connector add zmq socket fail log (#6155)
Mooncake connector add zmq socket fail log

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-01-24 12:06:42 +08:00
liziyu
14bef9af6f [P/D] Remove restrictions on mooncake for IPv6 (#5946)
### What this PR does / why we need it?
Remove restrictions on mooncake for IPv6
Dependencies: cann8.5、mooncake v0.3.8.post1

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-01-24 11:30:22 +08:00
UnifiedCacheManager
a2f022f9b6 [UCMConnector]Add has_connector_metadata (#6172)
### What this PR does / why we need it?
ucm_connector add has `has_connector_metadata` interface to adapt to the
latest KV connector in vLLM.

### Does this PR introduce _any_ user-facing change?
this PR doesn't introduce _any_ user-facing change.


### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: UnifiedCacheManager <unifiedcachem@163.com>
2026-01-23 21:16:48 +08:00
baxingpiaochong
8786412f5c [Bugfix]KV pool rank 0 consumes more HBM (#6113)
### What this PR does / why we need it?

before add_set_deivce
<img width="2354" height="674" alt="image"
src="https://github.com/user-attachments/assets/8b81ab5f-b9ba-4fd2-8546-8f36ac15d32b"
/>
after
<img width="1044" height="156" alt="image"
src="https://github.com/user-attachments/assets/996d845a-8abd-4aae-b894-4a9832b1f742"
/>

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: baxingpiaochong <771405853@qq.com>
2026-01-23 19:47:33 +08:00
weiguihua2
4173255c0c [main][Bugix] fix kv pcp+pooling+pd separation bug (#6153)
### What this PR does / why we need it?
Rectify the problem that the pcp and pd separation and kv pooling
scenario.

In the pooling scenario, multi_nodes_meta_mapping is empty. As a result,
an error is reported when the remote_host information is obtained
through the get_remote_port_send_num method.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2026-01-23 16:15:04 +08:00
wangxiaoteng888
82a2b3bcc7 [P/D]Add ssl cert for metaserver proxy (#5875)
### What this PR does / why we need it?
When the P node accesses the proxy meteserver, add the SSL certificate
and the CA certificate path to improve security.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-01-23 11:11:44 +08:00
zhangxinyuehfad
819a4459ce Drop vLLM 0.13.0 support (#6069)
### What this PR does / why we need it?
Drop vLLM 0.13.0 support, upgrade to 0.14.0

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-23 09:45:08 +08:00
wangxiaoteng888
f2c0ced06d [P/D][PCP]bugfix pcp force free twice caused logger error (#6124)
### What this PR does / why we need it?
The issue of the D node mistakenly sending the pull-end signal twice,
leading to the P node printing logger errors abnormally, has been
resolved.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
By ci
- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-01-22 16:24:33 +08:00
Li Wang
484e7c59dc [CI] optimize lint term (#5986)
### What this PR does / why we need it?
This patch purpose to optimize the lint check term. The main idea is to
reduce unnecessary installation time.
1. The installation of vllm is not must, only append the path of vllm
src to the `PATHONPATH` is effective
2. This installation of `requirements-dev.txt` is not must, we have a
pre-built image `quay.io/ascend-ci/vllm-ascend:lint` with all the
requirements installed in advance.
**NOTE**: the conditions for triggering image builds are: 1).Daily
scheduled build; 2) Build when requirements are modified; 3) Manual
build. This ensures that the dependencies in our image are up-to-date to
the greatest extent possible.
3. The `mypy` was separated from the `pre-commit` hook for performance
reasons; we found that integrating `mypy` into the `pre-commit` hook
resulted in poor performance.
4. Reduce the CPU core consumption from 16 -> 8

### Does this PR introduce _any_ user-facing change?
The end-to-end lint time was optimized from 20min/per PR to 8min/per PR
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-22 15:46:59 +08:00
JiangWeixiang
cef04b3555 [bugfix] adapt_remote_request_id (#6051)
This PR addresses a request ID mismatch issue in the PD
(Prefill-Decoding) separation deployment scenario for vllm-ascend.
Upstream vLLM recently mitigated request ID collisions by appending a
random suffix to each request_id (e.g., req-123 → req-123-abc), refer to
[PR-27987](https://github.com/vllm-project/vllm/pull/27987 ) &
[PR-29665](https://github.com/vllm-project/vllm/pull/29665). While this
works in single-node deployments, it breaks compatibility in
PD-separated setups: the Producer (Prefill node) and Consumer (Decoding
node) end up with different request_id values, preventing the Consumer
from correctly retrieving the KV cache generated by the Producer.
To resolve this, this PR introduces a new field remote_request_id in the
metadata passed via mooncake_connector. The Producer preserves and
forwards the original (unmodified) request_id as remote_request_id. The
Consumer then uses this remote_request_id—instead of its locally
generated suffixed ID—to fetch the correct KV cache from the Prefill
node.
This ensures consistent request identification across PD nodes while
maintaining compatibility with upstream vLLM’s request ID deduplication
mechanism.
<img width="1279" height="781" alt="image"
src="https://github.com/user-attachments/assets/274238c1-dab6-4d3a-9ee4-6e578679b762"
/>

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: ghphotoframe <854746559@qq.com>
Co-authored-by: jiangweixiang <jwx02384838@antgroup.com>
2026-01-22 10:48:40 +08:00
DreamerLeader
b6d55fc48e [Bugfix]Fixed precision issues caused by pooled request pooling (#6049)
### What this PR does / why we need it?
Fixed precision issues caused by pooled request pooling
### Does this PR introduce _any_ user-facing change?
pr6045
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
Co-authored-by: 房建伟 <fangjianwei@fangjianweideMacBook-Air.local>
2026-01-20 23:51:31 +08:00
fems14
8b98d7a4e8 【main】【bugfix】Resolved memory deallocation failure in the pooling layer under re-computation workloads. (#6045)
### What this PR does / why we need it?
Resolved a double-free memory vulnerability in the pooling layer under
re-computation scenarios.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

Signed-off-by: fems14 <1804143737@qq.com>
2026-01-20 22:56:04 +08:00
wangxiaochao6
bc486d9530 [main][bugfix] fix mooncake kv cache transfer when one P has multi nodes (#5960)
### What this PR does / why we need it?
In PD disaggregation case, when P has multi nodes, mooncake fails to
send data. Fix the issue in this PR.

The details:
If a P rank does not need to transfer kv cache to any one D rank, D node
should send a message to P node to release the kv
cache in P node. If P has multi nodes, D node should know the
corresponding IP in each P node, then D node can send message to the
right P node. Otherwise, send data error will happen. This PR fix this
issue by providing P nodes IP to D node through Parameter
`remote_port_send_num`.

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: wangxiaochao <w00642655@china.huawei.com>
Co-authored-by: wangxiaochao <w00642655@china.huawei.com>
2026-01-19 16:35:13 +08:00
LICO67373
687df88151 [Refactor] Move AttentionSpec initialization to Attention module (#5834)
### What this PR does / why we need it?

This PR refactors `get_kv_cache_spec` method to delegate AttentionSpec
creation to each attention module's own `get_kv_cache_spec()` method,
aligning with the vllm source code structure.

**Changes:**
- Simplify `get_kv_cache_spec` in `model_runner_v1.py` and
`cpu_offload_connector.py`
- Remove manual `AttentionType` checks for `Attention` modules
- Delegate spec creation to each attention module's `get_kv_cache_spec`
method directly
- Let `MambaBase` layers use their own `get_kv_cache_spec` method
- Keep `use_sparse` hack for `MLAAttention` (DeepSeek DSA mode) as
Ascend-specific handling

This change follows RFC #5463 item 12: move AttentionSpec to Attention
module.

- Fixes #5463 (item 12)

### Does this PR introduce _any_ user-facing change?

No. This is an internal refactoring that simplifies code structure
without changing any external behavior.

### How was this patch tested?

- Syntax validation passed via `python -m py_compile`
- CI tests will verify the changes work correctly with existing test
cases
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: lico67373 <918688502@qq.com>
2026-01-19 14:22:18 +08:00
wangxiaoteng888
fff5df3efe [P/D]The issue of solving the force-free secondary release request, which causes the node to crash. (#5968)
### What this PR does / why we need it?
The force-free secondary release request causes the node to crash. When
requests are pulled too quickly, they should not be added to the
delay-free queue.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By ci

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-01-17 18:49:27 +08:00
lidenghui1110
48e10de8c9 [Bugfix] fix cpu offload hang with tp=1 (#5963)
### What this PR does / why we need it?
As issue #5948 reported,when using cpu_offload_connector with TP=1, the
server will hang on starting, we found several bugs here to fix.
1. some crash error encountered because of code changed with vllm
version updating, some of them can be fixed as #5948, and this PR fixed
all of them.
2. hang problem described in #5948, the direct reason is that in
cpu_offload_connector, RPC client using the same client id in scheduler
and worker when tensor_parrallel_size is 1, this PR force the client id
to be different, then it is fixed.

- Why we didn't find this hang problem before?
Because we using --distributed-executor-backend mp or
tensor_parrallel_size > 1 in our test, in our old test case, the
scheduler and workers are different procceses, then client ids build by
`worker-{os.getpid()}` are not the same. But when using
tensor_parrallel_size=1, vllm will use uniproc as
distributed-executor-backend by default, the scheduler and worker will
by in the same proccess, then client ids are the same and hang.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
2026-01-17 11:50:13 +08:00
lty
3cb0af0bcf [Refactor]Refactor of vllm_ascend/distributed module (#5910)
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604

This PR is a refactoring of vllm_ascend/distributed.
### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?


- vLLM version: v0.13.0
- vLLM main:
11b6af5280

Signed-off-by: lty <linhebiwen@gmail.com>
2026-01-15 16:26:53 +08:00
wjunLu
c11a05c4e1 [Main2Main] Upgrade vllm commit to 0113 (#5839)
### What this PR does / why we need it?
Upgrade vllm commit to 0113 (11b6af5280d6d6dfb8953af16e67b25f819b3be9)

- Modify import paths due to the refactors
https://github.com/vllm-project/vllm/pull/31916
https://github.com/vllm-project/vllm/pull/32054

- Fix `TypeError: NPUOffloadingSpec.__init__() takes 2 positional
arguments but 3 were given` due to
https://github.com/vllm-project/vllm/pull/24498

- Skip the async-scheduling tests in
`tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are never
verified
https://github.com/vllm-project/vllm/pull/31998

- Skip some pooling tests, which are caused by
https://github.com/vllm-project/vllm/pull/32148
where vllm is also failed
https://buildkite.com/vllm/ci/builds/46705/steps/canvas?jid=019bb329-3834-4685-862b-1613b8e0f5d4

We will reopen those tests when main2main reachs
https://github.com/vllm-project/vllm/pull/32243

- Skip some cases in
`tests/e2e/multicard/4-cards/long_sequence/test_mtp.py`, which are
broken by
https://github.com/vllm-project/vllm/pull/32118

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

Signed-off-by: wjunLu <wjunlu217@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-01-15 09:48:53 +08:00
lty
295018ec0f [Refactor]Refactor of vllm_ascend/distributed module (#5719)
### What this PR does / why we need it?
Based on the RFC:https://github.com/vllm-project/vllm-ascend/issues/5604

This PR is a refactoring of vllm_ascend/distributed, moving all
kv_transfer realtaed codes into a dedicated folder, which has already
been done in vLLM

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?


- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: lty <linhebiwen@gmail.com>
2026-01-15 08:57:40 +08:00
wangxiyuan
0190b68f51 [Misc]Remove PD v0 code (#2047)
Cleanup V0 disaggregated prefill code for V0 Engine.

part of https://github.com/vllm-project/vllm-ascend/issues/1620

TODO: enable v1 e2e test.

- vLLM version: v0.10.0
- vLLM main:
2cc571199b

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-28 19:09:22 +08:00
wangxiyuan
34cfdf5520 [Misc] Fix logger bug (#2024)
1. Remove useless logger
2. Fix logger bug, same problem as
https://github.com/vllm-project/vllm-ascend/pull/515

- vLLM version: v0.10.0
- vLLM main:
18cc33dd60

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-07-28 15:59:09 +08:00
Li Wang
c7446438a9 [1/N][CI] Move linting system to pre-commits hooks (#1256)
### What this PR does / why we need it?

Follow vllm-project/vllm lint way:
https://github.com/vllm-project/vllm/blob/main/.pre-commit-config.yaml

Enable pre-commit to avoid some low level error  AMAP.

This pr is one step of #1241, The purpose is make linting system more
clear and convenient, on this step, Mainly did the following things:
yapf, actionlint, ruff, typos, isort, mypy, png-lint, signoff-commit,
enforce-import-regex-instead-of-re.

TODO: 
- clang-format(check for csrc with google style)
need clean code, disable for now 
- pymarkdown
need clean code, disable for now 
- shellcheck
need clean code, disable for now 

### Does this PR introduce _any_ user-facing change?

Only developer UX change:

https://vllm-ascend--1256.org.readthedocs.build/en/1256/developer_guide/contributing.html#run-lint-locally

```
pip install -r requirements-lint.txt && pre-commit install
bash format.sh
```

### How was this patch tested?

CI passed with new added/existing test.

Co-authored-by: Yikun [yikunkero@gmail.com](mailto:yikunkero@gmail.com)
Co-authored-by: wangli
[wangli858794774@gmail.com](mailto:wangli858794774@gmail.com)
- vLLM version: v0.9.1
- vLLM main:
5358cce5ff

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-07-10 14:17:15 +08:00
Mengqing Cao
6eddbd2521 [CI/UT][PD Disaggreate] Initialize PD Disaggreate UT (#889)
Initialize PD Disaggreate UT

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2025-05-29 10:17:12 +08:00
wangxiyuan
6193ba679b [CI] add codespell CI and fix format.sh (#827)
1. Fix format check error to make format.sh work
2. Add codespell check CI 
3. Add the missing required package for vllm-ascend.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2025-05-12 22:04:48 +08:00
whx
8b194ad12e [Disaggregated Prefill] P2P Disaggregated Prefill based on llm_datadist (#694)
### What this PR does / why we need it?
- This PR proposes a P2P version of Disaggregated Prefill based on
llm_datadist which manages data transfer.

- This solution reconstructs previous offline single-node Disaggregated
Prefill solution, and supports multi-node and online serveing now.

- Currently this solution supports 1P1D situation of Deepseek hybrid
parallelism (P: TP+EP, D: DP+EP). Note that xPyD situation is considered
in the solution design, and will be supported soon within v1 engine.

---------

Signed-off-by: hw_whx <wanghexiang7@huawei.com>
Signed-off-by: ganyi <pleaplusone.gy@gmail.com>
Co-authored-by: hw_whx <wanghexiang7@huawei.com>
Co-authored-by: ganyi <pleaplusone.gy@gmail.com>
2025-05-01 22:31:36 +08:00