Commit Graph

2309 Commits

Author SHA1 Message Date
lidenghui1110
79803932e2 [Kernel] Add AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer (#6366)
### What this PR does / why we need it?
As #2947 describe, we need to transpose kv cache layout after GQA kv
transfer when prefill and decode tensor parallel size are heterogeneous,
in the previous implementation, we use `npu_paged_cache_load ` +
`tranpose` + `_npu_reshape_and_cache` to do this work.

But obviously, it is not an efficient plan, the ops above need to be
called for each layer, which introduces 3 * layer_num kernel launch, and
6 * layer_num data movement between L1 Cache and HBM for one request on
decode node. Usually, decode node uses graph mode, so these op kernels
will be called between decode forward launched by an async thread in
mooncacke connector, this kernels maybe last for several decode forward
and TTFT will increase by 3~4 decode forward time.

In this PR, we implement an AscendC fused op
`transpose_kv_cache_by_block` to do this with only once kernel launch
and move data between L1 Cache and HBM only once.

After using this fused op, the time cost in transpose kv cacke layout
can be decreased to 0.24ms from 7ms in UT on 910C, and in PD
disaggregation scenario, TTFT can decrease about 90 ~ 110 ms in
qwen3-235B.

| request_num | original | fused_op|
|:----------------------:|:---------------:|:-------------------:|
|           1            |      643 ms      |        578 ms        |
|          128           |     1480 ms      |       1368 ms        |

### Does this PR introduce _any_ user-facing change?
Use fused op by default, incase the op has bug in any scenario, provide
fallback choice using env to disable it.

**DISABLE fused op by add following env**
`export VLLM_ASCEND_FUSION_OP_TRANSPOSE_KV_CACHE_BY_BLOCK=0`

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
2026-02-03 14:10:01 +08:00
SILONG ZENG
f4a72f0d16 [CI]Disable early exit to complete all tests (#6482)
### What this PR does / why we need it?
1. Disable the feature to exit early upon encountering an error in order
to complete all tests.
2. Within each partition, tests are re-sorted by `estimated_time` in
ascending order. This allows the CI to cover as many test cases as
possible in the early stages.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-02-03 11:25:51 +08:00
guanguan0308
dffac6db73 [Refactor] Add expert processed token count output for DispatchFFNCombine/DispatchFFNCombineBF16 (#6402)
### What this PR does / why we need it?
Add New Output for Expert Token Count
An additional output tensor expert_token_nums is added to both operators
to meet the requirement of tracking token distribution among experts:

Tensor Name: expert_token_nums
Dimension: 1D tensor
Shape: (local_expert_num,)
Data Type: int32
Semantics: Represents the number of tokens actually received by each
expert on the current card.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: guanguan0308 <1546542263@qq.com>
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
2026-02-03 10:41:06 +08:00
zhangxinyuehfad
26b83f8bde [Bugfix] Improve Triton stability on Ascend for large grids (#6301)
### What this PR does / why we need it?
Improve Triton stability on Ascend for large grids
set `TRITON_ALL_BLOCKS_PARALLEL=1` when grids > 65535

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-02-03 10:32:27 +08:00
zhangxinyuehfad
05cc03d785 [Bugfix] fix hash conflict due to reset incompatible configuations (#6368)
### What this PR does / why we need it?
[Bugfix] fix hash conflict due to reset incompatible configuations

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-02-03 10:32:02 +08:00
starmountain1997
b6256e8bc9 Revert "[CI] fix DS3.2 single node cudagraph_sizes config (#6241)" (#6497)
# What this PR does / why we need it?
This PR reverts commit 8134146ab6, which
modified the DeepSeek V3.2 (W8A8) single-node nightly test
configuration. as there is no limit between tp_size and MTP.
# Does this PR introduce any user-facing change?
No. This PR only affects CI/CD test configurations and does not
introduce any user-facing changes.
# How was this patch tested?
N/A for a revert PR. The changes restore the previously known working
configuration.
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-03 08:42:58 +08:00
debuger
c1618a0427 [Bugfix]Fix the compatibility issue of may_reinitialize_input_batch (#6290)
### What this PR does / why we need it?
Added a check in the may_reinitialize_input_batch method to verify
whether the backend implements the get_supported_block_size method

### Does this PR introduce _any_ user-facing change?
no user-facing change

### How was this patch tested?
Only a few lines of code within the methods were modified, and the
format check test has been passed.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: Debuuuuger <huangzr@cmbchina.com>
Signed-off-by: debuger <102402761+huangazazaz@users.noreply.github.com>
Signed-off-by: Debuuuuger <12110718@mail.sustech.edu.cn>
Co-authored-by: Debuuuuger <huangzr@cmbchina.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
2026-02-02 19:16:26 +08:00
lilinsiman
7932255c06 [Refactor][EAGLE] 6/N route mtp to eagle except pcp/dcp+mtp (#6349)
### What this PR does / why we need it?

Overview: This pull request refactors speculative decoding for Eagle and
MTP proposers on Ascend hardware. It fixes a bug related to
draft_attn_metadatas being lost, migrates the lmhead feature, and adds
routing logic in MtpProposer.

Details:
1. Migrated the lmhead feature from mtp to eagle and normalized it in
eagle_proposer.
2. Fixed the bug where draft_attn_metadatas was lost after enabling
eagle mode in the merge graph.
3. Added the routing for pcp and disable padded drafter batch; in mtp
mode, if pcp and disable padded drafter batch are not enabled, the
normalized file eagle_proposer will be used.

RFC: https://github.com/vllm-project/vllm-ascend/issues/5467

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
ut and test

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2026-02-02 19:15:31 +08:00
meihanc
c08364f761 [Bugfix] Fix intermittent kv_port conflict with AscendDirectTransport (#6455)
### What this PR does / why we need it?

When using Mooncake on Ascend NPU, AscendDirectTransport randomly
allocates ports within range `[20000, 20000 + npu_per_node × 1000)`.
Reference:
[ascend_direct_transport.cpp#L554](https://github.com/kvcache-ai/Mooncake/blob/v0.3.7.post2/mooncake-transfer-engine/src/transport/ascend_transport/ascend_direct_transport/ascend_direct_transport.cpp#L475)

If `kv_port` overlaps with this range, users may encounter intermittent
startup failures:
```bash
zmq.error.ZMQError: Address already in use (addr='tcp://x.x.x.x:30012')
RuntimeError: KV Cache sending/receiving thread failed to start.
```
This pr fix intermittent kv_port conflict with AscendDirectTransport in
`Qwen3-235B-W8A8-EPLB.yaml`, and add Added `kv_port Configuration Guide`
section in `pd_disaggregation_mooncake_multi_node.md`.

test
Results(tests/e2e/nightly/multi_node/config/Qwen3-235B-W8A8-EPLB.yaml):
https://github.com/vllm-project/vllm-ascend/actions/runs/21540138907/job/62073265259

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-02-02 17:31:21 +08:00
LHXuuu
45a573cff1 [Quantization][Feature] Support compressed tensors moe w4a8 dynamic weight (#5889)
### What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Support Moe model W4A8 dynamic weight.

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
2026-02-02 16:39:32 +08:00
lty
082aa2e5b7 [Bugfix]The service fails to be started when the memcache pool is enabled (#6229)
### What this PR does / why we need it?
The service fails to be started when the memcache pool is enabled
without configuring the mooncake path.

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?
```
#memcache
echo 200000 > /proc/sys/vm/nr_hugepages
source /usr/local/memfabric_hybrid/set_env.sh
source /usr/local/memcache_hybrid/set_env.sh
source /usr/local/Ascend/ascend-toolkit/set_env.sh
source /usr/local/Ascend/nnal/atb/set_env.sh
export MMC_LOCAL_CONFIG_PATH=/usr/local/memcache_hybrid/latest/config/mmc-local.conf

vllm serve /mnt/weight/DeepSeek-V3.2-Exp-W8A8 \
  --host $local_ip \
  --port 8002 \
  --served-model-name model \
  --data-parallel-size 2 \
  --tensor-parallel-size 8 \
  --enable-expert-parallel \
  --no-enable-prefix-caching \
  --no-enable-chunked-prefill \
  --max-num-seqs 4 \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.9 \
  --trust-remote-code \
  --enforce-eager \
  --quantization ascend \
  --additional_config '{"ascend_scheduler_config":{"enabled":false}}' \
  --kv-transfer-config \
    '{
            "kv_connector": "AscendStoreConnector",
            "kv_role": "kv_both",
            "kv_connector_extra_config": {
	            "backend": "memcache",
                "lookup_rpc_port":"0"
            }
    }'
```

- vLLM version: v0.14.0
- vLLM main:
d68209402d

---------

Signed-off-by: lty <linhebiwen@gmail.com>
2026-02-02 16:26:18 +08:00
Shaoxu Cheng
460ea88276 [Refact.]: Refactor some leftover implementations of 300I DUO in the main branch. (#6425)
### What this PR does / why we need it?
- Replace the RoPE operator implementation.
- Refactor some leftover implementations of 300I DUO in the main branch.

### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-02-02 16:12:04 +08:00
wangxiyuan
eeedf7c503 [Main2Main][Deps][Misc] Upgrade vLLM to v0.15.0 (#6470)
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
https://github.com/vllm-project/vllm/pull/31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
https://github.com/vllm-project/vllm/pull/32082 by
https://github.com/vllm-project/vllm-ascend/pull/6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
https://github.com/vllm-project/vllm/pull/25954 by overriding attention
metadata slots.

This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.

### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.

### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8


co-authored-by: shen-shanshan <467638484@qq.com>

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-02 15:57:55 +08:00
zhangyiming
d53510b26d [Misc] Print triton info in collect_env.py (#6298)
### What this PR does / why we need it?
[Misc] Print triton info in collect_env.py, help us to collect more info
when user create an issue.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: menogrey <1299267905@qq.com>
2026-02-02 15:53:42 +08:00
starmountain1997
8134146ab6 [CI] fix DS3.2 single node cudagraph_sizes config (#6241)
# What this PR does / why we need it?
This PR fixes the single-node nightly test for DeepSeek V3.2 (W8A8)
model to ensure CI stability. The changes include:
1. Simplified nightly test matrix (nightly_test_a3.yaml):
- Temporarily reduced to only run deepseek3_2-w8a8 test case for
debugging
- Changed trigger from schedule/workflow_dispatch to support
push/pull_request for faster iteration
2. Updated DeepSeek V3.2 test configuration
(test_deepseek_v3_2_w8a8.py):
- Adjusted cudagraph_capture_sizes from [3, 6, 9, 12] to [8, 16, 24, 32]
for better performance
- Increased max-num-seqs from 4 to 8
- Increased gpu-memory-utilization from 0.92 to 0.98
- Increased num_speculative_tokens from 2 to 3
3. Added PR checkout step (_e2e_nightly_single_node.yaml):
- Added ability to checkout a specific PR (#6241) for testing
# Does this PR introduce any user-facing change?
No. This PR only affects CI/CD test configurations and does not
introduce any user-facing changes.
# How was this patch tested?
Mock nightly test has passed, see
[here](https://github.com/vllm-project/vllm-ascend/actions/runs/21574655952/job/62159656622?pr=6241).

<img width="1053" height="714" alt="a2f2ee359febb13e1f6330b1bd3c116b"
src="https://github.com/user-attachments/assets/3262ad0f-adec-4c71-871f-d9cf2db06fbc"
/>


- vLLM version: v0.14.1
- vLLM main:
d68209402d

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-02 11:47:32 +08:00
LQLlulu
d1dcdfc408 [bugfix]fix some bug in dispatch_ffn_combine kernel (#6465)
### What this PR does / why we need it?
The kernel internals had an issue with maxoutputsize overflow in the
swiglu section, which has been fixed.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: LQLlulu <39671654+LQLlulu@users.noreply.github.com>
2026-02-02 08:32:42 +08:00
SILONG ZENG
347eb36a59 [Lint]Style: Convert vllm-ascend/ to ruff format(Batch #9) (#6135)
### What this PR does / why we need it?
**Scope of Changes**:
| File Path |
| :--- |
|`vllm_ascend/worker/model_runner_v1.py`|
|`vllm_ascend/worker/pcp_utils.py`|

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: MrZ20 <2609716663@qq.com>
2026-02-01 23:20:20 +08:00
wangxiyuan
f7dc7d9b86 [CI] support build wheel and docker image by workflow (#6453)
Make image and wheel build CI job work with workflow_dispatch way

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-01 20:06:22 +08:00
wangxiyuan
b4aafd4293 [Core][Misc] Clean up ProfileExecuteDuration (#6461)
### What this PR does / why we need it?
This PR removes the custom `ProfileExecuteDuration` utility and its
usages across the codebase. This utility was used for profiling
execution duration of different stages in the inference process. It is
replaced by the standard `vllm.v1.utils.record_function_or_nullcontext`,
which integrates with PyTorch's profiler.

This change simplifies the code by removing a custom implementation in
favor of an upstream utility, improving maintainability. Associated
documentation and tests for `ProfileExecuteDuration` are also removed.

### Does this PR introduce _any_ user-facing change?
`VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` env is removed now.

### How was this patch tested?
CI passed. The changes are a cleanup and replacement with a standard
utility. Existing tests cover the functionality. The removed feature had
its own tests which are also removed.

Related RFC: #5304

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-01 20:06:01 +08:00
fems14
775fbc4cd2 【main】【bugfix】fix: restrict default MLAPO activation to Decode nodes only (#6451)
### What this PR does / why we need it?
There is an issue with the current default logic for MLAPO (MLA Policy
Optimization). By design, MLAPO should only be enabled by default on
Decode (D) nodes. However, in hybrid (collocated prefill and decode)
scenarios, the strategy is erroneously activated during the Prefill
stage.
This PR corrects the default behavior to ensure that MLAPO is
exclusively enabled for the Decoding phase. This prevents unexpected
policy interference during Prefill and ensures optimal performance in
hybrid deployment environments.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: fems14 <1804143737@qq.com>
2026-01-31 22:44:56 +08:00
wangxiyuan
ef02d20086 [CI] update gemini styleguide (#6463)
Let gemini always add pr title and commit message summary.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-31 18:02:49 +08:00
Li Wang
5b0a6bcfe9 [ModelRunner] Revert "[Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6459)
This reverts commit 56f5d3bd49.

### What this PR does / why we need it?
The patch https://github.com/vllm-project/vllm-ascend/pull/6357 which
break the functionality availability in the spec_decode scenario, let's
revert and make CI happy first
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-31 16:33:34 +08:00
wangxiyuan
96cbfebede [CI]Update gemini guide (#6458)
Let the output of gemini is more readable

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-31 15:17:39 +08:00
wangxiyuan
e3a1586fce [CI]Update gemini config (#6447)
Update gemini config to make it more useful

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-31 10:47:38 +08:00
Qiu
638cae824d [bugfix](CP) Fix and unify the PD request discrimination logic. (#5939)
### What this PR does / why we need it?
Since the PR (https://github.com/vllm-project/vllm/pull/32118) has
modified the criteria for judging Prefill and Decode requests in vLLM,
PCPManager needs to synchronize with this standard. As PCPManager
involves multiple calculations of PD request counts, this PR attempts to
consolidate the related logic and update the PD request count once per
batch.

### How was this patch tested?
```bash
pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py
```

- vLLM version: v0.13.0
- vLLM main:
11b6af5280

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-31 10:26:02 +08:00
wubin58
4230bc8646 [Bugfix]Modify NPU rotary encoding parameter fields,fix RopeOperation setup failed in condition of self.rotary_dim < self.head_size (#6310)
### What this PR does / why we need it?
change self.head_size to self.rotary_dim.
only the rotary part is processed here, the dimension should be rotary_dim.

Fix bug #6060

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Only a small section of code was modified to adjust the parameters, and
all standard tests were passed.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: fengshi666 <fengshi666@adsl-99-12-210-25.dsl.hstntx.sbcglobal.net>
Co-authored-by: fengshi666 <fengshi666@adsl-99-12-210-25.dsl.hstntx.sbcglobal.net>
2026-01-30 21:25:04 +08:00
xulei
77ea873224 fix: resolve sync bug in DispathFFNCombine when expert num per card is 32 (#6416)
### What this PR does / why we need it?

Fix the synchronization deadlock issue in DispathFFNCombine module that
occurs on NPU cards when the number of experts per card exceeds 16 (the
bug manifests prominently when set to 32/128).

### Does this PR introduce _any_ user-facing change?
No, this is a bug fix for internal synchronization logic specific to NPU
expert dispatch, with no impact on external APIs, interfaces, or
end-user behaviors.


- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: xulei_ict <xulei292@huawei.com>
Co-authored-by: xulei_ict <xulei292@huawei.com>
2026-01-30 21:21:20 +08:00
Yizhou
56f5d3bd49 [Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6357)
### What this PR does / why we need it?
This handles both uniform and mixed batches (by inserting a dummy
request for mixed batches), consolidates ad-hoc padding into a single
helper, copies the updated buffer to the device, and asserts the layout
constraint before building the attention metadata. Together, these
changes prevent kernel mismatches or failures and ensure correct shapes
for FIA/TND execution in full graph modes.

We currently place this helper in `execute_model`. My original design
was to include it in `_prepare_inputs`, but that doesn’t work because it
must run after padding. While I’d prefer to minimize the impact and
reuse as much of the base class as possible in the future, it doesn’t
seem achievable at the moment.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test cases added.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2026-01-30 16:41:44 +08:00
ChenCangtao
f2990f7741 [e2e Test][npugraph_ex]add static kernel e2e test case (#6320)
### What this PR does / why we need it?
Added an E2E test case for the scenario of enabling a static kernel for
npugraph_ex, monitoring its compilation and unloading process.
Also fixed the previously existing spelling errors

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: chencangtao <chencangtao@huawei.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
2026-01-30 16:24:48 +08:00
Li Wang
8969b94a14 [Nightly] Correct nightly image build ref (#6420)
### What this PR does / why we need it?
The underlying tags for nightly image builds have been corrected, and
some useless and confusing workflow fields have been removed.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-30 15:55:58 +08:00
liziyu
d252e4f5ec [P/D] Using the cache load operator to replace the index select operator. (#6295)
### What this PR does / why we need it?
Using the cache load operator to replace the index select operator.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-01-30 14:27:53 +08:00
Wang Kunpeng
70cc5f7969 [bugfix]fix rope_forward_triton error (#6404)
### What this PR does / why we need it?
The rope_forward_triton method reports an error.
For example:
```
(Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822]     q, k = rope_forward_triton(q, k, cos, sin, rope_dim=self.qk_rope_head_dim, is_neox_style=True)
(Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822]   File "/vllm-workspace/vllm-ascend/vllm_ascend/ops/triton/rope.py", line 155, in rope_forward_triton
(Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822]     cos = cos.view(num_tokens, -1)
(Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822]           ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_DP0_TP1_EP1 pid=5298) ERROR 01-29 02:01:11 [multiproc_executor.py:822] RuntimeError: shape '[14, -1]' is invalid for input of size 768
```
This is because an incorrect num_tokens_padded was passed in.

Related-RFC: https://github.com/vllm-project/vllm-ascend/issues/5449

Co-authored-by: @zhenwenqi2024
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2026-01-30 14:09:00 +08:00
ChenCangtao
46cee945b3 [doc][npugraph_ex]add npugraph_ex introduction doc (#6306)
### What this PR does / why we need it?
As part of the preparation work for the
[RFC](https://github.com/vllm-project/vllm-ascend/issues/6214)
We have added a documentation about npugraph_ex, which mainly explains
and introduces its usage and FX graph optimization.
The introduction to FX graph optimization also includes specific
explanations of the default passes, the implementation methods for
custom fusion passes, and how to capture the FX graph during the
optimization process through environment variable configuration.

---------

Signed-off-by: chencangtao <chencangtao@huawei.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
2026-01-30 11:21:37 +08:00
zhangxinyuehfad
1d661bb279 [Bugfix] Specify tensorflow version in accuracy test to avoid segmentation fault (#6292)
### What this PR does / why we need it?
Specify tensorflow version in accuracy test to avoid segmentation fault

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-01-30 09:28:24 +08:00
CodeCat
b2857de43f [ST]Add e2e test for Npugraphex_pass (#6388)
### What this PR does / why we need it?
We found the custom passes of NPUGraphEX have implemented fusion
operator features, which still require E2E test case validation and
guard. This PR implements E2E test cases for the AddRMSNormQuant and
SplitQKVNormRope operator fusions under NPUGraphEX that are already in
the codebase.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: cjian <2318164299@qq.com>
2026-01-30 09:14:07 +08:00
wjunLu
4970de4242 [CI] Enable the skipped cases when HDK is upgraded to 25.5.0 (#6195)
### What this PR does / why we need it?
Enable the tests that were skipped due to an outdated driver version:
- tests/e2e/multicard/4-cards/long_sequence/test_accuracy.py
- tests/e2e/multicard/4-cards/long_sequence/test_basic.py
- tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py

and some cases in
- tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py
- tests/e2e/multicard/2-cards/test_external_launcher.py
- tests/e2e/multicard/2-cards/test_offline_weight_load.py
- tests/e2e/multicard/2-cards/test_quantization.py
- tests/e2e/multicard/4-cards/test_data_parallel_tp2.py

TODO:
- tests/e2e/multicard/4-cards/spec_decode/test_mtp_qwen3_next.py
- tests/e2e/multicard/4-cards/long_sequence/test_mtp.py
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-01-29 22:41:41 +08:00
Li Wang
e35f304419 [CI] Auto partition for test cases (#6379)
### What this PR does / why we need it?
This patch add auto-partition feat for tests, for example, before this
pr, we are running e2e single card test for 2h40min, after the auto
partition, test case is automatically allocated into the required n
parts based on its test duration (greedy strategy) and run in parallel.
The advantage of doing this is that our overall test duration will
become 1/n of the original.

### Does this PR introduce _any_ user-facing change?
Before:
e2e single card test spend 2h40min
After:
e2e single card test spend 1h13min

### How was this patch tested?

```shell
python .github/workflows/scripts/run_suite.py --auto-partition-size 2 --auto-partition-id 0 
args=Namespace(timeout_per_file=2000, suite='e2e-singlecard', auto_partition_id=0, auto_partition_size=2, continue_on_error=False, enable_retry=False, max_attempts=2, retry_wait_seconds=60, retry_timeout_increase=600)
+----------------+--------------------+
| Suite          | Partition          |
|----------------+--------------------|
| e2e-singlecard | 1/2 (0-based id=0) |
+----------------+--------------------+
 Enabled 13 test(s) (est total 4020.0s):
  - tests/e2e/singlecard/spec_decode/test_v1_spec_decode.py (est_time=1800)
  - tests/e2e/singlecard/test_aclgraph_accuracy.py (est_time=480)
  - tests/e2e/singlecard/test_guided_decoding.py (est_time=354)
  - tests/e2e/singlecard/test_batch_invariant.py (est_time=320)
  - tests/e2e/singlecard/pooling/test_embedding.py (est_time=270)
  - tests/e2e/singlecard/test_quantization.py (est_time=200)
  - tests/e2e/singlecard/test_llama32_lora.py (est_time=162)
  - tests/e2e/singlecard/test_cpu_offloading.py (est_time=132)
  - tests/e2e/singlecard/pooling/test_classification.py (est_time=120)
  - tests/e2e/singlecard/test_camem.py (est_time=77)
  - tests/e2e/singlecard/compile/test_norm_quant_fusion.py (est_time=70)
  - tests/e2e/singlecard/test_auto_fit_max_mode_len.py (est_time=25)
  - tests/e2e/singlecard/test_profile_execute_duration.py (est_time=10)

(base) wangli@Mac-mini vllm-ascend % python .github/workflows/scripts/run_suite.py --auto-partition-size 2 --auto-partition-id 1 
args=Namespace(timeout_per_file=2000, suite='e2e-singlecard', auto_partition_id=1, auto_partition_size=2, continue_on_error=False, enable_retry=False, max_attempts=2, retry_wait_seconds=60, retry_timeout_increase=600)
+----------------+--------------------+
| Suite          | Partition          |
|----------------+--------------------|
| e2e-singlecard | 2/2 (0-based id=1) |
+----------------+--------------------+
 Enabled 13 test(s) (est total 4025.0s):
  - tests/e2e/singlecard/spec_decode/test_mtp_eagle_correctness.py (est_time=1500)
  - tests/e2e/singlecard/pooling/test_scoring.py (est_time=500)
  - tests/e2e/singlecard/test_aclgraph_batch_invariant.py (est_time=410)
  - tests/e2e/singlecard/test_vlm.py (est_time=354)
  - tests/e2e/singlecard/test_models.py (est_time=300)
  - tests/e2e/singlecard/test_multistream_overlap_shared_expert.py (est_time=200)
  - tests/e2e/singlecard/test_sampler.py (est_time=200)
  - tests/e2e/singlecard/test_async_scheduling.py (est_time=150)
  - tests/e2e/singlecard/test_aclgraph_mem.py (est_time=130)
  - tests/e2e/singlecard/test_ilama_lora.py (est_time=95)
  - tests/e2e/singlecard/test_completion_with_prompt_embeds.py (est_time=76)
  - tests/e2e/singlecard/test_qwen3_multi_loras.py (est_time=65)
  - tests/e2e/singlecard/test_xlite.py (est_time=45)
```
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-29 20:28:10 +08:00
zxr2333
14bd55f30c [P/D][BugFix] Fix layerwise P/D request_id error (#6360)
### What this PR does / why we need it?
Fix layerwise Connector P/D request_id error, due to vllm pr:
https://github.com/vllm-project/vllm/pull/27987, which will add a random
suffix to request_id in EngineCore.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-01-29 20:19:05 +08:00
Qiu
feab047084 [bugfix](pcp,gqa) set kv_inverse_idx_for_chunk and cp_kv_recover_idx_for_chunk to None when dcp only (#6317)
### What this PR does / why we need it?
We only do restore and recover for pcp, so we should set
`kv_inverse_idx_for_chunk` and `cp_kv_recover_idx_for_chunk` to `None`
when only using dcp.
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-29 19:35:52 +08:00
Qiu
50e0e87646 [bugfix](CP,MLA) fix wrong slot_mapping of decode for mixed p/d batch (#6344)
### What this PR does / why we need it?
PR #5672 attempted to remove the -1 padding for duplicate tokens in the
decode slot_mapping when adapting PCP for MLAPO, and adopted a simpler
slicing approach. However, in the single-ops logic and mixed PD batches,
the decode slot_mapping did not eliminate the -1 and also shared the
slicing method, resulting in incorrect slot_mapping. This PR resolves
this issue, and the logic will be further consolidated in subsequent
refactoring PRs.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-29 16:48:37 +08:00
Sergey-Zlobin
6a7b3bc29c Qwen3-VL-MoE EAGLE support for vLLM-Ascend (#6327)
### What this PR does / why we need it?
Qwen3-VL-MoE EAGLE support for vLLM-Ascend

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
The patch tested with Qwen3-VL-30B-A3B-Instruct model

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: Sergey_Zlobin <sirg_zlobin@mail.ru>
2026-01-29 16:44:30 +08:00
JiangWeixiang
41a52beb26 [bugfix] resolve kv cache leak on P-side due to incorrect req_id (#6325)
### What this PR does / why we need it?
This PR fixes a critical bug in the PD-separated inference pipeline
where KV cache on the Prefill (P) side was not being properly released.
The issue arises when multiple clients use the same x-request-id: to
avoid request ID collisions, both Prefill and Decode nodes append a
random suffix to the incoming x-request-id. A previous PR ensured
consistency by having the P-side pass its final request_id as
remote_request_id to the D-side via kv_transfer_param. However, during
KV cache cleanup, the D-side incorrectly used the local req_id (instead
of remote_request_id) to select the target P-side rank. This mismatch
caused the P-side KV cache to remain unreleased on certain ranks,
leading to memory leaks. This PR corrects the logic to use
remote_request_id consistently when determining the P-side rank.
### Does this PR introduce _any_ user-facing change?
No. 
### How was this patch tested?
The fix was validated by running multiple concurrent benchmark instances

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: ghphotoframe <854746559@qq.com>
2026-01-29 16:05:56 +08:00
Nengjun Ma
597091be9f [Doc] Reranker guide remove deprecated task option (#6385)
### What this PR does / why we need it?
Reranker guide remove deprecated task option.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-01-29 16:00:26 +08:00
wangxiyuan
7a5b345dc4 [Misc] Drop deepseek patch (#6288)
We patched deepseek before since we notice asserterror raised by
transformers. Now due to transformers upgrade, the patch looks useless
now. Let's remove it.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-29 14:45:50 +08:00
whx
39f8af9d96 [Main2Main][BugFix] Add shared_experts check for AscendSharedFusedMoE (#6335)
### What this PR does / why we need it?
PR https://github.com/vllm-project/vllm/pull/32082 in vLLM makes
Qwen3-Moe models also go into `SharedFusedMoE`, while current
implementation of our `AscendSharedFusedMoE` assumes shared_experts
always exist. This PR adds checking to
`multistream_overlap_shared_expert` and `multistream_overlap_gate` in
order to only enable these features when shared experts exist.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
All ci passed

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-01-29 08:47:20 +08:00
Li Wang
f0ff2cc22d [CI] hot fix for nightly image build tag (#6367)
### What this PR does / why we need it?
The base image of `releases/v0.13.0` should tagged as
`releases/v0.13.0-**`

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-28 23:29:50 +08:00
InSec
86b6ecac4c [CI][BugFix] Import error fix. (#6293)
### What this PR does / why we need it?
Fix the **import error** of qwen3-next nightly test.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: InSec <1790766300@qq.com>
2026-01-28 22:07:47 +08:00
hucong
df588ed488 [BugFix] Disable enable_shared_expert_dp by default if tensor_parallel_size=1 (#6361)
### What this PR does / why we need it?

Disable enable_shared_expert_dp by default if tensor_parallel_size=1


- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: underfituu <hzhucong@163.com>
2026-01-28 22:01:01 +08:00
Li Wang
8b0a7b6d80 [CI] Nightly tests use releases/v0.13.0 (#6355)
### What this PR does / why we need it?
The pre-requirement pr is
https://github.com/vllm-project/vllm-ascend/pull/6353, this patch aims
to transfer nightly tests to `releases/v0.13.0`, what we need to do is
just use the branch built image for nightly

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-28 21:46:13 +08:00
Li Wang
501bb395b1 [CI] Fix image build (#6333)
Try to fix schedule image build CI

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-28 21:36:44 +08:00