Commit Graph

1152 Commits

Author SHA1 Message Date
zhenwenqi2024
ad9b711f89 [Bugfix] fix dcp_only bug and add e2e accuracy test for dcp only and pcp only (#5565)
### What this PR does / why we need it?
[Bugfix] fix dcp_only bug and add e2e accuracy test for dcp only and pcp
only
this pr fix the bug of accuracy test when decode_parallel_size>1 and
prefill_context_parallel_size=1.
### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
- vLLM version: v0.13.0
- vLLM main:
7157596103

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
2026-01-06 22:48:21 +08:00
Fager10086
77a029979e Revert "[BugFix][Fusion] Fix graph fusion failure problem (#5253)" (#5667)
### What this PR does / why we need it?

Revert PR 5253 to fix the smoking problem

### Does this PR introduce _any_ user-facing change?

Does not.

### How was this patch tested?

It was tested in the failure case.

Signed-off-by: Rifa <865071616@qq.com>
2026-01-06 21:55:47 +08:00
liziyu
330e25ab1d [P/D] Performance enhancement of Layerwise connector in TP asymmetric scenarios (#5540)
### What this PR does / why we need it?
[P/D] Performance enhancement of Layerwise connector in TP asymmetric
scenarios
1. Session fusion: For transmission tasks at each layer, aggregate
transmission tasks with the same destination and merge them into a
single task for assignment.
2. Alltoall aggregation: For TP asymmetric scenarios, perform all
alltoall operations at once according to the block granularity for all
requests.

[RFC]: CDCP Scheduling for Disaggregated Prefilling with KV Cache
Layerwise Push Support
https://github.com/vllm-project/vllm-ascend/issues/4842
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: nwpu-zxr <zhouxuerong2@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-01-06 20:25:36 +08:00
InSec
089ca2ddcc [Nightly][Test] Add Qwen3-Next-80B-A3B-Instruct-W8A8 nightly test (#5616)
### What this PR does / why we need it?
There was an accuracy issue with **Qwen3-Next-80B-A3B-Instruct-W8A8**
model in the old version of **Triton-Ascend**, so, we are now adding one
nightly test to maintain it.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: IncSec <1790766300@qq.com>
2026-01-06 17:36:00 +08:00
Mercykid-bash
29e2f9a43e Bugfix: Align expert map shapes with redundant experts in EPLB adjustment (#5285)
#### Overview
This PR fixes a shape mismatch bug between `expert_placement_map` and
`log2phy_expert_map` when **redundant experts** are enabled in the
vLLM-Ascend platform. The issue occurred during the initialization of
expert maps and their updates via EPLB (Expert Load Balancer)
adjustment, leading to potential tensor shape errors and incorrect
expert routing in distributed MoE deployments.

#### Key Changes
1. **Unify expert map shape calculation logic**
- Ensure the shape of `expert_placement_map` and `log2phy_expert_map`
strictly aligns with the total number of experts (including redundant
experts) during initialization.
- Update the shape adjustment logic in EPLB dynamic update process to
match the initial expert map dimensions.

2. **Add shape consistency checks**
- Add assertion statements to verify the shape consistency of the two
maps after initialization and EPLB adjustment, preventing silent shape
mismatches in subsequent operations.

#### Impact
- Resolves tensor shape errors when using redundant experts with EPLB on
Ascend platform.
- Ensures correct expert routing and load balancing for MoE models with
redundant expert configurations.
- No breaking changes to existing functionality; compatible with
non-redundant expert deployments.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Che Ruan <cr623@ic.ac.uk>
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: Che Ruan <cr623@ic.ac.uk>
Co-authored-by: shenchuxiaofugui <1311027364@qq.com>
2026-01-06 17:22:36 +08:00
Zetong Li
fe3f2c7702 [Refactor][EAGLE] 3/N delete redundant methods in mtp_proposer (#5420)
### What this PR does / why we need it?
This PR aims to delete redundant methods in mtp_proposer. All the
deleted methods now can be found in eagle_proposer. We also remove some
methods in eagle_proposer since they are identical to those in
vllm-eagle.

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
by ci

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

---------

Signed-off-by: Zetong Li <slippersss@126.com>
2026-01-06 16:47:39 +08:00
Shanshan Shen
b94d589769 [MM][Bugfix] Update hf_config to hf_text_config (#5319)
### What this PR does / why we need it?

Following https://github.com/vllm-project/vllm-ascend/pull/5205, update
`hf_config` to `hf_text_config`.

Find more details at
https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3675417534
and
https://github.com/vllm-project/vllm-ascend/pull/5205#issuecomment-3677920872.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

Signed-off-by: shen-shanshan <467638484@qq.com>
2026-01-06 16:41:39 +08:00
Qiu
e07938047e [UT][PCP&DCP] UT for block_table.py (#5032)
## Purpose
This PR add unit test for `compute_slot_mapping` function in
`block_table.py` with various `pcp_size` & `dcp_size` &
`cp_kv_cache_interleave_size`.

## Test Plan
```
pytest tests/ut/worker/test_block_table.py
```
## Test Result
```
==== 3 passed, 2 warnings in 0.20s ====
```
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-06 11:19:25 +08:00
wjunLu
3cf059a72b [Main2Main] Upgrade vllm commit to 0105 (#5595)
### What this PR does / why we need it?

Upgrade vllm commit to 0105 (8be6432bdaf6275664d857b1e5e9bf8ed1ce299e)

1. Remove `maybe_padded_num_tokens` arg in `model_runner_v1.py` since
https://github.com/vllm-project/vllm/pull/31517 deleted unused arg

2. Remove dense `Qwen/Qwen3-0.6B` in
`tests/e2e/multicard/test_aclgraph_capture_replay.py` and
`tests/e2e/multicard/test_data_parallel.py` due to
https://github.com/vllm-project/vllm/pull/30739
where offline data parallel mode will not be supported/useful for dense
models

3. Adapt `vllm_ascend/worker/worker.py` due to
https://github.com/vllm-project/vllm/pull/31584

4. Adapt `self.block_size` calling due to
https://github.com/vllm-project/vllm/pull/31540

5. Modify `test_mla_v1.py` due to
https://github.com/vllm-project/vllm/pull/28454 , which refactorred
`get_head_size()`

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-01-06 08:44:29 +08:00
Li Wang
c5e2f48510 [CI] mv ops to correct path (#5615)
### What this PR does / why we need it?
mv ops to correct path
:`tests/e2e/nightly/single_node/ops/singlecard_ops/triton`

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-05 23:17:07 +08:00
dsxsteven
129ba9fe1b [BugFix] Fix Smoke Testing Bug for DSR1 longseq (#5613)
### What this PR does / why we need it?
Fix Smoke Testing Bug for DSR1 longseq
We need to make this change because the daily smoke test case is
throwing an error: "max_tokens or max_completion_tokens is too large:
32768.This model's maximum context length is 32768 tokens and your
request has 128 input tokens". We encounter this error due to
max-out-len equals to max-model-len. We can fix this error by increasing
max-model-len argument in the script.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?


- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: daishixun <dsxsteven@sina.com>
2026-01-05 22:40:28 +08:00
Angazenn
11e75494b1 [TRITON][TEST]Add nightly test for triton split_qkv_rmsnorm_rope (#5267)
### What this PR does / why we need it?
Add nightly test for triton split_rmsnorm_rope

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Angazenn <supperccell@163.com>
2026-01-05 21:35:37 +08:00
ZT-AIA
58e8d19c35 [UT]add triton ops ut : test_fused_qkvzba_split_reshape_cat (#5474)
### What this PR does / why we need it?
[UT]add triton ops ut :  test_fused_qkvzba_split_reshape_cat
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
pytest -sv tests/ut/ops/test_fused_qkvzba_split_reshape_cat.py
- vLLM version: v0.13.0
- vLLM main:
5326c89803

---------

Signed-off-by: ZT-AIA <1028681969@qq.com>
2026-01-05 20:05:07 +08:00
Icey
e7b623b363 [BugFix][Fusion] Fix graph fusion failure problem (#5253)
Currently, the vllm pull request
(https://github.com/vllm-project/vllm/pull/24252) is causing operator
fusion to fail. This issue was previously fixed by patching the backend.
The root cause has been identified, and the problem can be resolved with
this pull request.

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wxsIcey <1790571317@qq.com>
2026-01-05 17:49:09 +08:00
wujinyuan1
4a3663327b [Refactor]7/N Extract common code to common_cp (#5490)
RFC: https://github.com/vllm-project/vllm-ascend/issues/4629
Reason:
Eliminate duplicate code for two file(mla_cp.py attention_cp.py) to
common_cp.py.

vLLM version: 0.13.0rc3
vLLM main:
ad32e3e19c

vLLM version: release/v0.13.0
vLLM main:
5fbfa8d9ef

- vLLM version: v0.13.0
- vLLM main:
5326c89803

---------

Signed-off-by: wujinyuan1 <wjy9595@qq.com>
Signed-off-by: wujinyuan1 <wujinyuan1@huawei.com>
Co-authored-by: wujinyuan1 <wjy9595@qq.com>
2026-01-05 17:41:12 +08:00
Yizhou
755caeb06e [Feat][Spec] Optimize token index calculation in spec decode with Triton kernel (#5356)
### What this PR does / why we need it?
Replace multiple PyTorch operations with a fused Triton kernel to
determine token indices for sampling during speculative decoding. This
reduces kernel launch overhead and memory traffic, improving overall
performance on Ascend hardware.

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2026-01-05 16:51:29 +08:00
daniel
8ffe3f5d78 feat: implement high-performance Triton kernels for rejection sampling: optimization for rejection_random_sample_kernel (#5259)
### What this PR does / why we need it?

This PR introduces optimized Triton implementations for the
rejection_random_sample_kernel delivering superior performance compared
to the existing Triton implementations. The new Triton kernels maintain
full functional accuracy while delivering significant performance
improvements across various batch sizes and MTP configurations.

### Does this PR introduce _any_ user-facing change?

Yes, this PR modifies rejection_sampler.py to use optimized Triton
kernels:
rejection_random_sample_kernel is modified and optimized

### How was this patch tested?
performance benchmark results:
<html xmlns:v="urn:schemas-microsoft-com:vml"
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40">
<head>

<meta name=Generator content="Microsoft Excel">
<!--[if !mso]>
</head>
<body>
<!--StartFragment-->

Batch Size | MTP | origin implementation(us) | optimized version(us)
-- | -- | -- | --
1 | 1 | 2.934 | 3.64
8 | 1 | 4.467 | 4
32 | 1 | 6.98 | 4.54
64 | 1 | 11.087 | 6.42
128 | 1 | 13.414 | 7.84
256 | 1 | 19.66 | 8.487
512 | 1 | 39.908 | 11.62
1024 | 1 | 81.781 | 18.16
2048 | 1 | 137.923 | 32.934
1 | 2 | 3.4 | 4.02
8 | 2 | 3.74 | 4.24
32 | 2 | 6.373 | 7.394
64 | 2 | 9.747 | 6.46
128 | 2 | 12.98 | 7.76
256 | 2 | 20.834 | 9.787
512 | 2 | 39.314 | 13.56
1024 | 2 | 83.135 | 22.387
2048 | 2 | 157.563 | 40.607


<!--EndFragment-->
</body>

</html>


- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: 1024daniel <xxltju324@gmail.com>
2026-01-05 16:03:02 +08:00
Trunrain
91bf524364 [BugFix][kernel] fix matmul_allreduce_add_rmsnorm_kernel (#5335)
### What this PR does / why we need it?
fix matmul_allreduce_add_rmsnorm_kernel, add hccl Init, SetCcTiling
interface
test case use multicard-4 
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
pytest -sv tests/e2e/nightly/ops/test_matmul_allreduce_add_rmsnorm.py
multicard-4 pass

https://github.com/vllm-project/vllm-ascend/actions/runs/20502630658/job/58914474652?pr=5335



- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

Signed-off-by: tongrunze <t00574058@china.huawei.com>
Co-authored-by: tongrunze <t00574058@china.huawei.com>
2026-01-05 15:19:54 +08:00
weiguihua2
549be94397 [Bugfix] fix pcp + eplb error (#5561)
### What this PR does / why we need it?
Fix the bug in the PCP overlay feature

1、Fix the bug related to PCP and EPLB overlap by including PCP size in
the word_size calculation.
2、In the PCP pooling scenario, a prompt has been added for setting the
cp_kv_cache_interleave_size.

- vLLM version: v0.13.0
- vLLM main:
7157596103

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2026-01-05 14:08:11 +08:00
lilinsiman
52863c4165 [Refactor][EAGLE] 2/N: load model and generate token (#5437)
### What this PR does / why we need it?
1. Refactor eagle and mtp function: load_model and generate_token_ids
2. Remove redundant code in mtp and eagle file
3. Refactor the UT of file

2/N of Refactor and merge mtp and eagle
Relational RFC: https://github.com/vllm-project/vllm-ascend/issues/5467

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut and tests

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2026-01-05 14:07:54 +08:00
pichangping
50e7934415 MLA prefill preformance optimization (#5456)
### What this PR does / why we need it?
Since the _npu_ring_mla operator deteriorates in long-sequencescenarios,
the long sequence is split into shorter sequences for input to improve
performance.

- vLLM version: v0.13.0
- vLLM main:
5326c89803

---------

Signed-off-by: pichangping <1337510399@qq.com>
2026-01-05 11:41:59 +08:00
Magnus
2b5536362a [CI] skip xlite-decode-only e2e test (#5407)
### What this PR does / why we need it?
skip xlite-decode-only e2e test, since it's unstable

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774

Signed-off-by: changdawei1 <changdawei3@huawei.com>
2026-01-05 11:05:26 +08:00
LookAround0301
d25a2c20c5 [Bugfix] Fix chunk prefill bug for long_sequence feature (#5444)
### What this PR does / why we need it?
Fix chunk prefill bug for long_sequence feature

When there are two requests with chunk prefill enabled in the
long-sequence scenario, if one request has only 1 token during
scheduling, it will be identified as a decode request and trigger an
error. This PR fixes the issue.
Closes: https://github.com/vllm-project/vllm-ascend/issues/5445

- vLLM version: release/v0.13.0
- vLLM main:
81786c8774
---------
Signed-off-by: LookAround <lixushi@huawei.com>
2026-01-05 09:16:36 +08:00
Qiu
96775a27a8 [refactor](UT,PCP,DCP) refactor pcp&dcp patches in UTs (#5505)
### What this PR does / why we need it?
Refactor PCP & DCP patches in UTs: Merge and reuse communication groups
and communication function patches to reduce code duplication.
### Does this PR introduce _any_ user-facing change?
No
- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-05 09:05:45 +08:00
lidenghui1110
d462577504 [Recover] [Bugfix] support mtp kv transfer and pp partition by hand in kv transfer (#4892) (revert in #4981) (#5511)
PR #4892 was revert in #4981, we recover it now. For the potential bug
break deepseek3.2 in PD case, we will find it out and fix it.

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

---------

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
2026-01-04 16:49:33 +08:00
Qiu
7c210225a2 [Perf][PCP][DCP] add multi-stream for GQA to enable computation-communication overlap (#5382)
### What this PR does / why we need it?
This PR adds multi-stream for GQA to enable computation-communication
overlap. For chunked prefill, we reduce TTFT by approximately 4%.

### Does this PR introduce _any_ user-facing change?
No

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08

---------

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-04 16:33:18 +08:00
dsxsteven
37fd48bee5 [CI] Move longseq Nightly CI (#5577)
### What this PR does / why we need it?
move longseq nightly CI to correct path due to #5479 [1/N] Refactor
nightly test structure

Signed-off-by: daishixun <dsxsteven@sina.com>
2026-01-04 15:42:43 +08:00
drslark
363ac1b80f [Feat][main] Supported to use full-graph with Qwen3-Next-MTP (#5477)
### What this PR does / why we need it?

Supported to use full-graph with Qwen3-Next-MTP.

In detail, we adatpted `AscendAttentionState.ChunkedPrefill` in main
model, and also adapted `AscendAttentionState.ChunkedPrefill` in mtp
model.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

We changed the test of Qwen3-Next-MTP in
`tests/e2e/multicard/test_qwen3_next.py` to make it a test of
`FULL_DECODE_ONLY`. Then run `pytest -s
tests/e2e/multicard/test_qwen3_next.py::test_qwen3_next_distributed_mp_eager_mtp_similarity_tp4`.

And this test passed.

```text
.

================================================================================================================================= warnings summary =================================================================================================================================
<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================================================== 1 passed, 2 warnings in 271.89s (0:04:31) =====================================================================================================================
```
- vLLM version: v0.13.0
- vLLM main:
5326c89803

Signed-off-by: drslark <slarksblood@qq.com>
2026-01-04 12:03:21 +08:00
dsxsteven
3c7e6c6817 [CI] Add multi-nodes longseq configs of DeepSeek-R1-W8A8 & Qwen3-235B-W8A8 (#5381)
### What this PR does / why we need it?
add DeepSeek-R1-W8A8 and Qwen3-235B-W8A8 configs in multi-nodes and
longseq (PCP&DCP) scenario

- vLLM version: release/v0.13.0
- vLLM main:
bc0a5a0c08
---------
Signed-off-by: daishixun <dsxsteven@sina.com>
2026-01-04 10:38:40 +08:00
CodeCat
80fc0f5b9e [Graph][Fusion] Add AddRMSNorm(with bias) (#5491)
### What this PR does / why we need it?
This PR builds upon PR #5011 and aims to further enhance the
npu_graph_ex_passes module. Based on prior work, we have added graph
optimization support for the add_rms_quant fused operator in scenarios
where a bias term is present—ensuring the fusion pattern is correctly
registered and matched into the computation graph.

For validation, we switched to the Qwen3-235B-A22B-W8A8 model. Benchmark
results show that, compared to the unfused baseline, enabling this
fusion pass significantly improves inference throughput for W8A8
quantized models.
For more details can refer to the
RFC:https://github.com/vllm-project/vllm-ascend/issues/4715

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
```
llm = LLM(
        model=model,
        tensor_parallel_size=GPUs_per_dp_rank,
        enforce_eager=False,
        enable_expert_parallel=enable_expert_parallel,
        trust_remote_code=trust_remote_code,
        gpu_memory_utilization=0.98,
        max_num_batched_tokens=512,
        # load_format="dummy",
        max_model_len=2048,
        max_num_seqs=16,
        quantization="ascend",
        additional_config={
            "refresh": True,
            "enable_npugraph_ex": True
        },
        compilation_config={
            "cudagraph_capture_sizes": [8, 16],
            "cudagraph_mode": "FULL_DECODE_ONLY",
        },
    )
    if profile_dir:
        llm.start_profile()
    outputs = llm.generate(prompts, sampling_params)
    if profile_dir:
        llm.stop_profile()
    for i, output in enumerate(outputs):
        if i >= 5:
            break
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(
            f"DP rank {global_dp_rank}, Prompt: {prompt!r}, "
            f"Generated text: {generated_text!r}"
        )
```
- vLLM version: v0.13.0
- vLLM main:
5326c89803

Signed-off-by: cjian <2318164299@qq.com>
2025-12-31 17:10:26 +08:00
zxr2333
46a1614387 [P/D] Improve the performance of Layerwise Connector (#5303)
### What this PR does / why we need it?
Improve the performance of Layerwise Connector, mainly includes the
following points:
1. Use event synchronize to replace stream synchronize.
2. Access metaserver when scheduling.
3. Transfer kvcache each Chunk prefill segmentation.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By CI.
- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Co-authored-by: liziyu <liziyu16@huawei.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>
2025-12-31 15:09:01 +08:00
Jade Zheng
7d5242faca [Refactor] Formatting output types related to FuseMoE (#5481)
Currently in the Fused MoE module, functions of classes like
MoECommMethod and MoETokenDispatcher output data in dictionary or tuple
format, which hampers code maintainability, readability, and
extensibility. This PR introduces dataclasses for these key output types
to address these issues.

- vLLM version: v0.13.0
- vLLM main:
5326c89803

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-31 14:24:37 +08:00
Jade Zheng
38570cfeb6 [Feature] Support kv nz feature for DeepSeek decode node in disagg-prefill scenario (#3072)
By converting the KV cache from ND to NZ format when the decode node
receives it, this PR ensures that the KV NZ feature works correctly
during the decoding phase in disagg-prefill scenario.

- vLLM version: v0.11.0
- vLLM main:
83f478bb19

---------

Signed-off-by: Jade Zheng <zheng.shoujian@outlook.com>
Co-authored-by: ghphotoframe <854746559@qq.com>
Co-authored-by: alex101-ops <alex1015718386@gmail.com>
2025-12-31 14:24:04 +08:00
wangxiaochao6
a539ae753a [feature] mooncake support pcp/dcp in common conditions (#5224)
### What this PR does / why we need it?
1. This PR is proposed to support complicated pcp/dcp parallelisms in
Prefill and Decode nodes in Mooncake, such as Prefill: TP8/PCP2DCP8 and
Decode: TP8/DCP4/DP2, which is not supported now. We establish the link
mappings to transfer KVCache between prefill and decode nodes. The main
function is realized in Function of `_get_kv_split_metadata` in
Mooncake_connector.py
2. After a prefill rank is pulled KVCache by a decode rank, the decode
rank will send `DONE_RECVING_MSG` to the prefill rank and the prefill
rank will free its KVCache blocks. If a prefill rank is pulled KVCache
more than one time by several decode ranks and it surely could happen in
complicated pcp/dcp parallelisms, it will cause the prefill rank free
its KVCache blocks for several times, which could cause memory issue.
This PR solve this issue by counting the times of prefill rank would be
pulled KVCache and in the last time, it will free the prefill rank
KVCache blocks. The related code is in Function of `run_busy_loop` in
Mooncake_connector.py
3. If a prefill rank is not pulled KVCache by any decode ranks, the
first rank in decode node will send "DONE_RECVING_MSG" to free its
blocks. The related code is in Function of
`_send_done_signal_to_free_remote_port` in Mooncake_connector.py

### How was this patch tested?
This PR is tested in many pcp/dcp parallelisms, and the accuracy are all
correct.
MLA model:
Prefill node:  TP8/DP2, Decode node: TP8/DP2
Prefill node:  TP8/PCP2/DCP8, Decode node: TP8/DP2
Prefill node:  TP8/PCP2/DCP8, Decode node: TP8/DCP4/DP2
Prefill node:  TP8/PCP2/DCP4, Decode node: TP4/DCP2/DP4
Prefill node:  TP8/PCP2/DCP2, Decode node: TP4/DCP4/DP4
Prefill node:  TP8/PCP2, Decode node: TP4/DCP2

GQA model:
Prefill node:  TP8/DP2, Decode node: TP8/DP2
Prefill node:  TP8/PCP2/DCP2, Decode node: TP8/DP2
Prefill node:  TP8/PCP2/DCP2, Decode node: TP8/DCP2/DP2
Prefill node:  TP8/PCP2/DCP2, Decode node: TP4/DP4
Prefill node:  TP16/DCP2/PCP1, Decode node: TP8/DCP2/DP2


- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
- Co-author by: Daishixun dsxtsteven@sina.com

---------

Signed-off-by: wangxiaochao <w00642655@china.huawei.com>
Co-authored-by: wangxiaochao <w00642655@china.huawei.com>
Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-31 09:53:03 +08:00
wjunLu
3c2d3e52e5 [Main2Main] Upgrade vllm commit to 1230 (#5495)
### What this PR does / why we need it?

Upgrade vllm commit to 1230

Affected by https://github.com/vllm-project/vllm/pull/27614 (and the
core PR https://github.com/vllm-project/vllm/pull/26866), we have to
make the following changes:

1. Modify `tests/e2e/multicard/test_aclgraph_capture_replay.py` to keep
compatible with both vllm version of `v0.13.0` and latest main commitID,
while vllm enables async scheduling by default
2. Skip `test_guided_decoding.py` due to xgrammar errors
(https://github.com/vllm-project/vllm-ascend/issues/5524)

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

---------

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2025-12-31 09:44:35 +08:00
zhenwenqi2024
5d9fde9819 [Feature] Refactor PCP &DCP related code (#5214)
### What this PR does / why we need it?
Refactor pcp& dcp related code. we use pcp_manager class to Unifiy
Manage pcp & dcp . as we do this , many code can be deleted from
model_runner, and can avoid break pcp & dcp by other developments.
RFC:https://github.com/vllm-project/vllm-ascend/issues/5449
### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
2025-12-31 09:29:57 +08:00
lilinsiman
46862ce1af [main][test] Refactor the mtp and eagle test case (#5326)
### What this PR does / why we need it?
1. Refactor the current test with mtp and eagle cases
2. Add new necessary cases with mtp and eagle

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
ut

- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2025-12-31 09:22:58 +08:00
Li Wang
2ee17e50a1 [2/N] Upgrade nightly doc (#5534)
### What this PR does / why we need it?
Follow up https://github.com/vllm-project/vllm-ascend/pull/5479, upgrade
the corresponding doc for developers

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-31 09:11:42 +08:00
Li Wang
e760aae1df [1/N] Refactor nightly test structure (#5479)
### What this PR does / why we need it?
This patch is a series of refactoring actions, including clarifying the
directory structure of nightly tests, refactoring the config retrieval
logic, and optimizing the workflow, etc. This is the first step:
refactoring the directory structure of nightly to make it more readable
and logical.

- vLLM version: v0.13.0
- vLLM main:
5326c89803

Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-30 19:03:02 +08:00
zzzzwwjj
71f729a661 Revert "moe_gating_top_k" (#5512)
Reverts vllm-project/vllm-ascend#5271

It breaks e2e test

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1
2025-12-30 15:05:47 +08:00
ZCG12345
45c3c279e2 moe_gating_top_k (#5271)
1. What this PR does / why we need it?
This PR supports the moe_gating_top_k operator, which enables
post-positioned renormalization (renorm) on the basis of softmax.
2. Does this PR introduce any user-facing change?
No user-facing changes are required.
3. How was this patch tested?
This patch was tested with the test_npu_moe_gating_top_k test case.
vLLM version: release/v0.13.0
vLLM main:
ad32e3e19c

---------

Signed-off-by: ZCG12345 <2097562023@qq.com>
Signed-off-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
Co-authored-by: zzzzwwjj <34335947+zzzzwwjj@users.noreply.github.com>
2025-12-30 09:28:01 +08:00
weiguihua2
15d73f248e [refactor] refactor model runner capture model (#5230)
### What this PR does / why we need it?
Refactor the `capture_model` method in model_runner to directly reuse
the method from vLLM.

Currently, most of the logic in the capture_model method is similar to
that in the vllm code. Directly using the vllm method can reduce the
maintenance cost of the vllm-ascend code. Modify as follows:
1、refactor capture_model function, directly inheriting community methods
2、refactor initialize_aclgraph_capture function, move to
initialize_attn_backend

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-30 08:32:14 +08:00
jiazhengyi
d5f72835e6 [OP] add custom op aclnnMoeInitRoutingCustom (#5251)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

This pull request introduces a new custom operator
`aclnnMoeInitRoutingCustom` for Mixture-of-Experts models.
It can be replaced by `aclnnMoeInitRoutingV3` once CANN 8.5 becomes
available.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
No.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: jiazhengyi <jiazhengyi@huawei.com>
Signed-off-by: Chenxi Qian <chenxi.qian.cq@outlook.com>
Co-authored-by: jiazhengyi <jiazhengyi@huawei.com>
Co-authored-by: Chenxi Qian <chenxi.qian.cq@outlook.com>
2025-12-29 19:29:40 +08:00
Zetong Li
92353c0643 [Refactor][EAGLE] 1/N delete __init__ in mtp_proposer (#5176)
### What this PR does / why we need it?
This PR aims to refactor eagle-related modules in vllm-ascend.

This is the starting PR of eagle refactoring. Provided with vllm-eagle,
ascend-eagle and ascend-mtp, we first let ascend-mtp inherit from
ascend-eagle and let ascend-eagle inherit from vllm-eagle. As a
initialization, we just delete `__init__` in mtp_proposer and simplify
the corresponding logic in eagle_proposer.

Based on "vllm-eagle <----- ascend-eagle <----- ascend-mtp", our target
is to gradually delete ascend-mtp and enable ascend-eagle to converge to
vllm-eagle. So the main workspace is eagle_proposer. In this way, we
hope that contributors can concurrently refactor eagle.

Incoming changes:
1. delete common methods in vllm-eagle & ascend-eagle & ascend-mtp
2. delete `load_model` in mtp_proposer
3. delete `dummy_run` and `propose` in mtp_proposer
4. ......

RFC: #5467

### Does this PR introduce _any_ user-facing change?
N/A

### How was this patch tested?
by ci

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: Zetong Li <slippersss@126.com>
2025-12-29 16:25:52 +08:00
whx
28b7614322 [Refactor][Triton] Move reject sample triton kernels into ops/triton (#5324)
### What this PR does / why we need it?
This PR moves reject sample related triton kernels into `ops/triton`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with existing test.


- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

---------

Signed-off-by: whx-sjtu <2952154980@qq.com>
2025-12-29 16:15:41 +08:00
Ronald
e7e1a7dc05 [Feature] support eager mode in model runner v2 (#5210)
### What this PR does / why we need it?
#5051 only implement a basic framework for model runner v2, but there
are still some bugs for e2e functionality, this PR aim to enable basic
functionality.
model runner v2 plans:
https://github.com/vllm-project/vllm-ascend/issues/5208

- vLLM version: release/v0.13.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2025-12-29 15:28:34 +08:00
ZongYuan Zhan
d8e15dae6c Optimize some rejectsampler functions to make npu op launch non-blocking (#4587)
### What this PR does / why we need it?
- Vetorize the loop (but change not output) in some rejectsampler
functions include: `expand_pytorch`, `sample_recovered_tokens_pytorch`,
`rejection_random_sample_pytorch`, `sample_recovered_tokens`.
- Remove synchronize-launch torchnpu operator in them to accelerate
sampling + MTP postprocess.

### Does this PR introduce _any_ user-facing change?
- No

### How was this patch tested?
- We tested this change with the serve&bench command:
```
===== serve =====
vllm serve $LOCAL_CKPT_DIR \
        --host 0.0.0.0 \
        --port 8000 \
        --data-parallel-size 4 \
        --data-parallel-size-local 2 \
        --data-parallel-address $MASTER_NODE_IP \
        --data-parallel-start-rank $((2*VC_TASK_INDEX)) \
        --data-parallel-rpc-port 13387 \
        --tensor-parallel-size 8 \
        --seed 1024 \
        --enable-expert-parallel \
        --served-model-name $NAME \
        --max-model-len 4096 \
        --max-num-seqs 16 \
        --trust-remote-code \
        --gpu-memory-utilization 0.90 \
        $headless \
	    --speculative_config '{"method": "deepseek_mtp", "num_speculative_tokens": 1}' \
        --additional-config '{"ascend_scheduler_config":{"enabled":false, "enable_chunked_prefill":true, "chunked_prefill_enabled":true}}' 

==== bench =====
vllm bench serve --model $LOCAL_CKPT_DIR  --served-model-name DeepseekV3ForCausalLM \
--dataset-name spec_bench --spec-bench-output-len 2048 \
--dataset-path question.jsonl \
--top-p 1.0 --temperature 0.8 \
--ignore-eos \
--num-prompts 64  --trust-remote-code --base-url "http://0.0.0.0:8000" --request-rate 64
```
- In this case, our rj optimization can reduce TPOT from 84.94ms to
64.61ms, about 23% gain.

## before
<img width="1068" height="830" alt="image"
src="https://github.com/user-attachments/assets/278ac878-b49d-4588-b87c-316ca4d537f5"
/>

## after
<img width="781" height="756" alt="image"
src="https://github.com/user-attachments/assets/0c6d37ad-ed77-40b3-a1be-4933c468365c"
/>

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: ZongYuan Zhan <zhanzy178@gmail.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
2025-12-29 14:10:39 +08:00
anon189Ty
3e67e8276c [Feature] Support to use fullgraph with eagle (#5118)
### What this PR does / why we need it?
    
We support to use full graph with eagle. 

Change list:
1. Distinguish between processing graph_params and draft_graph_params in
attention_v1.
    2. Adapt the full-graph mode in eagle_proposer, include:
        1). If use full graph, make Fullgraph Wrapper when load model.
2). Build a new meatadata, set running mode in FULL and mark attention
update in dummy_run when in Fullgraph mode.
3). Fixed and fill any attn_metadata, such as
attn_metadata.slot_mapping.
        4). Add a descriptor.
        5). Set running mode and triggered update metadata.
3. Trans is_mtp_model to is_draft_model, and add the update of
workspace.

NOTE:
When set async_scheduling=True, the draft model will enforce execution
in eager mode.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: anon189Ty <Stari_Falcon@outlook.com>
Co-authored-by: Yizhou Liu <liu_yizhou@outlook.com>
Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com>
2025-12-29 09:54:51 +08:00
LI SHENGYONG
f81cf694b2 [EPLB][refactor] Modification of the initialization logic for expert_map and log2phy(depend on pr5285) (#5311)
### What this PR does / why we need it?
Unify the loading logic for expert_map and log2phy.
1. The map generated when enabling the redundancy expert is incorrect.
The community generation map function only accepts the number of global
experts. When we pass in the number of logical experts plus redundant
experts, the local expert ID of the last card will index to an expert ID
that does not exist. Now we ensure that the index points to a real
existing expert ID, and each expert can be accessed. Moreover, when
redundant experts are not enabled, the output of our function remains
consistent with the community's function.
2. The map we generate is based on the length of the physical expert,
but in reality, we only need to use the length of the logical expert.
Later on, we will need to pad it accordingly, so we can simply generate
a map with the length of the logical [expert.]
3. Unify the initialization logic across different scenarios and
simplify the code for fused_moe.

**Before refactoring**

-   map path is not None:

expert map: get_rank_placement_map from _'expert_load_balancer.py'_,
maintains the map for all ranks and all layers.

log2phy: get_rank_log2phy_map from _'expert_load_balancer.py'_,
maintains the map for all ranks and all layers.

-   map path is None:

expert map: determine_expert_map from '_vllm.laye_r', The function does
not support the redundant experts of vllm-ascend.
log2phy: determine_default_log2phy_map from _'eplb_utils.py'_. The
function does not support the redundant experts of vllm-ascend.

**Refactoring**
eplb_utils.py
&nbsp;&nbsp;&nbsp;&nbsp;init_eplb_config
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; generate placement
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; generate expert map
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; generate log2phy

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Expert Mapping Test Generation:
ep size: 16, num of experts: 256, num of redundant experts: 16
+++++++++++++++++++++++++++++++++++++++++
Expert Mapping (Non-1 indicates the expert responsible for this rank)
for Rank 15:
vllm map:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1  0  1  2  3  4  5  6  7  8
  9 10 11 12 13 14 15 16]
+++++++++++++++++++++++++++++++++++++++++
Improved map:
[16 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]

Expert Mapping Test Generation:
ep size: 16, num of experts: 256, num of redundant experts: 0
+++++++++++++++++++++++++++++++++++++++++
Expert Mapping (Non-1 indicates the expert responsible for this rank)
for Rank 15:
vllm map:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
+++++++++++++++++++++++++++++++++++++++
Improved map:
[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]

dsr1 baselie:

| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| gsm8k-lite | 7cd45e | accuracy | gen | 100.00 |

dsr1 eplb:

| dataset | version | metric | mode | vllm-api-general-chat |
|----- | ----- | ----- | ----- | -----|
| gsm8k-lite | 7cd45e | accuracy | gen | 100.00 |


- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-29 09:26:14 +08:00
wujinyuan1
23169021d9 [Refactor]6/N Extract common code of class AscendMLAImpl (#5314)
RFC: https://github.com/vllm-project/vllm-ascend/issues/4629
Reason:
Eliminate duplicate code for two file(mla_v1.py mla_cp.py) of IMPL
classes.

vLLM version: 0.13.0rc3
vLLM main:
ad32e3e19c


- vLLM version: release/v0.13.0
- vLLM main:
5fbfa8d9ef

---------

Signed-off-by: wujinyuan1 <wjy9595@qq.com>
Co-authored-by: wujinyuan1 <wjy9595@qq.com>
Co-authored-by: weijinqian0 <1184188277@qq.com>
2025-12-28 10:40:45 +08:00