1120 Commits

Author SHA1 Message Date
LI SHENGYONG
7cf285a77a [MOE Refactor] Remove QuantType in prepare_finalize.py (#6534)
### What this PR does / why we need it?
To prevent confusion between different QuantType classes, we remove**
QuantType in prepare_finalize.py

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-02-10 15:59:58 +08:00
LI SHENGYONG
34eecacace [EPLB] Avoiding eplb's dependency on a specified model (#6528)
### What this PR does / why we need it?
1. Currently, eplb registers different attributes for different models,
but these attributes are not actually used. Now, these attributes are
directly deleted.
2. Add some log about eplb.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
#### Deepseek v3.1 chat
Of course! Here is a comprehensive explanation of deep learning, broken
down for clarity.\n\n### The Simple Analogy: A Child Learning to
Recognize a Cat\n\nImagine teaching a child what a cat is. You don't
give them a rulebook with instructions like \"has pointy ears, whiskers,
and a tail.\" Instead, you show them many pictures, saying \"this is a
cat\" or \"this is not a cat.\" The child's brain gradually learns to
identify the complex patterns—the combination of shapes, colors, and
textures—that define \"cat-ness.\"\n\n**Deep learning is essentially
this, but for computers.** It's a method for teaching computers to learn
from examples and recognize patterns directly from data (like images,
sound, or text) without being explicitly programmed with rigid
rules.\n\n---\n\n### The Technical Definition\n\n**Deep Learning is a
subfield of machine learning, which itself is a subfield of artificial
intelligence (AI).** It uses artificial **neural networks** with many
layers (\"deep\" networks) to model and understand complex patterns in
data.\n\nHere are the key concepts in that definition:\n\n1.
**Artificial Intelligence (AI):** The broad science of making machines
smart and capable of performing tasks that typically require human
intelligence.\n2. **Machine Learning (ML):** A subset of AI that gives
computers the ability to learn from data *without* being explicitly
programmed for every single rule.\n3. **Deep Learning (DL):** A
specific, powerful

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-02-10 15:58:44 +08:00
Ronald
77305df398 implement batch invariant with ascendc (#6590)
### What this PR does / why we need it?
there are batch invariant ops implemented by triton and ascendc, this pr
aims to choose which kind of ops to be used to enable batch invariant.
#5487

### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-02-10 14:15:26 +08:00
Nengjun Ma
66b60c9440 [Refact]Refact MLA/SFA weight prefetch to consist with moe weight prefetch (#6629)
### What this PR does / why we need it?
1. [Refact] Refact MLA/SFA weight prefetch to consist with moe weight
prefetch
2. Remove duplicated o_proj weight prefetch in forward for MLA/SFA

### Does this PR introduce _any_ user-facing change?
NA

### How was this patch tested?

1) Performance result:
Perf test data:
*) MLA:

| | 1st test | 2nd test | Output Token Throughput(Avg) | Performance
improvement percentage |
| --- | --- | --- | --- | --- |
| o_proj duplicate prefetch | 11.9669 token/s | 12.0287 token/s |
11.9978 |
| o_proj no duplicate prefetch | 12.5594 token/s | 12.6216 token/s |
12.5905 | 4.94%| |

single layer performace improve: 5%~8%

*) SFA:

| | 1st test | 2nd test | Output Token Throughput(Avg) | Performance
improvement percentage |
| --- | --- | --- | --- | --- |
| o_proj duplicate prefetch | 13.0523 token/s | 13.1084 token/s |
13.08035 | |
| o_proj no duplicate prefetch | 13.9844 token/s | 14.1678 token/s |
14.0761 | 7.6% |

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-02-10 14:14:37 +08:00
wangxiyuan
2a826b5fad [Misc] upgrade to vllm main (#6646)
### What this PR does / why we need it?
This PR upgrades the core vLLM dependency to a newer version from the
main branch (`13397841ab469cecf1ed425c3f52a9ffc38139b5`). This is
necessary to keep our project up-to-date with the latest features and
fixes from upstream vLLM.

1.
ac32e66cf9
pass file is moved.

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: wxsIcey <1790571317@qq.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Co-authored-by: wxsIcey <1790571317@qq.com>
2026-02-10 14:08:59 +08:00
yupeng
8d44ddacb0 [Test][LoRA] Add e2e test for base model inference (#6624)
### What this PR does / why we need it?

This PR adds an end-to-end test case to verify the correctness of base
model inference when LoRA is enabled. This is to ensure that after a
LoRA base model request issue was fixed, the functionality remains
correct and does not regress. The new test case calls `do_sample` with
`lora_id=0` to target the base model and asserts the output against
expected SQL queries.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI passed with the new test case. The test can be run with:
```bash
pytest -sv tests/e2e/singlecard/test_llama32_lora.py

Signed-off-by: paulyu12 <507435917@qq.com>
2026-02-09 21:06:49 +08:00
Qiu
cb7c419bc0 [Feat](sfa,dcp) support dcp for sfa (#6563)
### What this PR does / why we need it?
This PR adds DCP support to the SFA backend.

Please note that due to operator constraints, the current implementation
has to all-gather the entire KV cache and modify the block table to
satisfy the operator input requirements. This results in significantly
increased communication overhead and peak memory usage. Therefore, this
is only a temporary workaround and will be refactored once the operator
provides proper support.

Additionally, because of the above limitations,
`cp_kv_cache_interleave_size` is currently required to be equal to
`block_size`. This restriction will also be removed after the refactor.

#### Test
accuracy test using DeepSeek-V3.2-Exp-W8A8 with dp2tp8dcp8

| dataset | version | metric | mode | vllm-api-general-stream |
|----- | ----- | ----- | ----- | -----|
| gsm8kdataset | - | accuracy | gen | 96.35 |

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-02-09 18:52:25 +08:00
lhp-deep
d060c797ed [fix bug] fix tensor mismatch bug in sigmoid operate test case (#6619)
### What this PR does / why we need it?
This PR fixes a bug in the `test_triton_fusion_ops` test case. The test
compares a fused kernel (`fused_sigmoid_gating_delta_rule_update`) with
a split implementation. Both paths use a recurrent state tensor.

The bug was that the state tensor was being modified in-place by the
fused kernel call, and this modified tensor was then reused for the
split implementation path. This led to an incorrect comparison and test
failure.

This fix ensures that each path starts with an identical, clean initial
state by creating separate tensors. It also changes the state
initialization from `torch.randn` to `torch.ones` to make the test
deterministic.

### Does this PR introduce _any_ user-facing change?
No, this change only affects a test case and has no user-facing impact.
### How was this patch tested?
The fix is applied directly to the test case. The CI passing for
`test_fused_sigmoid_gating_delta_rule.py` will confirm that the fix is
working as expected.
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>
2026-02-09 16:43:27 +08:00
Canlin Guo
b7aa511daa [Patch] Remove the patch of MiniCPM (#5975)
### What this PR does / why we need it?

Part of #5304.

After https://github.com/vllm-project/vllm/pull/32523 merge, we could
remove the patch of `MiniCPMAttention`.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Test it locally.

- vLLM version: v0.13.0
- vLLM main:
2c24bc6996

---------

Signed-off-by: gcanlin <canlinguosdu@gmail.com>
2026-02-09 14:07:44 +08:00
pu-zhe
1cc225711d [Refactor]310p_e2e test case update (#6539)
### What this PR does / why we need it?
This pull request significantly enhances the test suite by adding new
end-to-end test cases for Qwen3 models on the 310P hardware platform.
The primary goal is to ensure the stability and correctness of these
models under diverse operational conditions, including various
parallelism strategies, data types, and quantization methods.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
E2E test
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
2026-02-07 09:28:37 +08:00
pu-zhe
4f33e25046 [Refactor]refactor 310p attention impl and add ut (#6579)
### What this PR does / why we need it?
This pull request significantly refactors the attention mechanism for
the Ascend 310P hardware, enhancing its architecture by separating mask
generation concerns from the core attention implementation. It
introduces a dedicated mask builder class capable of handling various
mask types, including causal, splitfuse, and sliding window attention
masks, all optimized for the NPU's fractal data format. This change not
only cleans up the codebase but also lays the groundwork for more robust
and feature-rich attention operations on Ascend devices, backed by new,
extensive unit tests.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
E2E test with qwen3 and qwen3-moe
- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
2026-02-07 09:26:26 +08:00
pu-zhe
23524f2ca4 [Refactor]refactor 310p ops and add ut (#6591)
### What this PR does / why we need it?
This pull request focuses on a significant refactoring effort within the
vllm-ascend project, specifically targeting operations optimized for the
Ascend 310P hardware. The changes aim to streamline the implementation
of core components like quantization and multi-head attention, making
the codebase more maintainable and robust. Concurrently, new unit tests
have been introduced to ensure the correctness and reliability of these
refactored modules.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
E2E test with qwen3-32b w8a8

- vLLM version: v0.15.0
- vLLM main:
d7e17aaacd

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
2026-02-07 09:25:17 +08:00
wangxiyuan
6c49f95da2 [Ops][Refactor] Remove custom rotary_embedding operator (#6523)
### What this PR does / why we need it?
This PR removes the custom `rotary_embedding` operator and its
associated C++ kernel implementation, PyTorch bindings, and tests.

The codebase now falls back to using the native
`torch_npu._npu_rotary_embedding` implementation. This change simplifies
the codebase by removing custom, platform-specific kernel code and
relying on the standard NPU library implementation, which is presumably
more optimized and easier to maintain.

### Does this PR introduce _any_ user-facing change?
No. This is an internal refactoring and does not introduce any
user-facing changes.

### How was this patch tested?
The tests for the custom `rotary_embedding` operator have been removed
along with the operator itself. The correctness of the fallback to the
native `torch_npu` implementation is verified by existing CI tests for
attention layers and models that use rotary embeddings.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-07 09:24:05 +08:00
wangyu
c63b7a1188 [Test] Add initial multi modal cases of Qwen2.5-VL-7B-Instruct for disaggregated encoder (#5301)
### What this PR does / why we need it?
This PR adds disaggregated encoder  tests for Qwen2.5-VL-7B-Instruct 
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
by running the test
by running ci

- vLLM version: release/v0.12.0

---------

Signed-off-by: wangyu31577 <wangyu31577@hundsun.com>
Signed-off-by: wangyu <53896905+yenuo26@users.noreply.github.com>
Co-authored-by: wangyu31577 <wangyu31577@hundsun.com>
2026-02-06 17:30:17 +08:00
Li Wang
d018aeb5fa [Image] Bump mooncake version to v0.3.8.post1 (#6428)
### What this PR does / why we need it?
This patch bump the mooncake version to the latest
[release](https://github.com/kvcache-ai/Mooncake/releases/tag/v0.3.8.post1)
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
test is locally
>>> from mooncake.engine import TransferEngine
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-02-06 10:54:03 +08:00
Nengjun Ma
11339eb48a [CI] Update UT CANN version to 8.5.0 for main branch (#6564)
### What this PR does / why we need it?
Update UT CANN version to 8.5.0

### Does this PR introduce _any_ user-facing change?
NA


- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-02-06 10:28:42 +08:00
zhangxinyuehfad
81f3c09d6d [CI] Change A2 runner (#6557)
### What this PR does / why we need it?

This PR updates the CI runner from `linux-aarch64-a2-*` to
`linux-aarch64-a2b3-*` in various test configuration files. This change
is necessary to adapt to updates in the CI infrastructure.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The changes are configuration updates for CI tests. The correctness will
be verified by the CI pipeline.

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-02-05 23:43:57 +08:00
meihanc
922e5c163b [main2main] upgrade vllm main 0202 (#6560)
### What this PR does / why we need it?
1. Fix `TypeError: FusedMoEParallelConfig.__init__() missing 1 required
positional argument: 'is_sequence_parallel'` due to
https://github.com/vllm-project/vllm/pull/32567
2. Fix ` TypeError: '>' not supported between instances of 'MagicMock'
and 'int'` due to https://github.com/vllm-project/vllm/pull/33035
3. Fix `TypeError: Can't instantiate abstract class AscendMLAImpl with
abstract methods forward_mha, forward_mqa` and AttributeError: 'bool'
object has no attribute 'process_weights_after_loading' due to
https://github.com/vllm-project/vllm/pull/33284
4. Fix `'AscendSharedFusedMoE' object has no attribute
'_routed_input_transform'`due to
https://github.com/vllm-project/vllm/pull/32790
5. Fix `NPUModelRunner._dummy_run() got an unexpected keyword argument
'num_active_loras'` due to
https://github.com/vllm-project/vllm/pull/32005
6. Fix the problem caused by` 'tuple' object has no attribute 'job_id'`
due to https://github.com/vllm-project/vllm/pull/27492
7. Fix the problem that all_moe_layers is not equal to vllm.moe_forward,
vllm.moe_forward_shared due to
https://github.com/vllm-project/vllm/pull/33184
8. Add patch to fix the problem "got multiple values for keyword
argument 'add_special_tokens'" due to
https://github.com/vllm-project/vllm/pull/32863
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: hfadzxy <starmoon_zhang@163.com>
2026-02-05 19:31:17 +08:00
ChenCangtao
2c1608265b [CI][npugraph_ex]Fix npugraph ex e2e test (#6553)
### What this PR does / why we need it?
When running the Qwen3-0.6B model using the npugraph_ex backend, the
last few characters of the generated results changed. We have modified
the relevant test cases to ensure the CI runs smoothly.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: chencangtao <chencangtao@huawei.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
2026-02-05 14:03:10 +08:00
Wang Kunpeng
13c4a9c78b [bugfix]Fix accuracy issue in PCP/DCP with speculative decoding (#6491)
### What this PR does / why we need it?

This PR fixes an accuracy issue that occurs when using Prefill/Decode
Context Parallelism (PCP/DCP) in conjunction with speculative decoding
(MTP). The issue is caused by an irregular attention mask shape when
both features are enabled.

The fix involves flattening the `block_table` for speculative decoding
requests under PCP/DCP to ensure a regular attention mask. This PR also
introduces a `use_cp` property for cleaner code and updates dummy runs
to handle this scenario correctly.

### Does this PR introduce _any_ user-facing change?

No. This is a bug fix that improves accuracy and should not have
user-facing API changes.

### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2026-02-05 10:06:14 +08:00
Yizhou
2ee4f23f28 [ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6475)
### What this PR does / why we need it?
This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to
satisfy FIA/TND constraint #6459 (commit
5b0a6bcfe9)" and fixes a check in
`model_runner_v1`.

**A key change is that we remove the strict assertion in the latest
commit, as it turns out MLA + PIECEWISE will slice during computing,
leaving our assertion uncalled for and will only cause false alarm.**

This handles both uniform and mixed batches (by inserting a dummy
request for mixed batches), consolidates ad-hoc padding into a single
helper, copies the updated buffer to the device, which prevents kernel
mismatches or failures and ensure correct shapes for FIA/TND execution
in full graph modes.

We currently place this helper in `execute_model`. My original design
was to include it in `_prepare_inputs`, but that doesn’t work because it
must run after padding. While I’d prefer to minimize the impact and
reuse as much of the base class as possible in the future, it doesn’t
seem achievable at the moment.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test cases added.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2026-02-04 21:11:08 +08:00
starmountain1997
bfcc372f75 [CI] Add long and short prompt tests for DeepSeek-V3.2 (#6499)
### What this PR does / why we need it?

This PR enhances the test_deepseek3_2_w8a8_pruning_mtp_tp2_ep E2E test
by adding both short and long prompt test cases:
- Short test: Validates basic functionality with minimal input ("Hello
")
- Long test: Validates the model can handle prompts near its maximum
context length (~163K tokens, approaching the max_position_embeddings
limit of 163,840)
Additionally, explicitly sets max_model_len=163840 to ensure the test
properly exercises the model's full context window capability.
### Does this PR introduce _any_ user-facing change?

No. This change only affects internal E2E testing infrastructure.  

### How was this patch tested?

The modified test case will be executed as part of the E2E test suite
and has been validated
[here](https://github.com/vllm-project/vllm-ascend/actions/runs/21620195055/job/62308026205?pr=6499).



- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-04 09:10:50 +08:00
Nengjun Ma
78fad4e348 [Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (#6442)
### What this PR does / why we need it?
Refactor MLP weight prefetch to consistency with MoE Model's prefetching
in terms of code and usage.
Environments VLLM_ASCEND_ENABLE_PREFETCH_MLP,
VLLM_ASCEND_MLP_DOWN_PREFETCH_SIZE and
VLLM_ASCEND_MLP_GATE_UP_PREFETCH_SIZE is removed, usage as following:

--additional-config '{"weight_prefetch_config": { "enabled": true,
"prefetch_ratio": {"mlp": { "gate_up": 1.0, "down": 1.0} }}}'

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: leo-pony <nengjunma@outlook.com>
2026-02-04 09:08:18 +08:00
whx
4d6444d5fd [Nightly][BugFix] Remove kv_cache nz test case for test_mla_preprocess_nq.py (#6505)
### What this PR does / why we need it?
Remove kv_cache nz test case for test_mla_preprocess_nq.py. This case is
added by https://github.com/vllm-project/vllm-ascend/pull/3072 but has
not been tested on bf16 scenario. Results show that this is not
currently supported.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed with existing test.


- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: whx-sjtu <2952154980@qq.com>
2026-02-03 18:26:51 +08:00
Feng Liu
03a18ad6fd [E2E] add E2E for Prefix Caching cp & Chunked Prefill cp (#5149)
### What this PR does / why we need it?
Add E2E for Prefix Caching cp & Chunked Prefill cp 
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?

- vLLM version: v0.15.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: F.Liu <liufeng248@huawei.com>
Signed-off-by: Feng Liu <46866849+ader47@users.noreply.github.com>
Co-authored-by: F.Liu <liufeng248@huawei.com>
2026-02-03 15:04:14 +08:00
LeeWenquan
b1de6cbb31 [Bugfix][CI]Add qwen3Next MTP+Full Decode (#6047)
### What this PR does / why we need it?
Fix a bug in the repo and add a test case for MTP + Full Decode Only +
Qwen3Next.
The _build_dummy_attn_metadata function in NPUModelRunner seems losed a
query_star_loc.copy_to_gpu operation, which will lead to difference
between query_start_loc and query_start_loc_cpu, and they are required
to be same in MTP + Full Decode Only + Qwen3Next case.

Before this pr:
`self.query_start_loc = [0, 0, 0, 0, ... , 0]
self.query_start_loc_cpu = [0, 2, 4, 6, ... ,128]`

### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: SunnyLee219 <3294305115@qq.com>
2026-02-03 14:26:21 +08:00
Shaoxu Cheng
39e77fb9e4 [Feat.]: support 310p w8a8 (#6454)
### What this PR does / why we need it?
Introduced 310P W8A8 Quantization Support: New modules and methods have
been added to enable W8A8 static quantization specifically for the
Ascend 310P platform.
Platform-Specific Quantization Configuration Loading: The system now
dynamically loads the appropriate quantization configurations
(AscendCompressedTensorsConfig, AscendModelSlimConfig) based on whether
the current hardware is an Ascend 310P device.
Implemented AscendW8A8LinearMethod310P: A dedicated linear quantization
method for 310P is provided, handling the specifics of weight and
activation quantization, including input parameter broadcasting and
weight data manipulation.
Extended AscendModelSlimConfig for 310P: A specialized configuration
class for 310P integrates the new W8A8 linear method for both standard
linear layers and vocabulary parallel embeddings, ensuring proper
quantization application.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Signed-off-by: Shaoxu Cheng <2906339855@qq.com>
2026-02-03 14:13:06 +08:00
lidenghui1110
79803932e2 [Kernel] Add AscendC fused op transpose_kv_cache_by_block to speed up GQA transfer (#6366)
### What this PR does / why we need it?
As #2947 describe, we need to transpose kv cache layout after GQA kv
transfer when prefill and decode tensor parallel size are heterogeneous,
in the previous implementation, we use `npu_paged_cache_load ` +
`tranpose` + `_npu_reshape_and_cache` to do this work.

But obviously, it is not an efficient plan, the ops above need to be
called for each layer, which introduces 3 * layer_num kernel launch, and
6 * layer_num data movement between L1 Cache and HBM for one request on
decode node. Usually, decode node uses graph mode, so these op kernels
will be called between decode forward launched by an async thread in
mooncacke connector, this kernels maybe last for several decode forward
and TTFT will increase by 3~4 decode forward time.

In this PR, we implement an AscendC fused op
`transpose_kv_cache_by_block` to do this with only once kernel launch
and move data between L1 Cache and HBM only once.

After using this fused op, the time cost in transpose kv cacke layout
can be decreased to 0.24ms from 7ms in UT on 910C, and in PD
disaggregation scenario, TTFT can decrease about 90 ~ 110 ms in
qwen3-235B.

| request_num | original | fused_op|
|:----------------------:|:---------------:|:-------------------:|
|           1            |      643 ms      |        578 ms        |
|          128           |     1480 ms      |       1368 ms        |

### Does this PR introduce _any_ user-facing change?
Use fused op by default, incase the op has bug in any scenario, provide
fallback choice using env to disable it.

**DISABLE fused op by add following env**
`export VLLM_ASCEND_FUSION_OP_TRANSPOSE_KV_CACHE_BY_BLOCK=0`

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: lidenghui <lidenghui1110@gmail.com>
2026-02-03 14:10:01 +08:00
guanguan0308
dffac6db73 [Refactor] Add expert processed token count output for DispatchFFNCombine/DispatchFFNCombineBF16 (#6402)
### What this PR does / why we need it?
Add New Output for Expert Token Count
An additional output tensor expert_token_nums is added to both operators
to meet the requirement of tracking token distribution among experts:

Tensor Name: expert_token_nums
Dimension: 1D tensor
Shape: (local_expert_num,)
Data Type: int32
Semantics: Represents the number of tokens actually received by each
expert on the current card.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: guanguan0308 <1546542263@qq.com>
Signed-off-by: guanguan0308 <162653673+guanguan0308@users.noreply.github.com>
2026-02-03 10:41:06 +08:00
starmountain1997
b6256e8bc9 Revert "[CI] fix DS3.2 single node cudagraph_sizes config (#6241)" (#6497)
# What this PR does / why we need it?
This PR reverts commit 8134146ab6, which
modified the DeepSeek V3.2 (W8A8) single-node nightly test
configuration. as there is no limit between tp_size and MTP.
# Does this PR introduce any user-facing change?
No. This PR only affects CI/CD test configurations and does not
introduce any user-facing changes.
# How was this patch tested?
N/A for a revert PR. The changes restore the previously known working
configuration.
- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-03 08:42:58 +08:00
meihanc
c08364f761 [Bugfix] Fix intermittent kv_port conflict with AscendDirectTransport (#6455)
### What this PR does / why we need it?

When using Mooncake on Ascend NPU, AscendDirectTransport randomly
allocates ports within range `[20000, 20000 + npu_per_node × 1000)`.
Reference:
[ascend_direct_transport.cpp#L554](https://github.com/kvcache-ai/Mooncake/blob/v0.3.7.post2/mooncake-transfer-engine/src/transport/ascend_transport/ascend_direct_transport/ascend_direct_transport.cpp#L475)

If `kv_port` overlaps with this range, users may encounter intermittent
startup failures:
```bash
zmq.error.ZMQError: Address already in use (addr='tcp://x.x.x.x:30012')
RuntimeError: KV Cache sending/receiving thread failed to start.
```
This pr fix intermittent kv_port conflict with AscendDirectTransport in
`Qwen3-235B-W8A8-EPLB.yaml`, and add Added `kv_port Configuration Guide`
section in `pd_disaggregation_mooncake_multi_node.md`.

test
Results(tests/e2e/nightly/multi_node/config/Qwen3-235B-W8A8-EPLB.yaml):
https://github.com/vllm-project/vllm-ascend/actions/runs/21540138907/job/62073265259

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-02-02 17:31:21 +08:00
LHXuuu
45a573cff1 [Quantization][Feature] Support compressed tensors moe w4a8 dynamic weight (#5889)
### What this PR does / why we need it?

While using the LLM Compressor quantization tool from the VLLM community
to generate quantized weights, the VLLM Ascend engine needs to be
adapted to support the compressed tensors quantization format.

1. Support Moe model W4A8 dynamic weight.

- vLLM version: v0.13.0
- vLLM main:
bde38c11df

---------

Signed-off-by: LHXuuu <scut_xlh@163.com>
Signed-off-by: menogrey <1299267905@qq.com>
Co-authored-by: menogrey <1299267905@qq.com>
2026-02-02 16:39:32 +08:00
wangxiyuan
eeedf7c503 [Main2Main][Deps][Misc] Upgrade vLLM to v0.15.0 (#6470)
### What this PR does / why we need it?
This PR upgrades the vLLM dependency from `v0.14.1` to `v0.15.0`. This
involves:
- Updating the `VLLM_TAG` in all `Dockerfile`.
- Updating the vLLM version in `docs/source/conf.py`.
- Removing conditional code paths specific to `v0.14.1` across the
codebase, which simplifies maintenance.
- Fix `TypeError: MMEncoderAttention.__init__() got an unexpected
keyword argument 'multimodal_config'` due to
https://github.com/vllm-project/vllm/pull/31972.
- Fix `_shared_experts: 'NoneType' object is not callable` due to
https://github.com/vllm-project/vllm/pull/32082 by
https://github.com/vllm-project/vllm-ascend/pull/6335.
- Fix `ReshapeAndCacheOperation setup failed!` due to
https://github.com/vllm-project/vllm/pull/25954 by overriding attention
metadata slots.

This upgrade is necessary to keep the project aligned with the latest
features, bug fixes, and API changes in the vLLM project.

### Does this PR introduce _any_ user-facing change?
No, this is an internal dependency update and does not introduce any
user-facing changes.

### How was this patch tested?
CI is expected to pass with these changes, ensuring that all existing
tests are successful with the new vLLM version.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8


co-authored-by: shen-shanshan <467638484@qq.com>

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-02 15:57:55 +08:00
starmountain1997
8134146ab6 [CI] fix DS3.2 single node cudagraph_sizes config (#6241)
# What this PR does / why we need it?
This PR fixes the single-node nightly test for DeepSeek V3.2 (W8A8)
model to ensure CI stability. The changes include:
1. Simplified nightly test matrix (nightly_test_a3.yaml):
- Temporarily reduced to only run deepseek3_2-w8a8 test case for
debugging
- Changed trigger from schedule/workflow_dispatch to support
push/pull_request for faster iteration
2. Updated DeepSeek V3.2 test configuration
(test_deepseek_v3_2_w8a8.py):
- Adjusted cudagraph_capture_sizes from [3, 6, 9, 12] to [8, 16, 24, 32]
for better performance
- Increased max-num-seqs from 4 to 8
- Increased gpu-memory-utilization from 0.92 to 0.98
- Increased num_speculative_tokens from 2 to 3
3. Added PR checkout step (_e2e_nightly_single_node.yaml):
- Added ability to checkout a specific PR (#6241) for testing
# Does this PR introduce any user-facing change?
No. This PR only affects CI/CD test configurations and does not
introduce any user-facing changes.
# How was this patch tested?
Mock nightly test has passed, see
[here](https://github.com/vllm-project/vllm-ascend/actions/runs/21574655952/job/62159656622?pr=6241).

<img width="1053" height="714" alt="a2f2ee359febb13e1f6330b1bd3c116b"
src="https://github.com/user-attachments/assets/3262ad0f-adec-4c71-871f-d9cf2db06fbc"
/>


- vLLM version: v0.14.1
- vLLM main:
d68209402d

---------

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
2026-02-02 11:47:32 +08:00
wangxiyuan
b4aafd4293 [Core][Misc] Clean up ProfileExecuteDuration (#6461)
### What this PR does / why we need it?
This PR removes the custom `ProfileExecuteDuration` utility and its
usages across the codebase. This utility was used for profiling
execution duration of different stages in the inference process. It is
replaced by the standard `vllm.v1.utils.record_function_or_nullcontext`,
which integrates with PyTorch's profiler.

This change simplifies the code by removing a custom implementation in
favor of an upstream utility, improving maintainability. Associated
documentation and tests for `ProfileExecuteDuration` are also removed.

### Does this PR introduce _any_ user-facing change?
`VLLM_ASCEND_MODEL_EXECUTE_TIME_OBSERVE` env is removed now.

### How was this patch tested?
CI passed. The changes are a cleanup and replacement with a standard
utility. Existing tests cover the functionality. The removed feature had
its own tests which are also removed.

Related RFC: #5304

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-02-01 20:06:01 +08:00
Li Wang
5b0a6bcfe9 [ModelRunner] Revert "[Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6459)
This reverts commit 56f5d3bd49.

### What this PR does / why we need it?
The patch https://github.com/vllm-project/vllm-ascend/pull/6357 which
break the functionality availability in the spec_decode scenario, let's
revert and make CI happy first
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-01-31 16:33:34 +08:00
Qiu
638cae824d [bugfix](CP) Fix and unify the PD request discrimination logic. (#5939)
### What this PR does / why we need it?
Since the PR (https://github.com/vllm-project/vllm/pull/32118) has
modified the criteria for judging Prefill and Decode requests in vLLM,
PCPManager needs to synchronize with this standard. As PCPManager
involves multiple calculations of PD request counts, this PR attempts to
consolidate the related logic and update the PD request count once per
batch.

### How was this patch tested?
```bash
pytest tests/e2e/multicard/4-cards/long_sequence/test_mtp.py
```

- vLLM version: v0.13.0
- vLLM main:
11b6af5280

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-31 10:26:02 +08:00
Yizhou
56f5d3bd49 [Fix] Pads query_start_loc to satisfy FIA/TND constraint (#6357)
### What this PR does / why we need it?
This handles both uniform and mixed batches (by inserting a dummy
request for mixed batches), consolidates ad-hoc padding into a single
helper, copies the updated buffer to the device, and asserts the layout
constraint before building the attention metadata. Together, these
changes prevent kernel mismatches or failures and ensure correct shapes
for FIA/TND execution in full graph modes.

We currently place this helper in `execute_model`. My original design
was to include it in `_prepare_inputs`, but that doesn’t work because it
must run after padding. While I’d prefer to minimize the impact and
reuse as much of the base class as possible in the future, it doesn’t
seem achievable at the moment.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test cases added.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2026-01-30 16:41:44 +08:00
ChenCangtao
f2990f7741 [e2e Test][npugraph_ex]add static kernel e2e test case (#6320)
### What this PR does / why we need it?
Added an E2E test case for the scenario of enabling a static kernel for
npugraph_ex, monitoring its compilation and unloading process.
Also fixed the previously existing spelling errors

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: chencangtao <chencangtao@huawei.com>
Co-authored-by: chencangtao <chencangtao@huawei.com>
2026-01-30 16:24:48 +08:00
liziyu
d252e4f5ec [P/D] Using the cache load operator to replace the index select operator. (#6295)
### What this PR does / why we need it?
Using the cache load operator to replace the index select operator.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: liziyu <liziyu16@huawei.com>
2026-01-30 14:27:53 +08:00
CodeCat
b2857de43f [ST]Add e2e test for Npugraphex_pass (#6388)
### What this PR does / why we need it?
We found the custom passes of NPUGraphEX have implemented fusion
operator features, which still require E2E test case validation and
guard. This PR implements E2E test cases for the AddRMSNormQuant and
SplitQKVNormRope operator fusions under NPUGraphEX that are already in
the codebase.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: cjian <2318164299@qq.com>
2026-01-30 09:14:07 +08:00
wjunLu
4970de4242 [CI] Enable the skipped cases when HDK is upgraded to 25.5.0 (#6195)
### What this PR does / why we need it?
Enable the tests that were skipped due to an outdated driver version:
- tests/e2e/multicard/4-cards/long_sequence/test_accuracy.py
- tests/e2e/multicard/4-cards/long_sequence/test_basic.py
- tests/e2e/multicard/4-cards/long_sequence/test_chunked_prefill.py

and some cases in
- tests/e2e/multicard/2-cards/spec_decode/test_spec_decode.py
- tests/e2e/multicard/2-cards/test_external_launcher.py
- tests/e2e/multicard/2-cards/test_offline_weight_load.py
- tests/e2e/multicard/2-cards/test_quantization.py
- tests/e2e/multicard/4-cards/test_data_parallel_tp2.py

TODO:
- tests/e2e/multicard/4-cards/spec_decode/test_mtp_qwen3_next.py
- tests/e2e/multicard/4-cards/long_sequence/test_mtp.py
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.0
- vLLM main:
d68209402d

Signed-off-by: wjunLu <wjunlu217@gmail.com>
2026-01-29 22:41:41 +08:00
zxr2333
14bd55f30c [P/D][BugFix] Fix layerwise P/D request_id error (#6360)
### What this PR does / why we need it?
Fix layerwise Connector P/D request_id error, due to vllm pr:
https://github.com/vllm-project/vllm/pull/27987, which will add a random
suffix to request_id in EngineCore.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-01-29 20:19:05 +08:00
Qiu
50e0e87646 [bugfix](CP,MLA) fix wrong slot_mapping of decode for mixed p/d batch (#6344)
### What this PR does / why we need it?
PR #5672 attempted to remove the -1 padding for duplicate tokens in the
decode slot_mapping when adapting PCP for MLAPO, and adopted a simpler
slicing approach. However, in the single-ops logic and mixed PD batches,
the decode slot_mapping did not eliminate the -1 and also shared the
slicing method, resulting in incorrect slot_mapping. This PR resolves
this issue, and the logic will be further consolidated in subsequent
refactoring PRs.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
2026-01-29 16:48:37 +08:00
InSec
86b6ecac4c [CI][BugFix] Import error fix. (#6293)
### What this PR does / why we need it?
Fix the **import error** of qwen3-next nightly test.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: InSec <1790766300@qq.com>
2026-01-28 22:07:47 +08:00
linfeng-yuan
e25ee65729 [Misc][Test] add e2e test for apply_top_k_top_p_custom kernel (#6348)
### What this PR does / why we need it?
Add e2e test case for apply_top_k_top_p_custom kernel and eliminate
chinese comments.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
pytest passed.

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-01-28 17:25:57 +08:00
dsxsteven
325cb16e3f [BugFix][CI]Fix DeepSeek-R1-W8A8-longseq nightly CI (#6297)
### What this PR does / why we need it?
The precision issue arose because the kv cache of the p-node had not
been fetched for an extended period(>6min) and was forcibly freed. To
avoid this problem, the batch size was reduced and the timeout period
has also been extended.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: dsxsteven <dsxsteven@sina.com>
2026-01-28 16:36:24 +08:00
wangxiyuan
f8e76a49fa [CI] Upgrade trasnformers version (#6307)
Upgrade transformers to >=4.56.4

- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
2026-01-28 14:06:39 +08:00
Wang Kunpeng
c498cea22d [refactor] refactor excute_model and _dymmy_run method (#6043)
### What this PR does / why we need it?
The structure of the `excute_model` and `_dymmy_run` methods in
NPUModelRunner differs greatly from that in GPUModelRunner.
Achieve alignment with GPUModelRunner:
Split the `_prepare_inputs` method into `_prepare_inputs`,
`_determine_batch_execution_and_padding`, `_build_attention_metadata`,
and `_preprocess`.
Modify `_generate_process_reqs_hidden_states` to `_model_forward`.
Align the implementation of the `postprocess` phase

**Related-RFC**: https://github.com/vllm-project/vllm-ascend/issues/5449

**Co-authored-by**: @zhenwenqi2024 
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
d68209402d

---------

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
Signed-off-by: gcanlin <canlinguosdu@gmail.com>
Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
Co-authored-by: gcanlin <canlinguosdu@gmail.com>
Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
2026-01-27 22:27:01 +08:00
pu-zhe
21b6779a33 [UT]: refactoring 310p ops ut (#6296)
### What this PR does / why we need it?
Refactor swiglu and rms_norm unittest case for 310P and 910B.
Apply attention_v1 get_kv_cache_shape and build metadata on all of
platforms

### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
CI UT test
- vLLM version: v0.14.1
- vLLM main:
dc917cceb8

---------

Signed-off-by: pu-zhe <zpuaa@outlook.com>
2026-01-27 16:31:51 +08:00