Commit Graph

2773 Commits

Author SHA1 Message Date
pz1116
14411e911e [Doc][0.18.0][KV Pool]add mooncake rdma timeout (#7784)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
- Add `default_kv_lease_ttl` to the `mooncake.json` example in the KV
Pool guide.
- Document `default_kv_lease_ttl` semantics and clarify that it should
be larger than `ASCEND_CONNECT_TIMEOUT` and `ASCEND_TRANSFER_TIMEOUT`.
- Add `HCCL_RDMA_TIMEOUT` explanation for Mooncake RDMA retransmission
timeout, including the recommended constraint note.
- Add `HCCL_RDMA_TIMEOUT=16` to relevant KV Pool environment setup
examples for consistency.

<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
2026-03-31 20:17:03 +08:00
wangyibo1005
a63dd5868d [0.18.0][cherry-pick][BugFix]Fix compilation errors for operators dispatch_gmm_combine_decode/moe_combine_normal/moe_dispatch_normal (#7844)
**What this PR does / why we need it?**
pick from https://github.com/vllm-project/vllm-ascend/pull/7114

Fix compilation errors encountered when building versions later than
b020 for the following operators:
dispatch_gmm_combine_decode, moe_combine_normal, moe_dispatch_normal
**Root Cause**
After the b020 version update, the original moe_distribute_base.h file
was updated and its definitions changed, which caused compilation
failures for the above three operators that depend on this file.
**Solution**
We have added a dedicated copy of moe_distribute_base.h into the
implementation of these three operators, ensuring stable compilation
independent of framework version updates.

**Does this PR introduce any user-facing change?**
No. There are no user-facing changes; this fix only resolves compilation
issues without affecting functionality or user behavior.

**How was this patch tested?**
vLLM version: releases/v0.18.0

Signed-off-by: Wangyibo1005 <2633333316@qq.com>
2026-03-31 19:58:46 +08:00
linfeng-yuan
ed4ef1f4e7 [releases/v0.18.0][Triton][Sampler] Add penalty-related Triton kernel for better performance of penalties (#7794)
### What this PR does / why we need it?
Implement get_token_bin_counts_and_mask and apply_penalties with
Triton-Ascend kernels. This significantly reduces latency of the
sampling process when repetition/frequency/presence penalties are
enabled.

Cherry-pick from main PR #7569 
### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>
2026-03-31 19:01:51 +08:00
wangxiaoteng888
82e26b5a6e [BugFix][v0.18.0]Adjust request map pop time (#7857)
### What this PR does / why we need it?
Adjust request map pop time.This pull request optimizes the KV cache
transfer mechanism by streamlining how requests are tracked and cleaned
up. By removing unnecessary mapping structures and adjusting the timing
of request removal, the system achieves more efficient state management
during the transfer process.
pick-from:https://github.com/vllm-project/vllm-ascend/pull/7855


### How was this patch tested?
By ci
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
2026-03-31 18:55:36 +08:00
ZT-AIA
66db070423 [cherry-pick][Test]repair for test_compute_slot_mapping (#7836)
### What this PR does / why we need it?
repair for test_compute_slot_mapping

Signed-off-by: ZT-AIA <1028681969@qq.com>
2026-03-31 16:52:58 +08:00
zhangxinyuehfad
af4278be35 [v0.18.0][CI] Close build image by pr (#7776)
### What this PR does / why we need it?
Close build image by pr

This PR is related to
https://github.com/vllm-project/vllm-ascend/pull/7775, please merge them
together

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-31 16:38:43 +08:00
jack
7314bbe2df fix(platform): reimplement MiniMax usage accounting patch (#7835)
## Summary
- replace the MiniMax usage accounting monkey patch with a runtime
wrapper implementation instead of source-text rewriting
- preserve MiniMax reasoning-token semantics when `</think>` is missing
by counting the emitted output as reasoning tokens
- add unit coverage for usage tracking helpers and MiniMax
reasoning-token counting

## Why
The previous implementation rewrote `OpenAIServingChat` by matching
exact source blocks. That was brittle against `vllm` source drift and
could crash during early plugin initialization with:
`RuntimeError: Failed to locate expected block while patching
OpenAIServingChat usage accounting.`

This change keeps the usage-accounting backport, but applies it by
wrapping the original stream/full generators and tracking output token
ids at runtime.

For MiniMax reasoning counting, a missing `</think>` should not be
treated as zero reasoning tokens. It can mean the whole output is still
in thinking mode, or that generation stopped before the closing token
was produced. In that case, the emitted output should still be counted
as reasoning.

## Validation
- `pytest -q
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py`
- `vllm serve --help`

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
2026-03-31 16:27:00 +08:00
Wangbei25
4f259d4fd8 [Performance]Optimize DeepSeekOCR2 RelPosAttention and CustomQwen2Decoder (#7737)
### What this PR does / why we need it?
Optimize DeepSeekOCR2 RelPosAttention and CustomQwen2Decoder and add doc
for DeepSeekOCR2.md

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
- vllm 0.18.0
- vllm-ascend main

1. _create_custom_4d_mask during 141ms49us620ns -->
_create_npu_optimized_mask during 1ms227us780ns
2. convd2d : 27ms --> matmul <1ms
3. relposattention:sdpa->prompt_flash_attention

---------

Signed-off-by: Wangbei25 <wangbei41@huawie.com>
Signed-off-by: Wangbei25 <wangbei41@huawei.com>
Co-authored-by: Wangbei25 <wangbei41@huawie.com>
2026-03-31 14:49:29 +08:00
liuchenbing2026
2a0a588311 [0.18.0][BugFix] Disable block verify to avoid incorrect verification on NPU … (#7839)
…(#7603)

### What this PR does / why we need it?
Block verify uses cumprod(target_probs / draft_probs) for joint
acceptance. Suffix/ngram methods have
draft_probs=None, the fallback draft_token_probs=1.0 with cumprod is not
equivalent to per-token
verification, causing incorrect accept/reject results. Fix:
using_block_verify = max_spec_len >= 3 and draft_probs is not None.
MTP/Eagle3 unaffected.

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: liuchenbing <chenliumail@163.com>
Co-authored-by: liuchenbing <chenliumail@163.com>
2026-03-31 09:36:48 +08:00
zxr2333
ab928ed586 [v0.18.0][P/D][Feature]Layerwise connector supports Mamba prefill prefix caching (#7796)
### What this PR does / why we need it?
Mooncake layerwise connector supports Mamba prefix caching on prefiller
nodes.

### Does this PR introduce _any_ user-facing change?
Yes. Use `--enable-prefix-caching` and `--mamba-cache-mode align` to
enable mamba align mode prefix caching on P/D prefill nodes. This
function does not supports on decode nodes now.

### How was this patch tested?
By P/D E2E test.

---------

Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
2026-03-31 09:25:22 +08:00
linfeng-yuan
cab5d73633 [releases/v0.18.0][BugFix] Fix server init error when set max_num_seqs not a multiple of tp while FLASHCOMM is on (#7832)
### What this PR does / why we need it?
Current version will run into init error when user set max_num_seqs to
number not a multiple of tp size. The reason is that we will first find
out the valid size of sequence parallelism, and then remove numbers that
are not the multiple of tp size. This may cause an error when we set a
max_num_seqs above a multiple of 8 before a multiple of tp size, say
when the tp size is 16 and the max_num_seqs is 90. The system will just
drop the calculated max graph capture size 88 from the valid size list
but not reset the max_cudagraph_capture_size to the next valid number.
Thus, we will need to add the line to match them up.

Cherry-pick from main PR #7801

### Does this PR introduce _any_ user-facing change?
No. 

### How was this patch tested?
Full CI passed with this PR.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
2026-03-30 20:24:52 +08:00
linfeng-yuan
deceefd305 [releases/v0.18.0][bugfix][eplb] remove unnecessary weight_scale wrap behaviour (#7732)
### What this PR does / why we need it?
This PR simplifies the apply method in w8a8_dynamic.py by removing the
conditional logic that used fused_w1_scale and fused_w2_scale based on
the fused_scale_flag. This redundant wrap behavior leads to EPLB break
in int8 quantization scenarios.

Cherry-picked from #7188. Note that only bugfix lines in that PR are
picked.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-30 16:16:03 +08:00
Mengqing Cao
fdd0726ae4 [v0.18.0][Triton] Fix triton-ascend version in Dockerfile (#7766)
### What this PR does / why we need it?
Triton-ascend occasionally encounters compilation errors, which is a
known issue in triton-ascend 3.2.0. However, we want to use the official
version rather than the development version, so we only changed the
triton-ascend version in the Dockerfile and added a FAQ to explain this
issue.

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
2026-03-30 14:43:16 +08:00
Yang Yuxi
e776d5c0f1 [Bugfix]v0.18.0 support FlashComm1 & DCP for Qwen (#7726)
### What this PR does / why we need it?
This PR backports the changes from #7673 ([Bugfix] support FlashComm1 &
DCP for Qwen) to the releases/v0.18.0 branch.

--------
Signed-off-by: Yang Yuxi <907276627@qq.com>
2026-03-29 15:59:19 +08:00
wangbj127
9cc41c9457 [v0.18.0][Bugfix][EAGLE] Fix FIA pad bug under max concurrency (#7754)
cherry picked from https://github.com/vllm-project/vllm-ascend/pull/7740
Fixes padding problems of FIA op under max concurrency.

- vLLM version: v0.18.0
- vLLM main:
35141a7eed

Signed-off-by: Wangbingjie <wangbj1207@126.com>
2026-03-29 12:23:44 +08:00
Wang Kunpeng
5df2ddd8db [v0.18.0][Bugfix]Fix Error "AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute 'enabling_fa_quant'" (#7748)
### What this PR does / why we need it?
cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7736

**Error information**
When the quantized weights in CompressedTensors format of the kimi-k2
model are used, the following error is reported:
`AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute
'enabling_fa_quant'`

**Error Cause**
Currently, FA3 quantization supports only the weights of modelslim
quantization. The added methods are not defined in
AscendCompressedTensorsConfig.

**Solution**
Before invoking related methods, check whether the FA3 feature is
enabled.
Additionally, the unused `get_scaled_act_names` method and its
corresponding unit test have been removed.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing unit tests were updated by removing a deprecated test case, and
the refactored logic was reviewed for correctness.

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2026-03-28 17:03:56 +08:00
zhangxinyuehfad
c1cefd26de [v0.18.0][CI] Add nightly- prefix to branch/PR image tags (#7765)
### What this PR does / why we need it?
Add nightly- prefix to branch/PR image tags 

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-28 11:31:16 +08:00
jack
f83cb0e6dc [Bugfix][Platform] Fix GLM47 tool-call finish backfill (#7710)
### What this PR does / why we need it?
This rebases the GLM47 tool-call parser fix onto `releases/v0.18.0`
after the MiniMax usage-accounting patch merged upstream on March 27,
2026.

It fixes OpenAI chat tool-call streaming for GLM47 by:
- draining terminal parser chunks that contain both the final argument
text and the closing `</tool_call>` suffix
- computing finish backfill from the tool argument bytes actually
emitted to the client, instead of trusting parser-internal buffered
state
- adding focused regression tests for finish backfill and terminal chunk
handling

### Does this PR introduce _any_ user-facing change?
Yes. GLM47 OpenAI-compatible streaming tool-call responses now emit
correct final chunks and argument payloads on `releases/v0.18.0`.

### How was this patch tested?
- `pytest -q tests/ut/patch/platform/test_patch_glm_tool_call_parser.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py`
- `python -m pre_commit run --files
vllm_ascend/patch/platform/patch_glm_tool_call_parser.py
tests/ut/patch/platform/test_patch_glm_tool_call_parser.py
vllm_ascend/patch/platform/__init__.py vllm_ascend/patch/__init__.py`

---------

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
2026-03-28 09:15:04 +08:00
SparrowMu
6fbd0049df [v0.18.0] Apply Eagle3 to MiniMax-M2.5 (#7619) (#7714)
### What this PR does / why we need it?
Apply Eagle3 to MiniMax-M2.5 to increase model performance This will be
discard after Eagle3 weight for MiniMax-M2.5 releases and code change
accepted by official repo
https://github.com/vllm-project/vllm/pull/37512/changes
backport: #7619

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

Signed-off-by: limuyuan <limuyuan3@huawei.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
2026-03-27 18:33:29 +08:00
Feng-xiaosuo
60e88d9541 [v0.18.0][Refactor] Use forward mapping instead of reverse mapping in AscendMo… (#7716)
…delSlimConfig (#7596)

### What this PR does / why we need it?

This PR refactors the `AscendModelSlimConfig` class to use **forward
mapping** instead of reverse mapping for quantization config key
transformation.

**Changes:**
1. Modified `apply_vllm_mapper()` to directly apply
`hf_to_vllm_mapper.apply_dict()` to transform `quant_description` keys
from HF format to vLLM format
2. Simplified `quant_prefix_mapper()` to return the prefix directly (no
longer needs mapping since keys are already in vLLM format)
3. Removed `QUANT_MODEL_PREFIX_MAPPINGS` dictionary (~50 lines) - no
longer needed
4. Removed `get_prefix_mapping()` function - no longer needed
5. Removed `vllm_to_hf_mapper` attribute - no longer needed

**Why this change is needed:**

The previous implementation used reverse mapping (vLLM → HF) which had
several issues:
- Some keys might not be used in the forward direction but would be
incorrectly used in reverse
- Empty values in the mapping would cause issues when reversed
- Required maintaining a separate `QUANT_MODEL_PREFIX_MAPPINGS` dict
that duplicated information already available in vLLM's model-specific
`WeightsMapper`

The new approach:
- Uses the forward mapping (HF → vLLM) directly from vLLM's
`WeightsMapper`
- Eliminates the need for duplicate mapping definitions
- Avoids issues with reverse mapping (unused keys, empty values)
- Aligns with how `compressed_tensors_config.py` handles the same
scenario

- vLLM version: v0.18.0
- vLLM main:
ed359c497a
---------

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: Matrix_K <zhangke144@huawei.com>
Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com>
Co-authored-by: Matrix_K <zhangke144@huawei.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
2026-03-27 18:25:42 +08:00
Angazenn
7cca7e6990 [v0.18.0][Misc] Recompute scheduler upgrade to vLLM 0.18.0 (#7720)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
cherry-pick from #7675 .
The current RecomputeScheduler is aligned to Scheduler in vLLM v0.16.0.
Since upstream vLLM has upgraded to v0.18.0, we also need to upgrade
RecomputeScheduler to pick up missing updates.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: Angazenn <supperccell@163.com>
2026-03-27 18:24:53 +08:00
linfeng-yuan
ab619e1c53 [0.18.0][profiler] profile AICore and MTE time with torch profiler (#7730)
### What this PR does / why we need it?
This pull request updates the NPU profiler configuration in the Ascend
worker by changing the `aic_metrics` parameter from `AiCoreNone` to
`PipeUtilization`. This change enables the collection of pipe
utilization metrics during profiling sessions.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-27 16:37:54 +08:00
zhangxinyuehfad
2c175f5ed8 [v0.18.0][Bugfix] Fix pr triggers on branches for nightly test workflows (#7695)
### What this PR does / why we need it?
1. Allow PR triggers on `*-dev` and `releases/v*` branches for nightly
test workflows.
2.  fix image-tag in doc

---------

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-27 15:17:06 +08:00
weiguihua2
bc8e87f3db [v0.18.0][Bugfix] fix ds3.2 dcp mtp (#7681)
### What this PR does / why we need it?
Fixed the issue where the DCP overlaps the MTP scenario in the ds3.2
scenario.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

cherry-pick from: https://github.com/vllm-project/vllm-ascend/pull/7617

Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
2026-03-27 14:24:53 +08:00
bowenli
048c8d1afe [v0.18.0][Bugfix] Fix the bug of MTP1 crashing in multiple concurrent scenarios. (#7699)
### What this PR does / why we need it?
The triton operator does not perform boundary checks on the global
position within the loop, leading to the memory overflow in scenarios
with multiple concurrency + 1-step MTP launch.

Solution: Add a check that global_pos < vec_len, and strictly limit the
boundaries of all memory accesses to avoid out-of-bounds writes.
backport:#7459

Signed-off-by: Bowen-Leee <caoshankuangren@gmail.com>
2026-03-27 14:13:12 +08:00
Debonet
6ce1dc162a [v0.18.0] fix(attention): reuse weight address in graph + RL scenario (#7715)
### What this PR does / why we need it?

In graph + RL scenario, we only capture the graph once, and the weight
address is expected to be the same across iterations. However, when
calling .contiguous() on weight tensors, a new memory address may be
allocated, causing the graph to capture incorrect weight addresses.
This PR modifies the weight update logic in AscendMLAImpl and
AscendSFAImpl to use copy_() instead of reassignment, ensuring the
weight addresses remain consistent across iterations.

detailed in #7473

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Signed-off-by: Debonex <719893090@qq.com>
2026-03-27 14:11:20 +08:00
Mengqing Cao
29308ac3a9 [v0.18.0][Bugfix] Fixed wrong class attribute assignment (#7586) (#7655)
### What this PR does / why we need it?
Fixed incorrect class attribute assignment and corrected it to instance
attribute assignment. Ensured reorder_batch_threshold only applies to
the current instance to avoid global pollution and multi-instance
conflicts.

Backport of #7586

Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: LookAround0301 <lixushi@huawei.com>
2026-03-27 11:20:59 +08:00
Li Wang
2c2d8bb015 [cherry-pick][CI] Enforce torchaudio and torchvison compatible with pta (#7688)
### What this PR does / why we need it?
This patch cherry-pick from #7648 

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-27 11:06:13 +08:00
jack
53cc225cac [v0.18.0][Bugfix][Platform] Fix MiniMax M2 reasoning token usage accounting (#7700)
### What this PR does / why we need it?
This backports the MiniMax M2 reasoning-token usage accounting fix onto
`releases/v0.18.0` for vllm-ascend.

The release branch does not include the other local GLM patch commit, so
this PR keeps the MiniMax change self-contained by:
- registering `patch_minimax_usage_accounting` on the release branch
- backporting `completion_tokens_details.reasoning_tokens` into chat
usage generation
- fixing MiniMax reasoning token counting for `</think>`-delimited
outputs without depending on the GLM suffix patch

### Does this PR introduce _any_ user-facing change?
Yes. OpenAI-compatible chat usage accounting for MiniMax M2 responses
now reports corrected reasoning token counts on the release branch.

### How was this patch tested?
- `python -m compileall
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py`
- `python - <<'PY'` import check for
`vllm_ascend.patch.platform.patch_minimax_usage_accounting` on top of
`releases/v0.18.0`

No targeted automated regression test exists for this release-branch
backport yet, so I validated syntax and module import compatibility on
the release branch.

---------

Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
2026-03-27 10:45:28 +08:00
zzzzwwjj
a40eee2ba1 [feat] support dispatch_v2/combine_v2 hierarchy communication (#7698)
### What this PR does / why we need it?

This PR adds support for hierarchical communication for `dispatch_v2`
and `combine_v2` MoE operations. This is achieved by introducing a new
configuration `enable_mc2_hierarchy_comm`. When enabled, the
communication algorithm is set to "hierarchy", which support mc2 op comm
between two super pod.

The changes include:
- Adding `enable_mc2_hierarchy_comm` to `AscendConfig`.
- Modifying `TokenDispatcherWithMC2` to pass `comm_alg: "hierarchy"` to
the underlying `torch_npu` ops when the new config is enabled.
- Adding validation to ensure that this feature is only used with
compatible PTA/CANN versions and is not used with the conflicting
`fused_mc2` op.
- Updating `is_hierarchical_communication_enabled` to respect the new
configuration flag.

### Does this PR introduce _any_ user-facing change?

Yes, this PR introduces a new user-facing configuration option
`enable_mc2_hierarchy_comm` in `additional_config` to enable
hierarchical communication for MoE.

### How was this patch tested?

- vLLM version: v0.18.0

Signed-off-by: zzzzwwjj <1183291235@qq.com>
2026-03-27 09:20:16 +08:00
Wang Kunpeng
0bab629f90 [v0.18.0][bugfix]fixed block_size incorrect setting issue in dsv3.2 (#7630) (#7652)
### What this PR does / why we need it?
https://github.com/vllm-project/vllm/pull/35122 This PR in the vllm
community refactors the update mode of block_size. As a result, when the
user does not specify `--block-size`, dsv3.2 obtains an incorrect
block_size.

**The root cause of the problem is analyzed from the block_size update
process as follows:**
1. In NPUPlatform, `check_and_update_config` calls `refresh_block_size`
to set block_size to 128.
2. During Modelrunner initialization, the `self.block_size` parameter is
generated. At this time, block_size is still 128. This parameter will be
used for operations such as kvcache initialization.
3. `update_block_size_for_backend` updates block_size to the size set in
attn_backend. The reason why the DSV3.2 is faulty is that it has an
additional attn_backend `DeepseekV32IndexerBackend`, and this backend is
not rewritten. The block_size obtained from attn_backend is 64. In this
case, only `vllm_config.cache_config.block_size` is updated, and other
parts are not modified. As a result, the block_size on the entire
network is inconsistent.

**Modification solution:**
Skip `update_block_size_for_backend` and modify block_size only in the
`check_and_update_config` method.

In the future, the block_size update logic can be migrated to the
`update_block_size_for_backend` method. Ensure that all block_size
values on the entire network are updated.
### Does this PR introduce _any_ user-facing change? 
no
### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
ed359c497a
---------

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2026-03-26 22:38:28 +08:00
HarpsealCC
d6661c09b6 [v0.18.0][kernel] Recompilation optimization triggered by triton function parameter optimization (#7647)
### What this PR does / why we need it?
Some parameters of Triton operators are unnecessarily modified with the
"constexpr" modifier. When these parameters change, recompilation is
triggered, which significantly affects the model performance. Therefore,
these parameters need to be rectified.

- vLLM version: v0.17.0
- vLLM main:
8b6325758c

Signed-off-by: HarpSealCC [844291270@qq.com](mailto:844291270@qq.com)
Signed-off-by: l30072083 <liuchengzhuo1@h-partners.com>
Co-authored-by: l30072083 <liuchengzhuo1@h-partners.com>
2026-03-26 19:10:45 +08:00
zhangxinyuehfad
d781902ce9 [v0.18.0][CI] Fix releases/v0.18.0 ci test only support vllm v0.18.0 (#7686)
### What this PR does / why we need it?
Fix releases/v0.18.0 ci test only support vllm v0.18.0 

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-26 18:36:04 +08:00
zhangxinyuehfad
124bb00158 [CI][v0.18.0] Build nightly image for releases/v0.18.0 per pr (#7662)
### What this PR does / why we need it?
This patch add per pr image build for branch `releases/v0.18.0`, Due to
the limitations of the quay naming convention, we should not name the
image tag the same as branch name, we name the image
tag`releases-v0.18.0` for daily build.

Signed-off-by: hfadzxy <starmoon_zhang@163.com>
2026-03-26 16:48:51 +08:00
cvSoldier
2db33868a4 [kernel] Recompilation optimization triggered by triton function parameter optimization (#7645)
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
- Please clarify why the changes are needed. For instance, the use case
and bug description.
Some parameters of Triton operators are unnecessarily modified with the
"constexpr" modifier. When these parameters change, recompilation is
triggered, which significantly affects the model performance. Therefore,
these parameters need to be rectified.
main branch:https://github.com/vllm-project/vllm-ascend/pull/7483

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->

---------

Signed-off-by: cvSoldier <610496306@qq.com>
2026-03-26 16:31:34 +08:00
Mr.WXS
dba34d4915 [v0.18.0][Triton][Qwen3.5] delete expr for kernels args (#7646)
### What this PR does / why we need it?
Some parameters of Triton operators are unnecessarily modified with the
"constexpr" modifier. When these parameters change, recompilation is
triggered, which significantly affects the model performance. Therefore,
these parameters need to be rectified.
backport: https://github.com/vllm-project/vllm-ascend/pull/7482


Signed-off-by: w30012745 <wangxiaoshuai2@h-partners.com>
Co-authored-by: w30012745 <wangxiaoshuai2@h-partners.com>
2026-03-25 23:31:27 +08:00
Wangbei25
dd55736ee4 fix uncompatible between fc1 and non-sp-padding (#7643)
cherry pick https://github.com/vllm-project/vllm-ascend/pull/7614
### What this PR does / why we need it?
fix uncompatible between fc1 and non-sp-padding
After PR
[non-sp-padding](https://github.com/vllm-project/vllm-ascend/pull/7297),
kimi2.5 open flashcomm1 will raise an error : The expanded size of the
tensor do not match the existing size at non-singleton dimension 0.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM-Ascend main: 9976e685b7

Signed-off-by: Wangbei25 <wangbei41@huawie.com>
Co-authored-by: Wangbei25 <wangbei41@huawie.com>
2026-03-25 23:23:37 +08:00
wangbj127
2ad0ca52a6 Qwen3.5 MoE supports flashcomm v1 (#7644)
cherry pick from https://github.com/vllm-project/vllm-ascend/pull/7486
<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
Multimodal models like Qwen3.5 MoE does embedding in model_runner, so
when flash comm is enabled, the first AllGather operation should be
skipped.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
No.

### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
- vLLM version: v0.18.0
- vLLM main:
8b6325758c

---------

Signed-off-by: Wangbingjie <wangbj1207@126.com>
Signed-off-by: wangbj127 <256472688+wangbj127@users.noreply.github.com>
2026-03-25 23:09:33 +08:00
Wang Kunpeng
ff1860bd81 [CI]fix lint (#7641)
### What this PR does / why we need it?
This pull request addresses a linting issue by reordering a specific
configuration assignment within the `apply_config_platform_defaults`
method in `vllm_ascend/platform.py`. This change ensures compliance with
code style guidelines without altering the functional behavior of the
system.
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

Signed-off-by: Wang Kunpeng <1289706727@qq.com>
2026-03-25 18:48:10 +08:00
linfeng-yuan
05a561129e [Graph][Bugfix] Set default cudagraph max capture size via platform defaults (#7572)
### What this PR does / why we need it?

This PR lets NPU platform provide its own default
`max_cudagraph_capture_size` via
`NPUPlatform.apply_config_platform_defaults()`.

Previously, when cudagraph sizing was left unset, Ascend inherited
vLLM's upstream default heuristic in `_set_cudagraph_sizes()`, which
uses `max_num_seqs * decode_query_len * 2`. This PR changes Ascend's
default to `min(max_num_seqs * decode_query_len, 512)` while keeping the
rest of vLLM's cudagraph sizing logic unchanged.

### Does this PR introduce _any_ user-facing change?

Yes, but only for Ascend when users do not explicitly configure
cudagraph sizing.

If `max_cudagraph_capture_size` and `cudagraph_capture_sizes` are both
unset, we now uses `max_num_seqs * decode_query_len` (capped at `512`)
instead of the upstream `* 2` default. Explicit user settings are
unchanged.

### How was this patch tested?

Add unit tests to cover:

- default max injection via `apply_config_platform_defaults()`
- explicit `max_cudagraph_capture_size` is preserved
- explicit `cudagraph_capture_sizes` are preserved
- Ascend default max no longer uses the upstream `* 2`
- late `_set_cudagraph_sizes()` recomputation reuses the current max
input

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-25 17:57:19 +08:00
linfeng-yuan
d452d04656 [A5][bugfix] Fix fused MoE A5 MXFP8 scale normalization, load-balance routing and gating_topk ops (#7573)
### What this PR does / why we need it?
This PR fixes A5 MXFP8 MoE scale handling in the fused MoE path.

- It normalizes MXFP8 activation scales to the packed 3D layout expected
by A5 kernels, including both precomputed dynamic_scale inputs and gmm1
output scales before they are consumed by downstream grouped matmul ops.
- It also refines the MXFP8 force load-balancing path in profiling runs.
- This PR also enables npu_gating_top_k from torch_npu instead of custom
op when running ascend950 chip.
### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI and E2E serving tests on Ascend950DT passed.

---------

Signed-off-by: linfeng-yuan <1102311262@qq.com>
2026-03-25 17:20:28 +08:00
Shaoxu Cheng
e0e585a109 [310P]: add torch chunk gated delta rule and 910b parity ut (#7594)
### What this PR does / why we need it?
RFC https://github.com/vllm-project/vllm-ascend/issues/7394
Add a PyTorch implementation of the  chunk gated delta rule on 310P.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
UT

---------

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
2026-03-25 16:46:43 +08:00
Marck
17da96658f [ModelLoader][Feature] Add rfork support for fast model loading (#7392)
### What this PR does / why we need it?
Support an new load format: RFORK

For implementation details of this feature, please refer to #7441


### Does this PR introduce _any_ user-facing change?

add an new options for load-format: rfork

e.g.
```bash
vllm serve /workspace/models/Qwen3-8B --load-format rfork
```

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

Signed-off-by: Marck <1412354149@qq.com>
2026-03-25 16:40:30 +08:00
pichangping
6ddfc41312 [bugfix] Fixed the error issue when overlaying MTP and full decode on DSV3.1 C8. (#7571)
…DSV3.1 C8.

### What this PR does / why we need it?
DeepSeek v3.1 C8 had a hanging issue when overlaying MTP and full graph
modes; this pull request resolves that issue.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

---------

Signed-off-by: pichangping <1337510399@qq.com>
2026-03-25 14:36:26 +08:00
lilinsiman
95d33f05c2 [eagle3][pcp] fix acceptance rate for eagle3 and pcp enabled (#7549)
### What this PR does / why we need it?
fix the position 3 acceptance rate for eagle3 and pcp enabled

detail:
In the merged graph of eagle_proposer, the code logic was changed from
updating the code once before the forward pass of the draft model to
updating all three positions of common_attn_metadata in the merged graph
before performing the forward pass of the model. As a result, the update
of position 2 and position 3 affected the update of position 1.

For example, in the following field:
common_attn_metadata.block_table_tensor[:batch_size] =
common_attn_metadata.block_table_tensor[block_indices]

When updating the block_table_tensor at position 2, the modification of
this field occurred at the original address of common_attn_metadata. As
a result, the parameter at position 1 was also modified, but the forward
pass at position 1 had not been performed. Therefore, a copy of the
address of block_table_tensor needs to be made, and the modification
needs to be performed on the new address to ensure complete isolation
between positions.

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
tests and ut

- vLLM version: v0.18.0
- vLLM main:
8b6325758c

---------

Signed-off-by: lilinsiman <lilinsiman@gmail.com>
2026-03-25 11:52:04 +08:00
meihanc
114ec75a06 [bugfix][CI] fix '_OpNamespace' 'vllm' object has no attribute 'qkv_rmsnorm_rope' (#7620)
### What this PR does / why we need it?
fix '_OpNamespace' 'vllm' object has no attribute 'qkv_rmsnorm_rope' by
uinstall triton

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

Signed-off-by: Meihan-chen <jcccx.cmh@gmail.com>
2026-03-25 11:05:34 +08:00
Li Wang
8e3f8bab57 [Nightly] Nightly pre-build image (#7388)
### What this PR does / why we need it?
This pull request refactor nightly image build and simplify the logic of
multi workflows.
1. Nightly image build become the prerequisite when the test are
triggered by `schedule` or `workflow_dispatch`
2. Simplify the pull request select case logic
3. Next step: Implement replaceable nightly tests. Specifically, if
nightly tests are manually triggered, they can accept any optional
docker image to meet the needs of different commits(Which means the
image is customizable).
### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.17.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
2026-03-25 09:24:01 +08:00
Yaphets24
8977be1df3 [Bugfix]Fix deepseek 3.2 C8 precision by rotary tensor (#7537)
### What this PR does / why we need it?
During the attention quantization process of DeepSeek V3.2, it is
necessary to retrieve the Hadamard matrix from the weights to facilitate
the computation.

### Does this PR introduce _any_ user-facing change?
No. But there will be two new tensor in quant weight.

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
8b6325758c

---------

Signed-off-by: mayumeng <m30059191@china.huawei.com>
Co-authored-by: mayumeng <m30059191@china.huawei.com>
2026-03-25 09:18:00 +08:00
Ronald
d96440924a adapt to main2main for model runner v2 (#7578)
### What this PR does / why we need it?
This PR aims to adapt to newest commit of vllm main branch for model
runner v2. please refer to
https://github.com/vllm-project/vllm-ascend/issues/5208
### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?

- vLLM version: v0.18.0
- vLLM main:
ed359c497a

---------

Signed-off-by: Ronald1995 <ronaldautomobile@163.com>
2026-03-25 09:08:44 +08:00
Zhu Yi Lin
fc3ec100bc [Patch] Fix balance scheduling (#7611)
### What this PR does / why we need it?
This PR introduces a "balance scheduling" feature, enabled by the
`VLLM_ASCEND_BALANCE_SCHEDULING` environment variable. This feature
adjusts the scheduling logic to better balance the load across
data-parallel workers, preventing a single worker from blocking
scheduling for others. This can improve overall throughput.

Additionally, this PR includes a number of other updates and fixes to
the scheduler, syncing it with a more recent version of the upstream
vLLM scheduler. These changes include:
- Handling for paused scheduler state.
- Support for Mamba block-aligned splits.
- Handling for streaming requests.
- Refinements in preemption logic and resource management (KV cache,
encoder cache).
- General code refactoring for clarity and correctness.

Fixes #

### Does this PR introduce _any_ user-facing change?
Yes, this PR introduces a new feature controlled by the
`VLLM_ASCEND_BALANCE_SCHEDULING` environment variable. When enabled, the
scheduling behavior changes, which could affect performance and request
throughput.

### How was this patch tested?
CI passed. Further testing should be done to validate the performance
and correctness of the new scheduling logic under various workloads,
with and without the feature flag enabled.

Signed-off-by: GDzhu01 <809721801@qq.com>
2026-03-25 08:57:06 +08:00