Backport of #7882 to releases/v0.18.0. Adds aime2025 benchmark test for
DeepSeek-V3.2-W8A8 EP with disaggregated prefill on A3 (4-node, 16 NPUs
per node, accuracy benchmark baseline 66.67%).
Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
Update CI for GLM-5 configuration on vllm-ascend/releases/v0.18.0 branch
在0.18.0版本上对glm5-w4a8做测试
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
---------
Signed-off-by: yangjiuhua <y00845194@china.huawei.com>
Co-authored-by: yangjiuhua <y00845194@china.huawei.com>
### What this PR does / why we need it?
The env `VLLM_ASCEND_ENABLE_FUSED_MC2` should only enabled in the
decoder node during Prefill-Decode Disaggregation scenario
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
What this PR does / why we need it?
switch Ascend conv3d forward_oot to use forward_native and add ut
Does this PR introduce any user-facing change?
No
How was this patch tested?
by CI
---------
Signed-off-by: zouyizhou <zouyizhou@huawei.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
- Enforce recompute scheduler only in PD-disaggregated mode.
- Enforce balance scheduling only in PD-mixed mode.
- Enforce fused MC2 only on PD-disaggregated D-side (kv_consumer).
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
No
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
By ci
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
---------
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
This PR backports the DSA-CP PD role gating fix to `releases/v0.18.0`.
The existing helper logic on the release branch does not handle the PD
mixed-role case correctly when deciding whether layer sharding or TP
`o_proj` handling should be enabled. Layer sharding should only run on
the P-side instance, while TP `o_proj` handling should stay enabled for
normal non-PD deployments and for the PD mixed-role (`kv_both`)
instance. This patch makes those conditions explicit and adds unit
coverage for the allowed and disallowed combinations, including the
DSA-CP-disabled path.
Such wrong condition lead to **vllm serve failures** in case: **FC1 +
PD-colocated KV pooling + no layer_sharding**, specifically causing:
1. insufficient Available KV cache memory
2. o_proj shape error in sfa_v1 attention module
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E test with dsv32 + FC1 + FULL_DECODE_ONLY +
kv_transfer_config(kv_both) + no layer_sharding
---------
Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
### What this PR does / why we need it?
This PR improves the readability of the documentation by fixing typos,
correcting command extensions, and fixing broken links in the Chinese
README.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Documentation changes only.
---------
Signed-off-by: sunshine202600 <sunshine202600@163.com>
### What this PR does / why we need it?
This PR implements the `AscendW8A8DynamicLinearMethod310` quantization
scheme specifically for 310P hardware. It includes the logic for weight
retrieval, per-channel parameter generation, and the application of
dynamic quantization using NPU-specific kernels. Additionally, it
updates `ShardedStateLoader310` to handle quantization configurations
more robustly when generating parameter type maps.
Feedback from the review identified two critical issues in the
implementation:
1. The tensor squeezing logic in the `apply` method incorrectly handles
2D inputs, which may lead to shape mismatches in subsequent layers.
2. The weight tensor in `process_weights_after_loading` is transposed
after being converted to the private NZ format; the transpose operation
should be performed on the ND tensor before conversion to ensure correct
physical layout.
cherry-pick from : #7546#7725
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New unit tests were added in
`tests/ut/_310p/quantization/test_w8a8_dynamic_310.py` to verify the
quantization method, and
`tests/ut/_310p/test_sharded_state_loader_310p.py` was updated to test
the state loader changes.
---------
Signed-off-by: csoulnd <daidaicurry@foxmail.com>
### What this PR does / why we need it?
This PR backports the CPU binding locale normalization fix from #7274 to
`releases/v0.18.0`, including the follow-up review fixes already applied
on `main`.
The change forces `LC_ALL`, `LANG`, and `LC_MESSAGES` to `C` before
spawning subprocesses in `vllm_ascend.cpu_binding.execute_command()`, so
parser-dependent command output stays stable on localized systems. It
also handles `subprocess.TimeoutExpired` by killing the child process
before collecting output, and updates the existing unit tests to keep
command-argument coverage while adding timeout-path coverage.
Fixes#6992
### Does this PR introduce _any_ user-facing change?
Yes.
Users running CPU binding on non-English OS environments should now get
consistent English subprocess output for parser-dependent commands,
avoiding failures caused by inherited locale settings.
### How was this patch tested?
- Updated the existing unit tests in
`tests/ut/device_allocator/test_cpu_binding.py` to assert the locale
environment, retain command argument coverage, and cover the timeout
cleanup path.
- Attempted to run targeted pytest cases locally, but the pytest
invocation did not complete normally in this environment, so I could not
record a clean passing run here.
Attribution:
- Co-authored-by: stdjhs <1601599324@qq.com>
- Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: stdjhs <1601599324@qq.com>
### What this PR does / why we need it?
Fix a bug in the GLM tool call parser where the `function.name` field
was incorrectly included in the final (non-first) chunks of streaming
tool calls.
Per OpenAI streaming semantics, `id`, `type`, and `function.name` must
only appear in the **first** chunk for a given tool call index. When
`_create_remaining_args_delta` was called for continuing/finishing
chunks, it was incorrectly reading the function name from
`delta_message.tool_calls` and re-emitting it, causing clients to see a
duplicate/extra function name in the final chunk.
**Root cause**: The original code always looked up the tool call in
`delta_message.tool_calls` to get the name, id, and type — even when
this was not the first chunk being streamed. This caused the function
name to appear again in the final argument-completion chunk.
**Fix**:
- Track whether arguments have already been streamed
(`already_streamed_args`) for each tool call index.
- Only populate `fallback_tool_call_id`, `fallback_tool_call_type`, and
`fallback_tool_call_name` when `already_streamed_args` is empty (i.e.,
this is genuinely the first chunk).
- Refactored `_create_remaining_args_delta` to omit header fields
entirely when all fallback values are `None`, which is the correct
behavior for continuing/finishing chunks.
### Does this PR introduce _any_ user-facing change?
Yes. Clients consuming the streaming tool call response will no longer
receive a duplicate `function.name` in the final chunk. This fixes
incorrect behavior visible in the OpenAI-compatible streaming API output
for GLM models using tool calls.
### How was this patch tested?
- Code review and logic analysis of the streaming tool call path in
`patch_glm_tool_call_parser.py`.
- Existing unit tests in
`tests/ut/platform/test_patch_glm_tool_call_parser.py`.
---------
Signed-off-by: chen-weipeng12 <chen-weipeng12@noreply.gitcode.com>
Signed-off-by: chenweiqiang11 <chenweiqiang11@noreply.github.com>
Co-authored-by: chen-weipeng12 <chen-weipeng12@noreply.gitcode.com>
### What this PR does / why we need it?
Adds a `check_rank0_process_count` validation step to the
DeepSeek-R1-W8A8-HBM nightly single-node test.
The check verifies that after the server starts, there is **exactly 1**
`vllm serve` process running on rank0. This guards against the
regression fixed in #8041 (extra NPU context leaking on device 0),
ensuring it does not silently reappear in future releases.
#### Changes
-
**`tests/e2e/nightly/single_node/models/scripts/test_single_node.py`**:
Add `run_check_rank0_process_count` async handler. It calls `npu-smi
info` for diagnostics, then uses `psutil` to assert exactly one `vllm
serve` process exists on rank0.
-
**`tests/e2e/nightly/single_node/models/configs/DeepSeek-R1-W8A8-HBM.yaml`**:
Register `check_rank0_process_count` in the `test_content` list for the
HBM test case.
Signed-off-by: hfadzxy <starmoon_zhang@163.com>
### What this PR does / why we need it?
Introduce a check to not using asynchronous communication under
`enable_dsa_cp_with_layer_shard` branch on capturing mode. This change
prevents potential stream and event issues when operating in
graph/capturing mode, ensuring safer communication practices.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E test with dsv32 + FC1 + FULL_DECODE_ONLY +
kv_transfer_config(kv_both)
---------
Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
### What this PR does / why we need it?
Fix the nightly pip binary install doc test fail.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
Nightly doc test
Signed-off-by: leo-pony <nengjunma@outlook.com>
### What this PR does / why we need it?
This PR fixes and simplifies the CI configuration for Qwen3 32B.
The main changes are:
- Remove the redundant `Qwen3-32B-Int8-A3-Feature-Stack3.yaml` config
and consolidate the CI setup into `Qwen3-32B-Int8.yaml`.
- Improve runtime stability by adding
`PYTORCH_NPU_ALLOC_CONF=expandable_segments:True` and setting
`--max-num-seqs 80`.
- Update the accuracy benchmark from `aime2024` to `gsm8k-lite`, and
adjust the related dataset config, output length, baseline, and
threshold accordingly.
These changes make the Qwen3 32B CI easier to maintain and more stable
in nightly validation.
---------
Signed-off-by: ZYang6263 <zy626375@gmail.com>
This pull request reverts previous changes to switch to FIA and instead
implements npu_ring_mla for MLA prefill operations(#5704 ). The change
streamlines the attention mechanism by removing unnecessary metadata
tracking and updating the underlying NPU operations to use the
ring-based MLA kernel. This adjustment ensures better compatibility and
performance for MLA prefill tasks within the vLLM Ascend backend.
Highlights
- Migration to npu_ring_mla: Replaced the usage of
npu_fused_infer_attention_score (FIA) with npu_ring_mla for MLA prefill
operations across the codebase to improve performance and alignment with
the intended architecture.
- Cleanup of redundant metadata: Removed
chunk_actual_seq_lengths_kv_list and actual_seq_lengths_q from various
metadata structures as they are no longer required for the updated
attention implementation.
- Test suite updates: Updated unit tests in test_mla_cp.py and
test_mla_v1.py to mock npu_ring_mla instead of the deprecated FIA
functions and adjusted test assertions to reflect the new implementation
details.
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
Cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7468
- Fix TTFT ratio threshold from 0.8 to 0.4 for prefix cache benchmarks
- Fix max_out_len values for warm_up and benchmark configs
- Applied to both DeepSeek-R1-0528-W8A8 and Qwen3-32B-Int8 configs
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: underfituu <hzhucong@163.com>
backport of #7474
This PR adds C8 (INT8) KV cache quantization support for standard GQA
attention models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel
quantization scales to store KV cache in INT8, reducing KV cache memory
by ~50% compared to BF16, enabling higher batch concurrency and longer
context lengths on the same hardware.
**Key changes:**
1. **`attention_v1.py`** — New `AscendC8AttentionBackendImpl` subclass
of `AscendAttentionBackendImpl`:
- `_prepare_c8_scales`: Shards per-channel scales/offsets to the current
TP rank and pre-computes BF16 BNSD-shaped antiquant tensors (one-time
per layer).
- `_quantize_kv_to_int8`: Quantizes BF16 K/V to INT8 before
`reshape_and_cache`, using pre-cached inverse scales.
- `_forward_c8_decode`: FIA V1 BNSD paged attention with native INT8 KV
and `perchannel` antiquant mode.
- `_forward_c8_chunked_prefill`: Splits decode (FIA V1 BNSD paged INT8)
and prefill (FIA V1 TND float) into two kernel calls.
- `_forward_c8_fused_infer_attention`: Handles `PrefillNoCache` and
`PrefillCacheHit` states.
2. **`quantization/methods/kv_c8.py`** — New
`AscendC8KVCacheAttentionMethod` scheme:
- Creates `k/v_cache_scale/offset` parameters via
`_c8_kv_scale_weight_loader`, which handles per-channel scale shapes and
lazy resizing.
- Sets `layer.kv_cache_torch_dtype = torch.int8` so
`get_kv_cache_spec()` returns INT8 dtype automatically.
- Upgrades `layer.impl` to `AscendC8AttentionBackendImpl` via class
surgery.
3. **`quantization/modelslim_config.py`** — C8 branch in
`get_quant_method()` activates when `kv_cache_type == "C8"` in
`quant_model_description.json`.
4. **`patch/worker/patch_qwen3_c8.py`** — Intercepts per-channel C8
scale/offset weights before `AutoWeightsLoader` discards them, routing
them to the parameters created by `AscendC8KVCacheAttentionMethod`.
5. **`tests/ut/quantization/test_kv_c8.py`** — Unit tests covering
`_c8_kv_scale_weight_loader`, `AscendC8KVCacheAttentionMethod`, and
`AscendC8AttentionBackendImpl` scale helpers.
Yes. Users can now serve Qwen3-32B W8A8C8 quantized models with INT8 KV
cache on Ascend NPU. The model checkpoint must contain a
`quant_model_description.json` with `"kv_cache_type": "C8"` and
per-channel scale/offset tensors in safetensors.
No changes to the serving CLI — the feature activates automatically when
the quantization config is detected.
Benchmarked with `vllm serve` (TP=8, `max_num_seqs=256`,
`max_model_len=131072`, `enable_chunked_prefill=true`) + `random_bench`
(input_len=10240, output_len=2048, 960 prompts, max_concurrency=192):
```
============ Serving Benchmark Result ============
Successful requests: 960
Failed requests: 0
Maximum request concurrency: 192
Benchmark duration (s): 1359.81
Total input tokens: 9830400
Total generated tokens: 1966080
Request throughput (req/s): 0.71
Output token throughput (tok/s): 1445.85
Peak output token throughput (tok/s): 2304.00
Total token throughput (tok/s): 8675.12
---------------Time to First Token----------------
Mean TTFT (ms): 24598.51
Median TTFT (ms): 23167.02
P50 TTFT (ms): 23167.02
P90 TTFT (ms): 47717.08
P99 TTFT (ms): 84402.61
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 120.76
Median TPOT (ms): 121.50
P50 TPOT (ms): 121.50
P90 TPOT (ms): 127.05
P99 TPOT (ms): 130.13
---------------Inter-token Latency----------------
Mean ITL (ms): 120.70
Median ITL (ms): 90.34
P50 ITL (ms): 90.34
P90 ITL (ms): 93.79
P99 ITL (ms): 101.80
==================================================
```
All attention states verified: `PrefillNoCache`, `PrefillCacheHit`,
`ChunkedPrefill`, `DecodeOnly`.
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: LICO67373 <110013619+LICO1314@users.noreply.github.com>
### What this PR does / why we need it?
This PR Qwen3.5-27B ;MiniMax-M2.5-w8a8 ;Qwen3.5-397B-w8a8-mtp acc/perf 3
cases on A3, we need test them daily.
- vLLM version: v0.18.0
- vLLM main:
35141a7eed
Signed-off-by: guxin108 <1252896542@qq.com>
### What this PR does / why we need it?
Fix qwen3Next Nightly CI config in 0.18.0.
backport: #7679
Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
### What this PR does / why we need it?
Add nightly ci test for qwen3vl
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Co-authored-by: betta18 <jiangmengyu1@huawei.com>
### What this PR does / why we need it?
Fixed the bug of incorrect reshape usage.
For example:
ori_tensor: [[1, 2, 3], [4, 5, 6]]
after reshape:
[[1, 2], [3, 4], [5, 6]]
after permute:
[[1, 4], [2, 5], [3, 6]]
Now, we will directly use squeeze for a more intuitive understanding.
pr for main:
#7887
### Does this PR introduce _any_ user-facing change?
The actual peak-to-average ratio has successfully decreased.
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
## What this PR does / why we need it?
pick-from:https://github.com/vllm-project/vllm-ascend/pull/7452
### Problem
Embedding models produce inconsistent outputs when prefix caching is
enabled vs disabled.
### Root Cause
The attention router condition was too broad:
- All `model_runner_type == "pooling"` → `_forward_encoder_attention()`
→ uses `npu_fusion_attention`
- **But `npu_fusion_attention` does NOT support prefix caching**
- Result: Numerical mismatch when KV cache is managed by prefix caching
### Solution
Refine the router condition to check causality:
**Before**:
```
if attn_metadata.model_runner_type == "pooling":
→ npu_fusion_attention (no prefix caching support)
```
**After**:
```
if attn_metadata.model_runner_type == "pooling" and not attn_metadata.causal:
→ npu_fusion_attention (for true encoders)
else:
→ npu_fused_infer_attention_score (prefix caching support)
```
### Changes Made
1. **Fixed router condition** (`vllm_ascend/attention/attention_v1.py`
L968)
- Added `and not attn_metadata.causal` check
- Effect: Non-causal embeddings now use correct operator
2. **Simplified encoder attention**
(`vllm_ascend/attention/attention_v1.py` L864-877)
- Removed redundant causal branch (encoders never use causal mask)
- Reduced from 34 lines to 14 lines
3. **Added test** (`tests/e2e/singlecard/pooling/test_embedding.py`)
- Validates embedding outputs with/without prefix caching are consistent
## Does this PR introduce _any_ user-facing change?
### Functional Changes
✅ **Yes** - Bug fix: Embedding models now produce consistent outputs
with prefix caching
### API Changes
❌ **No** - All public APIs unchanged
### Configuration Changes
❌ **No** - No new configuration required
### Backward Compatibility
✅ **Fully compatible** - Only fixes incorrect behavior
## How was this patch tested?
### New Test
Added `test_embed_models_using_prefix_caching_correctness()`:
- Tests: `Qwen3-Embedding-0.6B`
- Validates numerical consistency between runs with/without prefix
caching
- Uses long sequences to activate prefix caching
- Tolerance: 1e-2
- vLLM version: v0.18.0
Signed-off-by: underfituu <hzhucong@163.com>
### What this PR does / why we need it?
Fix the OOM (Out-of-Memory) error in the single-node-deepseek-v3-2-w8a8
nightly test of vllm-ascend:
- Reduced the value of HCCL_BUFFSIZE
- Lowered the gpu-memory-utilization
Optimize service-side performance:
Updated service-oriented configuration parameters (e.g., max-num-seqs,
cudagraph_capture_sizes, batch_size) to improve the inference
performance,so that the performance is closer to the optimal performance
of the current mainline.
Align performance baseline with main branch:
Updated the performance baseline according to the latest performance
data
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The test has passed.
https://github.com/vllm-project/vllm-ascend/actions/runs/23734079080/job/69134387320?pr=7793
---------
Signed-off-by: wyh145 <1987244901@qq.com>
### What this PR does / why we need it?
Due to the current dcp solution of allgathering the KV cache, the
performance deteriorates significantly, and the CI may get stuck. This
PR temporarily removes the performance and accuracy benchmarks for
DeepSeek-V3.2-W8A8-cp to prevent CI hangs until optimization is
complete.
pcik-from:https://github.com/vllm-project/vllm-ascend/pull/7842
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Verified that the configuration file remains valid and that the CI no
longer attempts to run the problematic benchmarks.
pick-from: https://github.com/vllm-project/vllm-ascend/pull/7842
---------
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
Before when we do put for KV Pool, we find the first non-existing key
and put all the blocks starting from that index; however, if the prefix
cache blocks is from another request, and some of the blocks are evicted
due to LRU, we will be putting blocks that still exist in the pool, and
causing MooncakeStore printing unnecessary logs in master service.
What this PR does:
Now we lookup all the keys and only put the ones that are missing.
Fix lookup_scheduler in pool_worker so it handles GQA correctly.
Fixes a few existing typos
Add UT, written by codex
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
---------
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: DreamerLeader <2270923832@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
Implement get_token_bin_counts_and_mask and apply_penalties with
Triton-Ascend kernels. This significantly reduces latency of the
sampling process when repetition/frequency/presence penalties are
enabled.
Cherry-pick from main PR #7569
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>
## Summary
- replace the MiniMax usage accounting monkey patch with a runtime
wrapper implementation instead of source-text rewriting
- preserve MiniMax reasoning-token semantics when `</think>` is missing
by counting the emitted output as reasoning tokens
- add unit coverage for usage tracking helpers and MiniMax
reasoning-token counting
## Why
The previous implementation rewrote `OpenAIServingChat` by matching
exact source blocks. That was brittle against `vllm` source drift and
could crash during early plugin initialization with:
`RuntimeError: Failed to locate expected block while patching
OpenAIServingChat usage accounting.`
This change keeps the usage-accounting backport, but applies it by
wrapping the original stream/full generators and tracking output token
ids at runtime.
For MiniMax reasoning counting, a missing `</think>` should not be
treated as zero reasoning tokens. It can mean the whole output is still
in thinking mode, or that generation stopped before the closing token
was produced. In that case, the emitted output should still be counted
as reasoning.
## Validation
- `pytest -q
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py`
- `vllm serve --help`
Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
### What this PR does / why we need it?
This PR backports the changes from #7673 ([Bugfix] support FlashComm1 &
DCP for Qwen) to the releases/v0.18.0 branch.
--------
Signed-off-by: Yang Yuxi <907276627@qq.com>
### What this PR does / why we need it?
cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7736
**Error information**
When the quantized weights in CompressedTensors format of the kimi-k2
model are used, the following error is reported:
`AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute
'enabling_fa_quant'`
**Error Cause**
Currently, FA3 quantization supports only the weights of modelslim
quantization. The added methods are not defined in
AscendCompressedTensorsConfig.
**Solution**
Before invoking related methods, check whether the FA3 feature is
enabled.
Additionally, the unused `get_scaled_act_names` method and its
corresponding unit test have been removed.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests were updated by removing a deprecated test case, and
the refactored logic was reviewed for correctness.
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
This rebases the GLM47 tool-call parser fix onto `releases/v0.18.0`
after the MiniMax usage-accounting patch merged upstream on March 27,
2026.
It fixes OpenAI chat tool-call streaming for GLM47 by:
- draining terminal parser chunks that contain both the final argument
text and the closing `</tool_call>` suffix
- computing finish backfill from the tool argument bytes actually
emitted to the client, instead of trusting parser-internal buffered
state
- adding focused regression tests for finish backfill and terminal chunk
handling
### Does this PR introduce _any_ user-facing change?
Yes. GLM47 OpenAI-compatible streaming tool-call responses now emit
correct final chunks and argument payloads on `releases/v0.18.0`.
### How was this patch tested?
- `pytest -q tests/ut/patch/platform/test_patch_glm_tool_call_parser.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py`
- `python -m pre_commit run --files
vllm_ascend/patch/platform/patch_glm_tool_call_parser.py
tests/ut/patch/platform/test_patch_glm_tool_call_parser.py
vllm_ascend/patch/platform/__init__.py vllm_ascend/patch/__init__.py`
---------
Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
### What this PR does / why we need it?
Fixed the issue where the DCP overlaps the MTP scenario in the ds3.2
scenario.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
cherry-pick from: https://github.com/vllm-project/vllm-ascend/pull/7617
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
This backports the MiniMax M2 reasoning-token usage accounting fix onto
`releases/v0.18.0` for vllm-ascend.
The release branch does not include the other local GLM patch commit, so
this PR keeps the MiniMax change self-contained by:
- registering `patch_minimax_usage_accounting` on the release branch
- backporting `completion_tokens_details.reasoning_tokens` into chat
usage generation
- fixing MiniMax reasoning token counting for `</think>`-delimited
outputs without depending on the GLM suffix patch
### Does this PR introduce _any_ user-facing change?
Yes. OpenAI-compatible chat usage accounting for MiniMax M2 responses
now reports corrected reasoning token counts on the release branch.
### How was this patch tested?
- `python -m compileall
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py`
- `python - <<'PY'` import check for
`vllm_ascend.patch.platform.patch_minimax_usage_accounting` on top of
`releases/v0.18.0`
No targeted automated regression test exists for this release-branch
backport yet, so I validated syntax and module import compatibility on
the release branch.
---------
Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
cherry pick from https://github.com/vllm-project/vllm-ascend/pull/7486
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
Multimodal models like Qwen3.5 MoE does embedding in model_runner, so
when flash comm is enabled, the first AllGather operation should be
skipped.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
No.
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
- vLLM version: v0.18.0
- vLLM main:
8b6325758c
---------
Signed-off-by: Wangbingjie <wangbj1207@126.com>
Signed-off-by: wangbj127 <256472688+wangbj127@users.noreply.github.com>
### What this PR does / why we need it?
This PR lets NPU platform provide its own default
`max_cudagraph_capture_size` via
`NPUPlatform.apply_config_platform_defaults()`.
Previously, when cudagraph sizing was left unset, Ascend inherited
vLLM's upstream default heuristic in `_set_cudagraph_sizes()`, which
uses `max_num_seqs * decode_query_len * 2`. This PR changes Ascend's
default to `min(max_num_seqs * decode_query_len, 512)` while keeping the
rest of vLLM's cudagraph sizing logic unchanged.
### Does this PR introduce _any_ user-facing change?
Yes, but only for Ascend when users do not explicitly configure
cudagraph sizing.
If `max_cudagraph_capture_size` and `cudagraph_capture_sizes` are both
unset, we now uses `max_num_seqs * decode_query_len` (capped at `512`)
instead of the upstream `* 2` default. Explicit user settings are
unchanged.
### How was this patch tested?
Add unit tests to cover:
- default max injection via `apply_config_platform_defaults()`
- explicit `max_cudagraph_capture_size` is preserved
- explicit `cudagraph_capture_sizes` are preserved
- Ascend default max no longer uses the upstream `* 2`
- late `_set_cudagraph_sizes()` recomputation reuses the current max
input
- vLLM version: v0.18.0
- vLLM main:
ed359c497a
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
RFC https://github.com/vllm-project/vllm-ascend/issues/7394
Add a PyTorch implementation of the chunk gated delta rule on 310P.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
UT
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
### What this PR does / why we need it?
RFC https://github.com/vllm-project/vllm-ascend/issues/7394
Add a PyTorch implementation of the fused recurrent gated delta ruler on
310P.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
UT
- vLLM version: v0.17.0
- vLLM main:
4497431df6
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
### What this PR does / why we need it?
This PR introduces several upstream `vllm`-aligned lint hooks into
`vllm-ascend` and makes them part of the actual `pre-commit` flow.
Main changes in this PR:
- add `check-boolean-context-manager` to catch boolean expressions in
`with` statements
- add `check-forbidden-imports` to forbid direct `re` imports and
disallowed direct `triton` imports
- enable shell script linting through `tools/shellcheck.sh`
- add root `.clang-format` aligned with upstream `vllm`, enable
`clang-format` in `pre-commit`, temporarily **exclude all `csrc/**`**
from `clang-format` to avoid bringing a large native code reformat into
this PR
This PR focuses on landing the smaller and immediately useful lint
alignment first, without mixing in the larger requirements-management
migration.
### Does this PR introduce _any_ user-facing change?
No.
This PR only updates repository lint configuration, static checks, and
internal import/style enforcement. It does not change runtime behavior
or public interfaces.
### How was this patch tested?
Tested locally in the project virtual environment.
Commands used:
```bash
bash format.sh
```
Verified checks passed:
``` bash
ruff check...............................................................Passed
ruff format..............................................................Passed
codespell................................................................Passed
typos....................................................................Passed
clang-format.............................................................Passed
Lint GitHub Actions workflow files.......................................Passed
Lint shell scripts.......................................................Passed
Lint PNG exports from excalidraw.........................................Passed
Check for spaces in all filenames........................................Passed
Enforce __init__.py in Python packages...................................Passed
Check for forbidden imports..............................................Passed
Check for boolean ops in with-statements.................................Passed
Suggestion...............................................................Passed
- hook id: suggestion
- duration: 0s
To bypass pre-commit hooks, add --no-verify to git commit.
```
**note:**
clang-format is enabled but currently excludes all csrc/**
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
---------
Signed-off-by: MrZ20 <2609716663@qq.com>
### What this PR does / why we need it?
This PR optimizes the `_compute_slot_mappings_kernel` for Ascend NPUs to
improve performance. The key changes include:
- A new Triton kernel implementation (`_compute_slot_mappings_kernel`)
with NPU-specific optimizations, such as using `tl.gather` to handle
non-contiguous memory access and replacing modulo operations.
- A new method `compute_slot_mappings` in `AscendBlockTables` to use
this new kernel.
- An end-to-end test to verify the correctness of the new kernel against
the reference GPU implementation.
The optimization is needed to avoid performance degradation from scalar
computation on Ascend devices.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.18.0
- vLLM main:
ed359c497a
---------
Signed-off-by: lhp-deep <liuhaopeng1@huawei.com>
### What this PR does / why we need it?
2nd PR for https://github.com/vllm-project/vllm-ascend/issues/5712,
extend SP to VL MoE models.
### Does this PR introduce _any_ user-facing change?
remove `sp_threshold` in additional config and reuse `sp_min_token_num`
from vLLM.
### How was this patch tested?
- Model: Qwen3-VL-30B-A3B,
- TP4 DP2
- 100 reqs
- max concurrency 1
| Seq length | Mean TTFT (ms) main | Mean TTFT (ms) this PR |
|------------|---------------------|------------------------|
| 4k | 429.40 | 323.3 |
| 16k | 1297.01 | 911.74 |
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
This pr modifies qwen3Next nightly CI config.
(1) Add a nightly CI .
(2) Set a more precise accuracy standard
- vLLM version: v0.18.0
- vLLM main:
6a9cceb219
Signed-off-by: Your Name <you@example.com>
Co-authored-by: Your Name <you@example.com>
### What this PR does / why we need it?
This PR adds missing arguments in `AscendRotaryEmbedding`,
`AscendYarnRotaryEmbedding` to conform with vLLM. Besides, corresponding
ut is introduced.
- vLLM version: v0.17.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
RFC #7394
310P cannot use the fused `rmsnormgated` operator and must fall back to
the native implementation.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
ut
- vLLM version: v0.17.0
- vLLM main:
4497431df6
---------
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
### What this PR does / why we need it?
During the prefill phase of Qwen3-Next and Qwen3.5, the
`torch.ops._C_ascend.causal_conv1d_fn` operator exhibits significant
performance bottlenecks. To address this, we have re-implemented the
optimization using `torch.ops._C_ascend.npu_causal_conv1d_custom`.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
1 accuracy test
```
[2026-03-20 16:44:22,961] [ais_bench] [INFO] Start launch task state board ...
+-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+
| Task Name | Process | Progress | Time Cost | Status | Log Path | Extend Parameters |
+=============================+===========+============+=============+==========+===========================================+=====================+
| vllm-api-general-chat/gsm8k | 2918978 | NA | 0:00:01 | finish | logs/eval/vllm-api-general-chat/gsm8k.out | None |
+-----------------------------+-----------+------------+-------------+----------+-------------------------------------------+---------------------+
[2026-03-20 16:44:34,284] [ais_bench] [INFO] Evaluation tasks completed.
[2026-03-20 16:44:34,287] [ais_bench] [INFO] Summarizing evaluation results...
dataset version metric mode vllm-api-general-chat
--------- --------- -------- ------ -----------------------
gsm8k 271d0b accuracy gen 96.21
```
2 ut modify test
`pytest -sv
/home/c30006096/vllm-ascend/tests/e2e/nightly/single_node/ops/singlecard_ops/triton/test_causal_conv1d.py::test_ascend_causal_conv1d`
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
Signed-off-by: wenba0 <3054239545@qq.com>
Signed-off-by: jiaojiao <56385650+wenba0@users.noreply.github.com>