### What this PR does / why we need it?
This PR implements the `AscendW8A8DynamicLinearMethod310` quantization
scheme specifically for 310P hardware. It includes the logic for weight
retrieval, per-channel parameter generation, and the application of
dynamic quantization using NPU-specific kernels. Additionally, it
updates `ShardedStateLoader310` to handle quantization configurations
more robustly when generating parameter type maps.
Feedback from the review identified two critical issues in the
implementation:
1. The tensor squeezing logic in the `apply` method incorrectly handles
2D inputs, which may lead to shape mismatches in subsequent layers.
2. The weight tensor in `process_weights_after_loading` is transposed
after being converted to the private NZ format; the transpose operation
should be performed on the ND tensor before conversion to ensure correct
physical layout.
cherry-pick from : #7546#7725
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New unit tests were added in
`tests/ut/_310p/quantization/test_w8a8_dynamic_310.py` to verify the
quantization method, and
`tests/ut/_310p/test_sharded_state_loader_310p.py` was updated to test
the state loader changes.
---------
Signed-off-by: csoulnd <daidaicurry@foxmail.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
For the ENPU scenario, it is required that device events follow the
principle of "record first, wait later", otherwise the inference process
may become stuck. However, in the current model_forward function,
event.wait precedes event.record. Therefore, for the ENPU scenario,
graph parameter updates should be performed before model execution.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
N/A
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
---------
Signed-off-by: 1zzk <785396250@qq.com>
Signed-off-by: 1kzk <785396250@qq.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
### What this PR does / why we need it?
This PR backports the CPU binding locale normalization fix from #7274 to
`releases/v0.18.0`, including the follow-up review fixes already applied
on `main`.
The change forces `LC_ALL`, `LANG`, and `LC_MESSAGES` to `C` before
spawning subprocesses in `vllm_ascend.cpu_binding.execute_command()`, so
parser-dependent command output stays stable on localized systems. It
also handles `subprocess.TimeoutExpired` by killing the child process
before collecting output, and updates the existing unit tests to keep
command-argument coverage while adding timeout-path coverage.
Fixes#6992
### Does this PR introduce _any_ user-facing change?
Yes.
Users running CPU binding on non-English OS environments should now get
consistent English subprocess output for parser-dependent commands,
avoiding failures caused by inherited locale settings.
### How was this patch tested?
- Updated the existing unit tests in
`tests/ut/device_allocator/test_cpu_binding.py` to assert the locale
environment, retain command argument coverage, and cover the timeout
cleanup path.
- Attempted to run targeted pytest cases locally, but the pytest
invocation did not complete normally in this environment, so I could not
record a clean passing run here.
Attribution:
- Co-authored-by: stdjhs <1601599324@qq.com>
- Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: stdjhs <1601599324@qq.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
ref:https://github.com/vllm-project/vllm-ascend/issues/8184
following https://github.com/vllm-project/vllm/pull/31057, add
`requires_piecewise_for_cudagraph` for `AscendStoreConnector`
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
### What this PR does / why we need it?
Fix a bug in the GLM tool call parser where the `function.name` field
was incorrectly included in the final (non-first) chunks of streaming
tool calls.
Per OpenAI streaming semantics, `id`, `type`, and `function.name` must
only appear in the **first** chunk for a given tool call index. When
`_create_remaining_args_delta` was called for continuing/finishing
chunks, it was incorrectly reading the function name from
`delta_message.tool_calls` and re-emitting it, causing clients to see a
duplicate/extra function name in the final chunk.
**Root cause**: The original code always looked up the tool call in
`delta_message.tool_calls` to get the name, id, and type — even when
this was not the first chunk being streamed. This caused the function
name to appear again in the final argument-completion chunk.
**Fix**:
- Track whether arguments have already been streamed
(`already_streamed_args`) for each tool call index.
- Only populate `fallback_tool_call_id`, `fallback_tool_call_type`, and
`fallback_tool_call_name` when `already_streamed_args` is empty (i.e.,
this is genuinely the first chunk).
- Refactored `_create_remaining_args_delta` to omit header fields
entirely when all fallback values are `None`, which is the correct
behavior for continuing/finishing chunks.
### Does this PR introduce _any_ user-facing change?
Yes. Clients consuming the streaming tool call response will no longer
receive a duplicate `function.name` in the final chunk. This fixes
incorrect behavior visible in the OpenAI-compatible streaming API output
for GLM models using tool calls.
### How was this patch tested?
- Code review and logic analysis of the streaming tool call path in
`patch_glm_tool_call_parser.py`.
- Existing unit tests in
`tests/ut/platform/test_patch_glm_tool_call_parser.py`.
---------
Signed-off-by: chen-weipeng12 <chen-weipeng12@noreply.gitcode.com>
Signed-off-by: chenweiqiang11 <chenweiqiang11@noreply.github.com>
Co-authored-by: chen-weipeng12 <chen-weipeng12@noreply.gitcode.com>
### What this PR does / why we need it?
This PR is cherry-pick from #8263.
This PR aims to fix short prompt problem. The root cause can be found in
#8029. Since the previous pr may miss mixed long and short prompt batch,
after discussion, we decide to add PrefillNoCache state in mla
_forward_decode now instead.
Signed-off-by: Zetong Li <slippersss@126.com>
### What this PR does / why we need it?
This PR improves the validation of `max_cudagraph_capture_size` by
comparing it against the potential maximum tokens required for decoding,
derived from the scheduler configuration. It introduces a warning to
alert users when the capture size might be insufficient for the
workload, which could lead to suboptimal performance.
ref: #8227
### Does this PR introduce _any_ user-facing change?
Yes, a warning log is added when the `max_cudagraph_capture_size` is
smaller than the potential decode workload.
---------
Signed-off-by: QiuChunshuo <qiuchunshuo@huawei.com>
### What this PR does / why we need it?
This PR fixes a service startup failure for DeepSeek-V3.1 models by
removing a strict type assertion for `MLAAttentionSpec` in
`NPUModelRunner.get_kv_cache_spec`. The assertion was failing due to
class identity mismatches caused by the runtime patching of
`MLAAttentionSpec` with `AscendMLAAttentionSpec`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Verified that the service starts correctly for DSV3.1 models.
Signed-off-by: mayumeng <m30059191@china.huawei.com>
Co-authored-by: mayumeng <m30059191@china.huawei.com>
### What this PR does / why we need it?
Fix Qwen3.5 MoE MTP layer shared expert shape error when flash comm v1
is enabled.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- vLLM version: v0.18.0
- vLLM main:
35141a7eed
Signed-off-by: Wangbingjie <wangbj1207@126.com>
### What this PR does / why we need it?
bugfix short squence has no respone. This pull request refactors the
event handling for KV cache reshaping in mla_v1.py by centralizing the
reshape_cache_event creation and recording within the _mla_preprocess
function, ensuring it covers both decode and prefill operations.
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
### What this PR does / why we need it?
Introduce a check to not using asynchronous communication under
`enable_dsa_cp_with_layer_shard` branch on capturing mode. This change
prevents potential stream and event issues when operating in
graph/capturing mode, ensuring safer communication practices.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
E2E test with dsv32 + FC1 + FULL_DECODE_ONLY +
kv_transfer_config(kv_both)
---------
Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
…and PCP across machines
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: DreamLeader <2270923832@qq.com>
### What this PR does / why we need it?
This PR is cherry-pick from #8029.
This PR aims to fix attention state of short prompt for correct
forwarding. Since a batch of short prompts (prefill tokens less than or
equal to num_spec_tokens + 1) will be treated as decode requests (by
split_decodes_and_prefills), its original PrefillNoCache attention state
contradicts. Thus these short prompts will be passed into a mismatched
branch and incur errors.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
by ci
Signed-off-by: Zetong Li <slippersss@126.com>
This pull request reverts previous changes to switch to FIA and instead
implements npu_ring_mla for MLA prefill operations(#5704 ). The change
streamlines the attention mechanism by removing unnecessary metadata
tracking and updating the underlying NPU operations to use the
ring-based MLA kernel. This adjustment ensures better compatibility and
performance for MLA prefill tasks within the vLLM Ascend backend.
Highlights
- Migration to npu_ring_mla: Replaced the usage of
npu_fused_infer_attention_score (FIA) with npu_ring_mla for MLA prefill
operations across the codebase to improve performance and alignment with
the intended architecture.
- Cleanup of redundant metadata: Removed
chunk_actual_seq_lengths_kv_list and actual_seq_lengths_q from various
metadata structures as they are no longer required for the updated
attention implementation.
- Test suite updates: Updated unit tests in test_mla_cp.py and
test_mla_v1.py to mock npu_ring_mla instead of the deprecated FIA
functions and adjusted test assertions to reflect the new implementation
details.
Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>
Co-authored-by: weijinqian_v1 <weijinqian@huawei.com>
### What this PR does / why we need it?
Cherry-picked from #8039
Restore the setting of MC2 `global_bs` and `mc2_mask` handling when
`all_reduce` across DP group cannot be skipped. Ascend MC2 ops require
`global_bs=0` + `mc2_mask` while enabling inter-node roce hierarchical
communication. PR #4983 always passed non-zero `global_bs` without
`mc2_mask`, which is incompatible with hierarchy comm raised in PR #7583
**Changes:**
- Add `should_skip_allreduce_across_dp_group()` to `utils.py` with
hierarchy constraint
- Set `global_bs=0` when allreduce is not skipped; pass `mc2_mask`
accordingly
- Add `mc2_mask` field to `MoEMC2CombineMetadata` for dispatch→combine
propagation
### Does this PR introduce _any_ user-facing change?
No. But this PR fixes cross-super-node communication function on A3 with
`enable_mc2_hierarchy_comm=True` in `additional_config` and `export
HCCL_INTRA_ROCE_ENABLE=1`.
### How was this patch tested?
E2E serving succeeded and CI pssed.
- vLLM version: v0.18.0
- vLLM main:
14acf429ac
---------
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Enabling temperature in post-processing on 310P devices can cause the
service to stall and eventually hang. We first traced the issue to a
timeout where the temperature-related `div` operator was waiting for
results from a sub-stream. After investigating the preceding operators,
we finally identified the root cause as the `q.exponential_()` operator,
which is not well supported on 310P and triggers an internal issue in
the `add` kernel.
### Does this PR introduce _any_ user-facing change?
NA
### How was this patch tested?
This patch was thoroughly tested locally(accuracy-dataset test and
stress test). It is not easy to design a proper unit test for this case,
and I appreciate your understanding.
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
When we launch a PD-disaggregated process and send requests, an
additional processes appear on NPU 0, becasue when a thread has a
primary cuda context, the child thread it creates automatically doesn't
inherit the cuda context. See
https://forums.developer.nvidia.com/t/when-a-thread-has-a-primary-cuda-context-does-the-child-thread-it-creates-automatically-inherit-the-cuda-context/362810.
vLLM has fixed this issue in [pr-37449
](https://github.com/vllm-project/vllm/pull/37449), but version 0.18.0
does not include the fix. Therefore, we need to patch it.
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
no
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
---------
Signed-off-by: zouyida <zouyida@huawei.com>
Co-authored-by: zouyida <zouyida@huawei.com>
backport of #7474
This PR adds C8 (INT8) KV cache quantization support for standard GQA
attention models (e.g., Qwen3-32B W8A8C8). C8 uses static per-channel
quantization scales to store KV cache in INT8, reducing KV cache memory
by ~50% compared to BF16, enabling higher batch concurrency and longer
context lengths on the same hardware.
**Key changes:**
1. **`attention_v1.py`** — New `AscendC8AttentionBackendImpl` subclass
of `AscendAttentionBackendImpl`:
- `_prepare_c8_scales`: Shards per-channel scales/offsets to the current
TP rank and pre-computes BF16 BNSD-shaped antiquant tensors (one-time
per layer).
- `_quantize_kv_to_int8`: Quantizes BF16 K/V to INT8 before
`reshape_and_cache`, using pre-cached inverse scales.
- `_forward_c8_decode`: FIA V1 BNSD paged attention with native INT8 KV
and `perchannel` antiquant mode.
- `_forward_c8_chunked_prefill`: Splits decode (FIA V1 BNSD paged INT8)
and prefill (FIA V1 TND float) into two kernel calls.
- `_forward_c8_fused_infer_attention`: Handles `PrefillNoCache` and
`PrefillCacheHit` states.
2. **`quantization/methods/kv_c8.py`** — New
`AscendC8KVCacheAttentionMethod` scheme:
- Creates `k/v_cache_scale/offset` parameters via
`_c8_kv_scale_weight_loader`, which handles per-channel scale shapes and
lazy resizing.
- Sets `layer.kv_cache_torch_dtype = torch.int8` so
`get_kv_cache_spec()` returns INT8 dtype automatically.
- Upgrades `layer.impl` to `AscendC8AttentionBackendImpl` via class
surgery.
3. **`quantization/modelslim_config.py`** — C8 branch in
`get_quant_method()` activates when `kv_cache_type == "C8"` in
`quant_model_description.json`.
4. **`patch/worker/patch_qwen3_c8.py`** — Intercepts per-channel C8
scale/offset weights before `AutoWeightsLoader` discards them, routing
them to the parameters created by `AscendC8KVCacheAttentionMethod`.
5. **`tests/ut/quantization/test_kv_c8.py`** — Unit tests covering
`_c8_kv_scale_weight_loader`, `AscendC8KVCacheAttentionMethod`, and
`AscendC8AttentionBackendImpl` scale helpers.
Yes. Users can now serve Qwen3-32B W8A8C8 quantized models with INT8 KV
cache on Ascend NPU. The model checkpoint must contain a
`quant_model_description.json` with `"kv_cache_type": "C8"` and
per-channel scale/offset tensors in safetensors.
No changes to the serving CLI — the feature activates automatically when
the quantization config is detected.
Benchmarked with `vllm serve` (TP=8, `max_num_seqs=256`,
`max_model_len=131072`, `enable_chunked_prefill=true`) + `random_bench`
(input_len=10240, output_len=2048, 960 prompts, max_concurrency=192):
```
============ Serving Benchmark Result ============
Successful requests: 960
Failed requests: 0
Maximum request concurrency: 192
Benchmark duration (s): 1359.81
Total input tokens: 9830400
Total generated tokens: 1966080
Request throughput (req/s): 0.71
Output token throughput (tok/s): 1445.85
Peak output token throughput (tok/s): 2304.00
Total token throughput (tok/s): 8675.12
---------------Time to First Token----------------
Mean TTFT (ms): 24598.51
Median TTFT (ms): 23167.02
P50 TTFT (ms): 23167.02
P90 TTFT (ms): 47717.08
P99 TTFT (ms): 84402.61
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 120.76
Median TPOT (ms): 121.50
P50 TPOT (ms): 121.50
P90 TPOT (ms): 127.05
P99 TPOT (ms): 130.13
---------------Inter-token Latency----------------
Mean ITL (ms): 120.70
Median ITL (ms): 90.34
P50 ITL (ms): 90.34
P90 ITL (ms): 93.79
P99 ITL (ms): 101.80
==================================================
```
All attention states verified: `PrefillNoCache`, `PrefillCacheHit`,
`ChunkedPrefill`, `DecodeOnly`.
- vLLM version: v0.17.0
- vLLM main:
8b6325758c
Signed-off-by: lico67373 <918688502@qq.com>
Co-authored-by: LICO67373 <110013619+LICO1314@users.noreply.github.com>
### What this PR does / why we need it?
Enable Flash Comm V1 (sequence parallelism) for Qwen3-VL models (both
dense and MoE variants).
Root cause: Qwen3-VL's deepstack embeddings remain full-size [N, H]
while hidden states become [N/tp_size, H] after reduce-scatter, causing
shape mismatch on add.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- [x] Run Qwen3-VL dense model with FC1 enabled (TP > 1), verify correct
output
- [x] Run Qwen3-VL MoE model with FC1 enabled (TP > 1), verify correct
output
---------
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>
Co-authored-by: betta18 <jiangmengyu1@huawei.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
### What this PR does / why we need it?
Qwen3vl full attention supports enabling the split_qkv_rmsnorm_mrope
fusion operator.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- [x] Run Qwen3-VL dense model with the fusion operator, verify correct
output
- [x] Run Qwen3-VL MoE model with the fusion operator, verify correct
output
---------
Signed-off-by: jiangmengyu18 <451528648@qq.com>
Signed-off-by: jiangmengyu18 <56633611+jiangmengyu18@users.noreply.github.com>
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Co-authored-by: betta18 <jiangmengyu1@huawei.com>
### What this PR does / why we need it?
This PR fixes a weight loading error in the Qwen3-VL model.
The bug was introduced by vLLM. In vLLM's `qwen3-vl.py`, the prefix of
the `lm_head` layer is hardcoded as `"lm_head"`. However,
`hf_to_vllm_mapper` remaps the weight name of `lm_head` from `lm_head`
to `language_model.lm_head`.
This causes a mismatch between the keys in the weight file and the
prefix of the lm_head layer, resulting in an error.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- [x] Run Qwen3-VL dense model with the fusion operator, verify correct
output
Signed-off-by: betta18 <jiangmengyu1@huawei.com>
Co-authored-by: betta18 <jiangmengyu1@huawei.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
vLLM v0.18 defers KV connector finalization during target-modelforward
when speculative decoding is enable, leading to KV Pool not doing Put
Operation. This change is forgotten when we bumpped up the version for
vllm-ascend. Fix by adding finalize_kv_connector for spec decode.
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: DreamerLeader <2270923832@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
Fixed the bug of incorrect reshape usage.
For example:
ori_tensor: [[1, 2, 3], [4, 5, 6]]
after reshape:
[[1, 2], [3, 4], [5, 6]]
after permute:
[[1, 4], [2, 5], [3, 6]]
Now, we will directly use squeeze for a more intuitive understanding.
pr for main:
#7887
### Does this PR introduce _any_ user-facing change?
The actual peak-to-average ratio has successfully decreased.
Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
## What this PR does / why we need it?
pick-from:https://github.com/vllm-project/vllm-ascend/pull/7452
### Problem
Embedding models produce inconsistent outputs when prefix caching is
enabled vs disabled.
### Root Cause
The attention router condition was too broad:
- All `model_runner_type == "pooling"` → `_forward_encoder_attention()`
→ uses `npu_fusion_attention`
- **But `npu_fusion_attention` does NOT support prefix caching**
- Result: Numerical mismatch when KV cache is managed by prefix caching
### Solution
Refine the router condition to check causality:
**Before**:
```
if attn_metadata.model_runner_type == "pooling":
→ npu_fusion_attention (no prefix caching support)
```
**After**:
```
if attn_metadata.model_runner_type == "pooling" and not attn_metadata.causal:
→ npu_fusion_attention (for true encoders)
else:
→ npu_fused_infer_attention_score (prefix caching support)
```
### Changes Made
1. **Fixed router condition** (`vllm_ascend/attention/attention_v1.py`
L968)
- Added `and not attn_metadata.causal` check
- Effect: Non-causal embeddings now use correct operator
2. **Simplified encoder attention**
(`vllm_ascend/attention/attention_v1.py` L864-877)
- Removed redundant causal branch (encoders never use causal mask)
- Reduced from 34 lines to 14 lines
3. **Added test** (`tests/e2e/singlecard/pooling/test_embedding.py`)
- Validates embedding outputs with/without prefix caching are consistent
## Does this PR introduce _any_ user-facing change?
### Functional Changes
✅ **Yes** - Bug fix: Embedding models now produce consistent outputs
with prefix caching
### API Changes
❌ **No** - All public APIs unchanged
### Configuration Changes
❌ **No** - No new configuration required
### Backward Compatibility
✅ **Fully compatible** - Only fixes incorrect behavior
## How was this patch tested?
### New Test
Added `test_embed_models_using_prefix_caching_correctness()`:
- Tests: `Qwen3-Embedding-0.6B`
- Validates numerical consistency between runs with/without prefix
caching
- Uses long sequences to activate prefix caching
- Tolerance: 1e-2
- vLLM version: v0.18.0
Signed-off-by: underfituu <hzhucong@163.com>
### What this PR does / why we need
Fix layerwise connector out of memory during large buffer transfer.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By nightly.
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
### What this PR does / why we need it?
This PR fixs accuracy bug in some cases with additional communication
methods.
This PR is a specific fix for version 0.18.0
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.18.0
- vLLM main:
35141a7eed
---------
Signed-off-by: rjg-lyh <1318825571@qq.com>
Signed-off-by: yydyzr <liuyuncong1@huawei.com>
Co-authored-by: rjg-lyh <1318825571@qq.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
Before when we do put for KV Pool, we find the first non-existing key
and put all the blocks starting from that index; however, if the prefix
cache blocks is from another request, and some of the blocks are evicted
due to LRU, we will be putting blocks that still exist in the pool, and
causing MooncakeStore printing unnecessary logs in master service.
What this PR does:
Now we lookup all the keys and only put the ones that are missing.
Fix lookup_scheduler in pool_worker so it handles GQA correctly.
Fixes a few existing typos
Add UT, written by codex
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
---------
Signed-off-by: Pz1116 <zpbzpb123123@gmail.com>
Co-authored-by: DreamerLeader <2270923832@qq.com>
Co-authored-by: fems14 <1804143737@qq.com>
### What this PR does / why we need it?
Implement get_token_bin_counts_and_mask and apply_penalties with
Triton-Ascend kernels. This significantly reduces latency of the
sampling process when repetition/frequency/presence penalties are
enabled.
Cherry-pick from main PR #7569
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: realliujiaxu <realliujiaxu@163.com>
### What this PR does / why we need it?
Adjust request map pop time.This pull request optimizes the KV cache
transfer mechanism by streamlining how requests are tracked and cleaned
up. By removing unnecessary mapping structures and adjusting the timing
of request removal, the system achieves more efficient state management
during the transfer process.
pick-from:https://github.com/vllm-project/vllm-ascend/pull/7855
### How was this patch tested?
By ci
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
## Summary
- replace the MiniMax usage accounting monkey patch with a runtime
wrapper implementation instead of source-text rewriting
- preserve MiniMax reasoning-token semantics when `</think>` is missing
by counting the emitted output as reasoning tokens
- add unit coverage for usage tracking helpers and MiniMax
reasoning-token counting
## Why
The previous implementation rewrote `OpenAIServingChat` by matching
exact source blocks. That was brittle against `vllm` source drift and
could crash during early plugin initialization with:
`RuntimeError: Failed to locate expected block while patching
OpenAIServingChat usage accounting.`
This change keeps the usage-accounting backport, but applies it by
wrapping the original stream/full generators and tracking output token
ids at runtime.
For MiniMax reasoning counting, a missing `</think>` should not be
treated as zero reasoning tokens. It can mean the whole output is still
in thinking mode, or that generation stopped before the closing token
was produced. In that case, the emitted output should still be counted
as reasoning.
## Validation
- `pytest -q
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py`
- `vllm serve --help`
Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
### What this PR does / why we need it?
Optimize DeepSeekOCR2 RelPosAttention and CustomQwen2Decoder and add doc
for DeepSeekOCR2.md
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vllm 0.18.0
- vllm-ascend main
1. _create_custom_4d_mask during 141ms49us620ns -->
_create_npu_optimized_mask during 1ms227us780ns
2. convd2d : 27ms --> matmul <1ms
3. relposattention:sdpa->prompt_flash_attention
---------
Signed-off-by: Wangbei25 <wangbei41@huawie.com>
Signed-off-by: Wangbei25 <wangbei41@huawei.com>
Co-authored-by: Wangbei25 <wangbei41@huawie.com>
…(#7603)
### What this PR does / why we need it?
Block verify uses cumprod(target_probs / draft_probs) for joint
acceptance. Suffix/ngram methods have
draft_probs=None, the fallback draft_token_probs=1.0 with cumprod is not
equivalent to per-token
verification, causing incorrect accept/reject results. Fix:
using_block_verify = max_spec_len >= 3 and draft_probs is not None.
MTP/Eagle3 unaffected.
- vLLM version: v0.18.0
- vLLM main:
ed359c497a
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: liuchenbing <chenliumail@163.com>
Co-authored-by: liuchenbing <chenliumail@163.com>
### What this PR does / why we need it?
Mooncake layerwise connector supports Mamba prefix caching on prefiller
nodes.
### Does this PR introduce _any_ user-facing change?
Yes. Use `--enable-prefix-caching` and `--mamba-cache-mode align` to
enable mamba align mode prefix caching on P/D prefill nodes. This
function does not supports on decode nodes now.
### How was this patch tested?
By P/D E2E test.
---------
Signed-off-by: nwpu-zxr <zhouxuerong2@huawei.com>
### What this PR does / why we need it?
Current version will run into init error when user set max_num_seqs to
number not a multiple of tp size. The reason is that we will first find
out the valid size of sequence parallelism, and then remove numbers that
are not the multiple of tp size. This may cause an error when we set a
max_num_seqs above a multiple of 8 before a multiple of tp size, say
when the tp size is 16 and the max_num_seqs is 90. The system will just
drop the calculated max graph capture size 88 from the valid size list
but not reset the max_cudagraph_capture_size to the next valid number.
Thus, we will need to add the line to match them up.
Cherry-pick from main PR #7801
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Full CI passed with this PR.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Co-authored-by: limuyuan <limuyuan3@huawei.com>
### What this PR does / why we need it?
This PR simplifies the apply method in w8a8_dynamic.py by removing the
conditional logic that used fused_w1_scale and fused_w2_scale based on
the fused_scale_flag. This redundant wrap behavior leads to EPLB break
in int8 quantization scenarios.
Cherry-picked from #7188. Note that only bugfix lines in that PR are
picked.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
This PR backports the changes from #7673 ([Bugfix] support FlashComm1 &
DCP for Qwen) to the releases/v0.18.0 branch.
--------
Signed-off-by: Yang Yuxi <907276627@qq.com>
### What this PR does / why we need it?
cherry-pick from https://github.com/vllm-project/vllm-ascend/pull/7736
**Error information**
When the quantized weights in CompressedTensors format of the kimi-k2
model are used, the following error is reported:
`AttributeError: 'AscendCompressedTensorsConfig' obiect has no attribute
'enabling_fa_quant'`
**Error Cause**
Currently, FA3 quantization supports only the weights of modelslim
quantization. The added methods are not defined in
AscendCompressedTensorsConfig.
**Solution**
Before invoking related methods, check whether the FA3 feature is
enabled.
Additionally, the unused `get_scaled_act_names` method and its
corresponding unit test have been removed.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests were updated by removing a deprecated test case, and
the refactored logic was reviewed for correctness.
Signed-off-by: Wang Kunpeng <1289706727@qq.com>
### What this PR does / why we need it?
This rebases the GLM47 tool-call parser fix onto `releases/v0.18.0`
after the MiniMax usage-accounting patch merged upstream on March 27,
2026.
It fixes OpenAI chat tool-call streaming for GLM47 by:
- draining terminal parser chunks that contain both the final argument
text and the closing `</tool_call>` suffix
- computing finish backfill from the tool argument bytes actually
emitted to the client, instead of trusting parser-internal buffered
state
- adding focused regression tests for finish backfill and terminal chunk
handling
### Does this PR introduce _any_ user-facing change?
Yes. GLM47 OpenAI-compatible streaming tool-call responses now emit
correct final chunks and argument payloads on `releases/v0.18.0`.
### How was this patch tested?
- `pytest -q tests/ut/patch/platform/test_patch_glm_tool_call_parser.py
tests/ut/patch/platform/test_patch_minimax_usage_accounting.py`
- `python -m pre_commit run --files
vllm_ascend/patch/platform/patch_glm_tool_call_parser.py
tests/ut/patch/platform/test_patch_glm_tool_call_parser.py
vllm_ascend/patch/platform/__init__.py vllm_ascend/patch/__init__.py`
---------
Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
…delSlimConfig (#7596)
### What this PR does / why we need it?
This PR refactors the `AscendModelSlimConfig` class to use **forward
mapping** instead of reverse mapping for quantization config key
transformation.
**Changes:**
1. Modified `apply_vllm_mapper()` to directly apply
`hf_to_vllm_mapper.apply_dict()` to transform `quant_description` keys
from HF format to vLLM format
2. Simplified `quant_prefix_mapper()` to return the prefix directly (no
longer needs mapping since keys are already in vLLM format)
3. Removed `QUANT_MODEL_PREFIX_MAPPINGS` dictionary (~50 lines) - no
longer needed
4. Removed `get_prefix_mapping()` function - no longer needed
5. Removed `vllm_to_hf_mapper` attribute - no longer needed
**Why this change is needed:**
The previous implementation used reverse mapping (vLLM → HF) which had
several issues:
- Some keys might not be used in the forward direction but would be
incorrectly used in reverse
- Empty values in the mapping would cause issues when reversed
- Required maintaining a separate `QUANT_MODEL_PREFIX_MAPPINGS` dict
that duplicated information already available in vLLM's model-specific
`WeightsMapper`
The new approach:
- Uses the forward mapping (HF → vLLM) directly from vLLM's
`WeightsMapper`
- Eliminates the need for duplicate mapping definitions
- Avoids issues with reverse mapping (unused keys, empty values)
- Aligns with how `compressed_tensors_config.py` handles the same
scenario
- vLLM version: v0.18.0
- vLLM main:
ed359c497a
---------
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.
- Please clarify why the changes are needed. For instance, the use case
and bug description.
- Fixes #
-->
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
Signed-off-by: Matrix_K <zhangke144@huawei.com>
Signed-off-by: Feng-xiaosuo <tengchang1@huawei.com>
Co-authored-by: Matrix_K <zhangke144@huawei.com>
Co-authored-by: Wang Kunpeng <1289706727@qq.com>
<!-- Thanks for sending a pull request!
BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html
-->
### What this PR does / why we need it?
cherry-pick from #7675 .
The current RecomputeScheduler is aligned to Scheduler in vLLM v0.16.0.
Since upstream vLLM has upgraded to v0.18.0, we also need to upgrade
RecomputeScheduler to pick up missing updates.
### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
---------
Signed-off-by: Angazenn <supperccell@163.com>
### What this PR does / why we need it?
This pull request updates the NPU profiler configuration in the Ascend
worker by changing the `aic_metrics` parameter from `AiCoreNone` to
`PipeUtilization`. This change enables the collection of pipe
utilization metrics during profiling sessions.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed.
Signed-off-by: linfeng-yuan <1102311262@qq.com>
### What this PR does / why we need it?
Fixed the issue where the DCP overlaps the MTP scenario in the ds3.2
scenario.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
cherry-pick from: https://github.com/vllm-project/vllm-ascend/pull/7617
Signed-off-by: weiguihua2 <weiguihua2@huawei.com>
### What this PR does / why we need it?
The triton operator does not perform boundary checks on the global
position within the loop, leading to the memory overflow in scenarios
with multiple concurrency + 1-step MTP launch.
Solution: Add a check that global_pos < vec_len, and strictly limit the
boundaries of all memory accesses to avoid out-of-bounds writes.
backport:#7459
Signed-off-by: Bowen-Leee <caoshankuangren@gmail.com>
### What this PR does / why we need it?
In graph + RL scenario, we only capture the graph once, and the weight
address is expected to be the same across iterations. However, when
calling .contiguous() on weight tensors, a new memory address may be
allocated, causing the graph to capture incorrect weight addresses.
This PR modifies the weight update logic in AscendMLAImpl and
AscendSFAImpl to use copy_() instead of reassignment, ensuring the
weight addresses remain consistent across iterations.
detailed in #7473
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Signed-off-by: Debonex <719893090@qq.com>
### What this PR does / why we need it?
Fixed incorrect class attribute assignment and corrected it to instance
attribute assignment. Ensured reorder_batch_threshold only applies to
the current instance to avoid global pollution and multi-instance
conflicts.
Backport of #7586
Signed-off-by: LookAround <lixushi@huawei.com>
Signed-off-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: LookAround0301 <lixushi@huawei.com>
### What this PR does / why we need it?
This backports the MiniMax M2 reasoning-token usage accounting fix onto
`releases/v0.18.0` for vllm-ascend.
The release branch does not include the other local GLM patch commit, so
this PR keeps the MiniMax change self-contained by:
- registering `patch_minimax_usage_accounting` on the release branch
- backporting `completion_tokens_details.reasoning_tokens` into chat
usage generation
- fixing MiniMax reasoning token counting for `</think>`-delimited
outputs without depending on the GLM suffix patch
### Does this PR introduce _any_ user-facing change?
Yes. OpenAI-compatible chat usage accounting for MiniMax M2 responses
now reports corrected reasoning token counts on the release branch.
### How was this patch tested?
- `python -m compileall
vllm_ascend/patch/platform/patch_minimax_usage_accounting.py`
- `python - <<'PY'` import check for
`vllm_ascend.patch.platform.patch_minimax_usage_accounting` on top of
`releases/v0.18.0`
No targeted automated regression test exists for this release-branch
backport yet, so I validated syntax and module import compatibility on
the release branch.
---------
Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>
Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com>