From 51415aaa2f6fc148aa07dda653bca9526de93feb Mon Sep 17 00:00:00 2001
From: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
Date: Wed, 14 Jan 2026 22:57:38 +0800
Subject: [PATCH] [bugfix]support dsv3.2 enable both mtp and full_decode_only
 (#5849)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

### What this PR does / why we need it?
support dsv3.2 enable both mtp and full_decode_only

PR5626 To align with the community, the branch logic was modified.
Previously, dsv32 could not reach inside the branch, and now an
additional unpadded step is required, which causes transformations in
positions and num_input_tokens, leading to changes in the cos and sin
dimensions in sfa_v1.py. This, in turn, causes an illegal shape error
when passed to the operator.

1. The unpadded function is introduced to align with the community， and
in the community the function does not have the parameters
num_input_tokens and positions.
2. The positions are split and num_input_tokens=num_actual_tokens are
used to correspond to the function name unpad, so that the padded
positions and num_input_tokens are not output.

However, in fact, attention_v1 does not use the above two parameters.
This is done because we are concerned that some people might use these
parameters later and encounter shape mismatch issues if they are not
aware of this. Therefore, we have performed the cropping.

From the perspective of the source of acquisition, positions are not
cropped, so there is actually no need to add unpad in this case.

- vLLM version: v0.13.0
- vLLM main:
https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d

Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com>
---
 vllm_ascend/attention/utils.py | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/vllm_ascend/attention/utils.py b/vllm_ascend/attention/utils.py
index 826c91a5..619d2278 100644
--- a/vllm_ascend/attention/utils.py
+++ b/vllm_ascend/attention/utils.py
@@ -140,10 +140,10 @@ class AscendCommonAttentionMetadata(CommonAttentionMetadata):
             slot_mapping=self.slot_mapping,
             causal=self.causal,
             actual_seq_lengths_q=self.actual_seq_lengths_q[:num_actual_tokens],
-            positions=self.positions[:num_actual_tokens],
+            positions=self.positions,
             attn_state=self.attn_state,
             graph_pad_size=-1,  # It should be -1 when not run in fullgraph mode.
-            num_input_tokens=num_actual_tokens,
+            num_input_tokens=self.num_input_tokens,
             prefill_context_parallel_metadata=self.
             prefill_context_parallel_metadata,
             max_seq_len=self.max_seq_len)