From 51415aaa2f6fc148aa07dda653bca9526de93feb Mon Sep 17 00:00:00 2001 From: cookieyyds <126683903+cookieyyds@users.noreply.github.com> Date: Wed, 14 Jan 2026 22:57:38 +0800 Subject: [PATCH] [bugfix]support dsv3.2 enable both mtp and full_decode_only (#5849) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit ### What this PR does / why we need it? support dsv3.2 enable both mtp and full_decode_only PR5626 To align with the community, the branch logic was modified. Previously, dsv32 could not reach inside the branch, and now an additional unpadded step is required, which causes transformations in positions and num_input_tokens, leading to changes in the cos and sin dimensions in sfa_v1.py. This, in turn, causes an illegal shape error when passed to the operator. 1. The unpadded function is introduced to align with the community, and in the community the function does not have the parameters num_input_tokens and positions. 2. The positions are split and num_input_tokens=num_actual_tokens are used to correspond to the function name unpad, so that the padded positions and num_input_tokens are not output. However, in fact, attention_v1 does not use the above two parameters. This is done because we are concerned that some people might use these parameters later and encounter shape mismatch issues if they are not aware of this. Therefore, we have performed the cropping. From the perspective of the source of acquisition, positions are not cropped, so there is actually no need to add unpad in this case. - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2f4e6548efec402b913ffddc8726230d9311948d Signed-off-by: cookieyyds <126683903+cookieyyds@users.noreply.github.com> --- vllm_ascend/attention/utils.py | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/vllm_ascend/attention/utils.py b/vllm_ascend/attention/utils.py index 826c91a5..619d2278 100644 --- a/vllm_ascend/attention/utils.py +++ b/vllm_ascend/attention/utils.py @@ -140,10 +140,10 @@ class AscendCommonAttentionMetadata(CommonAttentionMetadata): slot_mapping=self.slot_mapping, causal=self.causal, actual_seq_lengths_q=self.actual_seq_lengths_q[:num_actual_tokens], - positions=self.positions[:num_actual_tokens], + positions=self.positions, attn_state=self.attn_state, graph_pad_size=-1, # It should be -1 when not run in fullgraph mode. - num_input_tokens=num_actual_tokens, + num_input_tokens=self.num_input_tokens, prefill_context_parallel_metadata=self. prefill_context_parallel_metadata, max_seq_len=self.max_seq_len)