[v0.18.0][BugFix] Fix dimension mismatch error when SP padding causes num_tokens_padded != num_tokens_unpadded (#8133)

Cherry-picked from https://github.com/vllm-project/vllm-ascend/pull/7858

### What this PR does / why we need it?
This PR fixes a `RuntimeError` (dimension mismatch) that occurs when
Sequence Parallelism (SP) is enabled and the padding added for SP causes
`num_tokens_padded` to differ from `num_tokens_unpadded`. In such cases,
`_pad_query_start_loc_for_fia` adds a dummy request, increasing
`num_reqs_padded`. This mismatch between the actual number of requests
and the padded number of requests leads to errors in downstream token
count computations (e.g., `compute_num_computed_tokens`).

The fix modifies the restrictive condition `num_tokens_padded ==
num_tokens_unpadded` when reverting the dummy request padding if SP is
enabled, as SP padding is handled by stripping it after communication
and should not be treated as an additional request in the attention
metadata.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
vLLM version: v0.18.0
vLLM-Ascend version: releases/v0.18.0

Signed-off-by: Wangbj127 <wangbj1207@126.com>
This commit is contained in:
wangbj127
2026-04-17 22:50:22 +08:00
committed by GitHub
parent 0954fd0912
commit f2956ce944

View File

@@ -1257,10 +1257,22 @@ class NPUModelRunner(GPUModelRunner):
num_reqs_padded = self._pad_query_start_loc_for_fia( num_reqs_padded = self._pad_query_start_loc_for_fia(
num_tokens_padded, num_reqs_padded, num_reqs, cudagraph_mode, batch_desc.num_reqs num_tokens_padded, num_reqs_padded, num_reqs, cudagraph_mode, batch_desc.num_reqs
) )
if enable_sp() and num_tokens_padded == num_tokens_unpadded:
if num_reqs_padded > old_num_reqs_padded:
# FIA may add a virtual request in Mixed Batch scenarios.
# here we revert the request added by _pad_query_start_loc_for_fia if SP is enabled.
# RELAXED CONDITION: Check if num_reqs_padded was actually increased, rather than
# strictly checking token equality. This handles cases where num_tokens_padded
# != num_tokens_unpadded due to SP alignment (e.g., 29292 vs 29290).
if enable_sp() and num_reqs_padded > old_num_reqs_padded:
if num_tokens_padded == num_tokens_unpadded:
num_reqs_padded = old_num_reqs_padded num_reqs_padded = old_num_reqs_padded
self.query_start_loc.np[num_reqs_padded + 1] = 0 self.query_start_loc.np[num_reqs_padded + 1] = 0
if num_tokens_padded != num_tokens_unpadded and not self.speculative_config:
num_reqs_padded = old_num_reqs_padded
self.query_start_loc.np[num_reqs_padded + 1] = 0
self.query_start_loc.np[num_reqs_padded] = num_tokens_padded
self.query_start_loc.gpu[num_reqs_padded] = num_tokens_padded
(attn_metadata, spec_decode_common_attn_metadata) = self._build_attention_metadata( (attn_metadata, spec_decode_common_attn_metadata) = self._build_attention_metadata(
num_tokens=num_tokens_unpadded num_tokens=num_tokens_unpadded