[Fix] Fixes issues in MTP with async scheduling and ACL graph (#4963)

### What this PR does / why we need it? Corrects attention metadata size for MTP when both asynchronous scheduling and full ACL graph mode are enabled. This prevents potential size mismatches during execution. Additionally, improves the robustness of calculating token sample indices by explicitly aligning tensor shapes. Finally, prevents padding when the number of input tokens exceeds the maximum ACL graph batch size to avoid out-of-bounds errors. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Need to add corresponding test case ASAP. - vLLM version: v0.12.0 - vLLM main: ad32e3e19c --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: Yizhou <136800916+yiz-liu@users.noreply.github.com> Co-authored-by: Jade Zheng <zheng.shoujian@outlook.com>
2025-12-14 00:10:11 +08:00
parent 42ceaf08a1
commit 0686b32d82
2 changed files with 16 additions and 3 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -1019,7 +1019,7 @@ class NPUModelRunner(GPUModelRunner):
            # TODO: We should make this official ASAP. Also note that if we pad here,
            # the builders won’t need to add any extra padding.
            if self.compilation_config.cudagraph_mode.decode_mode() == CUDAGraphMode.FULL and \
-                uniform_decode:
+                uniform_decode and num_input_tokens <= self.cudagraph_batch_sizes[-1]:
                num_reqs_padded = num_input_tokens // self.uniform_decode_query_len
                pad_size = num_reqs_padded - num_reqs
                if pad_size > 0: