fixed graph mode bug. (#7460)

### What this PR does / why we need it? In fulldecodeonly mode, num_req_padded was set to an incorrect value, causing accuracy degradation in Qwen3-Next. Therefore, we added a check for compilation_config.cudagraph_mode to the conditional logic, ensuring that padding is applied only in FULL mode. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? - vLLM version: v0.17.0 - vLLM main: 8a680463fa Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
2026-03-22 10:09:37 +08:00
parent 84a74f0cb1
commit cbf46fad3c
1 changed files with 2 additions and 1 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -567,7 +567,8 @@ class NPUModelRunner(GPUModelRunner):
        """
        # TODO: need refactor later, related to vllm PR #34043 this pr delete func
        # relax_for_mixed_batch_cudagraphs, num_reqs no longer equals the actual number of requests.
-        if cudagraph_runtime_mode == CUDAGraphMode.FULL:
+        if cudagraph_runtime_mode == CUDAGraphMode.FULL and \
+            self.compilation_config.cudagraph_mode == CUDAGraphMode.FULL:
            num_reqs_padded = num_reqs
        else:
            num_reqs_padded = batch_desc_num_reqs if batch_desc_num_reqs is not None else num_reqs