fixed graph mode bug. (#7460)
### What this PR does / why we need it?
In fulldecodeonly mode, num_req_padded was set to an incorrect value,
causing accuracy degradation in Qwen3-Next. Therefore, we added a check
for compilation_config.cudagraph_mode to the conditional logic, ensuring
that padding is applied only in FULL mode.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.17.0
- vLLM main:
8a680463fa
Co-authored-by: wangxiaoxin-sherie <wangxiaoxin7@huawei.com>
This commit is contained in:
@@ -567,7 +567,8 @@ class NPUModelRunner(GPUModelRunner):
|
||||
"""
|
||||
# TODO: need refactor later, related to vllm PR #34043 this pr delete func
|
||||
# relax_for_mixed_batch_cudagraphs, num_reqs no longer equals the actual number of requests.
|
||||
if cudagraph_runtime_mode == CUDAGraphMode.FULL:
|
||||
if cudagraph_runtime_mode == CUDAGraphMode.FULL and \
|
||||
self.compilation_config.cudagraph_mode == CUDAGraphMode.FULL:
|
||||
num_reqs_padded = num_reqs
|
||||
else:
|
||||
num_reqs_padded = batch_desc_num_reqs if batch_desc_num_reqs is not None else num_reqs
|
||||
|
||||
Reference in New Issue
Block a user