[Fix] Refactor dummy attention metadata creation (#3497)
### What this PR does / why we need it? The `force_attention` parameter is designed for flash infer kernel warmup, we don't actually need it on Ascend device (at least for now).And it tends to make things more complicated. So we replace the `force_attention` parameter with `aclgraph_runtime_mode` in the attention metadata creation logic. This change makes the control flow more explicit by directly using the graph runtime mode to determine how to build attention metadata, rather than relying on an intermediate boolean flag. This simplification removes redundant logic and clarifies the conditions for building attention metadata for full decode graph mode. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? DP + `FULL_DECODE_ONLY` + online serving. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
This commit is contained in:
@@ -26,7 +26,7 @@ import torch_npu
|
||||
import vllm.envs as envs_vllm
|
||||
from torch_npu.op_plugin.atb._atb_ops import _register_atb_extensions
|
||||
from torch_npu.profiler import dynamic_profile as dp
|
||||
from vllm.config import CUDAGraphMode, VllmConfig
|
||||
from vllm.config import VllmConfig
|
||||
from vllm.distributed import (ensure_model_parallel_initialized,
|
||||
init_distributed_environment)
|
||||
from vllm.distributed.kv_transfer import ensure_kv_transfer_initialized
|
||||
@@ -360,11 +360,9 @@ class NPUWorker(WorkerBase):
|
||||
return self.model_runner.pin_lora(lora_id)
|
||||
|
||||
def execute_dummy_batch(self) -> None:
|
||||
force_attention = self.compilation_config.cudagraph_mode == CUDAGraphMode.FULL_DECODE_ONLY
|
||||
self.model_runner._dummy_run(
|
||||
num_tokens=self.model_runner.decode_token_per_req,
|
||||
uniform_decode=True,
|
||||
force_attention=force_attention)
|
||||
uniform_decode=True)
|
||||
|
||||
def _init_worker_distributed_environment(self) -> None:
|
||||
"""Initialize the distributed environment."""
|
||||
|
||||
Reference in New Issue
Block a user