[Feature] model_runner refactor (#4764)

### What this PR does / why we need it? refactor npu_modelrunner， we should be close to gpu_modelrunner ### Does this PR introduce _any_ user-facing change? NO - vLLM version: v0.12.0 - vLLM main: ad32e3e19c --------- Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Signed-off-by: zhenwenqi2024 <155598497+zhenwenqi2024@users.noreply.github.com>
2025-12-12 17:27:09 +08:00
parent 5b12c068f9
commit f708d919f8
10 changed files with 676 additions and 1815 deletions
--- a/vllm_ascend/attention/utils.py
+++ b/vllm_ascend/attention/utils.py
@@ -100,10 +100,6 @@ class AscendCommonAttentionMetadata:
    # padding tokens. It is used to handle some padding operations.
    num_input_tokens: int = 0

-    # NOTE: This is a temporary solution for rotary embedding in MLA
-    cos: torch.Tensor = None
-    sin: torch.Tensor = None
-
    prefill_context_parallel_metadata: Optional[
        AscendPrefillContextParallelMetadata] = None