[refactor] refactor excute_model and _dymmy_run method (#6043)

### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: https://github.com/vllm-project/vllm-ascend/issues/5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: d68209402d --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
2026-01-27 22:27:01 +08:00
parent 41eb71d665
commit c498cea22d
4 changed files with 825 additions and 650 deletions
--- a/vllm_ascend/attention/sfa_v1.py
+++ b/vllm_ascend/attention/sfa_v1.py
@@ -170,7 +170,7 @@ class AscendSFAMetadataBuilder(MLACommonMetadataBuilder[AscendSFAMetadata]):
                npu_fused_infer_attention_score TND layout's limit of 16, \
                got {self.decode_threshold}"
            )
-
+        self.reorder_batch_threshold = self.decode_threshold
        self.attn_mask_builder = AttentionMaskBuilder(self.device)
        self.rope_dim = self.model_config.hf_text_config.qk_rope_head_dim
        self.enable_dsa_cp = enable_dsa_cp()