[refactor] refactor excute_model and _dymmy_run method (#6043)

### What this PR does / why we need it? The structure of the `excute_model` and `_dymmy_run` methods in NPUModelRunner differs greatly from that in GPUModelRunner. Achieve alignment with GPUModelRunner: Split the `_prepare_inputs` method into `_prepare_inputs`, `_determine_batch_execution_and_padding`, `_build_attention_metadata`, and `_preprocess`. Modify `_generate_process_reqs_hidden_states` to `_model_forward`. Align the implementation of the `postprocess` phase **Related-RFC**: https://github.com/vllm-project/vllm-ascend/issues/5449 **Co-authored-by**: @zhenwenqi2024 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? - vLLM version: v0.13.0 - vLLM main: d68209402d --------- Signed-off-by: Wang Kunpeng <1289706727@qq.com> Signed-off-by: gcanlin <canlinguosdu@gmail.com> Signed-off-by: zhenwenqi2024 <zhenwenqi_2022@qq.com> Co-authored-by: gcanlin <canlinguosdu@gmail.com> Co-authored-by: zhenwenqi2024 <zhenwenqi_2022@qq.com>
2026-01-27 22:27:01 +08:00
parent 41eb71d665
commit c498cea22d
4 changed files with 825 additions and 650 deletions
--- a/tests/ut/worker/test_pcp_manager.py
+++ b/tests/ut/worker/test_pcp_manager.py
@@ -123,6 +123,7 @@ def test_update_tokens_for_pcp_basic(tokens, num_reqs, num_computed_tokens,
    vllm_config = MagicMock()
    vllm_config.model_config = MagicMock()
    vllm_config.speculative_config.num_speculative_tokens = 0
+    vllm_config.scheduler_config.max_num_seqs = 1000

    pcp_manager = PCPManager(pcp_world_size=pcp_size,
                             pcp_rank=0,