[main][refactor] Refactoring forward_context and model_runner_v1 (#1979)

### What this PR does / why we need it? A refactoring of forward_context and model_runner_v1, add some context which is necessary in model inference into forward_context, and refactor dummy_run logic, make it more reasonable. Some details for this PR: Add `ascend_forward_context`; Update mc2_v2 op, and support `active_mask` param; Update scripts in examples dir; refactor `dummy_run` logic; Add soc_version for A2 and A3; ### Does this PR introduce _any_ user-facing change? No change at user-facing. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: 57c22e57f9 Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-07-28 14:06:20 +08:00
parent e3a2443c3a
commit ba3dfbd59e
22 changed files with 629 additions and 347 deletions
--- a/examples/offline_dualbatch_overlap_npu.py
+++ b/examples/offline_dualbatch_overlap_npu.py
@@ -21,6 +21,7 @@ def main():
              tensor_parallel_size=2,
              max_model_len=4096,
              trust_remote_code=True,
+              enable_expert_parallel=True,
              additional_config={
                  "torchair_graph_config": {
                      "enabled": False
@@ -28,7 +29,6 @@ def main():
                  "ascend_scheduler_config": {
                      "enabled": True
                  },
-                  "expert_tensor_parallel_size": 1
              })

    # Generate texts from the prompts. The output is a list of RequestOutput