[main][refactor] Refactoring forward_context and model_runner_v1 (#1979)

### What this PR does / why we need it? A refactoring of forward_context and model_runner_v1, add some context which is necessary in model inference into forward_context, and refactor dummy_run logic, make it more reasonable. Some details for this PR: Add `ascend_forward_context`; Update mc2_v2 op, and support `active_mask` param; Update scripts in examples dir; refactor `dummy_run` logic; Add soc_version for A2 and A3; ### Does this PR introduce _any_ user-facing change? No change at user-facing. ### How was this patch tested? - vLLM version: v0.10.0 - vLLM main: 57c22e57f9 Signed-off-by: zzzzwwjj <1183291235@qq.com>
2025-07-28 14:06:20 +08:00
parent e3a2443c3a
commit ba3dfbd59e
22 changed files with 629 additions and 347 deletions
--- a/vllm_ascend/models/pangu_moe.py
+++ b/vllm_ascend/models/pangu_moe.py
@@ -837,12 +837,8 @@ class PanguProMoEModel(nn.Module):
            # if attn_meatadata is not passed, we try to get it from forward_context.
            if attn_metadata is None:
                attn_metadata = get_forward_context().attn_metadata
-            if attn_metadata is None:
-                # when attn_meatadata is None, it is in profile_run. num_tokens on all dp ranks
-                # are same.
-                max_tokens_across_dp = hidden_states.shape[0]
-            else:
-                max_tokens_across_dp = attn_metadata.max_num_tokens_across_dp
+
+            max_tokens_across_dp = get_forward_context().max_tokens_across_dp

            tp_size = get_tp_group().world_size
            # reduce scatter will split the input tensor into equal sizes and then scatter them on all ranks.