[v0.11.0-dev][bugfix] Add branch for stream up-lifting in update_attn_params (#4437)

### What this PR does / why we need it? #3985 move stream context initialization before for-loops to improve performance. However, we find that this might cause potential accuracy drop when used with pd disaggregation. Thus we partly revert this change when using pd disaggregation, and we shall fix this bug in th future. ### Does this PR introduce _any_ user-facing change? No. --------- Signed-off-by: Angazenn <supperccell@163.com>
2025-12-08 08:54:46 +08:00
parent 2598124e67
commit 6391f0625f
2 changed files with 80 additions and 23 deletions
--- a/vllm_ascend/worker/model_runner_v1.py
+++ b/vllm_ascend/worker/model_runner_v1.py
@@ -1598,7 +1598,8 @@ class NPUModelRunner(LoRAModelRunnerMixin):
                                       self.speculative_config)
            else:
                update_attn_params(self.update_stream, forward_context,
-                                   maybe_padded_num_tokens)
+                                   maybe_padded_num_tokens,
+                                   self.vllm_config.kv_transfer_config)

        if get_forward_context().sp_enabled:
            hidden_states = tensor_model_parallel_all_gather(hidden_states, 0)
@@ -2359,7 +2360,8 @@ class NPUModelRunner(LoRAModelRunnerMixin):
                                       num_tokens, self.speculative_config)
            else:
                update_attn_params(self.update_stream, forward_context,
-                                   num_tokens)
+                                   num_tokens,
+                                   self.vllm_config.kv_transfer_config)

        if self.drafter and self.drafter.name == SpecDcodeType.EAGLE3:
            hidden_states, _ = hidden_states