[Refactor] Unify full-graph parameter update logic (#6041)
### What this PR does / why we need it? **Refactor: Unify full-graph parameter update logic** This PR consolidates the scattered full-graph parameter update logic into a unified approach, improving code architecture and eliminating duplication. **Key improvements:** 1. **Unified interface** - Create `update_full_graph_params` as the single entry point for all full-graph updates - Replace multiple scattered update calls with one unified function - Remove ~50 lines of duplicated if-else logic across `model_runner_v1.py` and `eagle_proposer.py` 2. **Better architecture** - Move update logic to respective Backend classes (`AscendAttentionBackend`, `AscendMLABackend`) - Each Backend manages its own parameter update logic internally - Simplify caller code to just dispatch to the appropriate Backend 3. **Cleaner parameter handling** - Remove unnecessary `pcp_size` and `dcp_size` parameter passing - Get parallel configuration directly from distributed groups - Consistent with how other parts of the codebase obtain these values **Why we need it:** - **Maintainability**: Future changes only need to be made in one place per Backend - **Code quality**: Follows DRY principle and Single Responsibility Principle - **Readability**: Cleaner, more intuitive code structure ### Does this PR introduce _any_ user-facing change? **No.** This is a pure refactoring with no functional changes - same behavior, cleaner code. ### How was this patch tested? - All existing unit tests pass with updated mocks - No new tests needed (pure refactoring, no behavior changes) - CI validates correctness --- - vLLM version: v0.13.0 Signed-off-by: lico67373 <918688502@qq.com> Co-authored-by: drslark <slarksblood@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>
This commit is contained in:
@@ -284,6 +284,85 @@ class AscendMlaCPImpl(AscendMLAImpl):
|
||||
self.dcp_rank = get_decode_context_model_parallel_rank() if self.dcp_size > 1 else 0
|
||||
self.dcp_group = get_dcp_group().device_group if self.dcp_size > 1 else None
|
||||
|
||||
@staticmethod
|
||||
def update_graph_params(
|
||||
update_stream,
|
||||
forward_context,
|
||||
num_tokens,
|
||||
vllm_config=None,
|
||||
speculative_config=None,
|
||||
num_dcp_pcp_tokens=None,
|
||||
):
|
||||
if forward_context.is_draft_model:
|
||||
graph_params = get_draft_graph_params()
|
||||
else:
|
||||
graph_params = get_graph_params()
|
||||
# FIXME: Behold! We are using a temporary hack here to update the args
|
||||
# for each layer's attention op in the graph.
|
||||
with torch.npu.stream(update_stream):
|
||||
for key, param, handle, event in zip(
|
||||
forward_context.attn_metadata,
|
||||
graph_params.attn_params[num_tokens],
|
||||
graph_params.handles[num_tokens],
|
||||
graph_params.events[num_tokens],
|
||||
):
|
||||
(
|
||||
q_nope,
|
||||
k_nope,
|
||||
q_pe,
|
||||
k_pe,
|
||||
num_heads,
|
||||
num_kv_heads,
|
||||
input_layout,
|
||||
spec_attn_mask,
|
||||
sparse_mode,
|
||||
scale,
|
||||
block_table,
|
||||
block_size,
|
||||
actual_seq_lengths,
|
||||
actual_seq_lengths_kv,
|
||||
attn_output,
|
||||
softmax_lse,
|
||||
) = param
|
||||
|
||||
decode_meta = forward_context.attn_metadata[key].decode
|
||||
seq_len = decode_meta.cp_seq_len
|
||||
if isinstance(seq_len, torch.Tensor):
|
||||
seq_len = seq_len.tolist()
|
||||
actual_seq_lengths_kv = seq_len
|
||||
|
||||
pad_length = num_tokens - len(actual_seq_lengths_kv)
|
||||
if pad_length > 0:
|
||||
actual_seq_lengths_kv = actual_seq_lengths_kv + [0] * (num_tokens - len(actual_seq_lengths_kv))
|
||||
|
||||
torch.npu.graph_task_update_begin(update_stream, handle)
|
||||
|
||||
torch_npu.npu_fused_infer_attention_score.out(
|
||||
q_nope,
|
||||
k_nope,
|
||||
k_nope,
|
||||
query_rope=q_pe,
|
||||
key_rope=k_pe,
|
||||
num_heads=num_heads,
|
||||
num_key_value_heads=num_kv_heads,
|
||||
input_layout=input_layout,
|
||||
atten_mask=spec_attn_mask,
|
||||
sparse_mode=sparse_mode,
|
||||
scale=scale,
|
||||
antiquant_mode=0,
|
||||
antiquant_scale=None,
|
||||
softmax_lse_flag=True,
|
||||
block_table=block_table,
|
||||
block_size=block_size,
|
||||
actual_seq_lengths_kv=actual_seq_lengths_kv,
|
||||
actual_seq_lengths=actual_seq_lengths,
|
||||
workspace=graph_params.workspaces.get(num_tokens),
|
||||
out=[attn_output, softmax_lse],
|
||||
)
|
||||
torch.npu.graph_task_update_end(update_stream)
|
||||
|
||||
event.record(update_stream)
|
||||
|
||||
def get_num_actual_tokens(self, attn_metadata: M):
|
||||
if self.pcp_size > 1:
|
||||
return attn_metadata.num_actual_tokens_pcp_padded // self.pcp_size
|
||||
|
||||
Reference in New Issue
Block a user