[Bugfix] fix the precision issues that may raise from the inter-layer reuse of the workspace in certain scenarios (#5522)

### What this PR does / why we need it?

In the current process of implementing attention updates, the FIA
operator shares a single workspace among different layers within the
same computation graph. To enable memory reuse, we adopt the
weak_ref_tensor mechanism. However, this approach may lead to precision
anomalies in certain scenarios. To address this issue, different layers
in the same computation graph are assigned independent workspaces.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

- vLLM version: v0.13.0
- vLLM main:
45c1ca1ca1

Signed-off-by: WithHades <244036962@qq.com>
This commit is contained in:
无脸男
2025-12-31 16:54:04 +08:00
committed by GitHub
parent 46a1614387
commit 03679cf1d3
2 changed files with 19 additions and 3 deletions

View File

@@ -440,8 +440,7 @@ class AscendAttentionBackendImpl(AttentionImpl):
block_table=attn_metadata.block_tables,
context_lens=attn_metadata.seq_lens,
out=output)
update_graph_params_workspaces(num_tokens,
weak_ref_tensors(workspace))
update_graph_params_workspaces(num_tokens, workspace)
# Handle graph capturing mode
stream = torch_npu.npu.current_stream()