[Refactor] Refactor Ascend attention implementation forward (#3714)

### What this PR does / why we need it? This PR refactors the Ascend attention implementation to align with vLLM's core interfaces, simplifying the code and improving maintainability. ### Key Changes: * **Align with vLLM's Attention Interface**: The `forward` method signature in `AscendAttentionBackendImpl` now matches the base `AttentionImpl` in vLLM, removing the custom `trace_flag`. * **Enable Opaque Attention Operator**: By adding `opaque_attention_op` to `AscendPlatform`, we allow vLLM to wrap our attention kernel in its standard `vllm.unified_attention_with_output` operator. This avoids the need for a custom call path. * **Remove Obsolete Code**: * The custom op `vllm.unified_ascend_attention_with_output` has been deleted as it is now redundant. * The `trace_flag` and its associated logic were removed, reducing code complexity. * An outdated quantization branch within the attention implementation was cleaned up. * **Improve Readability**: Renamed output variables (`output` vs. `intermediate_output`) and added comments to clarify the in-place nature of the attention output. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? No extra tests needed. - vLLM version: v0.11.0rc3 - vLLM main: 17c540a993 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-10-25 08:58:35 +08:00
parent 0b1da24742
commit 3158742a97
9 changed files with 191 additions and 349 deletions
--- a/vllm_ascend/torchair/models/qwen3_moe.py
+++ b/vllm_ascend/torchair/models/qwen3_moe.py
@@ -257,7 +257,6 @@ class CustomQwen3MoeAttention(Qwen3MoeAttention):
                                                 v,
                                                 kv_cache=kv_cache,
                                                 attn_metadata=attn_metadata,
-                                                 trace_flag=False,
                                                 **forward_kwargs)
            output, _ = self.o_proj(attn_output)
            return output