[Refactor] Refactor Ascend attention implementation forward (#3714)

### What this PR does / why we need it?
This PR refactors the Ascend attention implementation to align with
vLLM's core interfaces, simplifying the code and improving
maintainability.

### Key Changes:

* **Align with vLLM's Attention Interface**: The `forward` method
signature in `AscendAttentionBackendImpl` now matches the base
`AttentionImpl` in vLLM, removing the custom `trace_flag`.

* **Enable Opaque Attention Operator**: By adding `opaque_attention_op`
to `AscendPlatform`, we allow vLLM to wrap our attention kernel in its
standard `vllm.unified_attention_with_output` operator. This avoids the
need for a custom call path.

*   **Remove Obsolete Code**:
* The custom op `vllm.unified_ascend_attention_with_output` has been
deleted as it is now redundant.
* The `trace_flag` and its associated logic were removed, reducing code
complexity.
* An outdated quantization branch within the attention implementation
was cleaned up.

* **Improve Readability**: Renamed output variables (`output` vs.
`intermediate_output`) and added comments to clarify the in-place nature
of the attention output.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
No extra tests needed.

- vLLM version: v0.11.0rc3
- vLLM main:
17c540a993

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

This commit is contained in:

Yizhou

2025-10-25 08:58:35 +08:00

committed by

GitHub

parent 0b1da24742

commit 3158742a97

9 changed files with 191 additions and 349 deletions

									
										1

vllm_ascend/worker/model_runner_v1.py
									
												View File
												
				@@ -3740,7 +3740,6 @@ class NPUModelRunner(LoRAModelRunnerMixin):

				        splitting_ops_contain_attention = (

				            self.compilation_config.splitting_ops is not None

				            and all(op in self.compilation_config.splitting_ops for op in [

				                "vllm.unified_ascend_attention_with_output",

				                "vllm.mla_forward",

				            ]))

[Refactor] Refactor Ascend attention implementation forward (#3714)

1 vllm_ascend/worker/model_runner_v1.py Unescape Escape View File

1

vllm_ascend/worker/model_runner_v1.py

View File