[Feat][Graph] Support `FULL_DECODE_ONLY` mode for GQA/MHA models (#2128)

Note: This depends on [vLLM
#25161](https://github.com/vllm-project/vllm/pull/25161) and the
torch\_npu release from September 30.

### What this PR does / why we need it?
This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA
models like DeepSeek V3/R1 are not included). Key improvements include:

* **Reduced dispatch latency:** By replaying the entire model execution
graph at once, we cut overhead compared with multiple smaller replays.
* **Stabilized multi-device performance:** Captureing the whole model as
one static graph also mitigates the dispatch fluctuations across
devices.
* **Stream/resource savings:** Consolidating graph captures frees up
streams, allowing more graphs to be captured.

**Known issues:**

1. `_npu_paged_attention` currently manages its own workspace in
`torch_npu`, which can deadlock when synchronizing during graph replay —
we’re working on a fix.

There may be other corner cases. This PR is the first in a planned
series; we’ll continue to iterate and address remaining issues in
follow-ups.

This is essentially a port of #1503 and #1677, but includes two major
changes:

1. Let `graph_dispatcher` decide the graph mode instead of hard-coding
it in the backend, which decouples Full Graph and Piecewise Graph and
could make it possible to remove dynamo.
2. Adapt to the new `attn_group` logic, but leave a small hack in
`update_graph_params`; multi-attention models may or may not be fully
supported yet.

### Does this PR introduce _any_ user-facing change?
```python
compilation_config={
    "cudagraph_mode": "FULL_DECODE_ONLY",
},
```

### How was this patch tested?
Tests included.


- vLLM version: v0.10.2
- vLLM main:
9607d5eb44

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

This commit is contained in:

Yizhou

2025-09-22 17:14:28 +08:00

committed by

GitHub

parent f39bd309b6

commit 338231acaf

14 changed files with 390 additions and 91 deletions

									
										1

vllm_ascend/compilation/acl_graph.py
									
												View File
												
				@@ -147,6 +147,7 @@ class ACLGraphWrapper:

				                        patch("torch.npu.empty_cache", lambda: None))

				                # mind-exploding: carefully manage the reference and memory.

				                forward_context.capturing = True

				                with torch.npu.graph(aclgraph, pool=self.graph_pool):

				                    # `output` is managed by pytorch's aclgraph pool

				                    output = self.runnable(*args, **kwargs)

[Feat][Graph] Support FULL_DECODE_ONLY mode for GQA/MHA models (#2128)

1 vllm_ascend/compilation/acl_graph.py Unescape Escape View File

[Feat][Graph] Support `FULL_DECODE_ONLY` mode for GQA/MHA models (#2128)

1

vllm_ascend/compilation/acl_graph.py

View File