[Feat][Graph] Support FULL_DECODE_ONLY mode for GQA/MHA models (#2128)

Note: This depends on [vLLM #25161](https://github.com/vllm-project/vllm/pull/25161) and the torch\_npu release from September 30. ### What this PR does / why we need it? This pull request adds `FULL_DECODE_ONLY` mode for GQA/MHA models (MLA models like DeepSeek V3/R1 are not included). Key improvements include: * **Reduced dispatch latency:** By replaying the entire model execution graph at once, we cut overhead compared with multiple smaller replays. * **Stabilized multi-device performance:** Captureing the whole model as one static graph also mitigates the dispatch fluctuations across devices. * **Stream/resource savings:** Consolidating graph captures frees up streams, allowing more graphs to be captured. **Known issues:** 1. `_npu_paged_attention` currently manages its own workspace in `torch_npu`, which can deadlock when synchronizing during graph replay — we’re working on a fix. There may be other corner cases. This PR is the first in a planned series; we’ll continue to iterate and address remaining issues in follow-ups. This is essentially a port of #1503 and #1677, but includes two major changes: 1. Let `graph_dispatcher` decide the graph mode instead of hard-coding it in the backend, which decouples Full Graph and Piecewise Graph and could make it possible to remove dynamo. 2. Adapt to the new `attn_group` logic, but leave a small hack in `update_graph_params`; multi-attention models may or may not be fully supported yet. ### Does this PR introduce _any_ user-facing change? ```python compilation_config={ "cudagraph_mode": "FULL_DECODE_ONLY", }, ``` ### How was this patch tested? Tests included. - vLLM version: v0.10.2 - vLLM main: 9607d5eb44 --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
2025-09-22 17:14:28 +08:00
parent f39bd309b6
commit 338231acaf
14 changed files with 390 additions and 91 deletions
--- a/tests/ut/attention/test_attention_v1.py
+++ b/tests/ut/attention/test_attention_v1.py
@@ -101,7 +101,7 @@ class TestAscendAttentionMetadataBuilder(TestBase):
            max_query_len=5,
            decode_token_per_req=torch.tensor([1, 1]),
            block_table_tensor=torch.zeros((10, 10)),
-            slot_mapping_cpu=torch.tensor(range(20)),
+            slot_mapping=torch.tensor(range(20)),
            actual_seq_lengths_q=torch.tensor([0, 1]),
            positions=torch.tensor([10, 10]),
            attn_mask=torch.ones((10, 10)),
@@ -134,7 +134,7 @@ class TestAscendAttentionMetadataBuilder(TestBase):
            max_query_len=6,
            decode_token_per_req=torch.tensor([1, 1, 1]),
            block_table_tensor=torch.zeros((10, 10)),
-            slot_mapping_cpu=torch.tensor(range(20)),
+            slot_mapping=torch.tensor(range(20)),
            actual_seq_lengths_q=torch.tensor([0, 1, 2]),
            positions=torch.tensor([10, 10]),
            attn_mask=torch.ones((15, 15)),
@@ -165,7 +165,7 @@ class TestAscendAttentionMetadataBuilder(TestBase):
            max_query_len=6,
            decode_token_per_req=torch.tensor([1, 1, 1]),
            block_table_tensor=torch.zeros((10, 10)),
-            slot_mapping_cpu=torch.tensor(range(20)),
+            slot_mapping=torch.tensor(range(20)),
            actual_seq_lengths_q=torch.tensor([0, 1, 2]),
            positions=torch.tensor([10, 10]),
            attn_mask=torch.ones((15, 15)),
@@ -378,10 +378,12 @@ class TestAscendAttentionBackendImpl(TestBase):
        mock_flash_attention_qlens.assert_called_once()
        assert output.shape == (10, 8 * 64)

+    @patch('vllm_ascend.attention.attention_v1.get_forward_context')
    @patch('torch_npu._npu_reshape_and_cache')
    @patch('torch_npu._npu_paged_attention')
    def test_forward_decode_only(self, mock_paged_attention,
-                                 mock_npu_reshape_and_cache):
+                                 mock_npu_reshape_and_cache,
+                                 mock_get_forward_context):
        """Test forward pass in DecodeOnly state"""
        query = torch.randn(10, 8 * 64)
        key = torch.randn(10, 8 * 64)
@@ -395,6 +397,8 @@ class TestAscendAttentionBackendImpl(TestBase):
        metadata.slot_mapping = torch.zeros(10, dtype=torch.long)
        layer = self.layer_no_quant

+        mock_get_forward_context.return_value = MagicMock(capturing=False)
+
        output = self.impl.forward(layer,
                                   query,
                                   key,
@@ -435,12 +439,13 @@ class TestAscendAttentionBackendImpl(TestBase):
        mock_fused_infer_attention_score.assert_called_once()
        assert output.shape == (10, 8 * 64)

+    @patch('vllm_ascend.attention.attention_v1.get_forward_context')
    @patch('torch_npu._npu_reshape_and_cache')
    @patch('torch_npu._npu_paged_attention')
    @patch('torch_npu.npu_fused_infer_attention_score')
    def test_forward_decode_only_swa_seq_len_mismatch(
            self, mock_fused_infer_attention_score, mock_paged_attention,
-            mock_npu_reshape_and_cache):
+            mock_npu_reshape_and_cache, mock_get_forward_context):
        """Test forward pass in DecodeOnly state when seq)len_mismatch"""
        query = torch.randn(10, 8 * 64)
        key = torch.randn(10, 8 * 64)
@@ -457,6 +462,8 @@ class TestAscendAttentionBackendImpl(TestBase):
        mock_fused_infer_attention_score.return_value = (torch.ones(10, 8,
                                                                    64), 1)

+        mock_get_forward_context.return_value = MagicMock(capturing=False)
+
        output = self.impl_swa.forward(self.layer_no_quant,
                                       query,
                                       key,