[Feat][main] Supported to use full-graph with Qwen3-Next-MTP (#5477)

### What this PR does / why we need it? Supported to use full-graph with Qwen3-Next-MTP. In detail, we adatpted `AscendAttentionState.ChunkedPrefill` in main model, and also adapted `AscendAttentionState.ChunkedPrefill` in mtp model. ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? We changed the test of Qwen3-Next-MTP in `tests/e2e/multicard/test_qwen3_next.py` to make it a test of `FULL_DECODE_ONLY`. Then run `pytest -s tests/e2e/multicard/test_qwen3_next.py::test_qwen3_next_distributed_mp_eager_mtp_similarity_tp4`. And this test passed. ```text . ================================================================================================================================= warnings summary ================================================================================================================================= <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute <frozen importlib._bootstrap>:241 <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ==================================================================================================================== 1 passed, 2 warnings in 271.89s (0:04:31) ===================================================================================================================== ``` - vLLM version: v0.13.0 - vLLM main: 5326c89803 Signed-off-by: drslark <slarksblood@qq.com>
2026-01-04 12:03:21 +08:00
parent fd4b4fd06f
commit 363ac1b80f
4 changed files with 42 additions and 32 deletions
--- a/tests/e2e/multicard/spec_decode/test_mtp_qwen3_next.py
+++ b/tests/e2e/multicard/spec_decode/test_mtp_qwen3_next.py
@@ -34,7 +34,6 @@ os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
 MODELS = ["Qwen/Qwen3-Next-80B-A3B-Instruct"]


-# TODO: add full decode only (when ready)
@pytest.mark.parametrize("model_name", MODELS)
 def test_qwen3_next_mtp_acceptance_tp4(model_name):
    golden = [0.85, 0.46, 0.19]
@@ -55,6 +54,7 @@ def test_qwen3_next_mtp_acceptance_tp4(model_name):
                    distributed_executor_backend="mp",
                    disable_log_stats=False,
                    speculative_config={
+                        "cudagraph_mode": "FULL_DECODE_ONLY",
                        "method": "qwen3_next_mtp",
                        "num_speculative_tokens": 3,
                    },
@@ -88,6 +88,8 @@ def test_qwen3_next_mtp_acceptance_tp4(model_name):
    cleanup_dist_env_and_memory()


+# FIXME: When applying `FULL_DECODE_ONLY` in this e2e, ci will fail.
+# The failure can not be reproduced locally.
@pytest.mark.parametrize("model_name", MODELS)
@pytest.mark.parametrize("num_speculative_tokens", [1])
@pytest.mark.parametrize("disable_padded_drafter_batch", [True, False])