[Feat][main] Supported to use full-graph with Qwen3-Next-MTP (#5477)

### What this PR does / why we need it?

Supported to use full-graph with Qwen3-Next-MTP.

In detail, we adatpted `AscendAttentionState.ChunkedPrefill` in main
model, and also adapted `AscendAttentionState.ChunkedPrefill` in mtp
model.

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

We changed the test of Qwen3-Next-MTP in
`tests/e2e/multicard/test_qwen3_next.py` to make it a test of
`FULL_DECODE_ONLY`. Then run `pytest -s
tests/e2e/multicard/test_qwen3_next.py::test_qwen3_next_distributed_mp_eager_mtp_similarity_tp4`.

And this test passed.

```text
.

================================================================================================================================= warnings summary =================================================================================================================================
<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:241
  <frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================================================== 1 passed, 2 warnings in 271.89s (0:04:31) =====================================================================================================================
```
- vLLM version: v0.13.0
- vLLM main:
5326c89803

Signed-off-by: drslark <slarksblood@qq.com>
This commit is contained in:
drslark
2026-01-04 12:03:21 +08:00
committed by GitHub
parent fd4b4fd06f
commit 363ac1b80f
4 changed files with 42 additions and 32 deletions

View File

@@ -34,7 +34,6 @@ os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
MODELS = ["Qwen/Qwen3-Next-80B-A3B-Instruct"]
# TODO: add full decode only (when ready)
@pytest.mark.parametrize("model_name", MODELS)
def test_qwen3_next_mtp_acceptance_tp4(model_name):
golden = [0.85, 0.46, 0.19]
@@ -55,6 +54,7 @@ def test_qwen3_next_mtp_acceptance_tp4(model_name):
distributed_executor_backend="mp",
disable_log_stats=False,
speculative_config={
"cudagraph_mode": "FULL_DECODE_ONLY",
"method": "qwen3_next_mtp",
"num_speculative_tokens": 3,
},
@@ -88,6 +88,8 @@ def test_qwen3_next_mtp_acceptance_tp4(model_name):
cleanup_dist_env_and_memory()
# FIXME: When applying `FULL_DECODE_ONLY` in this e2e, ci will fail.
# The failure can not be reproduced locally.
@pytest.mark.parametrize("model_name", MODELS)
@pytest.mark.parametrize("num_speculative_tokens", [1])
@pytest.mark.parametrize("disable_padded_drafter_batch", [True, False])