[Feat][main] Supported to use full-graph with Qwen3-Next-MTP (#5477)
### What this PR does / why we need it?
Supported to use full-graph with Qwen3-Next-MTP.
In detail, we adatpted `AscendAttentionState.ChunkedPrefill` in main
model, and also adapted `AscendAttentionState.ChunkedPrefill` in mtp
model.
### Does this PR introduce _any_ user-facing change?
N/A
### How was this patch tested?
We changed the test of Qwen3-Next-MTP in
`tests/e2e/multicard/test_qwen3_next.py` to make it a test of
`FULL_DECODE_ONLY`. Then run `pytest -s
tests/e2e/multicard/test_qwen3_next.py::test_qwen3_next_distributed_mp_eager_mtp_similarity_tp4`.
And this test passed.
```text
.
================================================================================================================================= warnings summary =================================================================================================================================
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute
<frozen importlib._bootstrap>:241
<frozen importlib._bootstrap>:241: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
==================================================================================================================== 1 passed, 2 warnings in 271.89s (0:04:31) =====================================================================================================================
```
- vLLM version: v0.13.0
- vLLM main:
5326c89803
Signed-off-by: drslark <slarksblood@qq.com>
This commit is contained in:
@@ -34,7 +34,6 @@ os.environ["VLLM_WORKER_MULTIPROC_METHOD"] = "spawn"
|
||||
MODELS = ["Qwen/Qwen3-Next-80B-A3B-Instruct"]
|
||||
|
||||
|
||||
# TODO: add full decode only (when ready)
|
||||
@pytest.mark.parametrize("model_name", MODELS)
|
||||
def test_qwen3_next_mtp_acceptance_tp4(model_name):
|
||||
golden = [0.85, 0.46, 0.19]
|
||||
@@ -55,6 +54,7 @@ def test_qwen3_next_mtp_acceptance_tp4(model_name):
|
||||
distributed_executor_backend="mp",
|
||||
disable_log_stats=False,
|
||||
speculative_config={
|
||||
"cudagraph_mode": "FULL_DECODE_ONLY",
|
||||
"method": "qwen3_next_mtp",
|
||||
"num_speculative_tokens": 3,
|
||||
},
|
||||
@@ -88,6 +88,8 @@ def test_qwen3_next_mtp_acceptance_tp4(model_name):
|
||||
cleanup_dist_env_and_memory()
|
||||
|
||||
|
||||
# FIXME: When applying `FULL_DECODE_ONLY` in this e2e, ci will fail.
|
||||
# The failure can not be reproduced locally.
|
||||
@pytest.mark.parametrize("model_name", MODELS)
|
||||
@pytest.mark.parametrize("num_speculative_tokens", [1])
|
||||
@pytest.mark.parametrize("disable_padded_drafter_batch", [True, False])
|
||||
|
||||
Reference in New Issue
Block a user