[CI] Add long and short prompt tests for DeepSeek-V3.2 (#6536)

### What this PR does / why we need it?

This version has no divisibility constraint between tp and mtp+1.
However, cudagraph_capture_sizes must be a common multiple of tp and
mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed
cudagraph_capture_sizes.

We added a long-sequence test (64k input, 3k output) for the two-node
mixed deployment scenario. Due to the excessive time required for
performance benchmarking, we are only verifying functionality. The
single-node scenario is skipped because VRAM limitations prevent
launching the model with a max-model-len of 68,000.

and we also add aime2025 test for dual-node deepseek 3.2 nightly test.

### How was this patch tested?

test at nightly environment.

- vLLM version: v0.15.0
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0

Signed-off-by: guozr <guozr1997@hotmail.com>
Co-authored-by: guozr <guozr1997@hotmail.com>
This commit is contained in:
starmountain1997
2026-02-26 10:58:50 +08:00
committed by GitHub
parent 169e434f78
commit bc1622338c
3 changed files with 64 additions and 16 deletions

View File

@@ -255,18 +255,18 @@ def test_deepseek3_2_w8a8_pruning_mtp_tp2_ep():
long_example_prompts = [
"Hello " * (163839 - 500) + "Hello"
]
max_tokens = 500
max_tokens = 500
with VllmRunner("vllm-ascend/DeepSeek-V3.2-W8A8-Pruning",
tensor_parallel_size=2,
quantization="ascend",
enable_expert_parallel=True,
max_model_len=163840,
compilation_config={
"cudagraph_capture_sizes": [3, 6, 9, 12],
"cudagraph_capture_sizes": [2, 4, 6, 8, 10, 12],
"cudagraph_mode": "FULL_DECODE_ONLY"
},
speculative_config={
"num_speculative_tokens": 2,
"num_speculative_tokens": 1,
"method": "deepseek_mtp"
},
additional_config={