[CI] Add long and short prompt tests for DeepSeek-V3.2 (#6536)
### What this PR does / why we need it? This version has no divisibility constraint between tp and mtp+1. However, cudagraph_capture_sizes must be a common multiple of tp and mtp+1, with a maximum of tp * (mtp+1). Therefore, we fixed cudagraph_capture_sizes. We added a long-sequence test (64k input, 3k output) for the two-node mixed deployment scenario. Due to the excessive time required for performance benchmarking, we are only verifying functionality. The single-node scenario is skipped because VRAM limitations prevent launching the model with a max-model-len of 68,000. and we also add aime2025 test for dual-node deepseek 3.2 nightly test. ### How was this patch tested? test at nightly environment. - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: guozr <guozr1997@hotmail.com> Co-authored-by: guozr <guozr1997@hotmail.com>
This commit is contained in:
@@ -255,18 +255,18 @@ def test_deepseek3_2_w8a8_pruning_mtp_tp2_ep():
|
||||
long_example_prompts = [
|
||||
"Hello " * (163839 - 500) + "Hello"
|
||||
]
|
||||
max_tokens = 500
|
||||
max_tokens = 500
|
||||
with VllmRunner("vllm-ascend/DeepSeek-V3.2-W8A8-Pruning",
|
||||
tensor_parallel_size=2,
|
||||
quantization="ascend",
|
||||
enable_expert_parallel=True,
|
||||
max_model_len=163840,
|
||||
compilation_config={
|
||||
"cudagraph_capture_sizes": [3, 6, 9, 12],
|
||||
"cudagraph_capture_sizes": [2, 4, 6, 8, 10, 12],
|
||||
"cudagraph_mode": "FULL_DECODE_ONLY"
|
||||
},
|
||||
speculative_config={
|
||||
"num_speculative_tokens": 2,
|
||||
"num_speculative_tokens": 1,
|
||||
"method": "deepseek_mtp"
|
||||
},
|
||||
additional_config={
|
||||
|
||||
Reference in New Issue
Block a user