[main][misc]change default capture size for Qwen3-MoE when using full dp (#4199)

### What this PR does / why we need it? Currently, the default `cudagraph_capture_size` in vLLM is `[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]`. However, this is not always the best choice on different situations. This PR aims to change the default setting when running Qwen3-MoE on full dp (`dp_size > 1` && `tp_size == 1`) setting, which is usually applied in Large-Scale EP. old : `[1, 2, 4 ,8 ,16 ,24 ,... , max_capture_size]` new: `[1, 2, 5 ,10 ,15, 16 ,24 ,... , max_capture_size]` This is mainly because the performance of `_npu_paged_attention` op degrades dramatically on old settings. We hope to provide better performance if users do not set specific `cudagraph_capture_size`. ### Does this PR introduce _any_ user-facing change? The default `cudagraph_capture_size` is modified in above cases. However, if `cudagraph_capture_size` has already set by users, this PR won't have any influence on this. ### How was this patch tested? - vLLM version: v0.11.0 - vLLM main: 2918c1b49c --------- Signed-off-by: Angazenn <supperccell@163.com>
2025-11-18 08:41:45 +08:00
parent da1cd9c7ca
commit 10a046ddce
3 changed files with 81 additions and 3 deletions
--- a/vllm_ascend/platform.py
+++ b/vllm_ascend/platform.py
@@ -33,7 +33,8 @@ from vllm_ascend.torchair.utils import (check_torchair_cache_exist,
 from vllm_ascend.utils import (ASCEND_QUANTIZATION_METHOD, enable_sp, is_310p,
                               prefill_context_parallel_enable,
                               update_aclgraph_sizes,
-                               update_cudagraph_capture_sizes, vllm_version_is)
+                               update_cudagraph_capture_sizes,
+                               update_default_aclgraph_sizes, vllm_version_is)

 if TYPE_CHECKING:
    from vllm.config import ModelConfig, VllmConfig
@@ -193,6 +194,10 @@ class NPUPlatform(Platform):

        # set cudaprah sizes before extending `compilation_config.splitting_ops`
        vllm_config._set_cudagraph_sizes()
+        # There are cases where default cudagraph_capture_sizes are not friendly
+        # to ascend ops && hardwares. We update these sizes here to improve
+        # default performance.
+        update_default_aclgraph_sizes(vllm_config)
        # TODO delete graph size update here when compilation_config.pass_config.enable_sequence_parallelism
        # is supported by vllm-ascend.
        if vllm_config.parallel_config.tensor_parallel_size > 1 and not vllm_config.model_config.enforce_eager and \