performance optimization, usability optimization and API compatibility adjustments for deepseek with npu graph mode (#731)

--> ### What this PR does / why we need it?  1. Improve inference speed and usability for deepsek models with NPU graph mode. 2. Modify some codes to adapt to CANN 8.1.RC1.beta1. 3. Add a switch for NPU graph mode and its cache. ### Does this PR introduce _any_ user-facing change?  This PR provides an experimental configuration to enable NPU graph mode for Deepseek models. User can set additional_config={'enable_graph_mode': True} to try this feature. Note that this feature currently only supports for V0 engine. ### How was this patch tested?  This patch was tested with the newest torch_npu 2.5.1 (https://pypi.org/project/torch-npu/#files) and CANN 8.1.RC1.beta1 toolkit&nnal&kernels (https://www.hiascend.com/developer/download/community/result?module=cann) released in 25/30 April. Signed-off-by: linfeng-yuan <1102311262@qq.com>
2025-05-01 13:51:42 +08:00
parent 399b03830d
commit 84e2ed898b
6 changed files with 163 additions and 51 deletions
--- a/vllm_ascend/platform.py
+++ b/vllm_ascend/platform.py
@@ -124,7 +124,10 @@ class NPUPlatform(Platform):
        enforce_eager = True
        logger.warning(
            "NPU compilation support pending. Will be available in future CANN and "
-            "torch_npu releases. Using default: enforce_eager=True")
+            "torch_npu releases. NPU graph mode is currently experimental and disabled "
+            "by default. You can just adopt additional_config={'enable_graph_mode': True} "
+            "to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. "
+        )

        if enforce_eager or compilation_config.level == CompilationLevel.NO_COMPILATION:
            logger.info("Compilation disabled, using eager mode by default")
@@ -150,6 +153,11 @@ class NPUPlatform(Platform):
                    "enable_graph_mode is not supported because the version of torch is too low, forcing close enable_graph_mode"
                )
                vllm_config.additional_config["enable_graph_mode"] = False
+            if enable_graph_mode and envs.VLLM_USE_V1:
+                logger.warning(
+                    "NPU graph mode is still experimental and not supported for V1 currently, "
+                    "it has been disabled automatically.")
+                vllm_config.additional_config["enable_graph_mode"] = False

        parallel_config = vllm_config.parallel_config
        if parallel_config and parallel_config.worker_cls == "auto":