performance optimization, usability optimization and API compatibility adjustments for deepseek with npu graph mode (#731)

-->
### What this PR does / why we need it?
<!--
- Please clarify what changes you are proposing. The purpose of this
section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster
reviews in your PR.

- Please clarify why the changes are needed. For instance, the use case
and bug description.

- Fixes #
-->
1. Improve inference speed and usability for deepsek models with NPU
graph mode.
2. Modify some codes to adapt to CANN 8.1.RC1.beta1.
3. Add a switch for NPU graph mode and its cache.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such
as API, interface or other behavior changes.
Documentation-only updates are not considered user-facing changes.
-->
This PR provides an experimental configuration to enable NPU graph mode
for Deepseek models. User can set
additional_config={'enable_graph_mode': True} to try this feature. Note
that this feature currently only supports for V0 engine.


### How was this patch tested?
<!--
CI passed with new added/existing test.
If it was tested in a way different from regular unit tests, please
clarify how you tested step by step, ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future.
If tests were not added, please describe why they were not added and/or
why it was difficult to add.
-->
This patch was tested with the newest torch_npu 2.5.1
(https://pypi.org/project/torch-npu/#files) and CANN 8.1.RC1.beta1
toolkit&nnal&kernels
(https://www.hiascend.com/developer/download/community/result?module=cann)
released in 25/30 April.

Signed-off-by: linfeng-yuan <1102311262@qq.com>
This commit is contained in:
linfeng-yuan
2025-05-01 13:51:42 +08:00
committed by GitHub
parent 399b03830d
commit 84e2ed898b
6 changed files with 163 additions and 51 deletions

View File

@@ -124,7 +124,10 @@ class NPUPlatform(Platform):
enforce_eager = True
logger.warning(
"NPU compilation support pending. Will be available in future CANN and "
"torch_npu releases. Using default: enforce_eager=True")
"torch_npu releases. NPU graph mode is currently experimental and disabled "
"by default. You can just adopt additional_config={'enable_graph_mode': True} "
"to serve deepseek models with NPU graph mode on vllm-ascend with V0 engine. "
)
if enforce_eager or compilation_config.level == CompilationLevel.NO_COMPILATION:
logger.info("Compilation disabled, using eager mode by default")
@@ -150,6 +153,11 @@ class NPUPlatform(Platform):
"enable_graph_mode is not supported because the version of torch is too low, forcing close enable_graph_mode"
)
vllm_config.additional_config["enable_graph_mode"] = False
if enable_graph_mode and envs.VLLM_USE_V1:
logger.warning(
"NPU graph mode is still experimental and not supported for V1 currently, "
"it has been disabled automatically.")
vllm_config.additional_config["enable_graph_mode"] = False
parallel_config = vllm_config.parallel_config
if parallel_config and parallel_config.worker_cls == "auto":