Add sleep mode feature for Ascend NPU (#513)

### What this PR does / why we need it?
This PR adds sleep mode feature for vllm-ascend, when sleeps, we do
mainly two things:

- offload model weights
- discard kv cache

RLHF tools(such as https://github.com/volcengine/verl and
https://github.com/OpenRLHF/OpenRLHF) have a strong need of sleep mode
to accelerate the training process.

This PR may solve #375 and #320 .

### Does this PR introduce _any_ user-facing change?
No existing user interfaces changed.
Users will have two new methods(`sleep()` and `wake_up()`) to use.

### How was this patch tested?
This PR is tested with Qwen/Qwen2.5-0.5B-Instruct.

At first, we have free NPU memory M1.

After `llm = LLM("Qwen/Qwen2.5-0.5B-Instruct", enable_sleep_mode=True)`
executed, we have free NPU memory M2. M2 < M1.

Then we call `llm.sleep(level=1)`, we have free NPU memory M3.

We have M3 > M2, M3 is very close to M1.

Plus, we have the same output tokens before sleep and after wake up,
with the config of `SamplingParams(temperature=0, max_tokens=10)` and
with the same input tokens of course.


This PR is utilizing the CMake procedure of #371 , thanks a lot.

Signed-off-by: Shuqiao Li <celestialli@outlook.com>

This commit is contained in:

Shuqiao Li

2025-04-18 13:11:39 +08:00

committed by

GitHub

parent 42c7fbb10e

commit 84563fc65d

13 changed files with 1020 additions and 9 deletions

									
										3

vllm_ascend/__init__.py
									
												View File
												
				@@ -18,9 +18,6 @@

				def register():

				    """Register the NPU platform."""

				    # Adapt the global patch here.

				    from vllm_ascend.utils import adapt_patch

				    adapt_patch(is_global_patch=True)

				    return "vllm_ascend.platform.NPUPlatform"

Add sleep mode feature for Ascend NPU (#513)

3 vllm_ascend/__init__.py Unescape Escape View File

3

vllm_ascend/init.py

View File