2026-01-05 11:31:07 +00:00
2025-08-11 22:21:29 +08:00
2025-12-26 07:37:35 +00:00
2025-12-26 07:37:35 +00:00
2025-12-26 07:37:35 +00:00
2025-12-26 07:37:35 +00:00
2025-02-05 10:53:12 +08:00
2026-01-05 11:31:07 +00:00
2026-01-05 11:31:07 +00:00
2025-01-29 02:44:13 -08:00
2026-01-05 11:31:07 +00:00

vLLM-Ascend Multi-LLM Serving

Overview

This repository is a modified version of vLLM-Ascend (v0.11.0) designed to enable multiple large language models (LLMs) to share one Ascend NPU.

The key feature of this project is efficient memory coordination, enabling multiple vLLM instances share and dynamically hold Ascend NPU's physical memory. When an instance is idle, model parameters are offloaded to host memory. Upon a new inference request, the model parameters are quickly restored to the NPUs memory (if not exist), without the need to init the engine and load the model from scratch. (For Qwen3-8B, it introduces 0.8s of additional latency to TTFT when retoring from offload status.)

Features

  • Shared NPU Usage: Multiple vLLM instances can share the same Ascend NPU, allowing for multi-LLM serving of different LLMs.
  • Fast Memory Restore: We decouple the virtual and physical memory allcation. Physical NPU memory is allocated, exported and shared with other LLM engines. LLM engines can restore quickly without reinitialize and memory allocation.

Installation

Build from Dockerfile

Clone this repository:

docker build -t vllm-ascend-multi-llm:latest -f ./Dockerfile .

Usage

Note

Some platforms may not allow multiple containers to share the same Ascend NPU. You may try to use privilegd container to bypass this restriction and mount all NPUs, and set the env ASCEND_RT_VISIBLE_DEVICES to specify the target device to use.

  1. To share NPU, processes coordinate via shm, so you need to set all containers with ipc=host.
  2. Start a daemon process in a standalone container, by running vllm_vnpu_daemon installed inside the image.
  3. Start LLM services with this image, following the official usage instructions.
  4. Due to the limited stream resource of Ascend NPU, you may need to restrict graph capture sizes or disable ACLgraph by setting --enforce-eager, especially when launching multiple LLMs. Refer to the link.

Limitations

  • This project only support share a single NPU currently. This is also limited by the fact that HCCL cannot be shared. We haven't figure out how to bypass HCCL. Help wanted.
  • The prefix cache will be reset when the LLM is restored, since we just simply discard the KV cache when the LLM is offloaded.

Roadmap

  • Space-sharing.
  • ...

License

Apache License 2.0, as found in the LICENSE file.

Description
XC-LLM: A Specially Optimized LLM Inference Engine for ModelHub XC
Readme Apache-2.0 8.6 MiB
Languages
Python 66.8%
C++ 31.8%
Shell 1%
CMake 0.2%
Dockerfile 0.1%
Other 0.1%