xc-llm-ascend/README.md

# XC-LLM: A Specially Optimized LLM Inference Engine for ModelHub XC

## Overview

The project is optimized based on the popular LLM inference project vLLM. The current version supports Ascend NPU (910B3 & 910B4). 

One of the key features of this project is efficient memory coordination, enabling multiple vLLM instances share and dynamically hold Ascend NPU's physical memory. When an instance is idle, model parameters are offloaded to host memory. Upon a new inference request, the model parameters are quickly restored to the NPU’s memory (if not exist), without the need to initialize the engine and load the model from scratch. As a result, from the application’s perspective, multiple LLM inference engines can run on the NPU even when their total memory requirements exceed the physical memory limit. This technique is referred to as `InfiniVRAM`.


## Installation

### Build from Dockerfile

Clone this repository:

```bash
docker build -t vllm-ascend-multi-llm:latest -f ./Dockerfile .
```

## Usage

> [!NOTE]
> Some platforms may not allow multiple containers to share the same Ascend NPU. You may try to use privilegd container to bypass this restriction and mount all NPUs, and set the env ASCEND_RT_VISIBLE_DEVICES to specify the target device to use.

0. To share NPU, processes coordinate via shm, so you need to set all containers with `ipc=host`.
1. Start a daemon process in a standalone container, by running `vllm_vnpu_daemon` installed inside the image.
2. Start LLM services with this image, following the official usage instructions.
3. Due to the limited stream resource of Ascend NPU, you may need to restrict graph capture sizes or disable ACLgraph by setting `--enforce-eager`, especially when launching multiple LLMs. Refer to the [link](https://docs.vllm.ai/projects/ascend/en/latest/faqs.html#how-to-troubleshoot-and-resolve-size-capture-failures-resulting-from-stream-resource-exhaustion-and-what-are-the-underlying-causes).

### Environment Variables
- `VNPU_RESERVED_VRAM_SIZE_GB`: The amonut of reserved GPU memory for other miscellaneous memory. Only needs to be set for `vllm_vnpu_daemon`. Try increasing the variable if you launch multiple LLM services and encounter OOM. Default: `8`.
- `VLLM_VNPU_SHM_NAME`: The name of the shm file. Needs to be set for all containers of the shared vNPU group. Default: `/vllm_acl_vnpu_offload_shm`.


## Limitations

- Restricted by the fact that HCCL cannot be shared, deploying more than one model with multi-GPU (e.g., TP) is not feasible currently.
- The prefix cache will be reset when the LLM is restored, since we just simply discard the KV cache when the LLM is offloaded.
-												更新 README.md

											
										
										
											2026-01-05 20:33:31 +08:00
+								# XC-LLM: A Specially Optimized LLM Inference Engine for ModelHub XC
-												add Dockerfile and readme

											
										
										
											2026-01-05 09:10:56 +00:00
-												[Core] Init vllm-ascend (#3)

### What this PR does / why we need it?
vLLM Ascend plugin (vllm-ascend) is a backend plugin for running vLLM on
the Ascend NPU.

This plugin is the recommended approach for supporting the Ascend
backend within the vLLM community. It adheres to the principles outlined
in the [RFC]: Hardware pluggable, providing a hardware-pluggable
interface that decouples the integration of the Ascend NPU with vLLM.

This patch also include changes to make CI work and use cache speed up
e2e test, including:
1. Change push (post merge ci) and pull_request (pr ci) trigger branch
to main
   2. Make mypy work by ignore base_communicator and clear unused deps
   3. Several improvements for vllm_ascend_test:
     - use cache (pip, ms, hf) speed up e2e test (25mins --> 5mins)
- switch `git clone` command to `action/checkout` to speedup checkout
and
     - Enable sv for pytest for better info dump
- Remove network host to resole `docker: conflicting ontions: cannot
attach both user-defined and non-user-definednetwork-modes`, which is a
problem on docker 1.45 but not on 1.39.
4. Adapt MLA decode optimizations:
https://github.com/vllm-project/vllm/commit/cabaf4eff3c7df30d785769d5a0a1fa1a1c48a8a

### Does this PR introduce _any_ user-facing change?
Yes, init the PR.

### How was this patch tested?
- This is the first PR to make ascend NPU work on vLLM. All code is
tested on ascend with vLLM V0 Engine.
- CI passed

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: wangshuai09 <391746016@qq.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-02-05 10:53:12 +08:00
+								## Overview
-												更新 README.md

											
										
										
											2026-01-05 20:33:31 +08:00
+								The project is optimized based on the popular LLM inference project vLLM. The current version supports Ascend NPU (910B3 & 910B4).
-												add Dockerfile and readme

											
										
										
											2026-01-05 09:10:56 +00:00
-												更新 README.md

											
										
										
											2026-01-05 20:33:31 +08:00
+								One of the key features of this project is efficient memory coordination, enabling multiple vLLM instances share and dynamically hold Ascend NPU's physical memory. When an instance is idle, model parameters are offloaded to host memory. Upon a new inference request, the model parameters are quickly restored to the NPU’s memory (if not exist), without the need to initialize the engine and load the model from scratch. As a result, from the application’s perspective, multiple LLM inference engines can run on the NPU even when their total memory requirements exceed the physical memory limit. This technique is referred to as `InfiniVRAM`.
-												[Core] Init vllm-ascend (#3)

### What this PR does / why we need it?
vLLM Ascend plugin (vllm-ascend) is a backend plugin for running vLLM on
the Ascend NPU.

This plugin is the recommended approach for supporting the Ascend
backend within the vLLM community. It adheres to the principles outlined
in the [RFC]: Hardware pluggable, providing a hardware-pluggable
interface that decouples the integration of the Ascend NPU with vLLM.

This patch also include changes to make CI work and use cache speed up
e2e test, including:
1. Change push (post merge ci) and pull_request (pr ci) trigger branch
to main
   2. Make mypy work by ignore base_communicator and clear unused deps
   3. Several improvements for vllm_ascend_test:
     - use cache (pip, ms, hf) speed up e2e test (25mins --> 5mins)
- switch `git clone` command to `action/checkout` to speedup checkout
and
     - Enable sv for pytest for better info dump
- Remove network host to resole `docker: conflicting ontions: cannot
attach both user-defined and non-user-definednetwork-modes`, which is a
problem on docker 1.45 but not on 1.39.
4. Adapt MLA decode optimizations:
https://github.com/vllm-project/vllm/commit/cabaf4eff3c7df30d785769d5a0a1fa1a1c48a8a

### Does this PR introduce _any_ user-facing change?
Yes, init the PR.

### How was this patch tested?
- This is the first PR to make ascend NPU work on vLLM. All code is
tested on ascend with vLLM V0 Engine.
- CI passed

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: wangshuai09 <391746016@qq.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-02-05 10:53:12 +08:00
-												add Dockerfile and readme

											
										
										
											2026-01-05 09:10:56 +00:00
+								## Installation
-												[Core] Init vllm-ascend (#3)

### What this PR does / why we need it?
vLLM Ascend plugin (vllm-ascend) is a backend plugin for running vLLM on
the Ascend NPU.

This plugin is the recommended approach for supporting the Ascend
backend within the vLLM community. It adheres to the principles outlined
in the [RFC]: Hardware pluggable, providing a hardware-pluggable
interface that decouples the integration of the Ascend NPU with vLLM.

This patch also include changes to make CI work and use cache speed up
e2e test, including:
1. Change push (post merge ci) and pull_request (pr ci) trigger branch
to main
   2. Make mypy work by ignore base_communicator and clear unused deps
   3. Several improvements for vllm_ascend_test:
     - use cache (pip, ms, hf) speed up e2e test (25mins --> 5mins)
- switch `git clone` command to `action/checkout` to speedup checkout
and
     - Enable sv for pytest for better info dump
- Remove network host to resole `docker: conflicting ontions: cannot
attach both user-defined and non-user-definednetwork-modes`, which is a
problem on docker 1.45 but not on 1.39.
4. Adapt MLA decode optimizations:
https://github.com/vllm-project/vllm/commit/cabaf4eff3c7df30d785769d5a0a1fa1a1c48a8a

### Does this PR introduce _any_ user-facing change?
Yes, init the PR.

### How was this patch tested?
- This is the first PR to make ascend NPU work on vLLM. All code is
tested on ascend with vLLM V0 Engine.
- CI passed

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: wangshuai09 <391746016@qq.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-02-05 10:53:12 +08:00
-												add Dockerfile and readme

											
										
										
											2026-01-05 09:10:56 +00:00
+								### Build from Dockerfile
-												Add recommend version and refresh readme / contribution.md (#1757)

### What this PR does / why we need it?
Add recommend version and contribution.md

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI passed






- vLLM version: v0.9.2
- vLLM main:
https://github.com/vllm-project/vllm/commit/890323dc1b1141c816637af960208678a30588b6

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-07-12 12:35:40 +08:00
-												add Dockerfile and readme

											
										
										
											2026-01-05 09:10:56 +00:00
+								Clone this repository:
-												[Doc] Update Readme (#11)

<!--  Thanks for sending a pull request!

BEFORE SUBMITTING, PLEASE READ
https://docs.vllm.ai/en/latest/contributing/overview.html

-->
### What this PR does / why we need it?
Add feature and model support matrix

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
CI test is enough

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-02-06 14:08:44 +08:00
-												add Dockerfile and readme

											
										
										
											2026-01-05 09:10:56 +00:00
+								```bash
 								docker build -t vllm-ascend-multi-llm:latest -f ./Dockerfile .
 								```
-												[Docs] Add official doc index (#29)

Add official doc index. Move the release content to the right place.

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
											
										
										
											2025-02-11 12:00:27 +08:00
-												add Dockerfile and readme

											
										
										
											2026-01-05 09:10:56 +00:00
+								## Usage
-												[Core] Init vllm-ascend (#3)

### What this PR does / why we need it?
vLLM Ascend plugin (vllm-ascend) is a backend plugin for running vLLM on
the Ascend NPU.

This plugin is the recommended approach for supporting the Ascend
backend within the vLLM community. It adheres to the principles outlined
in the [RFC]: Hardware pluggable, providing a hardware-pluggable
interface that decouples the integration of the Ascend NPU with vLLM.

This patch also include changes to make CI work and use cache speed up
e2e test, including:
1. Change push (post merge ci) and pull_request (pr ci) trigger branch
to main
   2. Make mypy work by ignore base_communicator and clear unused deps
   3. Several improvements for vllm_ascend_test:
     - use cache (pip, ms, hf) speed up e2e test (25mins --> 5mins)
- switch `git clone` command to `action/checkout` to speedup checkout
and
     - Enable sv for pytest for better info dump
- Remove network host to resole `docker: conflicting ontions: cannot
attach both user-defined and non-user-definednetwork-modes`, which is a
problem on docker 1.45 but not on 1.39.
4. Adapt MLA decode optimizations:
https://github.com/vllm-project/vllm/commit/cabaf4eff3c7df30d785769d5a0a1fa1a1c48a8a

### Does this PR introduce _any_ user-facing change?
Yes, init the PR.

### How was this patch tested?
- This is the first PR to make ascend NPU work on vLLM. All code is
tested on ascend with vLLM V0 Engine.
- CI passed

---------

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: MengqingCao <cmq0113@163.com>
Co-authored-by: wangshuai09 <391746016@qq.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: wangli <wangli858794774@gmail.com>
											
										
										
											2025-02-05 10:53:12 +08:00
-												add Dockerfile and readme

											
										
										
											2026-01-05 09:10:56 +00:00
+								> [!NOTE]
 								> Some platforms may not allow multiple containers to share the same Ascend NPU. You may try to use privilegd container to bypass this restriction and mount all NPUs, and set the env ASCEND_RT_VISIBLE_DEVICES to specify the target device to use.
-												[Doc] Add versioning_policy doc (#62)

### What this PR does / why we need it?

This patch add the versioning policy doc for vllm-ascend

Reference:
- https://spark.apache.org/versioning-policy.html
- https://docs.openstack.org/project-team-guide/stable-branches.html
- https://github.com/pytorch/pytorch/blob/main/RELEASE.md

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
preview: https://vllm-ascend--62.org.readthedocs.build/en/62/

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-02-17 14:13:28 +08:00
-												add Dockerfile and readme

											
										
										
											2026-01-05 09:10:56 +00:00
+. To share NPU, processes coordinate via shm, so you need to set all containers with `ipc=host`.
 . Start a daemon process in a standalone container, by running `vllm_vnpu_daemon` installed inside the image.
 . Start LLM services with this image, following the official usage instructions.
 . Due to the limited stream resource of Ascend NPU, you may need to restrict graph capture sizes or disable ACLgraph by setting `--enforce-eager`, especially when launching multiple LLMs. Refer to the [link](https://docs.vllm.ai/projects/ascend/en/latest/faqs.html#how-to-troubleshoot-and-resolve-size-capture-failures-resulting-from-stream-resource-exhaustion-and-what-are-the-underlying-causes).
-												[Doc] Add versioning_policy doc (#62)

### What this PR does / why we need it?

This patch add the versioning policy doc for vllm-ascend

Reference:
- https://spark.apache.org/versioning-policy.html
- https://docs.openstack.org/project-team-guide/stable-branches.html
- https://github.com/pytorch/pytorch/blob/main/RELEASE.md

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
preview: https://vllm-ascend--62.org.readthedocs.build/en/62/

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-02-17 14:13:28 +08:00
-												add env vars & misc

											
										
										
											2026-02-11 06:27:58 +00:00
+								### Environment Variables
 								- `VNPU_RESERVED_VRAM_SIZE_GB`: The amonut of reserved GPU memory for other miscellaneous memory. Only needs to be set for `vllm_vnpu_daemon`. Try increasing the variable if you launch multiple LLM services and encounter OOM. Default: `8`.
 								- `VLLM_VNPU_SHM_NAME`: The name of the shm file. Needs to be set for all containers of the shared vNPU group. Default: `/vllm_acl_vnpu_offload_shm`.
-												[Doc] Add versioning_policy doc (#62)

### What this PR does / why we need it?

This patch add the versioning policy doc for vllm-ascend

Reference:
- https://spark.apache.org/versioning-policy.html
- https://docs.openstack.org/project-team-guide/stable-branches.html
- https://github.com/pytorch/pytorch/blob/main/RELEASE.md

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
preview: https://vllm-ascend--62.org.readthedocs.build/en/62/

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-02-17 14:13:28 +08:00
-												add Dockerfile and readme

											
										
										
											2026-01-05 09:10:56 +00:00
+								## Limitations
-												Mark v0.7.1 as unmaintained and v0.7.3 as maintained (#139)

### What this PR does / why we need it?
Mark v0.7.1 as unmaintained and v0.7.3 as maintained:
vLLM released the v0.7.3 version:
https://github.com/vllm-project/vllm/releases/tag/v0.7.3 which include
serval commits:
- https://github.com/vllm-project/vllm/pull/12874
- https://github.com/vllm-project/vllm/pull/12432
- https://github.com/vllm-project/vllm/pull/13208

We'd better to bump the versions to v0.7.3.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Preview

Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
											
										
										
											2025-02-21 22:41:44 +08:00
-												support multi npu partially

											
										
										
											2026-01-08 06:54:33 +00:00
+								- Restricted by the fact that HCCL cannot be shared, deploying more than one model with multi-GPU (e.g., TP) is not feasible currently.
-												add Dockerfile and readme

											
										
										
											2026-01-05 09:10:56 +00:00
+								- The prefix cache will be reset when the LLM is restored, since we just simply discard the KV cache when the LLM is offloaded.