EngineX/xc-llm-ascend

Go to file

starkwj 2a571d8bc8 support multi npu partially

2026-01-09 04:36:39 +00:00

Configure Gemini (#2298 )

2025-08-11 22:21:29 +08:00

vllm-ascend vnpu v1

2025-12-26 07:37:35 +00:00

Fix the bugs about operator registration by PyTorch Dispatcher (#2786 )

2025-09-13 11:58:52 +08:00

[core] Support custom ascendc kernels in vllm-ascend (#233 )

2025-04-03 14:52:34 +08:00

support multi npu partially

2026-01-09 04:36:39 +00:00

[Doc] Update version policy to the latest. (#5071 )

2025-12-16 15:24:46 +08:00

[0.11.0][Bugfix] fix fastapi version (#5048 )

2025-12-15 23:51:38 +08:00

[Bugfix] bugfix for moe_mlp in vllm-ascend/v0.11.0-dev (#4885 )

2025-12-12 14:51:47 +08:00

[Image][Build] Cherry pick #4062 from main (#4506 )

2025-12-01 11:39:40 +08:00

support multi npu partially

2026-01-09 04:36:39 +00:00

.gitignore

[Misc] Add fusion_result.json to .gitignore (#1836 )

2025-07-17 11:54:49 +08:00

.pre-commit-config.yaml

[Cherry-pick]bmm_transpose to v011dev (#3995 )

2025-12-08 19:22:14 +08:00

.readthedocs.yaml

[Doc] Add sphinx build for vllm-ascend (#55 )

2025-02-13 18:44:17 +08:00

CMakeLists.txt

vllm-ascend vnpu v1

2025-12-26 07:37:35 +00:00

CODE_OF_CONDUCT.md

[1/2/N] Enable pymarkdown and python __init__ for lint system (#2011 )

2025-07-25 22:16:10 +08:00

codecov.yml

ut: add ci guard for ut coverage (#2317 )

2025-08-12 08:05:01 +08:00

collect_env.py

[CI]Add model basic accuracy test(Qwen2.5-0.5B-Instruct) (#460 )

2025-04-17 14:59:56 +08:00

CONTRIBUTING.md

Add recommend version and refresh readme / contribution.md (#1757 )

2025-07-12 12:35:40 +08:00

DCO

[Core] Init vllm-ascend (#3 )

2025-02-05 10:53:12 +08:00

Dockerfile

add Dockerfile and readme

2026-01-05 11:31:07 +00:00

Dockerfile.310p

[v0.11.0]Upgrade cann to 8.3.rc2 (#4332 )

2025-11-21 22:48:57 +08:00

Dockerfile.310p.openEuler

[v0.11.0]Upgrade cann to 8.3.rc2 (#4332 )

2025-11-21 22:48:57 +08:00

Dockerfile.a3

[Image][Build] Cherry pick #4062 from main (#4506 )

2025-12-01 11:39:40 +08:00

Dockerfile.a3.openEuler

[Image][Build] Cherry pick #4062 from main (#4506 )

2025-12-01 11:39:40 +08:00

Dockerfile.backup

add Dockerfile and readme

2026-01-05 11:31:07 +00:00

Dockerfile.openEuler

[Image] Correcting the vllm tag of the openeuler image on the A2 device. (#4745 )

2025-12-06 10:55:22 +08:00

format.sh

[1/N][CI] Move linting system to pre-commits hooks (#1256 )

2025-07-10 14:17:15 +08:00

LICENSE

Initial commit

2025-01-29 02:44:13 -08:00

mypy.ini

Support multistream of shared experts in FusedMoE (#997 )

2025-06-11 09:18:38 +08:00

packages.txt

[CI/UT][PD Disaggreate] Initialize PD Disaggreate UT (#889 )

2025-05-29 10:17:12 +08:00

pyproject.toml

[0.11.0][Bugfix] fix fastapi version (#5052 )

2025-12-16 11:34:11 +08:00

README-vllm-ascend.md

add Dockerfile and readme

2026-01-05 11:31:07 +00:00

README-vllm-ascend.zh.md

add Dockerfile and readme

2026-01-05 11:31:07 +00:00

README.md

support multi npu partially

2026-01-09 04:36:39 +00:00

requirements-dev.txt

[0.11.0][Bugfix] fix fastapi version (#5052 )

2025-12-16 11:34:11 +08:00

requirements-lint.txt

[Test] Remove VLLM_USE_V1 in example and tests (#1733 )

2025-07-15 12:49:57 +08:00

requirements.txt

[0.11.0][Bugfix] fix fastapi version (#5052 )

2025-12-16 11:34:11 +08:00

setup.py

【bugfix】fix connector register failed (#3335 )

2025-10-09 21:09:54 +08:00

typos.toml

[Cherry-pick]bmm_transpose to v011dev (#3995 )

2025-12-08 19:22:14 +08:00

README.md

XC-LLM: A Specially Optimized LLM Inference Engine for ModelHub XC

Overview

The project is optimized based on the popular LLM inference project vLLM. The current version supports Ascend NPU (910B3 & 910B4).

One of the key features of this project is efficient memory coordination, enabling multiple vLLM instances share and dynamically hold Ascend NPU's physical memory. When an instance is idle, model parameters are offloaded to host memory. Upon a new inference request, the model parameters are quickly restored to the NPU’s memory (if not exist), without the need to initialize the engine and load the model from scratch. As a result, from the application’s perspective, multiple LLM inference engines can run on the NPU even when their total memory requirements exceed the physical memory limit. This technique is referred to as InfiniVRAM.

Installation

Build from Dockerfile

Clone this repository:

docker build -t vllm-ascend-multi-llm:latest -f ./Dockerfile .

Usage

Note

Some platforms may not allow multiple containers to share the same Ascend NPU. You may try to use privilegd container to bypass this restriction and mount all NPUs, and set the env ASCEND_RT_VISIBLE_DEVICES to specify the target device to use.

To share NPU, processes coordinate via shm, so you need to set all containers with ipc=host.
Start a daemon process in a standalone container, by running vllm_vnpu_daemon installed inside the image.
Start LLM services with this image, following the official usage instructions.
Due to the limited stream resource of Ascend NPU, you may need to restrict graph capture sizes or disable ACLgraph by setting --enforce-eager, especially when launching multiple LLMs. Refer to the link.

Limitations

Restricted by the fact that HCCL cannot be shared, deploying more than one model with multi-GPU (e.g., TP) is not feasible currently.
The prefix cache will be reset when the LLM is restored, since we just simply discard the KV cache when the LLM is offloaded.

Languages

C++ 51.8%

Python 45.8%

CMake 1.1%

Shell 0.7%

C 0.5%

Other 0.1%