EngineX/xc-llm-kunlun

Go to file

starkwj a15754c3ba add readme

2026-03-02 18:40:49 +08:00

Add foundational configuration

2025-12-28 20:28:58 +08:00

add vxpu

2026-03-02 18:38:10 +08:00

[CI/Build] Refactor E2E CI: split monolithic workflow into modular scripts (#162 )

2026-01-29 18:57:09 +08:00

[CI/Build] Modify biweekly report readme files (#131 )

2026-01-20 16:58:36 +08:00

[Doc] Update dependencies (#225 )

2026-03-02 10:50:12 +08:00

[CI] Add UT CI (#157 )

2026-01-28 18:00:16 +08:00

add vxpu

2026-03-02 18:38:10 +08:00

.gitignore

Add foundational configuration

2025-12-28 20:28:58 +08:00

.pre-commit-config.yaml

[CI/Build] Fixed bug related to conflicts in the code inspection tool (#169 )

2026-02-02 12:03:02 +08:00

.python-version

Initial commit for vLLM-Kunlun Plugin

2025-12-10 12:05:39 +08:00

.readthedocs.yaml

Add foundational configuration

2025-12-28 20:28:58 +08:00

build.sh

Initial commit for vLLM-Kunlun Plugin

2025-12-10 12:05:39 +08:00

CHANGELOG.md

update readme.md

2025-12-31 14:55:15 +08:00

ci.yml

提交vllm0.11.0开发分支

2025-12-10 17:51:24 +08:00

CODE_OF_CONDUCT.md

1.add CODE_OF_CONDUCT.md to vLLM Kunlun

2025-12-27 19:50:12 +08:00

collect_env.py

[Misc] add collect_env feat (#218 )

2026-02-27 12:19:58 +08:00

CONTRIBUTING.md

1.add CODE_OF_CONDUCT.md to vLLM Kunlun

2025-12-27 19:50:12 +08:00

DCO

Add foundational configuration

2025-12-28 20:28:58 +08:00

Dockerfile

add vxpu

2026-03-02 18:38:10 +08:00

LICENSE

Add foundational configuration

2025-12-28 20:28:58 +08:00

MAINTAINERS.md

update maintainer for vllm-kunlun (#100 )

2026-01-12 16:37:22 +08:00

pyproject.toml

提交vllm0.11.0开发分支

2025-12-10 17:51:24 +08:00

README.md

add readme

2026-03-02 18:40:49 +08:00

README.md.bak

add readme

2026-03-02 18:40:49 +08:00

requirements.txt

update ci workflow (#159 )

2026-01-28 20:28:38 +08:00

setup_env.sh

fix: use cuda visible (#244 )

2026-03-02 17:33:13 +08:00

setup.py

add vxpu

2026-03-02 18:38:10 +08:00

README.md

XC-LLM: A Specially Optimized LLM Inference Engine for ModelHub XC

Overview

The project is optimized based on the popular LLM inference project vLLM. The repository supports Kunlun3 P800.

One of the key features of this project is efficient memory coordination, enabling multiple vLLM instances share and dynamically hold GPU(XPU)'s physical memory. When an instance is idle, model parameters are offloaded to host memory. Upon a new inference request, the model parameters are quickly restored to the GPU’s memory (if not exist), without the need to initialize the engine and load the model from scratch. As a result, from the application’s perspective, multiple LLM inference engines can run on the GPU even when their total memory requirements exceed the physical memory limit. This technique is referred to as InfiniVRAM.

Installation

Build from Dockerfile

Clone this repository:

docker build -t $build_image -f ./Dockerfile .

Usage

To share GPU, processes coordinate via shm, so you need to set all containers with ipc=host.
Start a daemon process in a standalone container, by running vllm_vxpu_daemon installed inside the image.
Start LLM services with this image, following the official usage instructions.

Environment Variables

VXPU_RESERVED_VRAM_SIZE_GB: The amonut of reserved GPU memory for other miscellaneous memory. Only needs to be set for vllm_vxpu_daemon. Try increasing the variable if you launch multiple LLM services and encounter OOM. Default: 8.
VLLM_VXPU_SHM_NAME: The name of the shm file. Needs to be set for all containers of the shared vxpu group. Default: /vllm_kunlun_vxpu_offload_shm.