xc-llm-kunlun/README.md

# XC-LLM: A Specially Optimized LLM Inference Engine for ModelHub XC

## Overview

The project is optimized based on the popular LLM inference project vLLM. The repository supports Kunlun3 P800.

One of the key features of this project is efficient memory coordination, enabling multiple vLLM instances share and dynamically hold GPU(XPU)'s physical memory. When an instance is idle, model parameters are offloaded to host memory. Upon a new inference request, the model parameters are quickly restored to the GPU’s memory (if not exist), without the need to initialize the engine and load the model from scratch. As a result, from the application’s perspective, multiple LLM inference engines can run on the GPU even when their total memory requirements exceed the physical memory limit. This technique is referred to as `InfiniVRAM`.


## Installation

### Build from Dockerfile

1. Get or build base image (base with customized xpytorch, ops, etc.). Ref: [installation](https://vllm-kunlun.readthedocs.io/en/latest/installation.html).

2. Clone this repository and build
    ```bash
    docker build -t $build_image -f ./Dockerfile .
    ```

## Usage

0. To share GPU, processes coordinate via shm, so you need to set all containers with `ipc=host`.
1. Start a daemon process in a standalone container, by running `vllm_vxpu_daemon` installed inside the image.
2. Start LLM services with this image, following the official usage instructions.

### Environment Variables
- `VXPU_RESERVED_VRAM_SIZE_GB`: The amonut of reserved GPU memory for other miscellaneous memory. Only needs to be set for `vllm_vxpu_daemon`. Try increasing the variable if you launch multiple LLM services and encounter OOM. Default: `8`.
- `VLLM_VXPU_SHM_NAME`: The name of the shm file. Needs to be set for all containers of the shared vxpu group. Default: `/vllm_kunlun_vxpu_offload_shm`.
-												add readme

											
										
										
											2026-02-12 11:13:26 +08:00
+								# XC-LLM: A Specially Optimized LLM Inference Engine for ModelHub XC
-												Initial commit for vLLM-Kunlun Plugin

											
										
										
											2025-12-10 12:05:39 +08:00
-												[Docs] Update README (#206)

Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
											
										
										
											2026-02-15 11:05:54 +08:00
+								## Overview
-												add readme

											
										
										
											2026-02-12 11:13:26 +08:00
+								The project is optimized based on the popular LLM inference project vLLM. The repository supports Kunlun3 P800.
-												Initial commit for vLLM-Kunlun Plugin

											
										
										
											2025-12-10 12:05:39 +08:00
-												add readme

											
										
										
											2026-02-12 11:13:26 +08:00
+								One of the key features of this project is efficient memory coordination, enabling multiple vLLM instances share and dynamically hold GPU(XPU)'s physical memory. When an instance is idle, model parameters are offloaded to host memory. Upon a new inference request, the model parameters are quickly restored to the GPU’s memory (if not exist), without the need to initialize the engine and load the model from scratch. As a result, from the application’s perspective, multiple LLM inference engines can run on the GPU even when their total memory requirements exceed the physical memory limit. This technique is referred to as `InfiniVRAM`.
-												Initial commit for vLLM-Kunlun Plugin

											
										
										
											2025-12-10 12:05:39 +08:00
-												add readme

											
										
										
											2026-02-12 11:13:26 +08:00
+								## Installation
-												[Docs] Update README (#206)

Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
											
										
										
											2026-02-15 11:05:54 +08:00
-												add readme

											
										
										
											2026-02-12 11:13:26 +08:00
+								### Build from Dockerfile
-												[Docs] Update README (#206)

Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
											
										
										
											2026-02-15 11:05:54 +08:00
-												update base image

											
										
										
											2026-03-02 18:46:04 +08:00
+. Get or build base image (base with customized xpytorch, ops, etc.). Ref: [installation](https://vllm-kunlun.readthedocs.io/en/latest/installation.html).
-												[Docs] Update README (#206)

Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
											
										
										
											2026-02-15 11:05:54 +08:00
-												update base image

											
										
										
											2026-03-02 18:46:04 +08:00
+. Clone this repository and build
 								    ```bash
 								    docker build -t $build_image -f ./Dockerfile .
 								    ```
-												[Docs] Update README (#206)

Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
											
										
										
											2026-02-15 11:05:54 +08:00
-												add readme

											
										
										
											2026-02-12 11:13:26 +08:00
+								## Usage
-												[Docs] Update README (#206)

Signed-off-by: dongxinyu03 <dongxinyu03@baidu.com>
											
										
										
											2026-02-15 11:05:54 +08:00
-												add readme

											
										
										
											2026-02-12 11:13:26 +08:00
+. To share GPU, processes coordinate via shm, so you need to set all containers with `ipc=host`.
 . Start a daemon process in a standalone container, by running `vllm_vxpu_daemon` installed inside the image.
 . Start LLM services with this image, following the official usage instructions.
-												Initial commit for vLLM-Kunlun Plugin

											
										
										
											2025-12-10 12:05:39 +08:00
-												add readme

											
										
										
											2026-02-12 11:13:26 +08:00
+								### Environment Variables
 								- `VXPU_RESERVED_VRAM_SIZE_GB`: The amonut of reserved GPU memory for other miscellaneous memory. Only needs to be set for `vllm_vxpu_daemon`. Try increasing the variable if you launch multiple LLM services and encounter OOM. Default: `8`.
 								- `VLLM_VXPU_SHM_NAME`: The name of the shm file. Needs to be set for all containers of the shared vxpu group. Default: `/vllm_kunlun_vxpu_offload_shm`.