add readme

2026-02-12 11:13:26 +08:00
parent 4d8575115a
commit a15754c3ba
2 changed files with 228 additions and 200 deletions
--- a/README.md
+++ b/README.md
@@ -1,214 +1,28 @@
-![vLLM Kunlun Logo](vllm_kunlun/patches/vLLM_Kunlun.jpg)
-
-<p align="center">
-  <a href="https://vllm-kunlun.readthedocs.io/en/latest/"><b>📖 Documentation</b></a> |
-  <a href="https://vllm-kunlun.readthedocs.io/en/latest/quick_start.html"><b>🚀 Quick Start</b></a> |
-  <a href="https://vllm-kunlun.readthedocs.io/en/latest/installation.html"><b>📦 Installation</b></a> |
-  <a href="https://join.slack.com/t/vllm-kunlun/shared_invite/zt-3iinb8u5z-FcqZKbNNdMJ_32fHmipzvw"><b>💬 Slack</b></a>
-</p>
-
-<p align="center">
-  <img alt="GitHub License" src="https://img.shields.io/github/license/baidu/vLLM-Kunlun">
-  <img alt="GitHub Stars" src="https://img.shields.io/github/stars/baidu/vLLM-Kunlun">
-  <img alt="GitHub Forks" src="https://img.shields.io/github/forks/baidu/vLLM-Kunlun">
-  <img alt="GitHub Issues" src="https://img.shields.io/github/issues/baidu/vLLM-Kunlun">
-  <img alt="Python Version" src="https://img.shields.io/badge/python-%3E%3D3.10-blue">
-</p>
-
---
-
-## Latest News 🔥
-
- [2026/02] 🧠 **GLM model family support** — Added GLM5, GLM-4.7 MTP (Multi-Token Prediction), and GLM-47 tool parser with thinking/non-thinking mode toggle
- [2026/02] ⚡ **Performance optimizations** — Fused MoE with small batches, optimized attention metadata building, Multi-LoRA inference achieves 80%+ of non-LoRA performance
- [2026/02] 🔧 **DeepSeek-V3.2 MTP support** — Added MTP (Multi-Token Prediction) for DeepSeek-V3.2, with RoPE and decoding stage kernel optimizations
- [2026/01] 🔢 **New quantization methods** — Support for compressed-tensors W4A16, AWQ MoE W4A16, and DeepSeek-V3.2 W8A8 quantization
- [2026/01] 🛠️ **CI/CD overhaul** — Added E2E tests, unit test CI, ruff format checks, and modular CI workflow refactoring
- [2025/12] 🎉 **v0.11.0rc1 released** — Added Qwen3-Omni, Qwen3-Next, Seed-OSS support ([Release Notes](https://github.com/baidu/vLLM-Kunlun/releases/tag/v0.11.0rc1))
- [2025/12] 📦 **v0.10.1.1 released** — 5+ multimodal models, AWQ/GPTQ quantization for dense models, Piecewise CUDA Graph, vLLM V1 engine, Flash-Infer Top-K/Top-P sampling with 10-100× speedup ([Release Notes](https://github.com/baidu/vLLM-Kunlun/releases/tag/v0.10.1.1))
- [2025/12] 🌟 Initial release of vLLM Kunlun — Open sourced on Dec 8, 2025
-
---
+# XC-LLM: A Specially Optimized LLM Inference Engine for ModelHub XC

 ## Overview

-**vLLM Kunlun** (`vllm-kunlun`) is a community-maintained hardware plugin designed to seamlessly run [vLLM](https://github.com/vllm-project/vllm) on the **Kunlun XPU**. It is the recommended approach for integrating the Kunlun backend within the vLLM community, adhering to the principles outlined in the [RFC Hardware Pluggable](https://github.com/vllm-project/vllm/issues/11162).
+The project is optimized based on the popular LLM inference project vLLM. The repository supports Kunlun3 P800.

-This plugin provides a hardware-pluggable interface that decouples the integration of the Kunlun XPU with vLLM. By utilizing vLLM Kunlun, popular open-source models — including Transformer-like, Mixture-of-Expert (MoE), Embedding, and Multi-modal LLMs — can run effortlessly on the Kunlun XPU.
+One of the key features of this project is efficient memory coordination, enabling multiple vLLM instances share and dynamically hold GPU(XPU)'s physical memory. When an instance is idle, model parameters are offloaded to host memory. Upon a new inference request, the model parameters are quickly restored to the GPU’s memory (if not exist), without the need to initialize the engine and load the model from scratch. As a result, from the application’s perspective, multiple LLM inference engines can run on the GPU even when their total memory requirements exceed the physical memory limit. This technique is referred to as `InfiniVRAM`.

-### ✨ Key Features

- **Seamless Plugin Integration** — Works as a standard vLLM platform plugin via Python entry points, no need to modify vLLM source code
- **Broad Model Support** — Supports 15+ mainstream LLMs including Qwen, Llama, DeepSeek, Kimi-K2, and multimodal models
- **Quantization Support** — INT8 and other quantization methods for MoE and dense models
- **LoRA Fine-Tuning** — LoRA adapter support for Qwen series models
- **Piecewise Kunlun Graph** — Hardware-accelerated graph optimization for high-performance inference
- **FlashMLA Attention** — Optimized multi-head latent attention for DeepSeek MLA architectures
- **Tensor Parallelism** — Multi-device parallel inference with distributed execution support
- **OpenAI-Compatible API** — Serve models with the standard OpenAI API interface
+## Installation

---
+### Build from Dockerfile

-## Prerequisites
-
- **Hardware**: Kunlun3 P800
- **OS**: Ubuntu 22.04
- **Software**:
-  - Python >= 3.10
-  - PyTorch >= 2.5.1
-  - vLLM (same version as vllm-kunlun)
-  - transformers >= 4.57.0
-
---
-
-## Supported Models
-
-### Generative Models
-
-| Model | Support | Quantization | LoRA |  Kunlun Graph |
-|:------|:-------:|:------------:|:----:|:----------------------:|
-| Qwen2 | ✅ | ✅| ✅ | ✅ |
-| Qwen2.5 | ✅ |✅ | ✅ | ✅ |
-| Qwen3 | ✅ |✅ | ✅ | ✅ |
-| Qwen3-Moe | ✅ | ✅ |  | ✅ |
-| Qwen3-Next | ✅ | ✅ |  | ✅ |
-| MiMo-V2-Flash | ✅ | ✅| | ✅ |
-| Llama2 | ✅ | ✅|  ✅| ✅ |
-| Llama3 | ✅ |✅ | ✅ | ✅ |
-| Llama3.1 | ✅ |✅ | | ✅ |
-| gpt-oss | ✅ | ✅| | |
-| GLM4.5 | ✅ | ✅| | ✅ |
-| GLM4.5Air | ✅ |✅ | | ✅ |
-| GLM4.7 | ✅ | ✅| | ✅ |
-| GLM5 | ✅ | ✅| | ✅ |
-| Kimi-K2 | ✅ | ✅ | | ✅ |
-| DeepSeek-R1 | ✅ | ✅ | | ✅ |
-| DeepSeek-V3 | ✅ | ✅ | | ✅ |
-| DeepSeek-V3.2 | ✅ | ✅ | | ✅ |
-
-### Multimodal Language Models
-
-| Model | Support | Quantization | LoRA |  Kunlun Graph |
-|:------|:-------:|:------------:|:----:|:----------------------:|
-| Qwen2-VL | ✅ | ✅| | ✅ |
-| Qwen2.5-VL | ✅ | ✅| | ✅ |
-| Qwen3-VL | ✅ | ✅| | ✅ |
-| Qwen3-VL-MoE | ✅ | ✅ | | ✅ |
-| Qwen3-Omni-MoE | ✅ | | | ✅ |
-| InternVL-2.5 | ✅ | | | ✅ |
-| InternVL-3.5 | ✅ | | | ✅ |
-| InternS1 | ✅ | | | ✅ |
-
---
-
-## Performance Visualization 🚀
-
-### High-performance computing at work: How different models perform on the Kunlun3 P800.
-
-Current environment: 16-way concurrency, input/output size 2048.
-
-![Models and tgs](./vllm_kunlun/patches/performance.png)
-
---
-
-### Quick Start
-
-#### Start an OpenAI-Compatible API Server
+Clone this repository:

 ```bash
-python -m vllm.entrypoints.openai.api_server \
-    --host 0.0.0.0 \
-    --port 8356 \
-    --model <your-model-path> \
-    --gpu-memory-utilization 0.9 \
-    --trust-remote-code \
-    --max-model-len 32768 \
-    --tensor-parallel-size 1 \
-    --dtype float16 \
-    --max_num_seqs 128 \
-    --max_num_batched_tokens 32768 \
-    --block-size 128 \
-    --distributed-executor-backend mp \
-    --served-model-name <your-model-name>
+docker build -t $build_image -f ./Dockerfile .
 ```

-#### Send a Request
+## Usage

-```bash
-curl http://localhost:8356/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "<your-model-name>",
-    "messages": [{"role": "user", "content": "Hello!"}],
-    "max_tokens": 512
-  }'
-```
+0. To share GPU, processes coordinate via shm, so you need to set all containers with `ipc=host`.
+1. Start a daemon process in a standalone container, by running `vllm_vxpu_daemon` installed inside the image.
+2. Start LLM services with this image, following the official usage instructions.

-### Version Matrix
-
-| Version | Release Type | Documentation |
-|---------|:------------:|:-------------:|
-| v0.11.0 | Latest stable version | [Quick Start](https://vllm-kunlun.readthedocs.io/en/latest/quick_start.html) · [Installation](https://vllm-kunlun.readthedocs.io/en/latest/installation.html) |
-
---
-
-## Architecture
-
-```
-vllm-kunlun/
-├── vllm_kunlun/              # Core plugin package
-│   ├── platforms/             # Kunlun XPU platform implementation
-│   ├── models/                # Model implementations (DeepSeek, Qwen, Llama, etc.)
-│   ├── ops/                   # Custom operators (attention, linear, sampling, etc.)
-│   │   ├── attention/         # FlashMLA, paged attention, merge attention states
-│   │   ├── fla/               # Flash linear attention operations
-│   │   └── sample/            # Sampling operators
-│   ├── v1/                    # vLLM V1 engine adaptations
-│   ├── compilation/           # Torch compile wrapper for Kunlun Graph
-│   ├── csrc/                  # C++ extensions (custom CUDA-compatible kernels)
-│   └── config/                # Model configuration overrides
-├── tests/                     # Test suite
-├── docs/                      # Documentation (Sphinx-based, ReadTheDocs hosted)
-├── ci/                        # CI pipeline configurations
-├── setup.py                   # Legacy build script (with C++ extensions)
-└── pyproject.toml             # Modern Python build configuration (hatchling)
-```
-
---
-
-## Contributing
-
-We welcome contributions from the community! Please read our [Contributing Guide](CONTRIBUTING.md) before submitting a PR.
-
-### PR Classification
-
-Use the following prefixes for PR titles:
-
- `[Attention]` — Attention mechanism features/optimizations
- `[Core]` — Core vllm-kunlun logic (platform, attention, communicators, model runner)
- `[Kernel]` — Compute kernels and ops
- `[Bugfix]` — Bug fixes
- `[Doc]` — Documentation improvements
- `[Test]` — Tests
- `[CI]` — CI/CD improvements
- `[Misc]` — Other changes
-
---
-
-## Star History 🔥
-
-We opened the project at Dec 8, 2025. We love open source and collaboration ❤️
-
-[![Star History Chart](https://api.star-history.com/svg?repos=baidu/vLLM-Kunlun&type=date&legend=bottom-right)](https://www.star-history.com/#baidu/vLLM-Kunlun&type=date&legend=bottom-right)
-
---
-
-## Sponsors 👋
-
-We sincerely appreciate the [**KunLunXin**](https://www.kunlunxin.com/) team for their support in providing XPU resources, which enabled efficient model adaptation debugging, comprehensive end-to-end testing, and broader model compatibility.
-
---
-
-## License
-
-Apache License 2.0, as found in the [LICENSE](./LICENSE) file.
+### Environment Variables
+- `VXPU_RESERVED_VRAM_SIZE_GB`: The amonut of reserved GPU memory for other miscellaneous memory. Only needs to be set for `vllm_vxpu_daemon`. Try increasing the variable if you launch multiple LLM services and encounter OOM. Default: `8`.
+- `VLLM_VXPU_SHM_NAME`: The name of the shm file. Needs to be set for all containers of the shared vxpu group. Default: `/vllm_kunlun_vxpu_offload_shm`.
--- a/README.md.bak
+++ b/README.md.bak
@@ -0,0 +1,214 @@
+![vLLM Kunlun Logo](vllm_kunlun/patches/vLLM_Kunlun.jpg)
+
+<p align="center">
+  <a href="https://vllm-kunlun.readthedocs.io/en/latest/"><b>📖 Documentation</b></a> |
+  <a href="https://vllm-kunlun.readthedocs.io/en/latest/quick_start.html"><b>🚀 Quick Start</b></a> |
+  <a href="https://vllm-kunlun.readthedocs.io/en/latest/installation.html"><b>📦 Installation</b></a> |
+  <a href="https://join.slack.com/t/vllm-kunlun/shared_invite/zt-3iinb8u5z-FcqZKbNNdMJ_32fHmipzvw"><b>💬 Slack</b></a>
+</p>
+
+<p align="center">
+  <img alt="GitHub License" src="https://img.shields.io/github/license/baidu/vLLM-Kunlun">
+  <img alt="GitHub Stars" src="https://img.shields.io/github/stars/baidu/vLLM-Kunlun">
+  <img alt="GitHub Forks" src="https://img.shields.io/github/forks/baidu/vLLM-Kunlun">
+  <img alt="GitHub Issues" src="https://img.shields.io/github/issues/baidu/vLLM-Kunlun">
+  <img alt="Python Version" src="https://img.shields.io/badge/python-%3E%3D3.10-blue">
+</p>
+
+---
+
+## Latest News 🔥
+
+- [2026/02] 🧠 **GLM model family support** — Added GLM5, GLM-4.7 MTP (Multi-Token Prediction), and GLM-47 tool parser with thinking/non-thinking mode toggle
+- [2026/02] ⚡ **Performance optimizations** — Fused MoE with small batches, optimized attention metadata building, Multi-LoRA inference achieves 80%+ of non-LoRA performance
+- [2026/02] 🔧 **DeepSeek-V3.2 MTP support** — Added MTP (Multi-Token Prediction) for DeepSeek-V3.2, with RoPE and decoding stage kernel optimizations
+- [2026/01] 🔢 **New quantization methods** — Support for compressed-tensors W4A16, AWQ MoE W4A16, and DeepSeek-V3.2 W8A8 quantization
+- [2026/01] 🛠️ **CI/CD overhaul** — Added E2E tests, unit test CI, ruff format checks, and modular CI workflow refactoring
+- [2025/12] 🎉 **v0.11.0rc1 released** — Added Qwen3-Omni, Qwen3-Next, Seed-OSS support ([Release Notes](https://github.com/baidu/vLLM-Kunlun/releases/tag/v0.11.0rc1))
+- [2025/12] 📦 **v0.10.1.1 released** — 5+ multimodal models, AWQ/GPTQ quantization for dense models, Piecewise CUDA Graph, vLLM V1 engine, Flash-Infer Top-K/Top-P sampling with 10-100× speedup ([Release Notes](https://github.com/baidu/vLLM-Kunlun/releases/tag/v0.10.1.1))
+- [2025/12] 🌟 Initial release of vLLM Kunlun — Open sourced on Dec 8, 2025
+
+---
+
+## Overview
+
+**vLLM Kunlun** (`vllm-kunlun`) is a community-maintained hardware plugin designed to seamlessly run [vLLM](https://github.com/vllm-project/vllm) on the **Kunlun XPU**. It is the recommended approach for integrating the Kunlun backend within the vLLM community, adhering to the principles outlined in the [RFC Hardware Pluggable](https://github.com/vllm-project/vllm/issues/11162).
+
+This plugin provides a hardware-pluggable interface that decouples the integration of the Kunlun XPU with vLLM. By utilizing vLLM Kunlun, popular open-source models — including Transformer-like, Mixture-of-Expert (MoE), Embedding, and Multi-modal LLMs — can run effortlessly on the Kunlun XPU.
+
+### ✨ Key Features
+
+- **Seamless Plugin Integration** — Works as a standard vLLM platform plugin via Python entry points, no need to modify vLLM source code
+- **Broad Model Support** — Supports 15+ mainstream LLMs including Qwen, Llama, DeepSeek, Kimi-K2, and multimodal models
+- **Quantization Support** — INT8 and other quantization methods for MoE and dense models
+- **LoRA Fine-Tuning** — LoRA adapter support for Qwen series models
+- **Piecewise Kunlun Graph** — Hardware-accelerated graph optimization for high-performance inference
+- **FlashMLA Attention** — Optimized multi-head latent attention for DeepSeek MLA architectures
+- **Tensor Parallelism** — Multi-device parallel inference with distributed execution support
+- **OpenAI-Compatible API** — Serve models with the standard OpenAI API interface
+
+---
+
+## Prerequisites
+
+- **Hardware**: Kunlun3 P800
+- **OS**: Ubuntu 22.04
+- **Software**:
+  - Python >= 3.10
+  - PyTorch >= 2.5.1
+  - vLLM (same version as vllm-kunlun)
+  - transformers >= 4.57.0
+
+---
+
+## Supported Models
+
+### Generative Models
+
+| Model | Support | Quantization | LoRA |  Kunlun Graph |
+|:------|:-------:|:------------:|:----:|:----------------------:|
+| Qwen2 | ✅ | ✅| ✅ | ✅ |
+| Qwen2.5 | ✅ |✅ | ✅ | ✅ |
+| Qwen3 | ✅ |✅ | ✅ | ✅ |
+| Qwen3-Moe | ✅ | ✅ |  | ✅ |
+| Qwen3-Next | ✅ | ✅ |  | ✅ |
+| MiMo-V2-Flash | ✅ | ✅| | ✅ |
+| Llama2 | ✅ | ✅|  ✅| ✅ |
+| Llama3 | ✅ |✅ | ✅ | ✅ |
+| Llama3.1 | ✅ |✅ | | ✅ |
+| gpt-oss | ✅ | ✅| | |
+| GLM4.5 | ✅ | ✅| | ✅ |
+| GLM4.5Air | ✅ |✅ | | ✅ |
+| GLM4.7 | ✅ | ✅| | ✅ |
+| GLM5 | ✅ | ✅| | ✅ |
+| Kimi-K2 | ✅ | ✅ | | ✅ |
+| DeepSeek-R1 | ✅ | ✅ | | ✅ |
+| DeepSeek-V3 | ✅ | ✅ | | ✅ |
+| DeepSeek-V3.2 | ✅ | ✅ | | ✅ |
+
+### Multimodal Language Models
+
+| Model | Support | Quantization | LoRA |  Kunlun Graph |
+|:------|:-------:|:------------:|:----:|:----------------------:|
+| Qwen2-VL | ✅ | ✅| | ✅ |
+| Qwen2.5-VL | ✅ | ✅| | ✅ |
+| Qwen3-VL | ✅ | ✅| | ✅ |
+| Qwen3-VL-MoE | ✅ | ✅ | | ✅ |
+| Qwen3-Omni-MoE | ✅ | | | ✅ |
+| InternVL-2.5 | ✅ | | | ✅ |
+| InternVL-3.5 | ✅ | | | ✅ |
+| InternS1 | ✅ | | | ✅ |
+
+---
+
+## Performance Visualization 🚀
+
+### High-performance computing at work: How different models perform on the Kunlun3 P800.
+
+Current environment: 16-way concurrency, input/output size 2048.
+
+![Models and tgs](./vllm_kunlun/patches/performance.png)
+
+---
+
+### Quick Start
+
+#### Start an OpenAI-Compatible API Server
+
+```bash
+python -m vllm.entrypoints.openai.api_server \
+    --host 0.0.0.0 \
+    --port 8356 \
+    --model <your-model-path> \
+    --gpu-memory-utilization 0.9 \
+    --trust-remote-code \
+    --max-model-len 32768 \
+    --tensor-parallel-size 1 \
+    --dtype float16 \
+    --max_num_seqs 128 \
+    --max_num_batched_tokens 32768 \
+    --block-size 128 \
+    --distributed-executor-backend mp \
+    --served-model-name <your-model-name>
+```
+
+#### Send a Request
+
+```bash
+curl http://localhost:8356/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "<your-model-name>",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 512
+  }'
+```
+
+### Version Matrix
+
+| Version | Release Type | Documentation |
+|---------|:------------:|:-------------:|
+| v0.11.0 | Latest stable version | [Quick Start](https://vllm-kunlun.readthedocs.io/en/latest/quick_start.html) · [Installation](https://vllm-kunlun.readthedocs.io/en/latest/installation.html) |
+
+---
+
+## Architecture
+
+```
+vllm-kunlun/
+├── vllm_kunlun/              # Core plugin package
+│   ├── platforms/             # Kunlun XPU platform implementation
+│   ├── models/                # Model implementations (DeepSeek, Qwen, Llama, etc.)
+│   ├── ops/                   # Custom operators (attention, linear, sampling, etc.)
+│   │   ├── attention/         # FlashMLA, paged attention, merge attention states
+│   │   ├── fla/               # Flash linear attention operations
+│   │   └── sample/            # Sampling operators
+│   ├── v1/                    # vLLM V1 engine adaptations
+│   ├── compilation/           # Torch compile wrapper for Kunlun Graph
+│   ├── csrc/                  # C++ extensions (custom CUDA-compatible kernels)
+│   └── config/                # Model configuration overrides
+├── tests/                     # Test suite
+├── docs/                      # Documentation (Sphinx-based, ReadTheDocs hosted)
+├── ci/                        # CI pipeline configurations
+├── setup.py                   # Legacy build script (with C++ extensions)
+└── pyproject.toml             # Modern Python build configuration (hatchling)
+```
+
+---
+
+## Contributing
+
+We welcome contributions from the community! Please read our [Contributing Guide](CONTRIBUTING.md) before submitting a PR.
+
+### PR Classification
+
+Use the following prefixes for PR titles:
+
+- `[Attention]` — Attention mechanism features/optimizations
+- `[Core]` — Core vllm-kunlun logic (platform, attention, communicators, model runner)
+- `[Kernel]` — Compute kernels and ops
+- `[Bugfix]` — Bug fixes
+- `[Doc]` — Documentation improvements
+- `[Test]` — Tests
+- `[CI]` — CI/CD improvements
+- `[Misc]` — Other changes
+
+---
+
+## Star History 🔥
+
+We opened the project at Dec 8, 2025. We love open source and collaboration ❤️
+
+[![Star History Chart](https://api.star-history.com/svg?repos=baidu/vLLM-Kunlun&type=date&legend=bottom-right)](https://www.star-history.com/#baidu/vLLM-Kunlun&type=date&legend=bottom-right)
+
+---
+
+## Sponsors 👋
+
+We sincerely appreciate the [**KunLunXin**](https://www.kunlunxin.com/) team for their support in providing XPU resources, which enabled efficient model adaptation debugging, comprehensive end-to-end testing, and broader model compatibility.
+
+---
+
+## License
+
+Apache License 2.0, as found in the [LICENSE](./LICENSE) file.