
๐ Documentation |
๐ Quick Start |
๐ฆ Installation |
๐ฌ Slack
---
## Latest News ๐ฅ
- [2026/02] ๐ง **GLM model family support** โ Added GLM5, GLM-4.7 MTP (Multi-Token Prediction), and GLM-47 tool parser with thinking/non-thinking mode toggle
- [2026/02] โก **Performance optimizations** โ Fused MoE with small batches, optimized attention metadata building, Multi-LoRA inference achieves 80%+ of non-LoRA performance
- [2026/02] ๐ง **DeepSeek-V3.2 MTP support** โ Added MTP (Multi-Token Prediction) for DeepSeek-V3.2, with RoPE and decoding stage kernel optimizations
- [2026/01] ๐ข **New quantization methods** โ Support for compressed-tensors W4A16, AWQ MoE W4A16, and DeepSeek-V3.2 W8A8 quantization
- [2026/01] ๐ ๏ธ **CI/CD overhaul** โ Added E2E tests, unit test CI, ruff format checks, and modular CI workflow refactoring
- [2025/12] ๐ **v0.11.0rc1 released** โ Added Qwen3-Omni, Qwen3-Next, Seed-OSS support ([Release Notes](https://github.com/baidu/vLLM-Kunlun/releases/tag/v0.11.0rc1))
- [2025/12] ๐ฆ **v0.10.1.1 released** โ 5+ multimodal models, AWQ/GPTQ quantization for dense models, Piecewise CUDA Graph, vLLM V1 engine, Flash-Infer Top-K/Top-P sampling with 10-100ร speedup ([Release Notes](https://github.com/baidu/vLLM-Kunlun/releases/tag/v0.10.1.1))
- [2025/12] ๐ Initial release of vLLM Kunlun โ Open sourced on Dec 8, 2025
---
## Overview
**vLLM Kunlun** (`vllm-kunlun`) is a community-maintained hardware plugin designed to seamlessly run [vLLM](https://github.com/vllm-project/vllm) on the **Kunlun XPU**. It is the recommended approach for integrating the Kunlun backend within the vLLM community, adhering to the principles outlined in the [RFC Hardware Pluggable](https://github.com/vllm-project/vllm/issues/11162).
This plugin provides a hardware-pluggable interface that decouples the integration of the Kunlun XPU with vLLM. By utilizing vLLM Kunlun, popular open-source models โ including Transformer-like, Mixture-of-Expert (MoE), Embedding, and Multi-modal LLMs โ can run effortlessly on the Kunlun XPU.
### โจ Key Features
- **Seamless Plugin Integration** โ Works as a standard vLLM platform plugin via Python entry points, no need to modify vLLM source code
- **Broad Model Support** โ Supports 15+ mainstream LLMs including Qwen, Llama, DeepSeek, Kimi-K2, and multimodal models
- **Quantization Support** โ INT8 and other quantization methods for MoE and dense models
- **LoRA Fine-Tuning** โ LoRA adapter support for Qwen series models
- **Piecewise Kunlun Graph** โ Hardware-accelerated graph optimization for high-performance inference
- **FlashMLA Attention** โ Optimized multi-head latent attention for DeepSeek MLA architectures
- **Tensor Parallelism** โ Multi-device parallel inference with distributed execution support
- **OpenAI-Compatible API** โ Serve models with the standard OpenAI API interface
---
## Prerequisites
- **Hardware**: Kunlun3 P800
- **OS**: Ubuntu 22.04
- **Software**:
- Python >= 3.10
- PyTorch >= 2.5.1
- vLLM (same version as vllm-kunlun)
- transformers >= 4.57.0
---
## Supported Models
### Generative Models
| Model | Support | Quantization | LoRA | Kunlun Graph |
|:------|:-------:|:------------:|:----:|:----------------------:|
| Qwen2 | โ
| โ
| โ
| โ
|
| Qwen2.5 | โ
|โ
| โ
| โ
|
| Qwen3 | โ
|โ
| โ
| โ
|
| Qwen3-Moe | โ
| โ
| | โ
|
| Qwen3-Next | โ
| โ
| | โ
|
| MiMo-V2-Flash | โ
| โ
| | โ
|
| Llama2 | โ
| โ
| โ
| โ
|
| Llama3 | โ
|โ
| โ
| โ
|
| Llama3.1 | โ
|โ
| | โ
|
| gpt-oss | โ
| โ
| | |
| GLM4.5 | โ
| โ
| | โ
|
| GLM4.5Air | โ
|โ
| | โ
|
| GLM4.7 | โ
| โ
| | โ
|
| GLM5 | โ
| โ
| | โ
|
| Kimi-K2 | โ
| โ
| | โ
|
| DeepSeek-R1 | โ
| โ
| | โ
|
| DeepSeek-V3 | โ
| โ
| | โ
|
| DeepSeek-V3.2 | โ
| โ
| | โ
|
### Multimodal Language Models
| Model | Support | Quantization | LoRA | Kunlun Graph |
|:------|:-------:|:------------:|:----:|:----------------------:|
| Qwen2-VL | โ
| โ
| | โ
|
| Qwen2.5-VL | โ
| โ
| | โ
|
| Qwen3-VL | โ
| โ
| | โ
|
| Qwen3-VL-MoE | โ
| โ
| | โ
|
| Qwen3-Omni-MoE | โ
| | | โ
|
| InternVL-2.5 | โ
| | | โ
|
| InternVL-3.5 | โ
| | | โ
|
| InternS1 | โ
| | | โ
|
---
## Performance Visualization ๐
### High-performance computing at work: How different models perform on the Kunlun3 P800.
Current environment: 16-way concurrency, input/output size 2048.

---
### Quick Start
#### Start an OpenAI-Compatible API Server
```bash
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8356 \
--model \
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 1 \
--dtype float16 \
--max_num_seqs 128 \
--max_num_batched_tokens 32768 \
--block-size 128 \
--distributed-executor-backend mp \
--served-model-name
```
#### Send a Request
```bash
curl http://localhost:8356/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 512
}'
```
### Version Matrix
| Version | Release Type | Documentation |
|---------|:------------:|:-------------:|
| v0.11.0 | Latest stable version | [Quick Start](https://vllm-kunlun.readthedocs.io/en/latest/quick_start.html) ยท [Installation](https://vllm-kunlun.readthedocs.io/en/latest/installation.html) |
---
## Architecture
```
vllm-kunlun/
โโโ vllm_kunlun/ # Core plugin package
โ โโโ platforms/ # Kunlun XPU platform implementation
โ โโโ models/ # Model implementations (DeepSeek, Qwen, Llama, etc.)
โ โโโ ops/ # Custom operators (attention, linear, sampling, etc.)
โ โ โโโ attention/ # FlashMLA, paged attention, merge attention states
โ โ โโโ fla/ # Flash linear attention operations
โ โ โโโ sample/ # Sampling operators
โ โโโ v1/ # vLLM V1 engine adaptations
โ โโโ compilation/ # Torch compile wrapper for Kunlun Graph
โ โโโ csrc/ # C++ extensions (custom CUDA-compatible kernels)
โ โโโ config/ # Model configuration overrides
โโโ tests/ # Test suite
โโโ docs/ # Documentation (Sphinx-based, ReadTheDocs hosted)
โโโ ci/ # CI pipeline configurations
โโโ setup.py # Legacy build script (with C++ extensions)
โโโ pyproject.toml # Modern Python build configuration (hatchling)
```
---
## Contributing
We welcome contributions from the community! Please read our [Contributing Guide](CONTRIBUTING.md) before submitting a PR.
### PR Classification
Use the following prefixes for PR titles:
- `[Attention]` โ Attention mechanism features/optimizations
- `[Core]` โ Core vllm-kunlun logic (platform, attention, communicators, model runner)
- `[Kernel]` โ Compute kernels and ops
- `[Bugfix]` โ Bug fixes
- `[Doc]` โ Documentation improvements
- `[Test]` โ Tests
- `[CI]` โ CI/CD improvements
- `[Misc]` โ Other changes
---
## Star History ๐ฅ
We opened the project at Dec 8, 2025. We love open source and collaboration โค๏ธ
[](https://www.star-history.com/#baidu/vLLM-Kunlun&type=date&legend=bottom-right)
---
## Sponsors ๐
We sincerely appreciate the [**KunLunXin**](https://www.kunlunxin.com/) team for their support in providing XPU resources, which enabled efficient model adaptation debugging, comprehensive end-to-end testing, and broader model compatibility.
---
## License
Apache License 2.0, as found in the [LICENSE](./LICENSE) file.