![vLLM Kunlun Logo](vllm_kunlun/patches/vLLM_Kunlun.jpg)

๐Ÿ“– Documentation | ๐Ÿš€ Quick Start | ๐Ÿ“ฆ Installation | ๐Ÿ’ฌ Slack

GitHub License GitHub Stars GitHub Forks GitHub Issues Python Version

--- ## Latest News ๐Ÿ”ฅ - [2026/02] ๐Ÿง  **GLM model family support** โ€” Added GLM5, GLM-4.7 MTP (Multi-Token Prediction), and GLM-47 tool parser with thinking/non-thinking mode toggle - [2026/02] โšก **Performance optimizations** โ€” Fused MoE with small batches, optimized attention metadata building, Multi-LoRA inference achieves 80%+ of non-LoRA performance - [2026/02] ๐Ÿ”ง **DeepSeek-V3.2 MTP support** โ€” Added MTP (Multi-Token Prediction) for DeepSeek-V3.2, with RoPE and decoding stage kernel optimizations - [2026/01] ๐Ÿ”ข **New quantization methods** โ€” Support for compressed-tensors W4A16, AWQ MoE W4A16, and DeepSeek-V3.2 W8A8 quantization - [2026/01] ๐Ÿ› ๏ธ **CI/CD overhaul** โ€” Added E2E tests, unit test CI, ruff format checks, and modular CI workflow refactoring - [2025/12] ๐ŸŽ‰ **v0.11.0rc1 released** โ€” Added Qwen3-Omni, Qwen3-Next, Seed-OSS support ([Release Notes](https://github.com/baidu/vLLM-Kunlun/releases/tag/v0.11.0rc1)) - [2025/12] ๐Ÿ“ฆ **v0.10.1.1 released** โ€” 5+ multimodal models, AWQ/GPTQ quantization for dense models, Piecewise CUDA Graph, vLLM V1 engine, Flash-Infer Top-K/Top-P sampling with 10-100ร— speedup ([Release Notes](https://github.com/baidu/vLLM-Kunlun/releases/tag/v0.10.1.1)) - [2025/12] ๐ŸŒŸ Initial release of vLLM Kunlun โ€” Open sourced on Dec 8, 2025 --- ## Overview **vLLM Kunlun** (`vllm-kunlun`) is a community-maintained hardware plugin designed to seamlessly run [vLLM](https://github.com/vllm-project/vllm) on the **Kunlun XPU**. It is the recommended approach for integrating the Kunlun backend within the vLLM community, adhering to the principles outlined in the [RFC Hardware Pluggable](https://github.com/vllm-project/vllm/issues/11162). This plugin provides a hardware-pluggable interface that decouples the integration of the Kunlun XPU with vLLM. By utilizing vLLM Kunlun, popular open-source models โ€” including Transformer-like, Mixture-of-Expert (MoE), Embedding, and Multi-modal LLMs โ€” can run effortlessly on the Kunlun XPU. ### โœจ Key Features - **Seamless Plugin Integration** โ€” Works as a standard vLLM platform plugin via Python entry points, no need to modify vLLM source code - **Broad Model Support** โ€” Supports 15+ mainstream LLMs including Qwen, Llama, DeepSeek, Kimi-K2, and multimodal models - **Quantization Support** โ€” FP8 and other quantization methods for MoE and dense models - **LoRA Fine-Tuning** โ€” LoRA adapter support for Qwen series models - **Piecewise Kunlun Graph** โ€” Hardware-accelerated graph optimization for high-performance inference - **FlashMLA Attention** โ€” Optimized multi-head latent attention for DeepSeek MLA architectures - **Tensor Parallelism** โ€” Multi-device parallel inference with distributed execution support - **OpenAI-Compatible API** โ€” Serve models with the standard OpenAI API interface --- ## Prerequisites - **Hardware**: Kunlun3 P800 - **OS**: Ubuntu 22.04 - **Software**: - Python >= 3.10 - PyTorch >= 2.5.1 - vLLM (same version as vllm-kunlun) - transformers >= 4.57.0 --- ## Supported Models ### Generative Models | Model | Support | Quantization | LoRA | Kunlun Graph | |:------|:-------:|:------------:|:----:|:----------------------:| | Qwen2 | โœ… | โœ…| โœ… | โœ… | | Qwen2.5 | โœ… |โœ… | โœ… | โœ… | | Qwen3 | โœ… |โœ… | โœ… | โœ… | | Qwen3-Moe | โœ… | โœ… | | โœ… | | Qwen3-Next | โœ… | โœ… | | โœ… | | MiMo-V2-Flash | โœ… | โœ…| | โœ… | | Llama2 | โœ… | โœ…| โœ…| โœ… | | Llama3 | โœ… |โœ… | โœ… | โœ… | | Llama3.1 | โœ… |โœ… | | โœ… | | gpt-oss | โœ… | โœ…| | | | GLM4.5 | โœ… | โœ…| | โœ… | | GLM4.5Air | โœ… |โœ… | | โœ… | | GLM4.7 | โœ… | โœ…| | โœ… | | GLM5 | โœ… | โœ…| | โœ… | | Kimi-K2 | โœ… | โœ… | | โœ… | | DeepSeek-R1 | โœ… | โœ… | | โœ… | | DeepSeek-V3 | โœ… | โœ… | | โœ… | | DeepSeek-V3.2 | โœ… | โœ… | | โœ… | ### Multimodal Language Models | Model | Support | Quantization | LoRA | Kunlun Graph | |:------|:-------:|:------------:|:----:|:----------------------:| | Qwen2-VL | โœ… | โœ…| | โœ… | | Qwen2.5-VL | โœ… | โœ…| | โœ… | | Qwen3-VL | โœ… | โœ…| | โœ… | | Qwen3-VL-MoE | โœ… | โœ… | | โœ… | | Qwen3-Omni-MoE | โœ… | | | โœ… | | InternVL-2.5 | โœ… | | | โœ… | | InternVL-3.5 | โœ… | | | โœ… | | InternS1 | โœ… | | | โœ… | --- ## Performance Visualization ๐Ÿš€ ### High-performance computing at work: How different models perform on the Kunlun3 P800. Current environment: 16-way concurrency, input/output size 2048. ![Models and tgs](./vllm_kunlun/patches/performance.png) --- ### Quick Start #### Start an OpenAI-Compatible API Server ```bash python -m vllm.entrypoints.openai.api_server \ --host 0.0.0.0 \ --port 8356 \ --model \ --gpu-memory-utilization 0.9 \ --trust-remote-code \ --max-model-len 32768 \ --tensor-parallel-size 1 \ --dtype float16 \ --max_num_seqs 128 \ --max_num_batched_tokens 32768 \ --block-size 128 \ --distributed-executor-backend mp \ --served-model-name ``` #### Send a Request ```bash curl http://localhost:8356/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 512 }' ``` ### Version Matrix | Version | Release Type | Documentation | |---------|:------------:|:-------------:| | v0.11.0 | Latest stable version | [Quick Start](https://vllm-kunlun.readthedocs.io/en/latest/quick_start.html) ยท [Installation](https://vllm-kunlun.readthedocs.io/en/latest/installation.html) | --- ## Architecture ``` vllm-kunlun/ โ”œโ”€โ”€ vllm_kunlun/ # Core plugin package โ”‚ โ”œโ”€โ”€ platforms/ # Kunlun XPU platform implementation โ”‚ โ”œโ”€โ”€ models/ # Model implementations (DeepSeek, Qwen, Llama, etc.) โ”‚ โ”œโ”€โ”€ ops/ # Custom operators (attention, linear, sampling, etc.) โ”‚ โ”‚ โ”œโ”€โ”€ attention/ # FlashMLA, paged attention, merge attention states โ”‚ โ”‚ โ”œโ”€โ”€ fla/ # Flash linear attention operations โ”‚ โ”‚ โ””โ”€โ”€ sample/ # Sampling operators โ”‚ โ”œโ”€โ”€ v1/ # vLLM V1 engine adaptations โ”‚ โ”œโ”€โ”€ compilation/ # Torch compile wrapper for Kunlun Graph โ”‚ โ”œโ”€โ”€ csrc/ # C++ extensions (custom CUDA-compatible kernels) โ”‚ โ””โ”€โ”€ config/ # Model configuration overrides โ”œโ”€โ”€ tests/ # Test suite โ”œโ”€โ”€ docs/ # Documentation (Sphinx-based, ReadTheDocs hosted) โ”œโ”€โ”€ ci/ # CI pipeline configurations โ”œโ”€โ”€ setup.py # Legacy build script (with C++ extensions) โ””โ”€โ”€ pyproject.toml # Modern Python build configuration (hatchling) ``` --- ## Contributing We welcome contributions from the community! Please read our [Contributing Guide](CONTRIBUTING.md) before submitting a PR. ### PR Classification Use the following prefixes for PR titles: - `[Attention]` โ€” Attention mechanism features/optimizations - `[Core]` โ€” Core vllm-kunlun logic (platform, attention, communicators, model runner) - `[Kernel]` โ€” Compute kernels and ops - `[Bugfix]` โ€” Bug fixes - `[Doc]` โ€” Documentation improvements - `[Test]` โ€” Tests - `[CI]` โ€” CI/CD improvements - `[Misc]` โ€” Other changes --- ## Star History ๐Ÿ”ฅ We opened the project at Dec 8, 2025. We love open source and collaboration โค๏ธ [![Star History Chart](https://api.star-history.com/svg?repos=baidu/vLLM-Kunlun&type=date&legend=bottom-right)](https://www.star-history.com/#baidu/vLLM-Kunlun&type=date&legend=bottom-right) --- ## Sponsors ๐Ÿ‘‹ We sincerely appreciate the [**KunLunXin**](https://www.kunlunxin.com/) team for their support in providing XPU resources, which enabled efficient model adaptation debugging, comprehensive end-to-end testing, and broader model compatibility. --- ## License Apache License 2.0, as found in the [LICENSE](./LICENSE) file.