Initial commit for vLLM-Kunlun Plugin

This commit is contained in:
dongxinyu03
2025-12-10 12:05:39 +08:00
commit c728e52505
131 changed files with 28816 additions and 0 deletions

View File

@@ -0,0 +1,17 @@
# Environment Variables
vllm-kunlun uses the following environment variables to configure the system:
| *Environment Variables* | ***\*Recommended value\**** | ***\*Function description\**** |
| ---------------------------------------- | ----------------- | ------------------------------------------------------------ |
| `unset XPU_DUMMY_EVENT` | | ***\*Unsets\**** `XPU_DUMMY_EVENT` variable, usually done to ensure real XPU events are used for synchronization and performance measurement. |
| `export XPU_VISIBLE_DEVICES` | `0,1,2,3,4,5,6,7` | ***\*Specify visible XPU Devices\****. Here, 8 devices (0 to 7) are specified for inference tasks. This is required for multi-card or distributed inference. |
| `export XPU_USE_MOE_SORTED_THRES` | `1` | Enables the Moe Model ***\*Sort Optimization\****.Setting to `1` usually enables this performance optimization. |
| `export XFT_USE_FAST_SWIGLU` | `1` | Enables the ***\*Fast SwiGLU Ops\****. SwiGLU is a common activation function, and enabling this accelerates model inference. |
| `export XPU_USE_FAST_SWIGLU` | `1` | Enables the ***\*Fast SwiGLU Ops\****. Similar to `XFT_USE_FAST_SWIGLU`, this enables the fast SwiGLU calculation in Fused MoE Fusion Ops. |
| `export XMLIR_CUDNN_ENABLED` | `1` | Enables XMLIR (an intermediate representation/compiler) to use the ***\*cuDNN compatible/optimized path\**** (which may map to corresponding XPU optimized libraries in the KunlunCore environment). |
| `export XPU_USE_DEFAULT_CTX` | `1` | Sets the XPU to use the default context. Typically used to simplify environment configuration and ensure runtime consistency. |
| `export XMLIR_FORCE_USE_XPU_GRAPH` | `1` | ***\*Forces the enablement of XPU Graph mode.\****. This can capture and optimize the model execution graph, significantly boosting inference performance. |
| `export VLLM_HOST_IP` | `$(hostname -i)` | ***\*Sets the host IP address for the vLLM service\****. This uses a shell command to dynamically get the current host's internal IP. It's used for inter-node communication in a distributed environment. |
| `export XMLIR_ENABLE_MOCK_TORCH_COMPILE` | `false` | ***\*Disable Mock Torch Compile Function\****. Set to `false` to ensure the actual compilation and optimization flow is used, rather than mock mode. |
| `FUSED_QK_ROPE_OP` | `0` | ***\*Control whether to use the Fused QK-Norm and RoPE implementation\****. Default is `0` (use original/standard RoPE). Setting to `1` may be used to enable QWEN3. |

View File

@@ -0,0 +1,9 @@
# Configuration Guide
This section provides a detailed configuration guide of vLLM Kunlun.
:::{toctree}
:caption: Configuration Guide
:maxdepth: 1
env_vars
:::

View File

@@ -0,0 +1,82 @@
# Graph Mode Guide
This guide provides instructions for using Kunlun Graph Mode with vLLM Kunlun. Please note that graph mode is available both on V1 and V0 Engine. All supported models are highly compatible with Kunlun Graph.
## Getting Started
From vLLM-KunLun-0.10.1.1 with V1 Engine, vLLM Kunlun will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
There is a graph mode supported by vLLM Kunlun:
- **KunlunGraph**: This is the default graph mode supported by vLLM Kunlun. In vLLM-KunLun-0.10.1.1, Qwen, GLM and InternVL series models are well tested.
## Using KunlunGraph
KunlunGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine(default) is enough.
Offline example:
```python
import os
from vllm import LLM
model = LLM(model="models/Qwen3-8B-Instruct")
outputs = model.generate("Hello, how are you?")
```
Online example:
```shell
vllm serve Qwen3-8B-Instruct
```
## Using KunlunGraph
Enabling Kunlun Graph on the Kunlun platform requires the use of splitting ops.
Online example:
```shell
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--model /models/Qwen3-8B-Instruct\
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 1 \
--dtype float16 \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--served-model-name Qwen3-8B-Instruct \
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
"vllm.unified_attention", "vllm.unified_attention_with_output",
"vllm.mamba_mixer2"]}' \
```
## Fallback to the Eager Mode
If `KunlunGraph` fail to run, you should fallback to the eager mode.
Online example:
```shell
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--model /models/Qwen3-8B-Instruct\
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 1 \
--dtype float16 \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--served-model-name Qwen3-8B-Instruct \
--enforce_eager
```

View File

@@ -0,0 +1,11 @@
# Feature Guide
This section provides a detailed usage guide of vLLM Kunlun features.
:::{toctree}
:caption: Feature Guide
:maxdepth: 1
graph_mode
quantization
lora
:::

View File

@@ -0,0 +1,27 @@
# LoRA Adapters Guide
## Overview
Like vLLM, vllm_kunlun supports LoRA as well. The usage and more details can be found in [vLLM official document ](https://docs.vllm.ai/en/latest/features/lora.html).
You can refer to [Supported Models ](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models)to find which models support LoRA in vLLM.
Currently, only vLLM v0 mode (including eager and CUDA Graph modes) supports multi-LoRA inference in vllm_kunlun.
## Example
We provide a simple LoRA example here:
```bash
export ENABLE_KUNLUN_LARGE_OPS=0
USE_ORI_ROPE=0 VLLM_USE_V1=0 vllm serve qwen3-8b \
--enable-lora \
--max-lora-rank 64 \
--lora-modules lora1=/path/to/lora1 lora2=/path/to/lora2
```
## Custom LoRA Operators
We have implemented LoRA-related custom operators for Kunlun hardware, such as `bgmv_shrink`, `bgmv_expand`, `sgmv_shrink`, and `sgmv_expand`. The implementation can be found in `vllm_kunlun/lora/ops/kunlun_ops/lora_ops.py`.

View File

@@ -0,0 +1,45 @@
# Quantization Guide
>Note: This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy.
## Usages
### Compressed-tensor
To run a `compressed-tensors` model with vLLM-kunlun, you should first add the below configuration to the model's `config.json`:
```Bash
"quantization_config": {
"quant_method": "compressed-tensors"
}
```
Then you run `Qwen/Qwen3-30B-A3B` with dynamic W8A8 quantization with the following command:
```Bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-30B-A3B \
--quantization compressed-tensors
```
### AWQ
To run an `AWQ` model with vLLM-kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command:
```Bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-32B-AWQ \
--quantization awq
```
### GPTQ
To run a `GPTQ` model with vLLM-kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command:
```Bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
--quantization gptq
```

View File

@@ -0,0 +1,3 @@
# Release Notes
Comming soon...

View File

@@ -0,0 +1,10 @@
# Features and Models
This section provides a detailed matrix supported by vLLM-Kunlun.
:::{toctree}
:caption: Support Matrix
:maxdepth: 1
supported_models
supported_features
:::

View File

@@ -0,0 +1,14 @@
# Supported Features
The feature support principle of vLLM-KunLun is: **aligned with the vLLM**. We are also actively collaborating with the community to accelerate support.
You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is the feature support status of vLLM-KunLun:
## Features Supported
|Feature|Status|Note|
|-|-|-|
|Tensor Parallel|🟢 Functional||
|Experts Parallel|🟢 Functional||
|Graph Mode|🟢 Functional||
|Quantization| 🟢 Functional||
|LoRA|⚠️ Need Test|Only LLM models|

View File

@@ -0,0 +1,33 @@
# Supported Models
## Generative Models
| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
| :------------ | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
| Qwen2 | ✅ | | ✅ | ✅ | | ✅ | ✅ |
| Qwen2.5 | ✅ | | ✅ | ✅ | | ✅ | ✅ |
| Qwen3 | ✅ | | ✅ | ✅ | | ✅ | ✅ |
| Qwen3-Moe | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Qwen3-Coder | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| QwQ-32B | ✅ | | | ✅ | | ✅ | ✅ |
| LLama2 | ✅ | | | ✅ | | ✅ | ✅ |
| LLama3 | ✅ | | | ✅ | | ✅ | ✅ |
| LLama3.1 | ✅ | | | ✅ | | ✅ | ✅ |
| GLM-4.5 | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| GLM-4.5-Air | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ |
| Qwen3-next | 🔜Comming soon | | | | | | |
| gpt-oss | 🔜Comming soon | | | | | | |
| DeepSeek-V3 | 🔜Comming soon | | | | | | |
| DeepSeek-V3.2 | 🔜Comming soon | | | | | | |
## Multimodal Language Models
| Model | Support | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
| :----------- | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
|Qianfan-VL | ✅ | | | ✅| |✅ |✅|
| Qwen2.5VL | ✅ | | | ✅ | | ✅ | ✅ |
| InternVL2.5 | ✅ | | | ✅ | | ✅ | ✅ |
| InternVL3 | ✅ | | | ✅ | | ✅ | ✅ |
| InternVL3.5 | ✅ | | | ✅ | | ✅ | ✅ |
| InternS1 | ✅ | | | ✅ | | ✅ | ✅ |
| Qwen2.5-Omni | 🔜Comming soon | | | | | | |
| Qwen3-VL | 🔜Comming soon | | | | | | |