Initial commit for vLLM-Kunlun Plugin

This commit is contained in:
dongxinyu03
2025-12-10 12:05:39 +08:00
commit c728e52505
131 changed files with 28816 additions and 0 deletions

View File

@@ -0,0 +1,82 @@
# Graph Mode Guide
This guide provides instructions for using Kunlun Graph Mode with vLLM Kunlun. Please note that graph mode is available both on V1 and V0 Engine. All supported models are highly compatible with Kunlun Graph.
## Getting Started
From vLLM-KunLun-0.10.1.1 with V1 Engine, vLLM Kunlun will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
There is a graph mode supported by vLLM Kunlun:
- **KunlunGraph**: This is the default graph mode supported by vLLM Kunlun. In vLLM-KunLun-0.10.1.1, Qwen, GLM and InternVL series models are well tested.
## Using KunlunGraph
KunlunGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine(default) is enough.
Offline example:
```python
import os
from vllm import LLM
model = LLM(model="models/Qwen3-8B-Instruct")
outputs = model.generate("Hello, how are you?")
```
Online example:
```shell
vllm serve Qwen3-8B-Instruct
```
## Using KunlunGraph
Enabling Kunlun Graph on the Kunlun platform requires the use of splitting ops.
Online example:
```shell
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--model /models/Qwen3-8B-Instruct\
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 1 \
--dtype float16 \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--served-model-name Qwen3-8B-Instruct \
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
"vllm.unified_attention", "vllm.unified_attention_with_output",
"vllm.mamba_mixer2"]}' \
```
## Fallback to the Eager Mode
If `KunlunGraph` fail to run, you should fallback to the eager mode.
Online example:
```shell
python -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--model /models/Qwen3-8B-Instruct\
--gpu-memory-utilization 0.9 \
--trust-remote-code \
--max-model-len 32768 \
--tensor-parallel-size 1 \
--dtype float16 \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--distributed-executor-backend mp \
--served-model-name Qwen3-8B-Instruct \
--enforce_eager
```

View File

@@ -0,0 +1,11 @@
# Feature Guide
This section provides a detailed usage guide of vLLM Kunlun features.
:::{toctree}
:caption: Feature Guide
:maxdepth: 1
graph_mode
quantization
lora
:::

View File

@@ -0,0 +1,27 @@
# LoRA Adapters Guide
## Overview
Like vLLM, vllm_kunlun supports LoRA as well. The usage and more details can be found in [vLLM official document ](https://docs.vllm.ai/en/latest/features/lora.html).
You can refer to [Supported Models ](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models)to find which models support LoRA in vLLM.
Currently, only vLLM v0 mode (including eager and CUDA Graph modes) supports multi-LoRA inference in vllm_kunlun.
## Example
We provide a simple LoRA example here:
```bash
export ENABLE_KUNLUN_LARGE_OPS=0
USE_ORI_ROPE=0 VLLM_USE_V1=0 vllm serve qwen3-8b \
--enable-lora \
--max-lora-rank 64 \
--lora-modules lora1=/path/to/lora1 lora2=/path/to/lora2
```
## Custom LoRA Operators
We have implemented LoRA-related custom operators for Kunlun hardware, such as `bgmv_shrink`, `bgmv_expand`, `sgmv_shrink`, and `sgmv_expand`. The implementation can be found in `vllm_kunlun/lora/ops/kunlun_ops/lora_ops.py`.

View File

@@ -0,0 +1,45 @@
# Quantization Guide
>Note: This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy.
## Usages
### Compressed-tensor
To run a `compressed-tensors` model with vLLM-kunlun, you should first add the below configuration to the model's `config.json`:
```Bash
"quantization_config": {
"quant_method": "compressed-tensors"
}
```
Then you run `Qwen/Qwen3-30B-A3B` with dynamic W8A8 quantization with the following command:
```Bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-30B-A3B \
--quantization compressed-tensors
```
### AWQ
To run an `AWQ` model with vLLM-kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command:
```Bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-32B-AWQ \
--quantization awq
```
### GPTQ
To run a `GPTQ` model with vLLM-kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command:
```Bash
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
--quantization gptq
```