Initial commit for vLLM-Kunlun Plugin
This commit is contained in:
82
docs/source/user_guide/feature_guide/graph_mode.md
Normal file
82
docs/source/user_guide/feature_guide/graph_mode.md
Normal file
@@ -0,0 +1,82 @@
|
||||
# Graph Mode Guide
|
||||
|
||||
This guide provides instructions for using Kunlun Graph Mode with vLLM Kunlun. Please note that graph mode is available both on V1 and V0 Engine. All supported models are highly compatible with Kunlun Graph.
|
||||
|
||||
## Getting Started
|
||||
|
||||
From vLLM-KunLun-0.10.1.1 with V1 Engine, vLLM Kunlun will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
|
||||
|
||||
There is a graph mode supported by vLLM Kunlun:
|
||||
|
||||
- **KunlunGraph**: This is the default graph mode supported by vLLM Kunlun. In vLLM-KunLun-0.10.1.1, Qwen, GLM and InternVL series models are well tested.
|
||||
|
||||
|
||||
## Using KunlunGraph
|
||||
|
||||
KunlunGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine(default) is enough.
|
||||
|
||||
Offline example:
|
||||
|
||||
```python
|
||||
import os
|
||||
|
||||
from vllm import LLM
|
||||
|
||||
model = LLM(model="models/Qwen3-8B-Instruct")
|
||||
outputs = model.generate("Hello, how are you?")
|
||||
```
|
||||
|
||||
Online example:
|
||||
|
||||
```shell
|
||||
vllm serve Qwen3-8B-Instruct
|
||||
```
|
||||
|
||||
## Using KunlunGraph
|
||||
|
||||
Enabling Kunlun Graph on the Kunlun platform requires the use of splitting ops.
|
||||
|
||||
Online example:
|
||||
|
||||
```shell
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--model /models/Qwen3-8B-Instruct\
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--trust-remote-code \
|
||||
--max-model-len 32768 \
|
||||
--tensor-parallel-size 1 \
|
||||
--dtype float16 \
|
||||
--no-enable-prefix-caching \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name Qwen3-8B-Instruct \
|
||||
--compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
|
||||
"vllm.unified_attention", "vllm.unified_attention_with_output",
|
||||
"vllm.mamba_mixer2"]}' \
|
||||
```
|
||||
|
||||
|
||||
## Fallback to the Eager Mode
|
||||
|
||||
If `KunlunGraph` fail to run, you should fallback to the eager mode.
|
||||
|
||||
Online example:
|
||||
|
||||
```shell
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--host 0.0.0.0 \
|
||||
--port 8000 \
|
||||
--model /models/Qwen3-8B-Instruct\
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--trust-remote-code \
|
||||
--max-model-len 32768 \
|
||||
--tensor-parallel-size 1 \
|
||||
--dtype float16 \
|
||||
--no-enable-prefix-caching \
|
||||
--no-enable-chunked-prefill \
|
||||
--distributed-executor-backend mp \
|
||||
--served-model-name Qwen3-8B-Instruct \
|
||||
--enforce_eager
|
||||
```
|
||||
11
docs/source/user_guide/feature_guide/index.md
Normal file
11
docs/source/user_guide/feature_guide/index.md
Normal file
@@ -0,0 +1,11 @@
|
||||
# Feature Guide
|
||||
|
||||
This section provides a detailed usage guide of vLLM Kunlun features.
|
||||
|
||||
:::{toctree}
|
||||
:caption: Feature Guide
|
||||
:maxdepth: 1
|
||||
graph_mode
|
||||
quantization
|
||||
lora
|
||||
:::
|
||||
27
docs/source/user_guide/feature_guide/lora.md
Normal file
27
docs/source/user_guide/feature_guide/lora.md
Normal file
@@ -0,0 +1,27 @@
|
||||
# LoRA Adapters Guide
|
||||
|
||||
## Overview
|
||||
|
||||
Like vLLM, vllm_kunlun supports LoRA as well. The usage and more details can be found in [vLLM official document ](https://docs.vllm.ai/en/latest/features/lora.html).
|
||||
|
||||
You can refer to [Supported Models ](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models)to find which models support LoRA in vLLM.
|
||||
|
||||
Currently, only vLLM v0 mode (including eager and CUDA Graph modes) supports multi-LoRA inference in vllm_kunlun.
|
||||
|
||||
## Example
|
||||
|
||||
We provide a simple LoRA example here:
|
||||
|
||||
```bash
|
||||
export ENABLE_KUNLUN_LARGE_OPS=0
|
||||
|
||||
USE_ORI_ROPE=0 VLLM_USE_V1=0 vllm serve qwen3-8b \
|
||||
--enable-lora \
|
||||
--max-lora-rank 64 \
|
||||
--lora-modules lora1=/path/to/lora1 lora2=/path/to/lora2
|
||||
```
|
||||
|
||||
|
||||
## Custom LoRA Operators
|
||||
|
||||
We have implemented LoRA-related custom operators for Kunlun hardware, such as `bgmv_shrink`, `bgmv_expand`, `sgmv_shrink`, and `sgmv_expand`. The implementation can be found in `vllm_kunlun/lora/ops/kunlun_ops/lora_ops.py`.
|
||||
45
docs/source/user_guide/feature_guide/quantization.md
Normal file
45
docs/source/user_guide/feature_guide/quantization.md
Normal file
@@ -0,0 +1,45 @@
|
||||
# Quantization Guide
|
||||
>Note: This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
|
||||
|
||||
Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy.
|
||||
|
||||
|
||||
## Usages
|
||||
|
||||
### Compressed-tensor
|
||||
To run a `compressed-tensors` model with vLLM-kunlun, you should first add the below configuration to the model's `config.json`:
|
||||
|
||||
```Bash
|
||||
"quantization_config": {
|
||||
"quant_method": "compressed-tensors"
|
||||
}
|
||||
```
|
||||
|
||||
Then you run `Qwen/Qwen3-30B-A3B` with dynamic W8A8 quantization with the following command:
|
||||
|
||||
```Bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model Qwen/Qwen3-30B-A3B \
|
||||
--quantization compressed-tensors
|
||||
```
|
||||
|
||||
### AWQ
|
||||
|
||||
To run an `AWQ` model with vLLM-kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command:
|
||||
|
||||
```Bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model Qwen/Qwen3-32B-AWQ \
|
||||
--quantization awq
|
||||
```
|
||||
|
||||
### GPTQ
|
||||
|
||||
To run a `GPTQ` model with vLLM-kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command:
|
||||
|
||||
```Bash
|
||||
python -m vllm.entrypoints.openai.api_server \
|
||||
--model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
|
||||
--quantization gptq
|
||||
```
|
||||
|
||||
Reference in New Issue
Block a user