Initial commit for vLLM-Kunlun Plugin

2025-12-10 12:05:39 +08:00
commit c728e52505
131 changed files with 28816 additions and 0 deletions
--- a/docs/source/user_guide/configuration/env_vars.md
+++ b/docs/source/user_guide/configuration/env_vars.md
@@ -0,0 +1,17 @@
+# Environment Variables
+
+vllm-kunlun uses the following environment variables to configure the system:
+
+| *Environment Variables*                     | ***\*Recommended value\****  | ***\*Function description\****                                           |
+| ---------------------------------------- | ----------------- | ------------------------------------------------------------ |
+| `unset XPU_DUMMY_EVENT`                  |                   | ***\*Unsets\**** `XPU_DUMMY_EVENT` variable, usually done to ensure real XPU events are used for synchronization and performance measurement. |
+| `export XPU_VISIBLE_DEVICES`             | `0,1,2,3,4,5,6,7` | ***\*Specify visible XPU Devices\****. Here, 8 devices (0 to 7) are specified for inference tasks. This is required for multi-card or distributed inference. |
+| `export XPU_USE_MOE_SORTED_THRES`        | `1`               | Enables the Moe Model ***\*Sort Optimization\****.Setting to `1` usually enables this performance optimization. |
+| `export XFT_USE_FAST_SWIGLU`             | `1`               | Enables the ***\*Fast SwiGLU Ops\****. SwiGLU is a common activation function, and enabling this accelerates model inference. |
+| `export XPU_USE_FAST_SWIGLU`             | `1`               | Enables the ***\*Fast SwiGLU Ops\****. Similar to `XFT_USE_FAST_SWIGLU`, this enables the fast SwiGLU calculation in Fused MoE Fusion Ops. |
+| `export XMLIR_CUDNN_ENABLED`             | `1`               | Enables XMLIR (an intermediate representation/compiler) to use the ***\*cuDNN compatible/optimized path\**** (which may map to corresponding XPU optimized libraries in the KunlunCore environment). |
+| `export XPU_USE_DEFAULT_CTX`             | `1`               | Sets the XPU to use the default context. Typically used to simplify environment configuration and ensure runtime consistency. |
+| `export XMLIR_FORCE_USE_XPU_GRAPH`       | `1`               | ***\*Forces the enablement of XPU Graph mode.\****. This can capture and optimize the model execution graph, significantly boosting inference performance. |
+| `export VLLM_HOST_IP`                    | `$(hostname -i)`  | ***\*Sets the host IP address for the vLLM service\****. This uses a shell command to dynamically get the current host's internal IP. It's used for inter-node communication in a distributed environment. |
+| `export XMLIR_ENABLE_MOCK_TORCH_COMPILE` | `false`           | ***\*Disable Mock Torch Compile Function\****. Set to `false` to ensure the actual compilation and optimization flow is used, rather than mock mode. |
+| `FUSED_QK_ROPE_OP`                           | `0`               | ***\*Control whether to use the Fused QK-Norm and RoPE implementation\****. Default is `0` (use original/standard RoPE). Setting to `1` may be used to enable QWEN3. |
--- a/docs/source/user_guide/configuration/index.md
+++ b/docs/source/user_guide/configuration/index.md
@@ -0,0 +1,9 @@
+# Configuration Guide
+
+This section provides a detailed configuration guide of vLLM Kunlun.
+
+:::{toctree}
+:caption: Configuration Guide
+:maxdepth: 1
+env_vars
+:::
--- a/docs/source/user_guide/feature_guide/graph_mode.md
+++ b/docs/source/user_guide/feature_guide/graph_mode.md
@@ -0,0 +1,82 @@
+# Graph Mode Guide
+
+This guide provides instructions for using Kunlun Graph Mode with vLLM Kunlun. Please note that graph mode is available both on V1 and V0 Engine. All supported models are highly compatible with Kunlun Graph.
+
+## Getting Started
+
+From vLLM-KunLun-0.10.1.1 with V1 Engine, vLLM Kunlun will run models in graph mode by default to keep the same behavior with vLLM. If you hit any issues, please feel free to open an issue on GitHub and fallback to the eager mode temporarily by setting `enforce_eager=True` when initializing the model.
+
+There is a graph mode supported by vLLM Kunlun:
+
+- **KunlunGraph**: This is the default graph mode supported by vLLM Kunlun. In vLLM-KunLun-0.10.1.1, Qwen, GLM and InternVL series models are well tested.
+
+
+## Using KunlunGraph
+
+KunlunGraph is enabled by default. Take Qwen series models as an example, just set to use V1 Engine(default) is enough.
+
+Offline example:
+
+```python
+import os
+
+from vllm import LLM
+
+model = LLM(model="models/Qwen3-8B-Instruct")
+outputs = model.generate("Hello, how are you?")
+```
+
+Online example:
+
+```shell
+vllm serve Qwen3-8B-Instruct
+```
+
+## Using KunlunGraph
+
+Enabling Kunlun Graph on the Kunlun platform requires the use of splitting ops. 
+
+Online example:
+
+```shell
+python -m vllm.entrypoints.openai.api_server \
+      --host 0.0.0.0 \
+      --port 8000 \
+      --model /models/Qwen3-8B-Instruct\
+      --gpu-memory-utilization 0.9 \
+      --trust-remote-code \
+      --max-model-len 32768 \
+      --tensor-parallel-size 1 \
+      --dtype float16 \
+      --no-enable-prefix-caching \
+      --no-enable-chunked-prefill \
+      --distributed-executor-backend mp \
+      --served-model-name Qwen3-8B-Instruct \
+      --compilation-config '{"splitting_ops": ["vllm.unified_attention_with_output_kunlun",
+            "vllm.unified_attention", "vllm.unified_attention_with_output",
+            "vllm.mamba_mixer2"]}' \
+```
+
+
+## Fallback to the Eager Mode
+
+If `KunlunGraph` fail to run, you should fallback to the eager mode.
+
+Online example:
+
+```shell
+python -m vllm.entrypoints.openai.api_server \
+      --host 0.0.0.0 \
+      --port 8000 \
+      --model /models/Qwen3-8B-Instruct\
+      --gpu-memory-utilization 0.9 \
+      --trust-remote-code \
+      --max-model-len 32768 \
+      --tensor-parallel-size 1 \
+      --dtype float16 \
+      --no-enable-prefix-caching \
+      --no-enable-chunked-prefill \
+      --distributed-executor-backend mp \
+      --served-model-name Qwen3-8B-Instruct \
+      --enforce_eager
+```
--- a/docs/source/user_guide/feature_guide/index.md
+++ b/docs/source/user_guide/feature_guide/index.md
@@ -0,0 +1,11 @@
+# Feature Guide
+
+This section provides a detailed usage guide of vLLM Kunlun features.
+
+:::{toctree}
+:caption: Feature Guide
+:maxdepth: 1
+graph_mode
+quantization
+lora
+:::
--- a/docs/source/user_guide/feature_guide/lora.md
+++ b/docs/source/user_guide/feature_guide/lora.md
@@ -0,0 +1,27 @@
+# LoRA Adapters Guide
+
+## Overview
+
+Like vLLM, vllm_kunlun supports LoRA as well. The usage and more details can be found in [vLLM official document ](https://docs.vllm.ai/en/latest/features/lora.html).
+
+You can refer to [Supported Models ](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models)to find which models support LoRA in vLLM.
+
+Currently, only vLLM v0 mode (including eager and CUDA Graph modes) supports multi-LoRA inference in vllm_kunlun.
+
+## Example
+
+We provide a simple LoRA example here:
+
+```bash
+export ENABLE_KUNLUN_LARGE_OPS=0
+
+USE_ORI_ROPE=0 VLLM_USE_V1=0 vllm serve qwen3-8b \
+    --enable-lora \
+    --max-lora-rank 64 \
+    --lora-modules lora1=/path/to/lora1 lora2=/path/to/lora2
+```
+
+
+## Custom LoRA Operators
+
+We have implemented LoRA-related custom operators for Kunlun hardware, such as `bgmv_shrink`, `bgmv_expand`, `sgmv_shrink`, and `sgmv_expand`. The implementation can be found in `vllm_kunlun/lora/ops/kunlun_ops/lora_ops.py`.
--- a/docs/source/user_guide/feature_guide/quantization.md
+++ b/docs/source/user_guide/feature_guide/quantization.md
@@ -0,0 +1,45 @@
+# Quantization Guide
+>Note: This feature is currently experimental. In future versions, there may be behavioral changes around configuration, coverage, performance improvement.
+
+Like vLLM, we now support quantization methods such as compressed-tensors, AWQ, and GPTQ, enabling various precision configurations including W8A8, W4A16, and W8A16. These can help reduce memory consumption and accelerate inference while preserving model accuracy.
+
+
+## Usages
+
+### Compressed-tensor
+To run a `compressed-tensors` model with vLLM-kunlun, you should first add the below configuration to the model's `config.json`:
+
+```Bash
+"quantization_config": {
+    "quant_method": "compressed-tensors"
+  }
+```
+
+Then you run `Qwen/Qwen3-30B-A3B` with dynamic W8A8 quantization with the following command:
+
+```Bash
+python -m vllm.entrypoints.openai.api_server \
+    --model Qwen/Qwen3-30B-A3B \
+    --quantization compressed-tensors
+```
+
+### AWQ
+
+To run an `AWQ` model with vLLM-kunlun, you can use `Qwen/Qwen3-32B-AWQ` with the following command:
+
+```Bash
+python -m vllm.entrypoints.openai.api_server \
+    --model Qwen/Qwen3-32B-AWQ \
+    --quantization awq
+```
+
+### GPTQ
+
+To run a `GPTQ` model with vLLM-kunlun, you can use `Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4` with the following command:
+
+```Bash
+python -m vllm.entrypoints.openai.api_server \
+    --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 \
+    --quantization gptq
+```
+
--- a/docs/source/user_guide/release_notes.md
+++ b/docs/source/user_guide/release_notes.md
@@ -0,0 +1,3 @@
+# Release Notes
+
+Comming soon...
--- a/docs/source/user_guide/support_matrix/index.md
+++ b/docs/source/user_guide/support_matrix/index.md
@@ -0,0 +1,10 @@
+# Features and Models
+
+This section provides a detailed matrix supported by vLLM-Kunlun.
+
+:::{toctree}
+:caption: Support Matrix
+:maxdepth: 1
+supported_models
+supported_features
+:::
--- a/docs/source/user_guide/support_matrix/supported_features.md
+++ b/docs/source/user_guide/support_matrix/supported_features.md
@@ -0,0 +1,14 @@
+# Supported Features
+
+The feature support principle of vLLM-KunLun is: **aligned with the vLLM**. We are also actively collaborating with the community to accelerate support.
+
+You can check the [support status of vLLM V1 Engine][v1_user_guide]. Below is the feature support status of vLLM-KunLun:
+
+## Features Supported
+|Feature|Status|Note|
+|-|-|-|
+|Tensor Parallel|🟢 Functional||
+|Experts Parallel|🟢 Functional||
+|Graph Mode|🟢 Functional||
+|Quantization| 🟢 Functional||
+|LoRA|⚠️ Need Test|Only LLM models|
--- a/docs/source/user_guide/support_matrix/supported_models.md
+++ b/docs/source/user_guide/support_matrix/supported_models.md
@@ -0,0 +1,33 @@
+# Supported Models
+
+## Generative Models
+
+| Model         | Support       | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
+| :------------ | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
+| Qwen2         | ✅             |      | ✅    | ✅               |                 | ✅             | ✅                      |
+| Qwen2.5       | ✅             |      | ✅    | ✅               |                 | ✅             | ✅                      |
+| Qwen3         | ✅             |      | ✅    | ✅               |                 | ✅             | ✅                      |
+| Qwen3-Moe     | ✅             | ✅    | ✅    | ✅               | ✅               | ✅             | ✅                      |
+| Qwen3-Coder   | ✅             | ✅    | ✅    | ✅               | ✅               | ✅             | ✅                      |
+| QwQ-32B       | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
+| LLama2        | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
+| LLama3        | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
+| LLama3.1      | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
+| GLM-4.5       | ✅             | ✅    | ✅    | ✅               | ✅               | ✅             | ✅                      |
+| GLM-4.5-Air   | ✅             | ✅    | ✅    | ✅               | ✅               | ✅             | ✅                      |
+| Qwen3-next    | 🔜Comming soon |      |      |                 |                 |               |                        |
+| gpt-oss       | 🔜Comming soon |      |      |                 |                 |               |                        |
+| DeepSeek-V3   | 🔜Comming soon |      |      |                 |                 |               |                        |
+| DeepSeek-V3.2 | 🔜Comming soon |      |      |                 |                 |               |                        |
+
+## Multimodal Language Models
+| Model        | Support       | W8A8 | LoRA | Tensor Parallel | Expert Parallel | Data Parallel | Piecewise Kunlun Graph |
+| :----------- | :------------ | :--- | :--- | :-------------- | :-------------- | :------------ | :--------------------- |
+|Qianfan-VL    | ✅     |       |      |       ✅|               |✅               |✅|
+| Qwen2.5VL    | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
+| InternVL2.5  | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
+| InternVL3    | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
+| InternVL3.5  | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
+| InternS1     | ✅             |      |      | ✅               |                 | ✅             | ✅                      |
+| Qwen2.5-Omni | 🔜Comming soon |      |      |                 |                 |               |                        |
+| Qwen3-VL     | 🔜Comming soon |      |      |                 |                 |               |                        |