Initial commit for vLLM-Kunlun Plugin

2025-12-10 12:05:39 +08:00
commit c728e52505
131 changed files with 28816 additions and 0 deletions
--- a/docs/source/user_guide/feature_guide/lora.md
+++ b/docs/source/user_guide/feature_guide/lora.md
@@ -0,0 +1,27 @@
+# LoRA Adapters Guide
+
+## Overview
+
+Like vLLM, vllm_kunlun supports LoRA as well. The usage and more details can be found in [vLLM official document ](https://docs.vllm.ai/en/latest/features/lora.html).
+
+You can refer to [Supported Models ](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models)to find which models support LoRA in vLLM.
+
+Currently, only vLLM v0 mode (including eager and CUDA Graph modes) supports multi-LoRA inference in vllm_kunlun.
+
+## Example
+
+We provide a simple LoRA example here:
+
+```bash
+export ENABLE_KUNLUN_LARGE_OPS=0
+
+USE_ORI_ROPE=0 VLLM_USE_V1=0 vllm serve qwen3-8b \
+    --enable-lora \
+    --max-lora-rank 64 \
+    --lora-modules lora1=/path/to/lora1 lora2=/path/to/lora2
+```
+
+
+## Custom LoRA Operators
+
+We have implemented LoRA-related custom operators for Kunlun hardware, such as `bgmv_shrink`, `bgmv_expand`, `sgmv_shrink`, and `sgmv_expand`. The implementation can be found in `vllm_kunlun/lora/ops/kunlun_ops/lora_ops.py`.