1001 B
1001 B
LoRA Adapters Guide
Overview
Like vLLM, vllm_kunlun supports LoRA as well. The usage and more details can be found in vLLM official document .
You can refer to Supported Models to find which models support LoRA in vLLM.
Currently, only vLLM v0 mode (including eager and CUDA Graph modes) supports multi-LoRA inference in vllm_kunlun.
Example
We provide a simple LoRA example here:
export ENABLE_KUNLUN_LARGE_OPS=0
USE_ORI_ROPE=0 VLLM_USE_V1=0 vllm serve qwen3-8b \
--enable-lora \
--max-lora-rank 64 \
--lora-modules lora1=/path/to/lora1 lora2=/path/to/lora2
Custom LoRA Operators
We have implemented LoRA-related custom operators for Kunlun hardware, such as bgmv_shrink, bgmv_expand, sgmv_shrink, and sgmv_expand. The implementation can be found in vllm_kunlun/lora/ops/kunlun_ops/lora_ops.py.