[2/2] Introduce Chunked-SGMV kernels and corresponding LoRA backend for improved performance (#10286)

This commit is contained in:
Lifu Huang
2025-09-15 16:04:03 -07:00
committed by GitHub
parent 2689f0bf02
commit 3f41b48c40
10 changed files with 1499 additions and 13 deletions

View File

@@ -35,7 +35,7 @@
"\n",
"* `max_loaded_loras`: If specified, it limits the maximum number of LoRA adapters loaded in CPU memory at a time. The value must be greater than or equal to `max-loras-per-batch`.\n",
"\n",
"* `lora_backend`: The backend of running GEMM kernels for Lora modules. Currently we only support Triton LoRA backend. In the future, faster backend built upon Cutlass or Cuda kernels will be added.\n",
"* `lora_backend`: The backend of running GEMM kernels for Lora modules. Currently we support Triton LoRA backend (`triton`) and Chunked SGMV backend (`csgmv`). In the future, faster backend built upon Cutlass or Cuda kernels will be added.\n",
"\n",
"* `max_lora_rank`: The maximum LoRA rank that should be supported. If not specified, it will be automatically inferred from the adapters provided in `--lora-paths`. This argument is needed when you expect to dynamically load adapters of larger LoRA rank after server startup.\n",
"\n",
@@ -79,7 +79,7 @@
"python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
" --enable-lora \\\n",
" --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
" --max-loras-per-batch 1 --lora-backend triton \\\n",
" --max-loras-per-batch 1 \\\n",
" --log-level warning \\\n",
"\"\"\"\n",
")\n",
@@ -139,7 +139,7 @@
" --enable-lora \\\n",
" --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
" lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \\\n",
" --max-loras-per-batch 2 --lora-backend triton \\\n",
" --max-loras-per-batch 2 \\\n",
" --log-level warning \\\n",
"\"\"\"\n",
")\n",
@@ -214,7 +214,7 @@
" python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
" --enable-lora \\\n",
" --cuda-graph-max-bs 2 \\\n",
" --max-loras-per-batch 2 --lora-backend triton \\\n",
" --max-loras-per-batch 2 \\\n",
" --max-lora-rank 256\n",
" --lora-target-modules all\n",
" --log-level warning\n",
@@ -413,7 +413,7 @@
" python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
" --enable-lora \\\n",
" --cuda-graph-max-bs 8 \\\n",
" --max-loras-per-batch 3 --lora-backend triton \\\n",
" --max-loras-per-batch 3 \\\n",
" --max-lora-rank 256 \\\n",
" --lora-target-modules all \\\n",
" --lora-paths \\\n",
@@ -501,6 +501,48 @@
"terminate_process(server_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Choosing LoRA Backend\n",
"\n",
"SGLang supports two LoRA backends that you can choose from using the `--lora-backend` argument:\n",
"\n",
"- `triton`: Default basic Triton-based backend.\n",
"- `csgmv`: Chunked SGMV backend optimized for high concurrency scenarios.\n",
"\n",
"The `csgmv` backend was recently introduced to improve performance especially at high-concurrency scenarios. Our benchmark shows that it achieves 20% to 80% latency improvements over the basic triton backend.\n",
"Currently it is at preview phase, we expect to make it our the default LoRA backend in future release. Before that, you can adopt it by manually setting the `--lora-backend` server config."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"server_process, port = launch_server_cmd(\n",
" \"\"\"\n",
" python3 -m sglang.launch_server \\\n",
" --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
" --enable-lora \\\n",
" --lora-backend csgmv \\\n",
" --max-loras-per-batch 16 \\\n",
" --lora-paths lora1=path/to/lora1 lora2=path/to/lora2\n",
" \"\"\"\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"terminate_process(server_process)"
]
},
{
"cell_type": "markdown",
"metadata": {},