[2/2] Introduce Chunked-SGMV kernels and corresponding LoRA backend for improved performance (#10286)
This commit is contained in:
@@ -35,7 +35,7 @@
|
||||
"\n",
|
||||
"* `max_loaded_loras`: If specified, it limits the maximum number of LoRA adapters loaded in CPU memory at a time. The value must be greater than or equal to `max-loras-per-batch`.\n",
|
||||
"\n",
|
||||
"* `lora_backend`: The backend of running GEMM kernels for Lora modules. Currently we only support Triton LoRA backend. In the future, faster backend built upon Cutlass or Cuda kernels will be added.\n",
|
||||
"* `lora_backend`: The backend of running GEMM kernels for Lora modules. Currently we support Triton LoRA backend (`triton`) and Chunked SGMV backend (`csgmv`). In the future, faster backend built upon Cutlass or Cuda kernels will be added.\n",
|
||||
"\n",
|
||||
"* `max_lora_rank`: The maximum LoRA rank that should be supported. If not specified, it will be automatically inferred from the adapters provided in `--lora-paths`. This argument is needed when you expect to dynamically load adapters of larger LoRA rank after server startup.\n",
|
||||
"\n",
|
||||
@@ -79,7 +79,7 @@
|
||||
"python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
|
||||
" --enable-lora \\\n",
|
||||
" --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
|
||||
" --max-loras-per-batch 1 --lora-backend triton \\\n",
|
||||
" --max-loras-per-batch 1 \\\n",
|
||||
" --log-level warning \\\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
@@ -139,7 +139,7 @@
|
||||
" --enable-lora \\\n",
|
||||
" --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
|
||||
" lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \\\n",
|
||||
" --max-loras-per-batch 2 --lora-backend triton \\\n",
|
||||
" --max-loras-per-batch 2 \\\n",
|
||||
" --log-level warning \\\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
@@ -214,7 +214,7 @@
|
||||
" python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
|
||||
" --enable-lora \\\n",
|
||||
" --cuda-graph-max-bs 2 \\\n",
|
||||
" --max-loras-per-batch 2 --lora-backend triton \\\n",
|
||||
" --max-loras-per-batch 2 \\\n",
|
||||
" --max-lora-rank 256\n",
|
||||
" --lora-target-modules all\n",
|
||||
" --log-level warning\n",
|
||||
@@ -413,7 +413,7 @@
|
||||
" python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
|
||||
" --enable-lora \\\n",
|
||||
" --cuda-graph-max-bs 8 \\\n",
|
||||
" --max-loras-per-batch 3 --lora-backend triton \\\n",
|
||||
" --max-loras-per-batch 3 \\\n",
|
||||
" --max-lora-rank 256 \\\n",
|
||||
" --lora-target-modules all \\\n",
|
||||
" --lora-paths \\\n",
|
||||
@@ -501,6 +501,48 @@
|
||||
"terminate_process(server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Choosing LoRA Backend\n",
|
||||
"\n",
|
||||
"SGLang supports two LoRA backends that you can choose from using the `--lora-backend` argument:\n",
|
||||
"\n",
|
||||
"- `triton`: Default basic Triton-based backend.\n",
|
||||
"- `csgmv`: Chunked SGMV backend optimized for high concurrency scenarios.\n",
|
||||
"\n",
|
||||
"The `csgmv` backend was recently introduced to improve performance especially at high-concurrency scenarios. Our benchmark shows that it achieves 20% to 80% latency improvements over the basic triton backend.\n",
|
||||
"Currently it is at preview phase, we expect to make it our the default LoRA backend in future release. Before that, you can adopt it by manually setting the `--lora-backend` server config."
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"server_process, port = launch_server_cmd(\n",
|
||||
" \"\"\"\n",
|
||||
" python3 -m sglang.launch_server \\\n",
|
||||
" --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
|
||||
" --enable-lora \\\n",
|
||||
" --lora-backend csgmv \\\n",
|
||||
" --max-loras-per-batch 16 \\\n",
|
||||
" --lora-paths lora1=path/to/lora1 lora2=path/to/lora2\n",
|
||||
" \"\"\"\n",
|
||||
")"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "code",
|
||||
"execution_count": null,
|
||||
"metadata": {},
|
||||
"outputs": [],
|
||||
"source": [
|
||||
"terminate_process(server_process)"
|
||||
]
|
||||
},
|
||||
{
|
||||
"cell_type": "markdown",
|
||||
"metadata": {},
|
||||
|
||||
Reference in New Issue
Block a user