Feat: support cuda graph for LoRA (#4115)
Co-authored-by: Beichen Ma <mabeichen12@gmail.com>
This commit is contained in:
@@ -77,7 +77,7 @@
|
||||
"python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \\\n",
|
||||
" --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
|
||||
" --max-loras-per-batch 1 --lora-backend triton \\\n",
|
||||
" --disable-cuda-graph --disable-radix-cache\n",
|
||||
" --disable-radix-cache\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
@@ -136,7 +136,7 @@
|
||||
" --lora-paths lora0=algoprog/fact-generation-llama-3.1-8b-instruct-lora \\\n",
|
||||
" lora1=Nutanix/Meta-Llama-3.1-8B-Instruct_lora_4_alpha_16 \\\n",
|
||||
" --max-loras-per-batch 2 --lora-backend triton \\\n",
|
||||
" --disable-cuda-graph --disable-radix-cache\n",
|
||||
" --disable-radix-cache\n",
|
||||
"\"\"\"\n",
|
||||
")\n",
|
||||
"\n",
|
||||
@@ -182,7 +182,7 @@
|
||||
"source": [
|
||||
"## Future Works\n",
|
||||
"\n",
|
||||
"The development roadmap for LoRA-related features can be found in this [issue](https://github.com/sgl-project/sglang/issues/2929). Currently Cuda graph and radix attention are not incompatible with LoRA and must be manually disabled. Other features, including Unified Paging, Cutlass backend, and dynamic loading/unloadingm, are still under development."
|
||||
"The development roadmap for LoRA-related features can be found in this [issue](https://github.com/sgl-project/sglang/issues/2929). Currently radix attention is incompatible with LoRA and must be manually disabled. Other features, including Unified Paging, Cutlass backend, and dynamic loading/unloadingm, are still under development."
|
||||
]
|
||||
}
|
||||
],
|
||||
|
||||
@@ -160,7 +160,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s
|
||||
|
||||
| Arguments | Description | Defaults |
|
||||
|----------|-------------|---------|
|
||||
| `lora_paths` | List of adapters to apply to your model. Each batch element uses the proper LoRA adapter. `cuda_graph` and `radix_attention` are not supported with this, so they must be disabled manually. See related [issues](https://github.com/sgl-project/sglang/issues/2929). | None |
|
||||
| `lora_paths` | List of adapters to apply to your model. Each batch element uses the proper LoRA adapter. `radix_attention` is not supported with this, so it must be disabled manually. See related [issues](https://github.com/sgl-project/sglang/issues/2929). | None |
|
||||
| `max_loras_per_batch` | Maximum number of LoRAs allowed in a running batch, including the base model. | `8` |
|
||||
| `lora_backend` | Backend used to run GEMM kernels for LoRA modules. Can be `triton` or `flashinfer`. | `triton` |
|
||||
|
||||
|
||||
Reference in New Issue
Block a user