Feat: support cuda graph for LoRA (#4115)

Co-authored-by: Beichen Ma <mabeichen12@gmail.com>
2025-04-29 02:30:44 -04:00
parent 2c3ea29476
commit 8c0cfca87d
13 changed files with 366 additions and 55 deletions
--- a/docs/backend/server_arguments.md
+++ b/docs/backend/server_arguments.md
@@ -160,7 +160,7 @@ Please consult the documentation below and [server_args.py](https://github.com/s

 | Arguments | Description | Defaults |
 |----------|-------------|---------|
-| `lora_paths` | List of adapters to apply to your model. Each batch element uses the proper LoRA adapter. `cuda_graph` and `radix_attention` are not supported with this, so they must be disabled manually. See related [issues](https://github.com/sgl-project/sglang/issues/2929). | None |
+| `lora_paths` | List of adapters to apply to your model. Each batch element uses the proper LoRA adapter. `radix_attention` is not supported with this, so it must be disabled manually. See related [issues](https://github.com/sgl-project/sglang/issues/2929). | None |
 | `max_loras_per_batch` | Maximum number of LoRAs allowed in a running batch, including the base model. | `8` |
 | `lora_backend` | Backend used to run GEMM kernels for LoRA modules. Can be `triton` or `flashinfer`. | `triton` |