[Feature] Define backends and add Triton backend for Lora (#3161)

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
This commit is contained in:
Baizhou Zhang
2025-02-03 22:09:13 -08:00
committed by GitHub
parent 7b5a374114
commit 70817a7eae
18 changed files with 1129 additions and 135 deletions

View File

@@ -124,6 +124,7 @@ Please consult the documentation below to learn more about the parameters you ma
* `lora_paths`: You may provide a list of adapters to your model as a list. Each batch element will get model response with the corresponding lora adapter applied. Currently `cuda_graph` and `radix_attention` are not supportet with this option so you need to disable them manually. We are still working on through these [issues](https://github.com/sgl-project/sglang/issues/2929).
* `max_loras_per_batch`: Maximum number of LoRAs in a running batch including base model.
* `lora_backend`: The backend of running GEMM kernels for Lora modules, can be one of `triton` or `flashinfer`. Defaults to be `triton`.
## Kernel backend