[Platform][Worker][ModelRunner] Add LoRA & Multi-LoRA support (#521)

### What this PR does / why we need it? According to this RFC [[RFC]: Join the MultiLora and MultiLora Dynammic Serving feature develop #396](https://github.com/vllm-project/vllm-ascend/issues/396) and this [vLLM Ascend Roadmap Q2 2025 #448](https://github.com/vllm-project/vllm-ascend/issues/448), we pull request relavant code to support (1) Multi-LoRA and (2) Multi-LoRA Dynamic Serving. LoRA reference is here: [LoRA reference](https://docs.vllm.ai/en/latest/features/lora.html) ### Does this PR introduce _any_ user-facing change? Following openai HTTP apis will be supported: /v1/load_lora_adapter /v1/unload_lora_adapter ### How was this patch tested? git clone https://github.com/vllm-project/vllm.git cd vllm/examples/offline_inference/ && python3 multilora_inference.py --------- Signed-off-by: paulyu <paulyu0307@gmail.com> Co-authored-by: paulyu <paulyu0307@gmail.com>
2025-04-17 16:48:46 +08:00
parent 9935d45728
commit 697908f5cd
4 changed files with 484 additions and 14 deletions
--- a/vllm_ascend/platform.py
+++ b/vllm_ascend/platform.py
@@ -141,6 +141,10 @@ class NPUPlatform(Platform):
            return "vllm_ascend.attention.attention.AscendMLAAttentionBackend"
        return "vllm_ascend.attention.attention.AscendAttentionBackend"

+    @classmethod
+    def get_punica_wrapper(cls) -> str:
+        return "vllm_ascend.lora.punica_wrapper.punica_npu.PunicaWrapperNPU"
+
    @classmethod
    def get_current_memory_usage(cls,
                                 device: Optional[torch.types.Device] = None