Add Custom Kernels For LoRA Performance (#1884)

### What this PR does / why we need it? Add two custom kernels(bgmv_shrink and bgmv expand) to solve the performance of LoRA ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? we add Unit Test file to test the custom ascendc kernel. See vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py and vllm-ascend/tests/e2e/singlecard/ops/test_bgmv_expand.py Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, the TTFT, TPOT and throughput have increased by about 70%. - vLLM version: v0.9.2 - vLLM main: 40d86ee412 --------- Signed-off-by: taoxudonghaha <justsheldon@163.com>
2025-07-29 19:27:50 +08:00
parent 2da281ec5a
commit 540336edc9
8 changed files with 946 additions and 3 deletions
--- a/vllm_ascend/lora/punica_wrapper/punica_npu.py
+++ b/vllm_ascend/lora/punica_wrapper/punica_npu.py
@@ -3,9 +3,18 @@
 from typing import Callable, Optional, Tuple, Union

 import torch
-from vllm.lora.ops.torch_ops import (bgmv_expand, bgmv_expand_slice,
-                                     bgmv_shrink, sgmv_expand,
-                                     sgmv_expand_slice, sgmv_shrink)
+
+from vllm_ascend.utils import is_310p
+
+if is_310p():
+    from vllm.lora.ops.torch_ops import (bgmv_expand, bgmv_expand_slice,
+                                         bgmv_shrink, sgmv_expand,
+                                         sgmv_expand_slice, sgmv_shrink)
+else:
+    from vllm_ascend.lora.punica_wrapper.lora_ops import (
+        bgmv_expand, bgmv_expand_slice, bgmv_shrink, sgmv_expand,
+        sgmv_expand_slice, sgmv_shrink)
+
 from vllm.lora.punica_wrapper.punica_base import PunicaWrapperBase