Add Custom Kernels For LoRA Performance (#2325)

### What this PR does / why we need it? Add two custom operators (sgmv_shrink and sgmv_expand) to address the performance issues of LoRA. Meanwhile, enable the graph mode for LoRA operators to enter ACL, so as to improve the model inference performance. ### Does this PR introduce _any_ user-facing change? no user-facing change ### How was this patch tested? Based on the actual test of the QWen2.5 7B model using vllm-ascend version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput have increased by about 100%. Signed-off-by: liuchn <909698896@qq.com> - vLLM version: v0.10.0 - vLLM main: 1f83e7d849 --------- Signed-off-by: liuchn <909698896@qq.com> Co-authored-by: liuchn <909698896@qq.com>
2025-08-19 09:09:11 +08:00
parent 8fb50a4248
commit 3648d18e67
8 changed files with 847 additions and 29 deletions
--- a/csrc/ops.h
+++ b/csrc/ops.h
@@ -88,4 +88,34 @@ namespace vllm_ascend {
        uint32_t output_hidden_dim,
        uint32_t slice_offset,
        uint32_t output_full_dim);
+
+    extern void sgmv_shrink_impl(
+        AscendType type,
+        void *stream,
+        void *x,
+        void *weight,
+        void *loraIndices,
+        void *seqLen,
+        void *y,
+        uint32_t batch_size,
+        uint32_t num_tokens_per_core,
+        uint32_t input_hidden_dim,
+        uint32_t lora_rank,
+        float scale);
+
+    extern void sgmv_expand_impl(
+        AscendType type,
+        void *stream,
+        void *x,
+        void *weight,
+        void *loraIndices,
+        void *seqLen,
+        void *y,
+        void *y_out,
+        uint32_t batch_size,
+        uint32_t num_tokens_per_core,
+        uint32_t lora_rank,
+        uint32_t output_hidden_dim,
+        uint32_t slice_offset,
+        uint32_t output_full_dim);
 }