Add Custom Kernels For LoRA Performance (#2325)
### What this PR does / why we need it?
Add two custom operators (sgmv_shrink and sgmv_expand) to address the
performance issues of LoRA. Meanwhile, enable the graph mode for LoRA
operators to enter ACL, so as to improve the model inference
performance.
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput
have increased by about 100%.
Signed-off-by: liuchn <909698896@qq.com>
- vLLM version: v0.10.0
- vLLM main:
1f83e7d849
---------
Signed-off-by: liuchn <909698896@qq.com>
Co-authored-by: liuchn <909698896@qq.com>
This commit is contained in:
30
csrc/ops.h
30
csrc/ops.h
@@ -88,4 +88,34 @@ namespace vllm_ascend {
|
||||
uint32_t output_hidden_dim,
|
||||
uint32_t slice_offset,
|
||||
uint32_t output_full_dim);
|
||||
|
||||
extern void sgmv_shrink_impl(
|
||||
AscendType type,
|
||||
void *stream,
|
||||
void *x,
|
||||
void *weight,
|
||||
void *loraIndices,
|
||||
void *seqLen,
|
||||
void *y,
|
||||
uint32_t batch_size,
|
||||
uint32_t num_tokens_per_core,
|
||||
uint32_t input_hidden_dim,
|
||||
uint32_t lora_rank,
|
||||
float scale);
|
||||
|
||||
extern void sgmv_expand_impl(
|
||||
AscendType type,
|
||||
void *stream,
|
||||
void *x,
|
||||
void *weight,
|
||||
void *loraIndices,
|
||||
void *seqLen,
|
||||
void *y,
|
||||
void *y_out,
|
||||
uint32_t batch_size,
|
||||
uint32_t num_tokens_per_core,
|
||||
uint32_t lora_rank,
|
||||
uint32_t output_hidden_dim,
|
||||
uint32_t slice_offset,
|
||||
uint32_t output_full_dim);
|
||||
}
|
||||
|
||||
Reference in New Issue
Block a user