Add Custom Kernels For LoRA Performance (#2325)
### What this PR does / why we need it?
Add two custom operators (sgmv_shrink and sgmv_expand) to address the
performance issues of LoRA. Meanwhile, enable the graph mode for LoRA
operators to enter ACL, so as to improve the model inference
performance.
### Does this PR introduce _any_ user-facing change?
no user-facing change
### How was this patch tested?
Based on the actual test of the QWen2.5 7B model using vllm-ascend
version v0.9.2.rc1, in acl graph mode, the TTFT, TPOT and throughput
have increased by about 100%.
Signed-off-by: liuchn <909698896@qq.com>
- vLLM version: v0.10.0
- vLLM main:
1f83e7d849
---------
Signed-off-by: liuchn <909698896@qq.com>
Co-authored-by: liuchn <909698896@qq.com>
This commit is contained in:
@@ -80,7 +80,30 @@ def get_masked_input_and_mask_meta(input: torch.Tensor,
|
||||
|
||||
return masked_input, mask
|
||||
|
||||
def bgmv_expand_meta(x: torch.Tensor,
|
||||
weight: torch.Tensor,
|
||||
indices: torch.Tensor,
|
||||
y: torch.Tensor,
|
||||
slice_offset: int,
|
||||
slice_size: int):
|
||||
|
||||
y_out = torch.empty_like(y)
|
||||
return y_out
|
||||
|
||||
def sgmv_expand_meta(x: torch.Tensor,
|
||||
weight: torch.Tensor,
|
||||
lora_indices: torch.Tensor,
|
||||
seq_len: torch.Tensor,
|
||||
y: torch.Tensor,
|
||||
slice_offset: int,
|
||||
slice_size: int):
|
||||
|
||||
y_out = torch.empty_like(y)
|
||||
return y_out
|
||||
|
||||
|
||||
register_meta_if_necessary("_C", "rotary_embedding", rotary_embedding_meta)
|
||||
register_meta_if_necessary("_C", "get_masked_input_and_mask",
|
||||
get_masked_input_and_mask_meta)
|
||||
register_meta_if_necessary("_C", "bgmv_expand", bgmv_expand_meta)
|
||||
register_meta_if_necessary("_C", "sgmv_expand", sgmv_expand_meta)
|
||||
|
||||
Reference in New Issue
Block a user