[v0.11.0][Perf] Eliminating the zerolike operator through patch (#3632)
### What this PR does / why we need it? There is a zero-like operator before the attention operation in each decoding stage. After analysis, this operator can be eliminated. The purpose of this PR is to remove this operator and improve performance. --------- Signed-off-by: ZYang6263 <zy626375@gmail.com>
This commit is contained in:
@@ -160,3 +160,15 @@
|
||||
# Future Plan:
|
||||
# Remove this patch when adapted vllm version contains the above PR.
|
||||
#
|
||||
# ** File: worker/patch_attention_layer.py **
|
||||
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
|
||||
# 1. `vllm.attention.layer.Attention.forward`
|
||||
# Why:
|
||||
# There is a zerolike operator before the attention operation in each decoding stage.
|
||||
# How
|
||||
# Replace this zerolike operator with torch.empty
|
||||
# Related PR (if no, explain why):
|
||||
# - https://github.com/vllm-project/vllm/pull/26680
|
||||
# Future Plan:
|
||||
# Remove this to match the optimization supported in the VLLM version.
|
||||
#
|
||||
|
||||
Reference in New Issue
Block a user