[main] Optimize rope in Qwen Models (#2571)
### What this PR does / why we need it?
Optimize rope by caching sin and cos at the first layer in Qwen Models.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI passed with new added/existing test.
- vLLM version: v0.10.1.1
- vLLM main:
562663a044
---------
Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: ZYang6263 <zy626375@gmail.com>
Signed-off-by: rjg-lyh <1318825571@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: ZYang6263 <51255902183@stu.ecnu.edu.cn>
Co-authored-by: ZYang6263 <zy626375@gmail.com>
This commit is contained in:
@@ -119,6 +119,9 @@ def set_ascend_forward_context(
|
||||
|
||||
forward_context.flashcomm_v1_enabled = flashcomm_v1_enabled
|
||||
|
||||
# set this for rope forward_oot using
|
||||
forward_context.is_first_layer = True
|
||||
|
||||
if num_tokens is None and attn_metadata is not None:
|
||||
num_tokens = attn_metadata.num_actual_tokens
|
||||
|
||||
|
||||
Reference in New Issue
Block a user