[main] Optimize rope in Qwen Models (#2571)

### What this PR does / why we need it?
Optimize rope by caching sin and cos at the first layer in Qwen Models.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
CI passed with new added/existing test.


- vLLM version: v0.10.1.1
- vLLM main:
562663a044

---------

Signed-off-by: MengqingCao <cmq0113@163.com>
Signed-off-by: ZYang6263 <zy626375@gmail.com>
Signed-off-by: rjg-lyh <1318825571@qq.com>
Co-authored-by: Mengqing Cao <cmq0113@163.com>
Co-authored-by: ZYang6263 <51255902183@stu.ecnu.edu.cn>
Co-authored-by: ZYang6263 <zy626375@gmail.com>
This commit is contained in:
rjg-lyh
2025-09-09 14:28:14 +08:00
committed by GitHub
parent 5bcb4c1528
commit 7a205dbaa8
4 changed files with 136 additions and 47 deletions

View File

@@ -119,6 +119,9 @@ def set_ascend_forward_context(
forward_context.flashcomm_v1_enabled = flashcomm_v1_enabled
# set this for rope forward_oot using
forward_context.is_first_layer = True
if num_tokens is None and attn_metadata is not None:
num_tokens = attn_metadata.num_actual_tokens