[main] Optimize rope in Qwen Models (#2571)

### What this PR does / why we need it? Optimize rope by caching sin and cos at the first layer in Qwen Models. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI passed with new added/existing test. - vLLM version: v0.10.1.1 - vLLM main: 562663a044 --------- Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: ZYang6263 <zy626375@gmail.com> Signed-off-by: rjg-lyh <1318825571@qq.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: ZYang6263 <51255902183@stu.ecnu.edu.cn> Co-authored-by: ZYang6263 <zy626375@gmail.com>
2025-09-09 14:28:14 +08:00
parent 5bcb4c1528
commit 7a205dbaa8
4 changed files with 136 additions and 47 deletions
--- a/vllm_ascend/ascend_forward_context.py
+++ b/vllm_ascend/ascend_forward_context.py
@@ -119,6 +119,9 @@ def set_ascend_forward_context(

        forward_context.flashcomm_v1_enabled = flashcomm_v1_enabled

+        # set this for rope forward_oot using
+        forward_context.is_first_layer = True
+
        if num_tokens is None and attn_metadata is not None:
            num_tokens = attn_metadata.num_actual_tokens