[Ops][Misc] Optimize split_qkv_rmsnorm_rope op (#6827)

### What this PR does / why we need it?

This PR optimizes the `split_qkv_rmsnorm_rope` operator by introducing a
new Triton kernel, `split_qkv_rmsnorm_rope_prefill_kernel`, for the
prefill stage (i.e., large batch sizes). The implementation now
dynamically selects between the existing decode kernel and the new
prefill kernel based on the batch size, which improves performance for
large batch scenarios.

Additionally, the RoPE implementation is updated to support partial
rotation dimensions (`rope_dim`), making the operator more flexible.

### Does this PR introduce _any_ user-facing change?

No. This is a performance optimization and is not expected to introduce
any user-facing changes.

### How was this patch tested?

CI should pass with existing tests. The new prefill path is triggered
when the batch size is larger than the number of available vector cores.
The partial RoPE feature can be tested by passing the `rope_dim`
argument.
- vLLM version: v0.15.0
- vLLM main:
83b47f67b1

---------

Signed-off-by: guzhiyong <guzhiyong5@h-partners.com>
Signed-off-by: frank <2547457096@qq.com>
Co-authored-by: guzhiyong <guzhiyong5@h-partners.com>
This commit is contained in:
frank
2026-03-06 09:30:31 +08:00
committed by GitHub
parent a60e179c7f
commit 18b52afe2b
4 changed files with 290 additions and 177 deletions

View File

@@ -1,5 +1,6 @@
import copy
import numpy as np
import pytest
import torch
import torch.nn as nn
@@ -16,6 +17,8 @@ from vllm_ascend.compilation.passes.qknorm_rope_fusion_pass import (
)
from vllm_ascend.ops.triton.triton_utils import init_device_properties_triton
MAX_POSITION_EMBEDDING = 262144
def find_op(gm, op_default):
return any(node.op == "call_function" and node.target == op_default for node in gm.graph.nodes)
@@ -207,8 +210,10 @@ def test_rmsnorm_quant_fusion(
model = model.to("npu")
seq_len = 5
qkv = torch.randn(seq_len, qkv_size, device="npu", dtype=dtype)
cos = torch.randn(1, seq_len, 1, head_dim, device="npu", dtype=dtype)
sin = torch.randn(1, seq_len, 1, head_dim, device="npu", dtype=dtype)
cos_sin_cache = torch.from_numpy(np.random.uniform(0, 1, [MAX_POSITION_EMBEDDING, head_dim])).to(dtype).npu()
positions = torch.randint(
low=0, high=MAX_POSITION_EMBEDDING, size=(num_tokens,), dtype=torch.int64, device="npu"
)
with torch.no_grad():
original_optimize = torchair.npu_fx_compiler._optimize_fx
@@ -218,6 +223,6 @@ def test_rmsnorm_quant_fusion(
compiled_model = torch.compile(model, backend="npugraph_ex", fullgraph=True, dynamic=True)
compiled_model(qkv, cos, sin)
compiled_model(qkv, cos_sin_cache, positions)
torchair.npu_fx_compiler._optimize_fx = original_optimize

View File

@@ -74,7 +74,7 @@ CASE_QWEN_FULL_DECODE_ONLY = LLMTestCase(
prompts=PROMPTS_LONG,
golden_answers=[
" \n\nTo solve this problem, we need to use the Law of Sines and Law of Cosines. Let me start by drawing triangle $ABC$ with the",
" \n\nTo solve this problem, we can use the fact that the expected value of the area of a triangle with vertices on a square can be calculated by integrating over",
" \n\nTo solve this problem, we can use the following approach: Let $P$ be the perimeter of the square. Then, the expected value of the area",
" \n\nTo solve this problem, we can use the following approach: Let $ \\alpha $ be the common real root of the two equations. Then, we can",
],
)
@@ -95,7 +95,7 @@ CASE_QWEN_EX = LLMTestCase(
prompts=PROMPTS_LONG,
golden_answers=[
" \n\nTo solve this problem, we need to use the Law of Sines and Law of Cosines. Let me start by drawing triangle $ABC$ with the",
" \n\nTo solve this problem, we can use the fact that the expected value of the area of a triangle with vertices on a square can be calculated by integrating over",
" \n\nTo solve this problem, we can use the following approach: Let $P$ be the perimeter of the square. Then, the expected value of the area",
" \n\nTo solve this problem, we can use the following approach: Let $ \\alpha $ be the common real root of the two equations. Then, we can",
],
)