[Graph][Fusion] Add QKVNormRope and QKVNormRopeWithBias (#5721)

### What this PR does / why we need it?
This PR builds upon PR
https://github.com/vllm-project/vllm-ascend/pull/5011 and aims to
further enhance the npu_graph_ex_passes module. Based on prior work, we
have added graph optimization support for the add_rms_quant fused
operator in scenarios where a bias term is present—ensuring the fusion
pattern is correctly registered and matched into the computation graph.

For validation, we switched to the Qwen3-235B-A22B-W8A8 model for
QKVNormRopeWithBias and Qwen3-32B model for QKVNormRope . Benchmark
results show that, compared to the unfused baseline, enabling this
fusion pass significantly improves inference throughput for W8A8
quantized models.
For more details can refer to the
RFC:https://github.com/vllm-project/vllm-ascend/issues/4715
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```
llm = LLM(
        model=model,
        tensor_parallel_size=GPUs_per_dp_rank,
        enforce_eager=False,
        enable_expert_parallel=enable_expert_parallel,
        trust_remote_code=trust_remote_code,
        gpu_memory_utilization=0.98,
        max_num_batched_tokens=512,
        # load_format="dummy",
        max_model_len=2048,
        max_num_seqs=16,
        quantization="ascend",
        additional_config={
            "refresh": True,
            "enable_npugraph_ex": True
        },
        compilation_config={
            "cudagraph_capture_sizes": [8, 16],
            "cudagraph_mode": "FULL_DECODE_ONLY",
        },
    )
    if profile_dir:
        llm.start_profile()
    outputs = llm.generate(prompts, sampling_params)
    if profile_dir:
        llm.stop_profile()
    for i, output in enumerate(outputs):
        if i >= 5:
            break
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(
            f"DP rank {global_dp_rank}, Prompt: {prompt!r}, "
            f"Generated text: {generated_text!r}"
        )
```
- vLLM version: v0.13.0
- vLLM main:
2f4e6548ef

---------

Signed-off-by: cjian <2318164299@qq.com>

This commit is contained in:

CodeCat

2026-01-22 17:22:41 +08:00

committed by

GitHub

parent f2c0ced06d

commit 1402cf6874

10 changed files with 707 additions and 474 deletions

									
										4

vllm_ascend/compilation/compiler_interface.py
									
												View File
												
				@@ -76,10 +76,6 @@ def npugraph_ex_compile(

				) -> tuple[Callable | None, Any | None]:

				    import torchair

				    # TODO: use a better way to lazy register replacement, instead of import one by one

				    # As an example, we directly import here to register replacement.

				    # import vllm_ascend.compilation.npugraph_ex_passes.add_rms_norm_quant  # noqa

				    torch.npu.set_compile_mode(jit_compile=False)

				    config = torchair.CompilerConfig()

				    # use aclgraph mode, avoid the transformation from fx graph to Ascend IR.

[Graph][Fusion] Add QKVNormRope and QKVNormRopeWithBias (#5721)

4 vllm_ascend/compilation/compiler_interface.py Unescape Escape View File

4

vllm_ascend/compilation/compiler_interface.py

View File