[Graph][Fusion] Add AddRMSNormSPPattern and AddRMSNormSPPatternWithBias (#5569)

### What this PR does / why we need it? This PR builds upon PR https://github.com/vllm-project/vllm-ascend/pull/5011 and aims to further enhance the npu_graph_ex_passes module. Based on prior work, we have added graph optimization support for the add_rms_quant fused operator in scenarios where a bias term is present—ensuring the fusion pattern is correctly registered and matched into the computation graph. For validation, we switched to the Qwen3-235B-A22B-W8A8 model for SPPatternWithBias and Qwen3-32B model for SPPattern. Benchmark results show that, compared to the unfused baseline, enabling this fusion pass significantly improves inference throughput for W8A8 quantized models. For more details can refer to the RFC:https://github.com/vllm-project/vllm-ascend/issues/4715 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? ``` llm = LLM( model=model, tensor_parallel_size=GPUs_per_dp_rank, enforce_eager=False, enable_expert_parallel=enable_expert_parallel, trust_remote_code=trust_remote_code, gpu_memory_utilization=0.98, max_num_batched_tokens=512, # load_format="dummy", max_model_len=2048, max_num_seqs=16, quantization="ascend", additional_config={ "refresh": True, "enable_npugraph_ex": True }, compilation_config={ "cudagraph_capture_sizes": [8, 16], "cudagraph_mode": "FULL_DECODE_ONLY", }, ) if profile_dir: llm.start_profile() outputs = llm.generate(prompts, sampling_params) if profile_dir: llm.stop_profile() for i, output in enumerate(outputs): if i >= 5: break prompt = output.prompt generated_text = output.outputs[0].text print( f"DP rank {global_dp_rank}, Prompt: {prompt!r}, " f"Generated text: {generated_text!r}" ) ``` - vLLM version: v0.13.0 - vLLM main: 7157596103 Signed-off-by: cjian <2318164299@qq.com>
2026-01-07 09:03:45 +08:00
parent ad9b711f89
commit bdedf3c9f8
2 changed files with 207 additions and 74 deletions
--- a/tests/ut/compilation/test_add_rms_norm_quant.py
+++ b/tests/ut/compilation/test_add_rms_norm_quant.py
@@ -16,6 +16,23 @@
 import sys
 from unittest import mock

+import torch
+
+
+def get_inputs():
+    """
+    Generate example inputs for the AddRMSNormQuantSPPatternWithBias fusion pattern.
+    """
+    rms_norm_input = torch.randn(2, 4)
+    residual = torch.randn(2, 4)
+    rms_norm_weight = torch.randn(4)
+    rmsnorm_bias = torch.randn(4)
+    scale = torch.ones(4)
+    offset = torch.zeros(4)
+    return [
+        rms_norm_input, residual, rms_norm_weight, scale, offset, rmsnorm_bias
+    ]
+

 def _extra_stream_scope_check_for_test(match) -> bool:
    """
@@ -93,3 +110,39 @@ def test_replacement_function_without_torch_npu(caplog):
            assert result is None
        except (ImportError, AttributeError):
            pass
+
+
+def test_get_inputs_sp_pattern_with_bias():
+    """
+    Test that get_inputs generates tensors with correct shapes and device.
+    This test verifies the internal get_inputs function used in the pattern.
+    """
+    try:
+        import torch
+    except ImportError:
+        return  # Skip if torch is not available
+
+    inputs = get_inputs()
+    (
+        rms_norm_input,
+        residual,
+        rms_norm_weight,
+        scale,
+        offset,
+        rmsnorm_bias,
+    ) = inputs
+
+    # Verify shapes
+    assert rms_norm_input.shape == (2, 4)
+    assert residual.shape == (2, 4)
+    assert rms_norm_weight.shape == (4, )
+    assert rmsnorm_bias.shape == (4, )
+    assert scale.shape == (4, )
+    assert offset.shape == (4, )
+
+    # Verify number of inputs
+    assert len(inputs) == 6
+
+    # Verify specific values
+    assert torch.all(scale == 1.0)
+    assert torch.all(offset == 0.0)