[Graph][Fusion]Add new pattern for AddRmsnormQuant with SP. (#5077)

### What this PR does / why we need it? 1. In addition to [#4168](https://github.com/vllm-project/vllm-ascend/pull/4168), [#5011](https://github.com/vllm-project/vllm-ascend/pull/5011)， this PR adds two more pattern for AddRmsnormQuant with SP enabled. The key difference is to insert an additional `maybe_all_gather_and_maybe_unpad` between `addrmsnorm` and `quantize`. 2. This PR also introduce another api `torch.ops.vllm.quantize`, so that we pass `input_scale` and `input_scale_reciprocal` at the same time. This is because `npu_add_rms_norm_quant` and `npu_quantize` requires different `div_mode`. To avoid introducing additional reciprocal calculation in runtime, we have to pass both of them to quantize api. 3. Removes redundant `AscendQuantRmsnorm`. - vLLM version: v0.12.0 - vLLM main: ad32e3e19c --------- Signed-off-by: Angazenn <supperccell@163.com>
2025-12-18 20:25:44 +08:00
parent a74a1196c5
commit acc3578f58
7 changed files with 454 additions and 116 deletions
--- a/vllm_ascend/ops/linear_op.py
+++ b/vllm_ascend/ops/linear_op.py
@@ -545,8 +545,7 @@ class SequenceRowParallelOp(CustomRowParallelOp):
        from vllm.model_executor.layers.linear import UnquantizedLinearMethod

        from vllm_ascend.quantization.quant_config import AscendLinearMethod
-        from vllm_ascend.quantization.w8a8 import (AscendW8A8LinearMethod,
-                                                   quant_per_tensor)
+        from vllm_ascend.quantization.w8a8 import AscendW8A8LinearMethod

        # For unquant
        if mmrs_fusion and isinstance(self.layer.quant_method,
@@ -568,8 +567,9 @@ class SequenceRowParallelOp(CustomRowParallelOp):
                and isinstance(self.layer.quant_method.quant_method,
                               AscendW8A8LinearMethod)):
            if x.dtype != torch.int8:
-                x_quant = quant_per_tensor(
-                    x, self.layer.aclnn_input_scale_reciprocal,
+                x_quant = torch.ops.vllm.quantize(
+                    x, self.layer.aclnn_input_scale,
+                    self.layer.aclnn_input_scale_reciprocal,
                    self.layer.aclnn_input_offset)
            else:
                x_quant = x