[Feat]Qwen3 Moe supports npu_add_rms_norm_quant op by default, update op with norm bias (#3205)

### What this PR does / why we need it? 1. qwen3 moe uses add_rms_norm_quant op instead of 'add_rms_norm op and quant op' during quantization scene. 2. torch_npu.add_rms_norm_quant op fixed accuracy while model weights is quantized by anti_method m4, m4 quantization is asymmetric outlier suppression method, it will generate none-zero norm bias, add_rms_norm_quant op updated to add this parameter to calculate. ### Does this PR introduce _any_ user-facing change? please use a torch_npu version >= torch_npu-2.7.1.dev20250919 ### How was this patch tested? 1. no special parameters to set, no new envs to set. 2. use qwen3 moe quantization model to test ,such as Qwen3-235B-A22B-W8A8, Qwen3-30B-A3B-W8A8, Qwen3-235B-A22B-Instruct-2507-m4 (anti_method m4) - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: huangdong2022 <huangdong51@huawei.com> Signed-off-by: h30027576 <huangdong51@huawei.com>
2025-10-09 20:18:10 +08:00
parent 81aff9c555
commit 23db56a340
4 changed files with 57 additions and 40 deletions
--- a/vllm_ascend/ascend_forward_context.py
+++ b/vllm_ascend/ascend_forward_context.py
@@ -147,12 +147,14 @@ def set_ascend_forward_context(
        # Once the necessary conditions are met, support for MOE models will also be added.
        from vllm_ascend.quantization.quant_config import AscendQuantConfig
        addrmsnorm_quant_fusion_enabled = isinstance(vllm_config.quant_config, AscendQuantConfig) and \
-            vllm_config.model_config.hf_config.model_type in ["llama", "qwen2", "qwen3"] and \
+            vllm_config.model_config.hf_config.model_type in ["llama", "qwen2", "qwen3", "qwen3_moe"] and \
            forward_context.layer_idx is not None
        if addrmsnorm_quant_fusion_enabled:
            forward_context.model_instance = model_instance
            forward_context.num_hidden_layers = vllm_config.model_config.hf_config.num_hidden_layers
            forward_context.fusion_linear = "gate_up_dense" if forward_context.layer_idx == 0 else "qkv_dense"
+            if vllm_config.model_config.hf_config.model_type == "qwen3_moe":
+                forward_context.fusion_linear = "gate_moe" if forward_context.layer_idx == 0 else "qkv_moe"
        forward_context.addrmsnorm_quant_fusion_enabled = addrmsnorm_quant_fusion_enabled

        if num_tokens is None and attn_metadata is not None: