[Feat]Qwen3 Moe supports npu_add_rms_norm_quant op by default, update op with bias, resolve conflict with weight prefetch (#3465)

### What this PR does / why we need it?
1.qwen3 moe uses add_rms_norm_quant op instead of 'add_rms_norm op and
quant op' during quantization scene.
2.torch_npu.add_rms_norm_quant op fixed accuracy while model weights is
quantized by anti_method m4, m4 quantization is asymmetric outlier
suppression method, it will generate none-zero norm bias,
add_rms_norm_quant op updated to add this parameter to calculate.
3. add torch-npu check

### Does this PR introduce _any_ user-facing change?
new feature works if torch_npu version >= torch_npu-2.7.1.dev20250919

### How was this patch tested?
1.no special parameters to set, no new envs to set. new feature works if
torch_npu version >= torch_npu-2.7.1.dev20250919
2.use qwen3 moe quantization model to test ,such as
Qwen3-235B-A22B-W8A8, Qwen3-30B-A3B-W8A8,
Qwen3-235B-A22B-Instruct-2507-m4 (anti_method m4)

- vLLM version: v0.11.0rc3
- vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

---------

Signed-off-by: h30027576 <huangdong51@huawei.com>

This commit is contained in:

huangdong2022

2025-10-17 09:30:51 +08:00

committed by

GitHub

parent 4c4a8458a5

commit 3a53bbc508

9 changed files with 121 additions and 38 deletions

									
										1

vllm_ascend/ops/moe/moe_mlp.py
									
												View File
												
				@@ -177,7 +177,6 @@ def quant_apply_mlp(hidden_states: torch.Tensor,

				            group_type=0,

				            group_list=group_list,

				            output_dtype=_output_dtype)[0]

				    return hidden_states

[Feat]Qwen3 Moe supports npu_add_rms_norm_quant op by default, update op with bias, resolve conflict with weight prefetch (#3465)

1 vllm_ascend/ops/moe/moe_mlp.py Unescape Escape View File

1

vllm_ascend/ops/moe/moe_mlp.py

View File