[main] [bugfix] Fix misjudging quantized/unquantized scenarios (#2627)

### What this PR does / why we need it? In a mixed-precision scenario, quant_config is not None, but MoE needs to perform unquantized computation; however, quantized computation is currently being used. Therefore, we put the with_quant logic into forward, avoid misjudging in mix-precision scenarios. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: 98ac0cb32d Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-08-29 16:20:22 +08:00
parent aadc75c247
commit 52aff9e229
7 changed files with 62 additions and 65 deletions
--- a/vllm_ascend/quantization/w8a8_dynamic.py
+++ b/vllm_ascend/quantization/w8a8_dynamic.py
@@ -406,7 +406,8 @@ class AscendW8A8DynamicFusedMoEMethod:
            shared_experts=shared_experts,
            shared_gate_up=shared_gate_up,
            shared_dequant_scale=shared_dequant_scale,
-            mc2_mask=kwargs.get("mc2_mask", None))
+            mc2_mask=kwargs.get("mc2_mask", None),
+            with_quant=True)

    def process_weights_after_loading(self, layer):
        if self.transpose_weight: