[main] [bugfix] Fix misjudging quantized/unquantized scenarios (#2627)

### What this PR does / why we need it?
In a mixed-precision scenario, quant_config is not None, but MoE needs
to perform unquantized computation; however, quantized computation is
currently being used. Therefore, we put the with_quant logic into
forward, avoid misjudging in mix-precision scenarios.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
e2e & ut

- vLLM version: v0.10.1.1
- vLLM main:
98ac0cb32d

Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
This commit is contained in:
weichen
2025-08-29 16:20:22 +08:00
committed by GitHub
parent aadc75c247
commit 52aff9e229
7 changed files with 62 additions and 65 deletions

View File

@@ -406,7 +406,8 @@ class AscendW8A8DynamicFusedMoEMethod:
shared_experts=shared_experts,
shared_gate_up=shared_gate_up,
shared_dequant_scale=shared_dequant_scale,
mc2_mask=kwargs.get("mc2_mask", None))
mc2_mask=kwargs.get("mc2_mask", None),
with_quant=True)
def process_weights_after_loading(self, layer):
if self.transpose_weight: