[main] [bugfix] Fix misjudging quantized/unquantized scenarios (#2627)

### What this PR does / why we need it? In a mixed-precision scenario, quant_config is not None, but MoE needs to perform unquantized computation; however, quantized computation is currently being used. Therefore, we put the with_quant logic into forward, avoid misjudging in mix-precision scenarios. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? e2e & ut - vLLM version: v0.10.1.1 - vLLM main: 98ac0cb32d Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com>
2025-08-29 16:20:22 +08:00
parent aadc75c247
commit 52aff9e229
7 changed files with 62 additions and 65 deletions
--- a/vllm_ascend/ascend_forward_context.py
+++ b/vllm_ascend/ascend_forward_context.py
@@ -99,8 +99,6 @@ def set_ascend_forward_context(
        forward_context.fused_moe_state = fused_moe_state
        forward_context.in_profile_run = in_profile_run

-        with_quant = vllm_config.quant_config is not None
-        forward_context.with_quant = with_quant
        from vllm_ascend.ops.moe_dispatcher.token_dispatcher import \
            get_token_dispatcher
        dispatcher_name = get_dispatcher_name(ep_size, with_prefill)