add mxfp8 moe quantization (#6670)

### What this PR does / why we need it? support mxfp8 quantization (Qwen MOE ) Using adaptor to make the hardware-specific behavior clearer and more maintainable ### How was this patch tested? - vLLM version: v0.15.0 - vLLM main: 13397841ab --------- Signed-off-by: fangrongcan <17343701736@163.com> Signed-off-by: wangyao-i <iwangyao@outlook.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com> Co-authored-by: fangrongcan <f00876277@china.huawei.com> Co-authored-by: wangyao-i <iwangyao@outlook.com> Co-authored-by: linfeng-yuan <1102311262@qq.com>
2026-03-02 11:04:06 +08:00
parent c324053b44
commit 3c66a970f2
10 changed files with 802 additions and 100 deletions
--- a/vllm_ascend/ops/fused_moe/prepare_finalize.py
+++ b/vllm_ascend/ops/fused_moe/prepare_finalize.py
@@ -76,7 +76,7 @@ class PrepareAndFinalize(ABC):
            router_logits (torch.Tensor): Router outputs, shape [num_tokens, num_experts]
            enable_shared_expert_dp (bool): Skip DP communication for shared experts
            replace_allreduce (bool): Bypass default all-reduce behavior
-            quant_type: none, w8a8 or w4a8
+            quant_type: none, w8a8, w4a8 or mxfp8

        Returns:
            Tuple of:
@@ -323,6 +323,10 @@ class PrepareAndFinalizeWithAllGather(PrepareAndFinalize):
        pertoken_scale = None
        if quant_type == QuantType.W8A8:
            hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states)
+        elif quant_type == QuantType.MXFP8:
+            # TODO(linfeng): MXFP8 with AllGather+EP currently does not pre-quantize
+            # per-token activations in prepare. Keep quantization in the MoE MLP path.
+            pass

        if self.multistream_overlap_gate:
            assert PrepareAndFinalize.quant_stream is not None