add mxfp8 moe quantization (#6670)
### What this PR does / why we need it?
support mxfp8 quantization (Qwen MOE )
Using adaptor to make the hardware-specific behavior clearer and more
maintainable
### How was this patch tested?
- vLLM version: v0.15.0
- vLLM main:
13397841ab
---------
Signed-off-by: fangrongcan <17343701736@163.com>
Signed-off-by: wangyao-i <iwangyao@outlook.com>
Signed-off-by: linfeng-yuan <1102311262@qq.com>
Signed-off-by: Eric-dot <60131170+Eric-dot@users.noreply.github.com>
Co-authored-by: fangrongcan <f00876277@china.huawei.com>
Co-authored-by: wangyao-i <iwangyao@outlook.com>
Co-authored-by: linfeng-yuan <1102311262@qq.com>
This commit is contained in:
@@ -76,7 +76,7 @@ class PrepareAndFinalize(ABC):
|
||||
router_logits (torch.Tensor): Router outputs, shape [num_tokens, num_experts]
|
||||
enable_shared_expert_dp (bool): Skip DP communication for shared experts
|
||||
replace_allreduce (bool): Bypass default all-reduce behavior
|
||||
quant_type: none, w8a8 or w4a8
|
||||
quant_type: none, w8a8, w4a8 or mxfp8
|
||||
|
||||
Returns:
|
||||
Tuple of:
|
||||
@@ -323,6 +323,10 @@ class PrepareAndFinalizeWithAllGather(PrepareAndFinalize):
|
||||
pertoken_scale = None
|
||||
if quant_type == QuantType.W8A8:
|
||||
hidden_states, pertoken_scale = torch_npu.npu_dynamic_quant(hidden_states)
|
||||
elif quant_type == QuantType.MXFP8:
|
||||
# TODO(linfeng): MXFP8 with AllGather+EP currently does not pre-quantize
|
||||
# per-token activations in prepare. Keep quantization in the MoE MLP path.
|
||||
pass
|
||||
|
||||
if self.multistream_overlap_gate:
|
||||
assert PrepareAndFinalize.quant_stream is not None
|
||||
|
||||
Reference in New Issue
Block a user