[A5][bugfix] Fix fused MoE A5 MXFP8 scale normalization, load-balance routing and gating_topk ops (#7573)
### What this PR does / why we need it? This PR fixes A5 MXFP8 MoE scale handling in the fused MoE path. - It normalizes MXFP8 activation scales to the packed 3D layout expected by A5 kernels, including both precomputed dynamic_scale inputs and gmm1 output scales before they are consumed by downstream grouped matmul ops. - It also refines the MXFP8 force load-balancing path in profiling runs. - This PR also enables npu_gating_top_k from torch_npu instead of custom op when running ascend950 chip. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? CI and E2E serving tests on Ascend950DT passed. --------- Signed-off-by: linfeng-yuan <1102311262@qq.com>
This commit is contained in:
@@ -128,7 +128,9 @@ def quant_apply_mlp(
|
||||
quantized_hidden_states = None
|
||||
else:
|
||||
unquantized_hidden_states = None
|
||||
pertoken_scale = dynamic_scale
|
||||
pertoken_scale = (
|
||||
DeviceOperator.maybe_normalize_mxfp_scale_layout(dynamic_scale) if use_mxfp_quant else dynamic_scale
|
||||
)
|
||||
quantized_hidden_states = hidden_states
|
||||
|
||||
bias1, bias2 = None, None
|
||||
|
||||
Reference in New Issue
Block a user