[Feature]EPLB:Adapt DispatchGmmCombineDecode operator to eplb tensor list and expert token numbers (#5552)

#### What this PR does / why we need it?
This PR adapt DispatchGmmCombineDecode operator to eplb tensor list and
expert token numbers.

This operator support gmm1, gmm2, gmm1Scale and gmm2Scale in format of
list.
This operator support couting how many token each local expert recieves
by expertTokensNum .


- vLLM version: v0.13.0
- vLLM main:
7157596103

More info about this operator, please refer to RFC: issue
https://github.com/vllm-project/vllm-ascend/issues/5476
This commit is contained in:
wangyibo1005
2026-01-07 11:23:42 +08:00
committed by GitHub
parent 086c093347
commit 25baf6df09
18 changed files with 425 additions and 195 deletions

View File

@@ -254,7 +254,8 @@ class AscendW8A8DynamicFusedMoEMethod:
w1 = layer.w13_weight_list
w1_scale = layer.w13_weight_scale_fp32_list
w2 = layer.w2_weight_list
w2_scale = layer.w2_weight_scale_list
w2_scale = layer.w2_weight_scale_fp32_list \
if w2_weight_scale_fp32_flag else layer.w2_weight_scale_list
else:
w1 = [layer.w13_weight]
w1_scale = [layer.w13_weight_scale_fp32]
@@ -333,11 +334,16 @@ class AscendW8A8DynamicFusedMoEMethod:
weight.clone()
for weight in layer.w2_weight_scale.data.unbind(dim=0)
]
layer.w2_weight_scale_fp32_list = [
weight.clone()
for weight in layer.w2_weight_scale_fp32.data.unbind(dim=0)
]
del layer.w13_weight
del layer.w2_weight
del layer.w13_weight_scale
del layer.w13_weight_scale_fp32
del layer.w2_weight_scale
del layer.w2_weight_scale_fp32
torch.npu.empty_cache()