【EPLB】Eplb Redundant Experts Bugfix (#4232)

### What this PR does / why we need it?
Redundant experts bugfix
The calculation logic for redundant experts has been fixed, allowing the
correct number of redundant experts to be calculated using the map.
Therefore, there is no longer a need to set the redundant expert
parameter when passing the map.

### Does this PR introduce _any_ user-facing change?
After configuring the path for experts_map, users do not need to
configure iinit_redundancy_expert.

### How was this patch tested?
The accuracy of EPLB was tested with and without the use of redundant
experts.

---------

Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
This commit is contained in:
LI SHENGYONG
2025-12-03 12:00:05 +08:00
committed by GitHub
parent b6d63bbd52
commit 593a96056c
9 changed files with 45 additions and 65 deletions

View File

@@ -8,12 +8,14 @@ import torch.distributed as dist
class ExpertLoadBalancer(object):
def __init__(self, expert_map_path, global_expert_num):
def __init__(self, expert_map_path, num_experts):
self.expert_map_path = expert_map_path
self.global_expert_num = global_expert_num
self.num_experts = num_experts
self.tensor_data = []
self.expert_map_tensor, self.layers_num, self.ranks_num = (
self._expert_file_to_tensor())
self.global_expert_num = num_experts + self.get_global_redundant_expert_num(
)
self.expert_placement_map = self.generate_expert_placement_map()
def _expert_file_to_tensor(self):
@@ -96,7 +98,7 @@ class ExpertLoadBalancer(object):
def get_global_redundant_expert_num(self):
global_redundant_expert_num = (
len(self.expert_map_tensor[0][0]) * self.ranks_num -
self.global_expert_num)
self.num_experts)
return global_redundant_expert_num
def check_expert_map_tensor(self):