Dynamic Expert Load Balance with Zero-like-overhead (#2956)
### Motivation
Currently dynamically experts balancing would stop-the-world.
Asynchronously expert load balancing would be better without flowing
problems:
Host-bound latency:
There are many cpu operations during EPLB such as
eplb-algorithm、creating p2p ops、and log2phy expert converting would
spend long cpu time, as ~1s.
Communication latency: The transfer time would cost much in the
situation without nvlink. As the weight of an expert maybe transfer to
multiple new positions, thus N times send/recv for one expert, with
result long latency. We had tested that batch_isend_irecv cost more
100ms for 16 experts weight transmission in A2 server of ascend.
SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms
cost for each layer while benefit 5ms-8ms decode latency with ep_size =
64.
The following updates have been made:
1、expert distribution recording with lower cost.
2、async cpu computing for eplb algo and other python operator.
3、new eplb algo with less expert rebalancing while almost the same
effect.
### Proposed Change
We will gradually migrate the EPLB logic to the VLLM community and
implement a generalized design. Relevant RFC:
https://github.com/vllm-project/vllm/issues/22246
The overall workflow involves:
<img width="801" height="302"
alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c"
src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed"
/>
1. Record experts distribution during forward. We using expert_token_num
after disptach instead of topk_ids, thus we got much smaller tensor
shape to reduce cost of hbm recording and add-operator.
2. Do all-gather for experts distribution. Using all-gather instead of
all-reduce as less traffic volume.
3. Wake up eplb worker process with experts distribution when
num_iterations comes. Run eplb algorithm in eplb worker.
4. Generate p2p send/recv ops and other operator such as log2phy would
cost long cpu time.
5. Lanch ibatch_send_recv in async_stream before forward.
6. After forward, wait for the ibatch_send_recv finish, then do uapte
expert map and expert weights.
### Co-author
Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con
Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn
Co-authored-by: qmkakaxi wjh1594260677@qq.com
Co-authored-by: Skywalker-EP 173723846@qq.com
- vLLM version: v0.10.2
- vLLM main:
567939953b
---------
Signed-off-by: offline0806 <z00858301@china.huawei.com>
Co-authored-by: offline0806 <z00858301@china.huawei.com>
This commit is contained in:
45
docs/source/user_guide/feature_guide/eplb_swift_balancer.md
Normal file
45
docs/source/user_guide/feature_guide/eplb_swift_balancer.md
Normal file
@@ -0,0 +1,45 @@
|
||||
# Swift Balancer
|
||||
|
||||
## Overview
|
||||
Experts rebalancing of MoE models for LLM serving is a mandatory option.Changing experts dynamically would have a negative impact on TTFT and TPOT while stop-the-world.
|
||||
Asynchronously expert load balancing would be a better choice.
|
||||
We have launched SwiftBalancer to support dynamic experts load balancing with Zero-overhead experts movement.
|
||||
|
||||
## Design
|
||||
|
||||

|
||||
|
||||
The overall workflow involves:
|
||||
1. Record experts distribution during forward. We using expert_token_num after dispatch instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm
|
||||
recording and add-operator.
|
||||
2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume.
|
||||
3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker.
|
||||
4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time.
|
||||
5. Lanch ibatch_send_recv in async_stream before forward.
|
||||
6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights.
|
||||
|
||||
In our profiling shows experts transforming is hidden in the bubble between forward iterations. Cpu time cost of eplb algo. and other python operator such as log2phy
|
||||
would be hidden by eplb worker process too.
|
||||
|
||||
## Config Params
|
||||
|
||||
Currently swift balancer optimize 5ms TPOT with ep size 64 while cost less than 2ms for every layer expert movement.
|
||||
|
||||
We add new parameters for eplb:
|
||||
"dynamic_eplb":true --- enable dynamic eplb
|
||||
"num_iterations_eplb_update": 400 -- forward iterations when eplb would begin
|
||||
"gate_eplb":true -- eplb would update only once, false by default.
|
||||
"num_wait_worker_iterations":30 -- forward iterations when eplb worker will finish cpu task. In our test default value 30 would cover most cases.
|
||||
"expert_map_record_path" -- When dynamic eplb is completed, save the current expert load heatmap to the specified path.
|
||||
"init_redundancy_expert" -- Specify redundant experts during initialization.
|
||||
|
||||
## Examples
|
||||
### Dynamic eplb
|
||||
Enable dynamic eplb and specify the trigger rounds.
|
||||
--additional-config '{ "dynamic_eplb":true,"num_iterations_eplb_update":400, "gate_eplb":true, "num_wait_worker_iterations":30}'
|
||||
### Record expert map for static eplb
|
||||
Specify the path for the static eplb initialization file.
|
||||
--additional-config '{ "expert_map_record_path": "/xx/xx.json", "init_redundancy_expert": 16, dynamic_eplb":true,"num_iterations_eplb_update":400, "gate_eplb":true, "num_wait_worker_iterations":30}'
|
||||
### Static eplb
|
||||
If expert map has been recorded, enable static eplb with expert map path.
|
||||
--additional-config '{ "expert_map_path": "/xx/xx.json"}'
|
||||
BIN
docs/source/user_guide/feature_guide/images/eplb_img.png
Normal file
BIN
docs/source/user_guide/feature_guide/images/eplb_img.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 55 KiB |
@@ -10,4 +10,5 @@ quantization
|
||||
sleep_mode
|
||||
structured_output
|
||||
lora
|
||||
eplb_swift_balancer
|
||||
:::
|
||||
|
||||
Reference in New Issue
Block a user