Files
xc-llm-ascend/vllm_ascend/eplb/utils.py

53 lines
1.9 KiB
Python
Raw Permalink Normal View History

Dynamic Expert Load Balance with Zero-like-overhead (#2956) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/567939953b7a9cb0ded6bf0bb21a76917b8fed97 --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>
2025-09-17 10:36:43 +08:00
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# This file is a part of the vllm-ascend project.
#
# Todo: Once https://github.com/vllm-project/vllm/pull/23553 is merged in vllm. Remove this model register.
import types
import torch
def get_expert_map(self, layer_id):
return self.model.layers[layer_id].mlp.experts.expert_map
Dynamic Expert Load Balance with Zero-like-overhead (#2956) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/567939953b7a9cb0ded6bf0bb21a76917b8fed97 --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>
2025-09-17 10:36:43 +08:00
def get_log2phy_map(self, layer_id):
return self.model.layers[layer_id].mlp.experts.get_log2phy_map()
def get_all_moe_loads(self):
[EPLB] Avoiding eplb's dependency on a specified model (#6528) ### What this PR does / why we need it? 1. Currently, eplb registers different attributes for different models, but these attributes are not actually used. Now, these attributes are directly deleted. 2. Add some log about eplb. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### Deepseek v3.1 chat Of course! Here is a comprehensive explanation of deep learning, broken down for clarity.\n\n### The Simple Analogy: A Child Learning to Recognize a Cat\n\nImagine teaching a child what a cat is. You don't give them a rulebook with instructions like \"has pointy ears, whiskers, and a tail.\" Instead, you show them many pictures, saying \"this is a cat\" or \"this is not a cat.\" The child's brain gradually learns to identify the complex patterns—the combination of shapes, colors, and textures—that define \"cat-ness.\"\n\n**Deep learning is essentially this, but for computers.** It's a method for teaching computers to learn from examples and recognize patterns directly from data (like images, sound, or text) without being explicitly programmed with rigid rules.\n\n---\n\n### The Technical Definition\n\n**Deep Learning is a subfield of machine learning, which itself is a subfield of artificial intelligence (AI).** It uses artificial **neural networks** with many layers (\"deep\" networks) to model and understand complex patterns in data.\n\nHere are the key concepts in that definition:\n\n1. **Artificial Intelligence (AI):** The broad science of making machines smart and capable of performing tasks that typically require human intelligence.\n2. **Machine Learning (ML):** A subset of AI that gives computers the ability to learn from data *without* being explicitly programmed for every single rule.\n3. **Deep Learning (DL):** A specific, powerful - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-02-10 15:58:44 +08:00
num_dense_layers = getattr(self.model.config, "first_k_dense_replace", 0)
num_layers = self.model.config.num_hidden_layers
Dynamic Expert Load Balance with Zero-like-overhead (#2956) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/567939953b7a9cb0ded6bf0bb21a76917b8fed97 --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>
2025-09-17 10:36:43 +08:00
all_moe_loads = torch.stack(
[EPLB] Avoiding eplb's dependency on a specified model (#6528) ### What this PR does / why we need it? 1. Currently, eplb registers different attributes for different models, but these attributes are not actually used. Now, these attributes are directly deleted. 2. Add some log about eplb. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### Deepseek v3.1 chat Of course! Here is a comprehensive explanation of deep learning, broken down for clarity.\n\n### The Simple Analogy: A Child Learning to Recognize a Cat\n\nImagine teaching a child what a cat is. You don't give them a rulebook with instructions like \"has pointy ears, whiskers, and a tail.\" Instead, you show them many pictures, saying \"this is a cat\" or \"this is not a cat.\" The child's brain gradually learns to identify the complex patterns—the combination of shapes, colors, and textures—that define \"cat-ness.\"\n\n**Deep learning is essentially this, but for computers.** It's a method for teaching computers to learn from examples and recognize patterns directly from data (like images, sound, or text) without being explicitly programmed with rigid rules.\n\n---\n\n### The Technical Definition\n\n**Deep Learning is a subfield of machine learning, which itself is a subfield of artificial intelligence (AI).** It uses artificial **neural networks** with many layers (\"deep\" networks) to model and understand complex patterns in data.\n\nHere are the key concepts in that definition:\n\n1. **Artificial Intelligence (AI):** The broad science of making machines smart and capable of performing tasks that typically require human intelligence.\n2. **Machine Learning (ML):** A subset of AI that gives computers the ability to learn from data *without* being explicitly programmed for every single rule.\n3. **Deep Learning (DL):** A specific, powerful - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-02-10 15:58:44 +08:00
[self.model.layers[layer_id].mlp.experts.moe_load for layer_id in range(num_dense_layers, num_layers)],
[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #6) (#6001) ### What this PR does / why we need it? | File Path | | :--- | | ` vllm_ascend/eplb/adaptor/abstract_adaptor.py` | | ` vllm_ascend/eplb/adaptor/vllm_adaptor.py` | | ` vllm_ascend/eplb/core/eplb_device_transfer_loader.py` | | ` vllm_ascend/eplb/core/eplb_utils.py` | | ` vllm_ascend/eplb/core/eplb_worker.py` | | ` vllm_ascend/eplb/core/policy/policy_abstract.py` | | ` vllm_ascend/eplb/core/policy/policy_default_eplb.py` | | ` vllm_ascend/eplb/core/policy/policy_factory.py` | | ` vllm_ascend/eplb/core/policy/policy_flashlb.py` | | ` vllm_ascend/eplb/core/policy/policy_random.py` | | ` vllm_ascend/eplb/core/policy/policy_swift_balancer.py` | | ` vllm_ascend/eplb/eplb_updator.py` | | ` vllm_ascend/eplb/utils.py` | | ` vllm_ascend/model_loader/netloader/executor/elastic_load.py` | | ` vllm_ascend/model_loader/netloader/executor/netloader_pg.py` | | ` vllm_ascend/model_loader/netloader/interaction/elastic.py` | | ` vllm_ascend/model_loader/netloader/load.py` | | ` vllm_ascend/model_loader/netloader/netloader.py` | | ` vllm_ascend/model_loader/netloader/utils.py` | | ` vllm_ascend/patch/platform/__init__.py` | | ` vllm_ascend/patch/platform/patch_balance_schedule.py` | | ` vllm_ascend/patch/platform/patch_ec_connector.py` | | ` vllm_ascend/patch/platform/patch_mamba_config.py` | | ` vllm_ascend/patch/platform/patch_multiproc_executor.py` | | ` vllm_ascend/patch/platform/patch_sched_yield.py` | - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2c24bc6996cb165fce92f780b388a5e39b3f4060 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
2026-01-24 22:08:33 +08:00
dim=0,
Dynamic Expert Load Balance with Zero-like-overhead (#2956) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/567939953b7a9cb0ded6bf0bb21a76917b8fed97 --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>
2025-09-17 10:36:43 +08:00
)
return all_moe_loads
def clear_all_moe_loads(self):
[EPLB] Avoiding eplb's dependency on a specified model (#6528) ### What this PR does / why we need it? 1. Currently, eplb registers different attributes for different models, but these attributes are not actually used. Now, these attributes are directly deleted. 2. Add some log about eplb. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### Deepseek v3.1 chat Of course! Here is a comprehensive explanation of deep learning, broken down for clarity.\n\n### The Simple Analogy: A Child Learning to Recognize a Cat\n\nImagine teaching a child what a cat is. You don't give them a rulebook with instructions like \"has pointy ears, whiskers, and a tail.\" Instead, you show them many pictures, saying \"this is a cat\" or \"this is not a cat.\" The child's brain gradually learns to identify the complex patterns—the combination of shapes, colors, and textures—that define \"cat-ness.\"\n\n**Deep learning is essentially this, but for computers.** It's a method for teaching computers to learn from examples and recognize patterns directly from data (like images, sound, or text) without being explicitly programmed with rigid rules.\n\n---\n\n### The Technical Definition\n\n**Deep Learning is a subfield of machine learning, which itself is a subfield of artificial intelligence (AI).** It uses artificial **neural networks** with many layers (\"deep\" networks) to model and understand complex patterns in data.\n\nHere are the key concepts in that definition:\n\n1. **Artificial Intelligence (AI):** The broad science of making machines smart and capable of performing tasks that typically require human intelligence.\n2. **Machine Learning (ML):** A subset of AI that gives computers the ability to learn from data *without* being explicitly programmed for every single rule.\n3. **Deep Learning (DL):** A specific, powerful - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-02-10 15:58:44 +08:00
num_dense_layers = getattr(self.model.config, "first_k_dense_replace", 0)
num_layers = self.model.config.num_hidden_layers
for layer_id in range(num_dense_layers, num_layers):
self.model.layers[layer_id].mlp.experts.clear_moe_load()
Dynamic Expert Load Balance with Zero-like-overhead (#2956) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/567939953b7a9cb0ded6bf0bb21a76917b8fed97 --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>
2025-09-17 10:36:43 +08:00
[EPLB] Avoiding eplb's dependency on a specified model (#6528) ### What this PR does / why we need it? 1. Currently, eplb registers different attributes for different models, but these attributes are not actually used. Now, these attributes are directly deleted. 2. Add some log about eplb. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? #### Deepseek v3.1 chat Of course! Here is a comprehensive explanation of deep learning, broken down for clarity.\n\n### The Simple Analogy: A Child Learning to Recognize a Cat\n\nImagine teaching a child what a cat is. You don't give them a rulebook with instructions like \"has pointy ears, whiskers, and a tail.\" Instead, you show them many pictures, saying \"this is a cat\" or \"this is not a cat.\" The child's brain gradually learns to identify the complex patterns—the combination of shapes, colors, and textures—that define \"cat-ness.\"\n\n**Deep learning is essentially this, but for computers.** It's a method for teaching computers to learn from examples and recognize patterns directly from data (like images, sound, or text) without being explicitly programmed with rigid rules.\n\n---\n\n### The Technical Definition\n\n**Deep Learning is a subfield of machine learning, which itself is a subfield of artificial intelligence (AI).** It uses artificial **neural networks** with many layers (\"deep\" networks) to model and understand complex patterns in data.\n\nHere are the key concepts in that definition:\n\n1. **Artificial Intelligence (AI):** The broad science of making machines smart and capable of performing tasks that typically require human intelligence.\n2. **Machine Learning (ML):** A subset of AI that gives computers the ability to learn from data *without* being explicitly programmed for every single rule.\n3. **Deep Learning (DL):** A specific, powerful - vLLM version: v0.15.0 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.15.0 Signed-off-by: shenchuxiaofugui <1311027364@qq.com>
2026-02-10 15:58:44 +08:00
def model_register(model):
Dynamic Expert Load Balance with Zero-like-overhead (#2956) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: https://github.com/vllm-project/vllm/commit/567939953b7a9cb0ded6bf0bb21a76917b8fed97 --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>
2025-09-17 10:36:43 +08:00
model.get_expert_map = types.MethodType(get_expert_map, model)
model.get_log2phy_map = types.MethodType(get_log2phy_map, model)
model.get_all_moe_loads = types.MethodType(get_all_moe_loads, model)
model.clear_all_moe_loads = types.MethodType(clear_all_moe_loads, model)