xc-llm-ascend

Author	SHA1	Message	Date
SILONG ZENG	4e53c1d900	[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #6 ) (#6001 ) ### What this PR does / why we need it? \| File Path \| \| :--- \| \| ` vllm_ascend/eplb/adaptor/abstract_adaptor.py` \| \| ` vllm_ascend/eplb/adaptor/vllm_adaptor.py` \| \| ` vllm_ascend/eplb/core/eplb_device_transfer_loader.py` \| \| ` vllm_ascend/eplb/core/eplb_utils.py` \| \| ` vllm_ascend/eplb/core/eplb_worker.py` \| \| ` vllm_ascend/eplb/core/policy/policy_abstract.py` \| \| ` vllm_ascend/eplb/core/policy/policy_default_eplb.py` \| \| ` vllm_ascend/eplb/core/policy/policy_factory.py` \| \| ` vllm_ascend/eplb/core/policy/policy_flashlb.py` \| \| ` vllm_ascend/eplb/core/policy/policy_random.py` \| \| ` vllm_ascend/eplb/core/policy/policy_swift_balancer.py` \| \| ` vllm_ascend/eplb/eplb_updator.py` \| \| ` vllm_ascend/eplb/utils.py` \| \| ` vllm_ascend/model_loader/netloader/executor/elastic_load.py` \| \| ` vllm_ascend/model_loader/netloader/executor/netloader_pg.py` \| \| ` vllm_ascend/model_loader/netloader/interaction/elastic.py` \| \| ` vllm_ascend/model_loader/netloader/load.py` \| \| ` vllm_ascend/model_loader/netloader/netloader.py` \| \| ` vllm_ascend/model_loader/netloader/utils.py` \| \| ` vllm_ascend/patch/platform/__init__.py` \| \| ` vllm_ascend/patch/platform/patch_balance_schedule.py` \| \| ` vllm_ascend/patch/platform/patch_ec_connector.py` \| \| ` vllm_ascend/patch/platform/patch_mamba_config.py` \| \| ` vllm_ascend/patch/platform/patch_multiproc_executor.py` \| \| ` vllm_ascend/patch/platform/patch_sched_yield.py` \| - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` --------- Signed-off-by: MrZ20 <2609716663@qq.com>	2026-01-24 22:08:33 +08:00
LI SHENGYONG	8210a62a44	[EPLB][Bugfix]Reduce unnecessary video memory usage (#6020 ) ### What this PR does / why we need it? 1.Incorporate the warm up of the EPLB into the profile run. 2.Reusing the same gather buffer ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? qwen3-235b aime baseline \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| eplb The OOM issue does not occur. \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `2c24bc6996` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-23 14:21:13 +08:00
LI SHENGYONG	83de5385b4	[EPLB][Bugfix] policy_swift_balancer bugfix and renaming (#5897 ) ### What this PR does / why we need it? 1. Rename dynamic_ep to default_eplb. 2. Rename dynamic_ep_v2 to swift_balancer 3. Discard func compose_expert_update_info_bipartite. - vLLM version: v0.13.0 - vLLM main: `bde38c11df` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-19 05:47:40 +00:00
LI SHENGYONG	9fed2636cb	[EPLB][Nightly][Bugfix] Get expert from moe layer only (#5908 ) ### What this PR does / why we need it? 1. If the model has dense layers, the current code will attempt to obtain the routing experts of the dense layers, which will cause an error. This should be fixed by modifying the code to skip the dense layers when obtaining the routing experts. 2. The global_expert_map that the function directly outputs a affects the performance of dsv3.2. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? DeepSeek V3.1 conversation is normal. #### aime precision test (dsv3.1) baseline without eplb \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 66.67 \| eplb \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 70.00 \| - vLLM version: v0.13.0 - vLLM main: `11b6af5280` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-19 09:23:28 +08:00
LI SHENGYONG	da958ee386	[EPLB]Eplb Config Renaming (#5533 ) ### What this PR does / why we need it? 1. Rename num_iterations_eplb_update to expert_heat_collection_interval. 2. Rename num_wait_worker_iterations to algorithm_execution_interval. 3. Rename init_redundancy_expert to num_redundant_experts because the variable with the same meaning in vLLM is named this way. 4. Delete gate_eplb because we don't need this feature. 5. Move eplb config into a dict in additional config. 6. Depend on pr5817 ### Does this PR introduce _any_ user-facing change? before this pr： `--additional-config '{"dynamic_eplb":true, "num_iterations_eplb_update": 4000, "num_wait_worker_iterations": 150, "init_redundancy_expert": 16, "expert_map_path": "xxx.json"}'` after this pr: `--additional-config '{"eplb_config":{"dynamic_eplb":true,"expert_heat_collection_interval":4000, "algorithm_execution_interval":150,"num_redundant_experts": 16, "expert_map_path": "xxx.json"}}'` ### How was this patch tested? #### test qwen3-235b eplb num_redundant_experts=16 without pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 83.33 \| with pr5817 \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-15 10:26:44 +08:00
LI SHENGYONG	ecf2fa482e	[EPLB][Bugfix] Get expert map from layers (#5817 ) ### What this PR does / why we need it? The initialization method of expert_map used by the eplb module is different from that used by the fused_moe module. This PR deletes the expert_map initialization method used by the eplb module to make the initialization methods consistent. #### before bugfix self._expert_map=tensor([64, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61,62, 63], device='npu:1', dtype=torch.int32) self.shared_dict["expert_maps"][0]=tensor([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64]], dtype=torch.int32) ### How was this patch tested? #### qwen3-235B-w8a8 aime \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| aime2024 \| 604a78 \| accuracy \| gen \| 86.67 \| - vLLM version: v0.13.0 - vLLM main: `2f4e6548ef` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-14 09:16:51 +08:00
Mercykid-bash	29e2f9a43e	Bugfix: Align expert map shapes with redundant experts in EPLB adjustment (#5285 ) #### Overview This PR fixes a shape mismatch bug between `expert_placement_map` and `log2phy_expert_map` when redundant experts are enabled in the vLLM-Ascend platform. The issue occurred during the initialization of expert maps and their updates via EPLB (Expert Load Balancer) adjustment, leading to potential tensor shape errors and incorrect expert routing in distributed MoE deployments. #### Key Changes 1. Unify expert map shape calculation logic - Ensure the shape of `expert_placement_map` and `log2phy_expert_map` strictly aligns with the total number of experts (including redundant experts) during initialization. - Update the shape adjustment logic in EPLB dynamic update process to match the initial expert map dimensions. 2. Add shape consistency checks - Add assertion statements to verify the shape consistency of the two maps after initialization and EPLB adjustment, preventing silent shape mismatches in subsequent operations. #### Impact - Resolves tensor shape errors when using redundant experts with EPLB on Ascend platform. - Ensures correct expert routing and load balancing for MoE models with redundant expert configurations. - No breaking changes to existing functionality; compatible with non-redundant expert deployments. - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` --------- Signed-off-by: Che Ruan <cr623@ic.ac.uk> Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: Che Ruan <cr623@ic.ac.uk> Co-authored-by: shenchuxiaofugui <1311027364@qq.com>	2026-01-06 17:22:36 +08:00
LI SHENGYONG	bdc721d35a	[smoke][bugfix] moe_init_routing_v2 active_expert_range use int type (#5521 ) ### What this PR does / why we need it? The float kernel of MOE_init_routing_v2 in the dispatch allgather operation does not support tensor format for active_expert_range; it only supports int. PR5311 To unify the variables `local_num_experts` and `self.local_num_experts`, `self.local_num_experts` was used consistently, which led to the subsequent integer type parameter being converted to a tensor type. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? gsm8k \| exact_match,strict-match: ground_truth=0.89 \| measured=0.8939 \| success=✅ gsm8k \| exact_match,flexible-extract: ground_truth=0.85 \| measured=0.856 \| success=✅ ceval-valid \| acc,none: ground_truth=0.84 \| measured=0.8373 \| success=✅ Model Parameters: {'pretrained': 'Qwen/Qwen3-30B-A3B', 'tensor_parallel_size': 2, 'dtype': 'auto', 'trust_remote_code': False, 'max_model_len': 4096, 'gpu_memory_utilization': 0.6, 'enable_expert_parallel': True} - vLLM version: v0.13.0 - vLLM main: `45c1ca1ca1` Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-12-31 09:19:04 +08:00
LI SHENGYONG	f81cf694b2	[EPLB][refactor] Modification of the initialization logic for expert_map and log2phy（depend on pr5285） (#5311 ) ### What this PR does / why we need it? Unify the loading logic for expert_map and log2phy. 1. The map generated when enabling the redundancy expert is incorrect. The community generation map function only accepts the number of global experts. When we pass in the number of logical experts plus redundant experts, the local expert ID of the last card will index to an expert ID that does not exist. Now we ensure that the index points to a real existing expert ID, and each expert can be accessed. Moreover, when redundant experts are not enabled, the output of our function remains consistent with the community's function. 2. The map we generate is based on the length of the physical expert, but in reality, we only need to use the length of the logical expert. Later on, we will need to pad it accordingly, so we can simply generate a map with the length of the logical [expert.] 3. Unify the initialization logic across different scenarios and simplify the code for fused_moe. Before refactoring - map path is not None： expert map: get_rank_placement_map from _'expert_load_balancer.py'_, maintains the map for all ranks and all layers. log2phy: get_rank_log2phy_map from _'expert_load_balancer.py'_, maintains the map for all ranks and all layers. - map path is None: expert map: determine_expert_map from '_vllm.laye_r', The function does not support the redundant experts of vllm-ascend. log2phy: determine_default_log2phy_map from _'eplb_utils.py'_. The function does not support the redundant experts of vllm-ascend. Refactoring eplb_utils.py     init_eplb_config          generate placement          generate expert map          generate log2phy ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Expert Mapping Test Generation: ep size: 16, num of experts: 256, num of redundant experts: 16 +++++++++++++++++++++++++++++++++++++++++ Expert Mapping (Non-1 indicates the expert responsible for this rank) for Rank 15: vllm map： [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16] +++++++++++++++++++++++++++++++++++++++++ Improved map： [16 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] Expert Mapping Test Generation: ep size: 16, num of experts: 256, num of redundant experts: 0 +++++++++++++++++++++++++++++++++++++++++ Expert Mapping (Non-1 indicates the expert responsible for this rank) for Rank 15: vllm map： [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] +++++++++++++++++++++++++++++++++++++++ Improved map： [-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] dsr1 baselie： \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| gsm8k-lite \| 7cd45e \| accuracy \| gen \| 100.00 \| dsr1 eplb： \| dataset \| version \| metric \| mode \| vllm-api-general-chat \| \|----- \| ----- \| ----- \| ----- \| -----\| \| gsm8k-lite \| 7cd45e \| accuracy \| gen \| 100.00 \| - vLLM version: release/v0.13.0 - vLLM main: `5fbfa8d9ef` Signed-off-by: shenchuxiaofugui <1311027364@qq.com> Co-authored-by: weijinqian0 <1184188277@qq.com>	2025-12-29 09:26:14 +08:00
wangxiyuan	492173cf89	[Misc] Cleanup useless print and logger (#5220 ) 1. Remove useless print 2. use vLLM logger 3. change useless INFO to DEBUG level - vLLM version: release/v0.13.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-22 11:28:26 +08:00
Mercykid-bash	84b9d38e28	BugFix: Resolve PolicyFlashlb warm up function attribute error (#4741 ) ## Description Fix the AttributeError caused by incorrect invocation of the warm-up function in the FlashLB algorithm: 1. Root Cause: The warm-up function for FlashLB is defined outside the `PolicyFlashlb` class (not a class method), but the code incorrectly attempted to call it via the `PolicyFlashlb` class instance. 2. Key Fix: Clarify the invocation rule for FlashLB: when selecting the FlashLB algorithm, the warm-up function must be called in advance to precompile and warm up the algorithm (invoked as a standalone function), instead of calling it through the `PolicyFlashlb` class. 3. Impact: Resolve the runtime error when using FlashLB, ensure the algorithm pre-compilation/warm-up process works as expected, and avoid attribute missing exceptions. Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>	2025-12-12 14:55:26 +08:00
wangxiyuan	b89763f1ed	[CI] speed up ut (#4901 ) avoid model download to speed up ut test. - vLLM version: v0.12.0 - vLLM main: `ad32e3e19c` Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>	2025-12-11 18:45:43 +08:00
LI SHENGYONG	019c7ded91	eplb redundant expert bugfix (#4291 ) ### What this PR does / why we need it? Redundant experts bugfix ### Does this PR introduce _any_ user-facing change? After configuring the path for experts_map, users do not need to configure iinit_redundancy_expert. ### How was this patch tested? The accuracy of EPLB was tested with and without the use of redundant experts. - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: shenchuxiaofugui <1311027364@qq.com>	2025-11-21 14:24:35 +08:00
欧派果奶我还要	c848da0687	[Bugfix] fix nightly multi-node EPLB tests' "DYNAMIC_EPLB=true" environment not working (#4223 ) ### What this PR does / why we need it? fix nightly multi-node EPLB tests by adjusting vllm_ascend\eplb\core\eplb_utils.py dynamic_eplb gate checking ### Does this PR introduce _any_ user-facing change? no - vLLM version: v0.11.0 - vLLM main: `2918c1b49c` --------- Signed-off-by: 白永斌 <baiyongbin3@h-partners.com> Signed-off-by: 欧派果奶我还要 <47294568+845473182@users.noreply.github.com> Co-authored-by: 白永斌 <baiyongbin3@h-partners.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>	2025-11-19 21:31:58 +08:00
offline893	14ca1e5cb2	[CI]Fix oom of deepseek-eplb nigtly test. (#3884 ) ### What this PR does / why we need it? Fix oom of deepseek-eplb nigtly test - vLLM version: v0.11.0rc3 - vLLM main: `83f478bb19` --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-30 10:18:07 +08:00
offline893	e916265b2b	[CI]Add EPLB CI. (#3568 ) ### What this PR does / why we need it? 1.Add eplb ci to check the change of eplb feature. 2.Add param checking of eplb params. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Qwen in A3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-21 22:58:02 +08:00
yechao237	4750d45d86	[BugFix]Support redundant experts in EPLB (#3473 ) This PR adds support for redundant experts in the EPLB. Key points: - Use global_num_experts = num_experts + num_redundant_experts consistently. - Backward compatible when num_redundant_experts=0. Tested On a 16-rank setup (W8A8) with static EPLB and expert_map_path, verifying router logits shape and successful requests. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: yechao237 <yechao20180411@gmail.com>	2025-10-18 00:09:16 +08:00
Mercykid-bash	ecb1713dfc	Bugfix: Expose the user policy type interface (#3336 ) This PR primarily focuses on two key changes: 1. Adjusts internal interface calls to optimize the interaction logic between related modules. 2. Exposes an interface that allows users to select the EPLB algorithm, enabling more flexible configuration based on specific usage scenarios. These changes aim to enhance the usability of the system while ensuring the stability of internal operations. Relevant unit tests have been updated to cover the modified logic. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: Che Ruan <cr623@ic.ac.uk> Co-authored-by: Che Ruan <cr623@ic.ac.uk>	2025-10-11 16:28:57 +08:00
offline893	82b6c846ca	[BugFix]Fix eplb problems when using dynamic eplb. (#3364 ) ### What this PR does / why we need it? When using dynamic eplb,it will be blocking by nz tensor.We fix these prolems by clone src tensor and recv tensor. ### Does this PR introduce any user-facing change? ### How was this patch tested? Qwen3_moe in A3. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: offline0806 <3337230449@qq.com> Co-authored-by: offline0806 <3337230449@qq.com>	2025-10-11 14:04:02 +08:00
Mercykid-bash	29c173ab48	FlashLB algorithm (#3042 ) ## Purpose This Pull Request enhances the EPLB (Expert Parallelism Load Balancing) system by introducing a novel balancing algorithm: FlashLB. ## Motivation 1. The default algorithm adopts a two-stage greedy strategy: a. Replica allotment: Determine the number of expert replicas by minimizing the maximum load per replica (Min Max Replica, MMR). b. Replica placement: Distribute replicas across devices by repeatedly assigning the heaviest replica to the least loaded device (Longest Processing Time First, LPT). However, this sequential process lacks inter-stage collaborative optimization, often leading to suboptimal load balancing. For example, in the simple case shown in the figure below: given 8 logical experts with hotness values of 600, 560, 120, 120, 20, 10, 10, 10, and 2 replicas allocated per device across 8 devices, the EPLB algorithm yields a maximum per-device hotness of 232, while our proposed FlashLB algorithm can reduce this value to 205. 2. The default algorithm relies on the averaged expert hotness over a fixed time window for optimization. While this provides a coarse approximation of the hotness distribution, it fails to capture oscillatory deviations and temporal correlations of expert hotness observed across iterations in real-world scenarios, limiting optimization quality. 3. The default algorithm periodically regenerates the expert placement table. However, it generates the table for each individual layer, and the new table does not account for correlations with the previous one; these two factors collectively lead to nearly full-scale expert reassignment. ## FlashLB Algorithm Principle 1. Joint Optimization FlashLB achieves joint optimization of replica allotment and placement through group-based decision-making. Each group gradually determines the replica count and placement for a subset of experts, ensuring that the expected inter-device load balance (considering both deployed and pending expert replicas) is holistically optimized. To attain superior load balancing, FlashLB employs tree search to expand the solution space while integrating pruning and precompilation techniques for acceleration, thereby delivering load balancing that is both high-quality and practically efficient. 2. Multi-Shot Enhancement FlashLB partitions each profiling interval (e.g., 1024 iterations) into consecutive smaller sub-intervals (e.g., 16 iterations), each capturing independent hotness measurements. It then performs multi-shot optimization to co-optimize these sub-intervals simultaneously—enabling adaptation to time-variant expert hotness while enhancing robustness. 3. Incremental Adjustment To reduce the overhead of frequent expert re-deployment, FlashLB introduces an incremental adjustment scheme operating at both inter-layer and intra-layer levels: a. Inter-Layer: Hotness variations are tracked at the layer level. Only layers with fluctuations exceeding a predefined threshold trigger re-computation of expert placement, avoiding unnecessary redeployment for stable layers； b. Intra-Layer (Optional): A lightweight incremental LPT algorithm (LPT-Incremental) is applied. Instead of recomputing full placement for all experts in a layer, it selectively adjusts only the hottest experts or those with replica count changes, further reducing migration overhead. This incremental strategy significantly reduces adjustment costs while maintaining balanced performance across layers and devices. ## Co-author: Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: `9607d5eb44` --------- Signed-off-by: sdmyzlp <lrwei2@petalmail.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk> Signed-off-by: Shanshan Shen <87969357+shen-shanshan@users.noreply.github.com> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: 22dimensions <waitingwind@foxmail.com> Signed-off-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Signed-off-by: hfadzxy <starmoon_zhang@163.com> Signed-off-by: Lucas Kabela <lucaskabela@meta.com> Signed-off-by: wangli <wangli858794774@gmail.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Icey <1790571317@qq.com> Signed-off-by: linfeng-yuan <1102311262@qq.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: tangtianyi <tangtianyi4@huawei.com> Signed-off-by: Angazenn <supperccell@163.com> Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: rjg-lyh <1318825571@qq.com> Signed-off-by: Pr0Wh1teGivee <calvin_zhu0210@outlook.com> Signed-off-by: fems14 <1804143737@qq.com> Co-authored-by: sdmyzlp <117554856+sdmyzlp@users.noreply.github.com> Co-authored-by: Che Ruan <cr623@ic.ac.uk> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: Yikun Jiang <yikunkero@gmail.com> Co-authored-by: 22dimensions <waitingwind@foxmail.com> Co-authored-by: zhanghw0354 <zhanghaiwencmss@139.com> Co-authored-by: zhanghaiwen <zhanghaiwen@cmss.chinamobile.com> Co-authored-by: zhangxinyuehfad <59153331+zhangxinyuehfad@users.noreply.github.com> Co-authored-by: Lucas Kabela <lucasakabela@gmail.com> Co-authored-by: Li Wang <wangli858794774@gmail.com> Co-authored-by: MengqingCao <cmq0113@163.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Icey <1790571317@qq.com> Co-authored-by: linfeng-yuan <1102311262@qq.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: tianyitang <tangtianyi4@huawei.com> Co-authored-by: Angazenn <supperccell@163.com> Co-authored-by: Yizhou <136800916+yiz-liu@users.noreply.github.com> Co-authored-by: rjg-lyh <83491835+rjg-lyh@users.noreply.github.com> Co-authored-by: weichen <132029610+Pr0Wh1teGivee@users.noreply.github.com> Co-authored-by: weijinqian0 <12153182+weijinqian0@users.noreply.github.com> Co-authored-by: fems14 <74094523+fems14@users.noreply.github.com>	2025-09-23 10:27:14 +08:00
offline893	76844eec78	Dynamic Expert Load Balance with Zero-like-overhead (#2956 ) ### Motivation Currently dynamically experts balancing would stop-the-world. Asynchronously expert load balancing would be better without flowing problems: Host-bound latency: There are many cpu operations during EPLB such as eplb-algorithm、creating p2p ops、and log2phy expert converting would spend long cpu time, as ~1s. Communication latency: The transfer time would cost much in the situation without nvlink. As the weight of an expert maybe transfer to multiple new positions, thus N times send/recv for one expert, with result long latency. We had tested that batch_isend_irecv cost more 100ms for 16 experts weight transmission in A2 server of ascend. SwiftBalancer would not stop-the-world anymore, in out test on NPU 1~2ms cost for each layer while benefit 5ms-8ms decode latency with ep_size = 64. The following updates have been made: 1、expert distribution recording with lower cost. 2、async cpu computing for eplb algo and other python operator. 3、new eplb algo with less expert rebalancing while almost the same effect. ### Proposed Change We will gradually migrate the EPLB logic to the VLLM community and implement a generalized design. Relevant RFC: https://github.com/vllm-project/vllm/issues/22246 The overall workflow involves: <img width="801" height="302" alt="474430541-23b06f58-23bc-44a3-a1be-00f268aeb15c" src="https://github.com/user-attachments/assets/1d73a459-1b23-4b0a-812a-bf0a75debfed" /> 1. Record experts distribution during forward. We using expert_token_num after disptach instead of topk_ids, thus we got much smaller tensor shape to reduce cost of hbm recording and add-operator. 2. Do all-gather for experts distribution. Using all-gather instead of all-reduce as less traffic volume. 3. Wake up eplb worker process with experts distribution when num_iterations comes. Run eplb algorithm in eplb worker. 4. Generate p2p send/recv ops and other operator such as log2phy would cost long cpu time. 5. Lanch ibatch_send_recv in async_stream before forward. 6. After forward, wait for the ibatch_send_recv finish, then do uapte expert map and expert weights. ### Co-author Co-authored-by: raindaywhu raindaywhu@raindaywhu@ 163.con Co-authored-by: njuyuan yuanjl19@smail.nju.edu.cn Co-authored-by: qmkakaxi wjh1594260677@qq.com Co-authored-by: Skywalker-EP 173723846@qq.com - vLLM version: v0.10.2 - vLLM main: `567939953b` --------- Signed-off-by: offline0806 <z00858301@china.huawei.com> Co-authored-by: offline0806 <z00858301@china.huawei.com>	2025-09-17 10:36:43 +08:00

21 Commits