Mercykid-bash 132f3c5d0a Support per-step heat collection and enhance FlashLB for multi-stage load balancing (#6477)
# Feature: FlashLB algorithm

## Purpose

This Pull Request enhances the EPLB (Expert Parallelism Load Balancing)
system by introducing a novel load balancing algorithm: FlashLB.
1. The default algorithm adopts two separate sub-procedures to optimize
expert replication and placement independently:

a. **Expert Replica Allotment Sub-procedure** : Determines the number of
replicas for all experts. At each step, it greedily adds one more
replica to the expert with the highest per-replica load, aiming to
minimize load skew at the expert replica granularity (Min Max Replica,
MMR).

b. **Expert Replica Placement Sub-procedure** : Distributes all replicas
across devices. First, it sorts the generated replicas in descending
order of hotness, then iteratively places the currently hottest replica
onto the device with the lowest cumulative load and available slots.
However, this simplistic combination of two separate procedures lacks
synergy and often leads to sub-optimal load balancing. For example, in
the simple scenario illustrated below: Given 8 logical experts with
hotness values [600, 560, 120, 120, 20, 10, 10, 10], and 2 replicas
allocated per device across 8 devices, the default EPLB algorithm
results in a maximum per-device hotness of 232 (peak-average load ratio
1.28), while our proposed FlashLB algorithm reduces this value to 205
(peak-average load ratio 1.13).

<figure><img
src="https://github.com/user-attachments/assets/b9b10fab-651e-4524-9942-adbca8d044a4"
width="90%"</figure>

2. The default algorithm simply aggregates hotness measurements across
the entire profiling window. While this provides a coarse approximation
of the hotness distribution, it fails to capture the time-phased
variations and temporal correlations in expert hotness (both within and
between experts) across iterations—phenomena that have been observed in
real-world scenarios. Such single-point hotness estimation degrades the
solution quality of the load balancing algorithm.

3. The default algorithm regularly recalculates updated expert placement
results for all layers without discrimination. Considering that
excessive expert updates can impact Service Level Objectives (SLOs),
such full-scale redeployment leads to excessively high adjustment
overhead, which negatively affects end-to-end performance.

## FlashLB Algorithm Principle

### 1. Joint Optimization of Replica Allotment and Placement

FlashLB achieves joint optimization of replica allotment and placement
through a novel tree search approach, combined with carefully designed e
Fl fficient pruning and lightweight look-ahead estimation. We partition
all experts into several subsets, and for each subset, hierarchically
determine the optimal replica count and placement. Leveraging efficient
pruning and lightweight look-ahead estimation, the process consistently
aims to optimize the globally expected inter-device load balance degree
(considering both deployed and unexplored experts) while ensuring
sufficient computational efficiency. Additionally, precompilation
techniques are employed for acceleration, delivering load balancing that
is both high-quality and practically efficient.
### 2. Multi-Episode Enhancement

Instead of performing full-duration averaging like the default
algorithm, FlashLB partitions each profiling interval (e.g., 1024
iterations) into multiple consecutive smaller episodes (e.g., 16
iterations). This preserves hotness fluctuation and correlation
information. It then constructs a multi-objective optimization problem
to co-optimize these episodes simultaneously, enabling adaptability to
interleaved hotness patterns and improving statistical robustness.

### 3. Layer-wise Cherry-Picking Redeployment

To reduce the overhead of frequent expert redeployment, FlashLB
introduces a cherry-picking redeployment scheme. During each algorithmic
decision cycle, it real-time tracks load balance degree of all layers
and triggers expert placement updates only for those layers whose
peak-average ratio exceeds a predefined threshold. This avoids
unnecessary redeployment for stable layers, significantly reducing
adjustment overhead and thereby improving end-to-end performance gains.

## Co-author:

Co-authored-by: Skywalker-EP 173723846@qq.com

This PR mainly introduces two key optimizations for load balancing
scheduling:
1. **Add per-step heat collection function**:
Support real-time collection of per-step heat information during model
inference. This enables more fine-grained load balancing decisions by
taking per-step heat as the optimization target, improving scheduling
accuracy for dynamic and fluctuating workloads.

2. **Update FlashLB algorithm**:
Upgrade the FlashLB scheduling logic to better adapt to multi-stage heat
distribution scenarios. The improved algorithm can comprehensively
perceive and utilize multi-stage heat characteristics, achieving more
stable and efficient load balancing under complex expert deployment and
dynamic traffic patterns.

---------

Signed-off-by: Mercykid-bash <ruanche0218@gmail.com>
Signed-off-by: xuzewei28 <xuzewei2@h-partners.com>
Co-authored-by: xuzewei28 <xuzewei2@h-partners.com>
2026-03-12 15:49:09 +08:00
2026-03-12 14:24:49 +08:00
2025-02-05 10:53:12 +08:00
2026-01-12 11:21:31 +08:00
2025-01-29 02:44:13 -08:00

vllm-ascend

vLLM Ascend Plugin

DeepWiki

| About Ascend | Documentation | #SIG-Ascend | Users Forum | Weekly Meeting |

English | 中文


Latest News 🔥

  • [2026/02] We released the new official version v0.13.0! Please follow the official guide to start using vLLM Ascend Plugin on Ascend.
  • [2025/12] We released the new official version v0.11.0! Please follow the official guide to start using vLLM Ascend Plugin on Ascend.
  • [2025/09] We released the new official version v0.9.1! Please follow the official guide to start deploying large-scale Expert Parallelism (EP) on Ascend.
  • [2025/08] We hosted the vLLM Beijing Meetup with vLLM and Tencent! Please find the meetup slides here.
  • [2025/06] User stories page is now live! It kicks off with LLaMA-Factory/verl/TRL/GPUStack to demonstrate how vLLM Ascend assists Ascend users in enhancing their experience across fine-tuning, evaluation, reinforcement learning (RL), and deployment scenarios.
  • [2025/06] Contributors page is now live! All contributions deserve to be recorded, thanks for all contributors.
  • [2025/05] We've released the first official version v0.7.3! We collaborated with the vLLM community to publish a blog post sharing our practice: Introducing vLLM Hardware Plugin, Best Practice from Ascend NPU.
  • [2025/03] We hosted the vLLM Beijing Meetup with vLLM team! Please find the meetup slides here.
  • [2025/02] vLLM community officially created vllm-project/vllm-ascend repo for running vLLM seamlessly on the Ascend NPU.
  • [2024/12] We are working with the vLLM community to support [RFC]: Hardware pluggable.

Overview

vLLM Ascend (vllm-ascend) is a community maintained hardware plugin for running vLLM seamlessly on the Ascend NPU.

It is the recommended approach for supporting the Ascend backend within the vLLM community. It adheres to the principles outlined in the [RFC]: Hardware pluggable, providing a hardware-pluggable interface that decouples the integration of the Ascend NPU with vLLM.

By using vLLM Ascend plugin, popular open-source models, including Transformer-like, Mixture-of-Experts (MoE), Embedding, Multi-modal LLMs can run seamlessly on the Ascend NPU.

Prerequisites

  • Hardware: Atlas 800I A2 Inference series, Atlas A2 Training series, Atlas 800I A3 Inference series, Atlas A3 Training series, Atlas 300I Duo (Experimental)
  • OS: Linux
  • Software:
    • Python >= 3.10, < 3.12
    • CANN == 8.5.0 (Ascend HDK version refers to here)
    • PyTorch == 2.9.0, torch-npu == 2.9.0
    • vLLM (the same version as vllm-ascend)

Getting Started

Please use the following recommended versions to get started quickly:

Version Release type Doc
v0.16.0rc1 Latest release candidate See QuickStart and Installation for more details
v0.13.0 Latest stable version See QuickStart and Installation for more details

Contributing

See CONTRIBUTING for more details, which is a step-by-step guide to help you set up the development environment, build and test.

We welcome and value any contributions and collaborations:

Branch

vllm-ascend has a main branch and a dev branch.

  • main: main branch, corresponds to the vLLM main branch, and is continuously monitored for quality through Ascend CI.
  • releases/vX.Y.Z: development branch, created alongside new releases of vLLM. For example, releases/v0.13.0 is the dev branch for vLLM v0.13.0 version.

Below are the maintained branches:

Branch Status Note
main Maintained CI commitment for vLLM main branch and vLLM v0.16.0 tag
v0.7.1-dev Unmaintained Only doc fixes are allowed
v0.7.3-dev Maintained CI commitment for vLLM 0.7.3 version, only bug fixes are allowed, and no new release tags anymore.
v0.9.1-dev Maintained CI commitment for vLLM 0.9.1 version
v0.11.0-dev Maintained CI commitment for vLLM 0.11.0 version
releases/v0.13.0 Maintained CI commitment for vLLM 0.13.0 version
rfc/feature-name Maintained Feature branches for collaboration

Please refer to Versioning policy for more details.

Weekly Meeting

License

Apache License 2.0, as found in the LICENSE file.

Description
XC-LLM: A Specially Optimized LLM Inference Engine for ModelHub XC
Readme Apache-2.0 31 MiB
Languages
C++ 51.8%
Python 45.8%
CMake 1.1%
Shell 0.7%
C 0.5%
Other 0.1%