[Main] [Patch] support balance scheduling patch (#5212)

### Motivation.

**Limitations of the current vLLM v1 scheduling strategy**
vLLM v1 scheduling currently enables chunkedprefill by default, which
processes prefill and decode requests simultaneously in a single
scheduling session. This can impact the overall system throughput and
performance in some scenarios.

Balance scheduling addresses this issue by synchronizing the number of
running queues across all schedulers to delay the scheduling of new
requests, thereby improving the overall system's steady-state decoding
time. This achieves:
✅Adding `balance_gather` to the scheduler synchronizes the number of
requests in the running queues between DPs.
✅Balance scheduling improves the decode steady-state time, thereby
increasing the overall output throughput of the inference system.


### Proposed Change.

 **1.Feature Overview**

In the vLLM scheduler, running requests (i.e., requests that are already
undergoing pre-filled computation) have the highest priority, followed
by waiting requests (i.e., requests that have not yet been computed).


As shown in the diagram above, when the entire inference system exits
from a steady state, the scheduler will schedule a batch of new requests
for prefill operations and then synchronize them among the dynamic
programming (DP) models. This can cause some DP models that are entirely
decoded to synchronize with the number of prefilled tokens. Frequent
prefill scheduling by certain DP models can lead to a deterioration in
the overall system output throughput.

Balance scheduling synchronizes the number of running queue requests
across different DPs, and only schedules new requests for prefilling
when at least every scheduler has fewer than max_nun_requst.

 **2.Implementation Design**

 **3.Experiment Results**
- Fixed-length input scenario: In the performance test scenario with
3.5K fixed-length input and 1.5K fixed-length output, the throughput
performance was improved by approximately **18%** after adding balance
scheduling.

| Method | Model | Input Len | Request Count | Output Len | BatchSize |
Average TTFT | Average TPOT | e2e duration | Input Token Throughput |
Output Token Throughput | Request Throughput
| ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- |
---- | ---- |
| Baseline | DeepSeekV3.1 | 3500 | 512 | 1500 | 128 | 6600 | 86.85 |
591.9s | 3030.5 | 1297.3 | 0.86 |
| Balance scheduling | DeepSeekV3.1 | 3500 | 512 | 1500 | 128 | 7012 |
70.63 | 501.7s | 3575.7 | 1530.7 | 1.02 |

**4.Demo PR**

[#29721 ](https://github.com/vllm-project/vllm/pull/29721)

---------

Signed-off-by: GDzhu01 <809721801@qq.com>

This commit is contained in:

Zhu Yi Lin

2025-12-23 09:04:38 +08:00

committed by

GitHub

parent f883a2edb9

commit 3d04ae8e7d

4 changed files with 703 additions and 0 deletions

									
										3

vllm_ascend/envs.py
									
												View File
												
				@@ -144,6 +144,9 @@ env_variables: Dict[str, Callable[[], Any]] = {

				    # with W8A8, non-dynamic-eplb. And MTP layer must be W8A8.

				    "VLLM_ASCEND_ENABLE_FUSED_MC2":

				    lambda: int(os.getenv("VLLM_ASCEND_ENABLE_FUSED_MC2", '0')),

				    # Whether to anbale balance scheduling

				    "VLLM_ASCEND_BALANCE_SCHEDULING":

				    lambda: bool(os.getenv("VLLM_ASCEND_BALANCE_SCHEDULING", '0')),

				}

				# end-env-vars-definition

[Main] [Patch] support balance scheduling patch (#5212)

3 vllm_ascend/envs.py Unescape Escape View File

3

vllm_ascend/envs.py

View File