[Main] [Patch] support balance scheduling patch (#5212)
### Motivation. **Limitations of the current vLLM v1 scheduling strategy** vLLM v1 scheduling currently enables chunkedprefill by default, which processes prefill and decode requests simultaneously in a single scheduling session. This can impact the overall system throughput and performance in some scenarios. Balance scheduling addresses this issue by synchronizing the number of running queues across all schedulers to delay the scheduling of new requests, thereby improving the overall system's steady-state decoding time. This achieves: ✅Adding `balance_gather` to the scheduler synchronizes the number of requests in the running queues between DPs. ✅Balance scheduling improves the decode steady-state time, thereby increasing the overall output throughput of the inference system. ### Proposed Change. **1.Feature Overview** In the vLLM scheduler, running requests (i.e., requests that are already undergoing pre-filled computation) have the highest priority, followed by waiting requests (i.e., requests that have not yet been computed). As shown in the diagram above, when the entire inference system exits from a steady state, the scheduler will schedule a batch of new requests for prefill operations and then synchronize them among the dynamic programming (DP) models. This can cause some DP models that are entirely decoded to synchronize with the number of prefilled tokens. Frequent prefill scheduling by certain DP models can lead to a deterioration in the overall system output throughput. Balance scheduling synchronizes the number of running queue requests across different DPs, and only schedules new requests for prefilling when at least every scheduler has fewer than max_nun_requst. **2.Implementation Design** **3.Experiment Results** - Fixed-length input scenario: In the performance test scenario with 3.5K fixed-length input and 1.5K fixed-length output, the throughput performance was improved by approximately **18%** after adding balance scheduling. | Method | Model | Input Len | Request Count | Output Len | BatchSize | Average TTFT | Average TPOT | e2e duration | Input Token Throughput | Output Token Throughput | Request Throughput | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | | Baseline | DeepSeekV3.1 | 3500 | 512 | 1500 | 128 | 6600 | 86.85 | 591.9s | 3030.5 | 1297.3 | 0.86 | | Balance scheduling | DeepSeekV3.1 | 3500 | 512 | 1500 | 128 | 7012 | 70.63 | 501.7s | 3575.7 | 1530.7 | 1.02 | **4.Demo PR** [#29721 ](https://github.com/vllm-project/vllm/pull/29721) --------- Signed-off-by: GDzhu01 <809721801@qq.com>
This commit is contained in:
@@ -144,6 +144,9 @@ env_variables: Dict[str, Callable[[], Any]] = {
|
||||
# with W8A8, non-dynamic-eplb. And MTP layer must be W8A8.
|
||||
"VLLM_ASCEND_ENABLE_FUSED_MC2":
|
||||
lambda: int(os.getenv("VLLM_ASCEND_ENABLE_FUSED_MC2", '0')),
|
||||
# Whether to anbale balance scheduling
|
||||
"VLLM_ASCEND_BALANCE_SCHEDULING":
|
||||
lambda: bool(os.getenv("VLLM_ASCEND_BALANCE_SCHEDULING", '0')),
|
||||
}
|
||||
|
||||
# end-env-vars-definition
|
||||
|
||||
Reference in New Issue
Block a user