### What this PR does / why we need it? This PR introduces a "balance scheduling" feature, enabled by the `VLLM_ASCEND_BALANCE_SCHEDULING` environment variable. This feature adjusts the scheduling logic to better balance the load across data-parallel workers, preventing a single worker from blocking scheduling for others. This can improve overall throughput. Additionally, this PR includes a number of other updates and fixes to the scheduler, syncing it with a more recent version of the upstream vLLM scheduler. These changes include: - Handling for paused scheduler state. - Support for Mamba block-aligned splits. - Handling for streaming requests. - Refinements in preemption logic and resource management (KV cache, encoder cache). - General code refactoring for clarity and correctness. Fixes # ### Does this PR introduce _any_ user-facing change? Yes, this PR introduces a new feature controlled by the `VLLM_ASCEND_BALANCE_SCHEDULING` environment variable. When enabled, the scheduling behavior changes, which could affect performance and request throughput. ### How was this patch tested? CI passed. Further testing should be done to validate the performance and correctness of the new scheduling logic under various workloads, with and without the feature flag enabled. Signed-off-by: GDzhu01 <809721801@qq.com>
38 lines
1.6 KiB
Python
38 lines
1.6 KiB
Python
#
|
|
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
|
|
# This file is a part of the vllm-ascend project.
|
|
#
|
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
# you may not use this file except in compliance with the License.
|
|
# You may obtain a copy of the License at
|
|
#
|
|
# http://www.apache.org/licenses/LICENSE-2.0
|
|
#
|
|
# Unless required by applicable law or agreed to in writing, software
|
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
# See the License for the specific language governing permissions and
|
|
# limitations under the License.
|
|
|
|
import os
|
|
|
|
import vllm_ascend.patch.platform.patch_distributed # noqa
|
|
import vllm_ascend.patch.platform.patch_fusion_matcher_compat_ops # noqa
|
|
import vllm_ascend.patch.platform.patch_kv_cache_interface # noqa
|
|
from vllm_ascend import envs
|
|
from vllm_ascend.utils import is_310p
|
|
|
|
if not is_310p():
|
|
import vllm_ascend.patch.platform.patch_mamba_config # noqa
|
|
else:
|
|
import vllm_ascend.patch.platform.patch_mamba_config_310 # noqa
|
|
import vllm_ascend.patch.platform.patch_minimax_m2_config # noqa
|
|
import vllm_ascend.patch.platform.patch_sched_yield # noqa
|
|
import vllm_ascend.patch.platform.patch_torch_accelerator # noqa
|
|
|
|
if os.getenv("DYNAMIC_EPLB", "false").lower() in ("true", "1") or os.getenv("EXPERT_MAP_RECORD", "false") == "true":
|
|
import vllm_ascend.patch.platform.patch_multiproc_executor # noqa
|
|
|
|
if envs.VLLM_ASCEND_BALANCE_SCHEDULING:
|
|
import vllm_ascend.patch.platform.patch_balance_schedule # noqa
|