Files

64 lines
1.9 KiB
Python
Raw Permalink Normal View History

[Misc] Add a model loader that utilizes HCCL for weight loading (#2888) ### What this PR does / why we need it? This PR introduces a new model loader called Netloader, which leverages high-bandwidth P2P direct transfer between NPU cards to achieve weight loading. Netloader is implemented as a plugin through the newly added 'register_model_loader' function in vLLM 0.10. It facilitates the process of weight loading by sending weights from a pre-loaded model (server) to an empty model of a newly started instance (client). The server operates concurrently with normal inference tasks through sub-threads and the 'stateless_init_torch_distributed_process_group' in vLLM. The client initiates a transfer request after verifying that the model and partitioning method are the same as the server's, and uses HCCL's collective communication (send/recv) to load the weights in the order they are stored in the model. Application Scenarios: 1. Significantly Reduces Inference Instance Startup Time By reusing the weights of already loaded instances and performing high-speed transfers directly between computing cards, this method reduces model loading latency compared to traditional remote/local pull methods. 2. Reduces Network and Storage Pressure Avoids the need to repeatedly download weight files from remote repositories, reducing the impact on centralized storage and network traffic, thereby enhancing overall system stability and service quality. 3. Improves Resource Utilization and Reduces Costs Accelerating the loading process reduces reliance on redundant computing pools, allowing computing resources to be elastically scaled and reclaimed as needed. 4. Enhances Business Continuity and High Availability In fault recovery scenarios, new instances can quickly take over existing services, avoiding prolonged business interruptions and improving the system's high availability and user experience. ### Does this PR introduce _any_ user-facing change? Netloader utilizes the existing --load-format=netloader and --model-loader-extra-config to be activated. The model-loader-extra-config needs to be input as a JSON string (as it is now) Afterwards, you can check whether the outputs for the same sentence are consistent when the temperature is set to 0. Signed-off-by: destinysky <kangrui10@126.com> - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: destinysky <kangrui10@126.com>
2025-10-23 15:56:07 +08:00
#
# Copyright (c) 2025 Huawei Technologies Co., Ltd. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import os
import socket
[Lint]Add lint hooks for clang-format, shellcheck, forbidden imports, and boolean context manager checks (#7511) ### What this PR does / why we need it? This PR introduces several upstream `vllm`-aligned lint hooks into `vllm-ascend` and makes them part of the actual `pre-commit` flow. Main changes in this PR: - add `check-boolean-context-manager` to catch boolean expressions in `with` statements - add `check-forbidden-imports` to forbid direct `re` imports and disallowed direct `triton` imports - enable shell script linting through `tools/shellcheck.sh` - add root `.clang-format` aligned with upstream `vllm`, enable `clang-format` in `pre-commit`, temporarily **exclude all `csrc/**`** from `clang-format` to avoid bringing a large native code reformat into this PR This PR focuses on landing the smaller and immediately useful lint alignment first, without mixing in the larger requirements-management migration. ### Does this PR introduce _any_ user-facing change? No. This PR only updates repository lint configuration, static checks, and internal import/style enforcement. It does not change runtime behavior or public interfaces. ### How was this patch tested? Tested locally in the project virtual environment. Commands used: ```bash bash format.sh ``` Verified checks passed: ``` bash ruff check...............................................................Passed ruff format..............................................................Passed codespell................................................................Passed typos....................................................................Passed clang-format.............................................................Passed Lint GitHub Actions workflow files.......................................Passed Lint shell scripts.......................................................Passed Lint PNG exports from excalidraw.........................................Passed Check for spaces in all filenames........................................Passed Enforce __init__.py in Python packages...................................Passed Check for forbidden imports..............................................Passed Check for boolean ops in with-statements.................................Passed Suggestion...............................................................Passed - hook id: suggestion - duration: 0s To bypass pre-commit hooks, add --no-verify to git commit. ``` **note:** clang-format is enabled but currently excludes all csrc/** - vLLM version: v0.17.0 - vLLM main: https://github.com/vllm-project/vllm/commit/8b6325758cce5f9c36d38f2462edbd368b97a07c --------- Signed-off-by: MrZ20 <2609716663@qq.com>
2026-03-24 20:03:01 +08:00
import regex as re
[Misc] Add a model loader that utilizes HCCL for weight loading (#2888) ### What this PR does / why we need it? This PR introduces a new model loader called Netloader, which leverages high-bandwidth P2P direct transfer between NPU cards to achieve weight loading. Netloader is implemented as a plugin through the newly added 'register_model_loader' function in vLLM 0.10. It facilitates the process of weight loading by sending weights from a pre-loaded model (server) to an empty model of a newly started instance (client). The server operates concurrently with normal inference tasks through sub-threads and the 'stateless_init_torch_distributed_process_group' in vLLM. The client initiates a transfer request after verifying that the model and partitioning method are the same as the server's, and uses HCCL's collective communication (send/recv) to load the weights in the order they are stored in the model. Application Scenarios: 1. Significantly Reduces Inference Instance Startup Time By reusing the weights of already loaded instances and performing high-speed transfers directly between computing cards, this method reduces model loading latency compared to traditional remote/local pull methods. 2. Reduces Network and Storage Pressure Avoids the need to repeatedly download weight files from remote repositories, reducing the impact on centralized storage and network traffic, thereby enhancing overall system stability and service quality. 3. Improves Resource Utilization and Reduces Costs Accelerating the loading process reduces reliance on redundant computing pools, allowing computing resources to be elastically scaled and reclaimed as needed. 4. Enhances Business Continuity and High Availability In fault recovery scenarios, new instances can quickly take over existing services, avoiding prolonged business interruptions and improving the system's high availability and user experience. ### Does this PR introduce _any_ user-facing change? Netloader utilizes the existing --load-format=netloader and --model-loader-extra-config to be activated. The model-loader-extra-config needs to be input as a JSON string (as it is now) Afterwards, you can check whether the outputs for the same sentence are consistent when the temperature is set to 0. Signed-off-by: destinysky <kangrui10@126.com> - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: destinysky <kangrui10@126.com>
2025-10-23 15:56:07 +08:00
from vllm.logger import logger
def find_free_port():
"""
Finds a free port on the local machine.
Returns:
- A free port number.
"""
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #6) (#6001) ### What this PR does / why we need it? | File Path | | :--- | | ` vllm_ascend/eplb/adaptor/abstract_adaptor.py` | | ` vllm_ascend/eplb/adaptor/vllm_adaptor.py` | | ` vllm_ascend/eplb/core/eplb_device_transfer_loader.py` | | ` vllm_ascend/eplb/core/eplb_utils.py` | | ` vllm_ascend/eplb/core/eplb_worker.py` | | ` vllm_ascend/eplb/core/policy/policy_abstract.py` | | ` vllm_ascend/eplb/core/policy/policy_default_eplb.py` | | ` vllm_ascend/eplb/core/policy/policy_factory.py` | | ` vllm_ascend/eplb/core/policy/policy_flashlb.py` | | ` vllm_ascend/eplb/core/policy/policy_random.py` | | ` vllm_ascend/eplb/core/policy/policy_swift_balancer.py` | | ` vllm_ascend/eplb/eplb_updator.py` | | ` vllm_ascend/eplb/utils.py` | | ` vllm_ascend/model_loader/netloader/executor/elastic_load.py` | | ` vllm_ascend/model_loader/netloader/executor/netloader_pg.py` | | ` vllm_ascend/model_loader/netloader/interaction/elastic.py` | | ` vllm_ascend/model_loader/netloader/load.py` | | ` vllm_ascend/model_loader/netloader/netloader.py` | | ` vllm_ascend/model_loader/netloader/utils.py` | | ` vllm_ascend/patch/platform/__init__.py` | | ` vllm_ascend/patch/platform/patch_balance_schedule.py` | | ` vllm_ascend/patch/platform/patch_ec_connector.py` | | ` vllm_ascend/patch/platform/patch_mamba_config.py` | | ` vllm_ascend/patch/platform/patch_multiproc_executor.py` | | ` vllm_ascend/patch/platform/patch_sched_yield.py` | - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2c24bc6996cb165fce92f780b388a5e39b3f4060 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
2026-01-24 22:08:33 +08:00
s.bind(("", 0))
[Misc] Add a model loader that utilizes HCCL for weight loading (#2888) ### What this PR does / why we need it? This PR introduces a new model loader called Netloader, which leverages high-bandwidth P2P direct transfer between NPU cards to achieve weight loading. Netloader is implemented as a plugin through the newly added 'register_model_loader' function in vLLM 0.10. It facilitates the process of weight loading by sending weights from a pre-loaded model (server) to an empty model of a newly started instance (client). The server operates concurrently with normal inference tasks through sub-threads and the 'stateless_init_torch_distributed_process_group' in vLLM. The client initiates a transfer request after verifying that the model and partitioning method are the same as the server's, and uses HCCL's collective communication (send/recv) to load the weights in the order they are stored in the model. Application Scenarios: 1. Significantly Reduces Inference Instance Startup Time By reusing the weights of already loaded instances and performing high-speed transfers directly between computing cards, this method reduces model loading latency compared to traditional remote/local pull methods. 2. Reduces Network and Storage Pressure Avoids the need to repeatedly download weight files from remote repositories, reducing the impact on centralized storage and network traffic, thereby enhancing overall system stability and service quality. 3. Improves Resource Utilization and Reduces Costs Accelerating the loading process reduces reliance on redundant computing pools, allowing computing resources to be elastically scaled and reclaimed as needed. 4. Enhances Business Continuity and High Availability In fault recovery scenarios, new instances can quickly take over existing services, avoiding prolonged business interruptions and improving the system's high availability and user experience. ### Does this PR introduce _any_ user-facing change? Netloader utilizes the existing --load-format=netloader and --model-loader-extra-config to be activated. The model-loader-extra-config needs to be input as a JSON string (as it is now) Afterwards, you can check whether the outputs for the same sentence are consistent when the temperature is set to 0. Signed-off-by: destinysky <kangrui10@126.com> - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: destinysky <kangrui10@126.com>
2025-10-23 15:56:07 +08:00
return s.getsockname()[1]
def is_valid_path_prefix(path_prefix):
"""
Checks if the provided path prefix is valid.
Parameters:
- path_prefix: The path prefix to validate.
Returns:
- True if the path prefix is valid, otherwise False.
"""
if not path_prefix:
return False
if re.search(r'[<>:"|?*]', path_prefix):
[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #6) (#6001) ### What this PR does / why we need it? | File Path | | :--- | | ` vllm_ascend/eplb/adaptor/abstract_adaptor.py` | | ` vllm_ascend/eplb/adaptor/vllm_adaptor.py` | | ` vllm_ascend/eplb/core/eplb_device_transfer_loader.py` | | ` vllm_ascend/eplb/core/eplb_utils.py` | | ` vllm_ascend/eplb/core/eplb_worker.py` | | ` vllm_ascend/eplb/core/policy/policy_abstract.py` | | ` vllm_ascend/eplb/core/policy/policy_default_eplb.py` | | ` vllm_ascend/eplb/core/policy/policy_factory.py` | | ` vllm_ascend/eplb/core/policy/policy_flashlb.py` | | ` vllm_ascend/eplb/core/policy/policy_random.py` | | ` vllm_ascend/eplb/core/policy/policy_swift_balancer.py` | | ` vllm_ascend/eplb/eplb_updator.py` | | ` vllm_ascend/eplb/utils.py` | | ` vllm_ascend/model_loader/netloader/executor/elastic_load.py` | | ` vllm_ascend/model_loader/netloader/executor/netloader_pg.py` | | ` vllm_ascend/model_loader/netloader/interaction/elastic.py` | | ` vllm_ascend/model_loader/netloader/load.py` | | ` vllm_ascend/model_loader/netloader/netloader.py` | | ` vllm_ascend/model_loader/netloader/utils.py` | | ` vllm_ascend/patch/platform/__init__.py` | | ` vllm_ascend/patch/platform/patch_balance_schedule.py` | | ` vllm_ascend/patch/platform/patch_ec_connector.py` | | ` vllm_ascend/patch/platform/patch_mamba_config.py` | | ` vllm_ascend/patch/platform/patch_multiproc_executor.py` | | ` vllm_ascend/patch/platform/patch_sched_yield.py` | - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2c24bc6996cb165fce92f780b388a5e39b3f4060 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
2026-01-24 22:08:33 +08:00
logger.warning(f"The path prefix {path_prefix} contains illegal characters.")
[Misc] Add a model loader that utilizes HCCL for weight loading (#2888) ### What this PR does / why we need it? This PR introduces a new model loader called Netloader, which leverages high-bandwidth P2P direct transfer between NPU cards to achieve weight loading. Netloader is implemented as a plugin through the newly added 'register_model_loader' function in vLLM 0.10. It facilitates the process of weight loading by sending weights from a pre-loaded model (server) to an empty model of a newly started instance (client). The server operates concurrently with normal inference tasks through sub-threads and the 'stateless_init_torch_distributed_process_group' in vLLM. The client initiates a transfer request after verifying that the model and partitioning method are the same as the server's, and uses HCCL's collective communication (send/recv) to load the weights in the order they are stored in the model. Application Scenarios: 1. Significantly Reduces Inference Instance Startup Time By reusing the weights of already loaded instances and performing high-speed transfers directly between computing cards, this method reduces model loading latency compared to traditional remote/local pull methods. 2. Reduces Network and Storage Pressure Avoids the need to repeatedly download weight files from remote repositories, reducing the impact on centralized storage and network traffic, thereby enhancing overall system stability and service quality. 3. Improves Resource Utilization and Reduces Costs Accelerating the loading process reduces reliance on redundant computing pools, allowing computing resources to be elastically scaled and reclaimed as needed. 4. Enhances Business Continuity and High Availability In fault recovery scenarios, new instances can quickly take over existing services, avoiding prolonged business interruptions and improving the system's high availability and user experience. ### Does this PR introduce _any_ user-facing change? Netloader utilizes the existing --load-format=netloader and --model-loader-extra-config to be activated. The model-loader-extra-config needs to be input as a JSON string (as it is now) Afterwards, you can check whether the outputs for the same sentence are consistent when the temperature is set to 0. Signed-off-by: destinysky <kangrui10@126.com> - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: destinysky <kangrui10@126.com>
2025-10-23 15:56:07 +08:00
return False
[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #6) (#6001) ### What this PR does / why we need it? | File Path | | :--- | | ` vllm_ascend/eplb/adaptor/abstract_adaptor.py` | | ` vllm_ascend/eplb/adaptor/vllm_adaptor.py` | | ` vllm_ascend/eplb/core/eplb_device_transfer_loader.py` | | ` vllm_ascend/eplb/core/eplb_utils.py` | | ` vllm_ascend/eplb/core/eplb_worker.py` | | ` vllm_ascend/eplb/core/policy/policy_abstract.py` | | ` vllm_ascend/eplb/core/policy/policy_default_eplb.py` | | ` vllm_ascend/eplb/core/policy/policy_factory.py` | | ` vllm_ascend/eplb/core/policy/policy_flashlb.py` | | ` vllm_ascend/eplb/core/policy/policy_random.py` | | ` vllm_ascend/eplb/core/policy/policy_swift_balancer.py` | | ` vllm_ascend/eplb/eplb_updator.py` | | ` vllm_ascend/eplb/utils.py` | | ` vllm_ascend/model_loader/netloader/executor/elastic_load.py` | | ` vllm_ascend/model_loader/netloader/executor/netloader_pg.py` | | ` vllm_ascend/model_loader/netloader/interaction/elastic.py` | | ` vllm_ascend/model_loader/netloader/load.py` | | ` vllm_ascend/model_loader/netloader/netloader.py` | | ` vllm_ascend/model_loader/netloader/utils.py` | | ` vllm_ascend/patch/platform/__init__.py` | | ` vllm_ascend/patch/platform/patch_balance_schedule.py` | | ` vllm_ascend/patch/platform/patch_ec_connector.py` | | ` vllm_ascend/patch/platform/patch_mamba_config.py` | | ` vllm_ascend/patch/platform/patch_multiproc_executor.py` | | ` vllm_ascend/patch/platform/patch_sched_yield.py` | - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2c24bc6996cb165fce92f780b388a5e39b3f4060 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
2026-01-24 22:08:33 +08:00
if path_prefix.startswith("/") or path_prefix.startswith("\\"):
[Misc] Add a model loader that utilizes HCCL for weight loading (#2888) ### What this PR does / why we need it? This PR introduces a new model loader called Netloader, which leverages high-bandwidth P2P direct transfer between NPU cards to achieve weight loading. Netloader is implemented as a plugin through the newly added 'register_model_loader' function in vLLM 0.10. It facilitates the process of weight loading by sending weights from a pre-loaded model (server) to an empty model of a newly started instance (client). The server operates concurrently with normal inference tasks through sub-threads and the 'stateless_init_torch_distributed_process_group' in vLLM. The client initiates a transfer request after verifying that the model and partitioning method are the same as the server's, and uses HCCL's collective communication (send/recv) to load the weights in the order they are stored in the model. Application Scenarios: 1. Significantly Reduces Inference Instance Startup Time By reusing the weights of already loaded instances and performing high-speed transfers directly between computing cards, this method reduces model loading latency compared to traditional remote/local pull methods. 2. Reduces Network and Storage Pressure Avoids the need to repeatedly download weight files from remote repositories, reducing the impact on centralized storage and network traffic, thereby enhancing overall system stability and service quality. 3. Improves Resource Utilization and Reduces Costs Accelerating the loading process reduces reliance on redundant computing pools, allowing computing resources to be elastically scaled and reclaimed as needed. 4. Enhances Business Continuity and High Availability In fault recovery scenarios, new instances can quickly take over existing services, avoiding prolonged business interruptions and improving the system's high availability and user experience. ### Does this PR introduce _any_ user-facing change? Netloader utilizes the existing --load-format=netloader and --model-loader-extra-config to be activated. The model-loader-extra-config needs to be input as a JSON string (as it is now) Afterwards, you can check whether the outputs for the same sentence are consistent when the temperature is set to 0. Signed-off-by: destinysky <kangrui10@126.com> - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: destinysky <kangrui10@126.com>
2025-10-23 15:56:07 +08:00
if not os.path.exists(os.path.dirname(path_prefix)):
[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #6) (#6001) ### What this PR does / why we need it? | File Path | | :--- | | ` vllm_ascend/eplb/adaptor/abstract_adaptor.py` | | ` vllm_ascend/eplb/adaptor/vllm_adaptor.py` | | ` vllm_ascend/eplb/core/eplb_device_transfer_loader.py` | | ` vllm_ascend/eplb/core/eplb_utils.py` | | ` vllm_ascend/eplb/core/eplb_worker.py` | | ` vllm_ascend/eplb/core/policy/policy_abstract.py` | | ` vllm_ascend/eplb/core/policy/policy_default_eplb.py` | | ` vllm_ascend/eplb/core/policy/policy_factory.py` | | ` vllm_ascend/eplb/core/policy/policy_flashlb.py` | | ` vllm_ascend/eplb/core/policy/policy_random.py` | | ` vllm_ascend/eplb/core/policy/policy_swift_balancer.py` | | ` vllm_ascend/eplb/eplb_updator.py` | | ` vllm_ascend/eplb/utils.py` | | ` vllm_ascend/model_loader/netloader/executor/elastic_load.py` | | ` vllm_ascend/model_loader/netloader/executor/netloader_pg.py` | | ` vllm_ascend/model_loader/netloader/interaction/elastic.py` | | ` vllm_ascend/model_loader/netloader/load.py` | | ` vllm_ascend/model_loader/netloader/netloader.py` | | ` vllm_ascend/model_loader/netloader/utils.py` | | ` vllm_ascend/patch/platform/__init__.py` | | ` vllm_ascend/patch/platform/patch_balance_schedule.py` | | ` vllm_ascend/patch/platform/patch_ec_connector.py` | | ` vllm_ascend/patch/platform/patch_mamba_config.py` | | ` vllm_ascend/patch/platform/patch_multiproc_executor.py` | | ` vllm_ascend/patch/platform/patch_sched_yield.py` | - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2c24bc6996cb165fce92f780b388a5e39b3f4060 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
2026-01-24 22:08:33 +08:00
logger.warning(f"The directory for the path prefix {os.path.dirname(path_prefix)} does not exist.")
[Misc] Add a model loader that utilizes HCCL for weight loading (#2888) ### What this PR does / why we need it? This PR introduces a new model loader called Netloader, which leverages high-bandwidth P2P direct transfer between NPU cards to achieve weight loading. Netloader is implemented as a plugin through the newly added 'register_model_loader' function in vLLM 0.10. It facilitates the process of weight loading by sending weights from a pre-loaded model (server) to an empty model of a newly started instance (client). The server operates concurrently with normal inference tasks through sub-threads and the 'stateless_init_torch_distributed_process_group' in vLLM. The client initiates a transfer request after verifying that the model and partitioning method are the same as the server's, and uses HCCL's collective communication (send/recv) to load the weights in the order they are stored in the model. Application Scenarios: 1. Significantly Reduces Inference Instance Startup Time By reusing the weights of already loaded instances and performing high-speed transfers directly between computing cards, this method reduces model loading latency compared to traditional remote/local pull methods. 2. Reduces Network and Storage Pressure Avoids the need to repeatedly download weight files from remote repositories, reducing the impact on centralized storage and network traffic, thereby enhancing overall system stability and service quality. 3. Improves Resource Utilization and Reduces Costs Accelerating the loading process reduces reliance on redundant computing pools, allowing computing resources to be elastically scaled and reclaimed as needed. 4. Enhances Business Continuity and High Availability In fault recovery scenarios, new instances can quickly take over existing services, avoiding prolonged business interruptions and improving the system's high availability and user experience. ### Does this PR introduce _any_ user-facing change? Netloader utilizes the existing --load-format=netloader and --model-loader-extra-config to be activated. The model-loader-extra-config needs to be input as a JSON string (as it is now) Afterwards, you can check whether the outputs for the same sentence are consistent when the temperature is set to 0. Signed-off-by: destinysky <kangrui10@126.com> - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: destinysky <kangrui10@126.com>
2025-10-23 15:56:07 +08:00
return False
else:
if not os.path.exists(os.path.dirname(os.path.abspath(path_prefix))):
logger.warning(
[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #6) (#6001) ### What this PR does / why we need it? | File Path | | :--- | | ` vllm_ascend/eplb/adaptor/abstract_adaptor.py` | | ` vllm_ascend/eplb/adaptor/vllm_adaptor.py` | | ` vllm_ascend/eplb/core/eplb_device_transfer_loader.py` | | ` vllm_ascend/eplb/core/eplb_utils.py` | | ` vllm_ascend/eplb/core/eplb_worker.py` | | ` vllm_ascend/eplb/core/policy/policy_abstract.py` | | ` vllm_ascend/eplb/core/policy/policy_default_eplb.py` | | ` vllm_ascend/eplb/core/policy/policy_factory.py` | | ` vllm_ascend/eplb/core/policy/policy_flashlb.py` | | ` vllm_ascend/eplb/core/policy/policy_random.py` | | ` vllm_ascend/eplb/core/policy/policy_swift_balancer.py` | | ` vllm_ascend/eplb/eplb_updator.py` | | ` vllm_ascend/eplb/utils.py` | | ` vllm_ascend/model_loader/netloader/executor/elastic_load.py` | | ` vllm_ascend/model_loader/netloader/executor/netloader_pg.py` | | ` vllm_ascend/model_loader/netloader/interaction/elastic.py` | | ` vllm_ascend/model_loader/netloader/load.py` | | ` vllm_ascend/model_loader/netloader/netloader.py` | | ` vllm_ascend/model_loader/netloader/utils.py` | | ` vllm_ascend/patch/platform/__init__.py` | | ` vllm_ascend/patch/platform/patch_balance_schedule.py` | | ` vllm_ascend/patch/platform/patch_ec_connector.py` | | ` vllm_ascend/patch/platform/patch_mamba_config.py` | | ` vllm_ascend/patch/platform/patch_multiproc_executor.py` | | ` vllm_ascend/patch/platform/patch_sched_yield.py` | - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2c24bc6996cb165fce92f780b388a5e39b3f4060 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
2026-01-24 22:08:33 +08:00
f"The directory for the path prefix {os.path.dirname(os.path.abspath(path_prefix))} does not exist."
[Misc] Add a model loader that utilizes HCCL for weight loading (#2888) ### What this PR does / why we need it? This PR introduces a new model loader called Netloader, which leverages high-bandwidth P2P direct transfer between NPU cards to achieve weight loading. Netloader is implemented as a plugin through the newly added 'register_model_loader' function in vLLM 0.10. It facilitates the process of weight loading by sending weights from a pre-loaded model (server) to an empty model of a newly started instance (client). The server operates concurrently with normal inference tasks through sub-threads and the 'stateless_init_torch_distributed_process_group' in vLLM. The client initiates a transfer request after verifying that the model and partitioning method are the same as the server's, and uses HCCL's collective communication (send/recv) to load the weights in the order they are stored in the model. Application Scenarios: 1. Significantly Reduces Inference Instance Startup Time By reusing the weights of already loaded instances and performing high-speed transfers directly between computing cards, this method reduces model loading latency compared to traditional remote/local pull methods. 2. Reduces Network and Storage Pressure Avoids the need to repeatedly download weight files from remote repositories, reducing the impact on centralized storage and network traffic, thereby enhancing overall system stability and service quality. 3. Improves Resource Utilization and Reduces Costs Accelerating the loading process reduces reliance on redundant computing pools, allowing computing resources to be elastically scaled and reclaimed as needed. 4. Enhances Business Continuity and High Availability In fault recovery scenarios, new instances can quickly take over existing services, avoiding prolonged business interruptions and improving the system's high availability and user experience. ### Does this PR introduce _any_ user-facing change? Netloader utilizes the existing --load-format=netloader and --model-loader-extra-config to be activated. The model-loader-extra-config needs to be input as a JSON string (as it is now) Afterwards, you can check whether the outputs for the same sentence are consistent when the temperature is set to 0. Signed-off-by: destinysky <kangrui10@126.com> - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 --------- Signed-off-by: destinysky <kangrui10@126.com>
2025-10-23 15:56:07 +08:00
)
return False
[Lint]Style: Convert `vllm-ascend/` to ruff format(Batch #6) (#6001) ### What this PR does / why we need it? | File Path | | :--- | | ` vllm_ascend/eplb/adaptor/abstract_adaptor.py` | | ` vllm_ascend/eplb/adaptor/vllm_adaptor.py` | | ` vllm_ascend/eplb/core/eplb_device_transfer_loader.py` | | ` vllm_ascend/eplb/core/eplb_utils.py` | | ` vllm_ascend/eplb/core/eplb_worker.py` | | ` vllm_ascend/eplb/core/policy/policy_abstract.py` | | ` vllm_ascend/eplb/core/policy/policy_default_eplb.py` | | ` vllm_ascend/eplb/core/policy/policy_factory.py` | | ` vllm_ascend/eplb/core/policy/policy_flashlb.py` | | ` vllm_ascend/eplb/core/policy/policy_random.py` | | ` vllm_ascend/eplb/core/policy/policy_swift_balancer.py` | | ` vllm_ascend/eplb/eplb_updator.py` | | ` vllm_ascend/eplb/utils.py` | | ` vllm_ascend/model_loader/netloader/executor/elastic_load.py` | | ` vllm_ascend/model_loader/netloader/executor/netloader_pg.py` | | ` vllm_ascend/model_loader/netloader/interaction/elastic.py` | | ` vllm_ascend/model_loader/netloader/load.py` | | ` vllm_ascend/model_loader/netloader/netloader.py` | | ` vllm_ascend/model_loader/netloader/utils.py` | | ` vllm_ascend/patch/platform/__init__.py` | | ` vllm_ascend/patch/platform/patch_balance_schedule.py` | | ` vllm_ascend/patch/platform/patch_ec_connector.py` | | ` vllm_ascend/patch/platform/patch_mamba_config.py` | | ` vllm_ascend/patch/platform/patch_multiproc_executor.py` | | ` vllm_ascend/patch/platform/patch_sched_yield.py` | - vLLM version: v0.13.0 - vLLM main: https://github.com/vllm-project/vllm/commit/2c24bc6996cb165fce92f780b388a5e39b3f4060 --------- Signed-off-by: MrZ20 <2609716663@qq.com>
2026-01-24 22:08:33 +08:00
return True