[CI] Fix server start failure when long weight loading (#7098)
### What this PR does / why we need it?
When loading large models (e.g., 163 shards), weight loading can exceed
the default 600s timeout. Engine startup timeout with the error:
```shell
TimeoutError: Timed out waiting for engines to send initial message on input socket.
```
We should increase the `VLLM_ENGINE_READY_TIMEOUT_S ` to avoid it
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
- vLLM version: v0.16.0
- vLLM main:
4034c3d32e
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
@@ -76,6 +76,7 @@ jobs:
|
||||
UV_INDEX_STRATEGY: unsafe-best-match
|
||||
UV_NO_CACHE: 1
|
||||
UV_SYSTEM_PYTHON: 1
|
||||
VLLM_ENGINE_READY_TIMEOUT_S: 1800
|
||||
steps:
|
||||
- name: Check npu and CANN info
|
||||
run: |
|
||||
@@ -204,6 +205,7 @@ jobs:
|
||||
VLLM_CI_RUNNER: ${{ inputs.runner }}
|
||||
working-directory: /vllm-workspace/vllm-ascend
|
||||
run: |
|
||||
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
|
||||
echo "Running pytest with tests path: ${{ inputs.tests }}"
|
||||
pytest -sv "${{ inputs.tests }}" \
|
||||
--ignore=tests/e2e/nightly/single_node/ops/singlecard_ops/test_fused_moe.py
|
||||
@@ -217,6 +219,7 @@ jobs:
|
||||
CONFIG_YAML_PATH: ${{ inputs.config_file_path }}
|
||||
working-directory: /vllm-workspace/vllm-ascend
|
||||
run: |
|
||||
export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
|
||||
echo "export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib" >> ~/.bashrc
|
||||
echo "Running YAML-driven test with config: ${{ inputs.config_file_path }}"
|
||||
pytest -sv tests/e2e/nightly/single_node/models/scripts/test_single_node.py
|
||||
|
||||
Reference in New Issue
Block a user