[0.18.0][Bugfix] Restore VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT to original value for nightly test (#8794)
### What this PR does / why we need it? PR #8618 renamed `VLLM_NIXL_ABORT_REQUEST_TIMEOUT` to `VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT` and simultaneously reduced the timeout value from 300000 to 480 seconds in the nightly test configs. The 480s value is far too short for heavy multi-node workloads (DeepSeek V3/R1 under W8A8 + EP), causing [spurious abort-request timeouts](https://github.com/vllm-project/vllm-ascend/actions/runs/25067539406/job/73441223206) in CI. This PR restores the timeout value to the original 300000 to fix the nightly test failures introduced by #8618. Signed-off-by: hfadzxy <starmoon_zhang@163.com>
This commit is contained in:
@@ -13,7 +13,7 @@ env_common:
|
||||
HCCL_DETERMINISTIC: True
|
||||
TASK_QUEUE_ENABLE: 1
|
||||
HCCL_OP_RETRY_ENABLE: "L0:0, L1:0"
|
||||
VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT: 480
|
||||
VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT: 300000
|
||||
|
||||
disaggregated_prefill:
|
||||
enabled: true
|
||||
|
||||
@@ -15,7 +15,7 @@ env_common:
|
||||
ASCEND_TRANSPORT_PRINT: 1
|
||||
ACL_OP_INIT_MODE: 1
|
||||
ASCEND_A3_ENABLE: 1
|
||||
VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT: 480
|
||||
VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT: 300000
|
||||
VLLM_ENGINE_READY_TIMEOUT_S: 1800
|
||||
HCCL_CONNECT_TIMEOUT: 1200
|
||||
HCCL_INTRA_PCIE_ENABLE: 1
|
||||
|
||||
@@ -15,7 +15,7 @@ env_common:
|
||||
ASCEND_TRANSPORT_PRINT: 1
|
||||
ACL_OP_INIT_MODE: 1
|
||||
ASCEND_A3_ENABLE: 1
|
||||
VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT: 480
|
||||
VLLM_MOONCAKE_ABORT_REQUEST_TIMEOUT: 300000
|
||||
VLLM_ENGINE_READY_TIMEOUT_S: 1800
|
||||
HCCL_CONNECT_TIMEOUT: 1200
|
||||
HCCL_INTRA_PCIE_ENABLE: 1
|
||||
|
||||
Reference in New Issue
Block a user