This pull request introduces a new capability to monitor the health of
NPU cards directly from the Worker class. This enhancement allows for
proactive detection of NPU issues by executing the npu-smi command,
improving system reliability and operational visibility within the
vllm_ascend worker environment.
- vLLM version: v0.15.0
- vLLM main:
13397841ab
---------
Signed-off-by: liziyu <liziyu16@huawei.com>
Signed-off-by: wangxiaoteng <wangxiaoteng@huawei.com>
Signed-off-by: luomin2005 <luomin2005@huawei.com>
Co-authored-by: liziyu <56102866+liziyu179@users.noreply.github.com>
Co-authored-by: wangxiaoteng <wangxiaoteng@huawei.com>