[CI] Add dispatch job to leverage dynamic devices (#251)
### What this PR does / why we need it? Add dispatch job to leverage jobs to dynamic devices include 2 stage as below: The dispatch job will spend extra about `10s * parallel number + 30s` time to wait other job launch container and release lock. - **Stage 1: Acquire lock** add a dispatch job, this job use lockfile to acquire locks and then get device number dynamically - **Stage 2.1: Launch container with dynamic device** pass the device number via output and start the container job with dynamic device - **Stage 2.2: Release lock** once the job started, release the lock. In the backend, we use multiple path to setup multiple self host runners as load balancer: ``` $ pwd /home/action $ ll | grep actions drwx------ 6 action action 4096 Mar 7 08:55 actions-runner-01 drwx------ 6 action action 4096 Mar 7 08:55 actions-runner-02 drwx------ 6 action action 4096 Mar 7 08:55 actions-runner-03 drwx------ 6 action action 4096 Mar 7 08:56 actions-runner-04 drwx------ 4 action action 4096 Jan 24 22:08 actions-runner-05 drwx------ 4 action action 4096 Jan 24 22:08 actions-runner-06 ``` ``` adduser -G docker action su action pip3 install docker prettytable sudo yum install procmail ``` ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? - CI passed - E2E test manully, triggered 3 jobs in parallel: - [1st job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711345757/job/38348309297) dispatch to /dev/davinci2. - [2nd job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711348739/job/38348316250) dispatch to /dev/davinci3 - [3rd job](https://github.com/vllm-project/vllm-ascend/actions/runs/13711351493/job/38348324551) dispatch to /dev/davinci4 Signed-off-by: Yikun Jiang <yikunkero@gmail.com>
This commit is contained in:
@@ -32,6 +32,7 @@ MODELS = [
|
||||
"Qwen/Qwen2.5-0.5B-Instruct",
|
||||
]
|
||||
os.environ["VLLM_USE_MODELSCOPE"] = "True"
|
||||
os.environ["PYTORCH_NPU_ALLOC_CONF"] = "max_split_size_mb:256"
|
||||
|
||||
TARGET_TEST_SUITE = os.environ.get("TARGET_TEST_SUITE", "L4")
|
||||
|
||||
|
||||
Reference in New Issue
Block a user