[main] support cpu binding (#3546)

### What this PR does / why we need it? Currently, in the piecewise of aclgraph, the model will be in eagle mode in attention, which will cause abnormal allreduce latency of O matrix. The reason is that cpu resources will be preempted in eagle mode. So I hope to temporarily add cpu binding to vllm-ascend. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed with new existing test. - vLLM version: v0.11.0rc3 - vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0 Signed-off-by: GDzhu1 <809721801@qq.com>
2025-10-21 09:17:03 +08:00
parent 274b708e0c
commit 4a849df6fa
3 changed files with 345 additions and 1 deletions
--- a/vllm_ascend/ascend_config.py
+++ b/vllm_ascend/ascend_config.py
@@ -101,6 +101,8 @@ class AscendConfig:
                raise AssertionError(
                    "oproj_tensor_parallel_size is only supported in pd scenario and can only be used in D node."
                )
+        self.enable_cpu_binding = additional_config.get(
+            "enable_cpu_binding", False)
        self.pd_tp_ratio = 1
        self.pd_head_ratio = 1
        self.num_head_replica = 1