[Doc][CPU binding] Add user/developer guide for CPU binding (#7045)

### What this PR does / why we need it? This PR adds comprehensive documentation for the CPU binding feature on Ascend NPUs. It includes: - A detailed developer guide (`docs/source/developer_guide/feature_guide/cpu_binding.md`) covering the design, internal logic, allocation examples, and troubleshooting for the CPU binding mechanism. - A concise user guide (`docs/source/user_guide/feature_guide/cpu_binding.md`) explaining the core concepts, usage, and common issues for end-users. - An update to `additional_config.md` to use consistent terminology for binding strategies (`global-slicing` and `topo-affinity`). This documentation is needed to help both developers and users understand, use, and debug the CPU binding feature, which is critical for performance on ARM+Ascend platforms. ### Does this PR introduce _any_ user-facing change? No. This is a documentation-only update. ### How was this patch tested? The documentation has been reviewed for clarity and technical accuracy. The examples and descriptions align with the implementation in `vllm_ascend/cpu_binding.py`. - vLLM version: v0.16.0 - vLLM main: 4034c3d32e --------- Signed-off-by: chenchuw886 <chenchuw@huawei.com> Signed-off-by: c00818886 <chenchuwei@huawei.com> Co-authored-by: chenchuw886 <chenchuw@huawei.com>
2026-03-10 15:59:31 +08:00
parent 33234aa0c5
commit 14c71b19e1
5 changed files with 363 additions and 1 deletions
--- a/docs/source/user_guide/feature_guide/cpu_binding.md
+++ b/docs/source/user_guide/feature_guide/cpu_binding.md
@@ -0,0 +1,132 @@
+# CPU Binding
+
+## Overview
+
+CPU Binding is a performance optimization feature for vLLM, specifically designed for servers equipped with **ARM architecture and Ascend NPUs**. It pins vLLM processes and threads to specific CPU cores to reduce CPU–NPU cross‑NUMA communication overhead and stabilize inference latency. This feature only adjusts host-side CPU affinity policies and **does not alter model execution logic or impact inference results**.
+
+## Usage
+
+### Online serving example with CPU binding enabled (by default)
+
+```bash
+vllm serve Qwen/Qwen2.5-7B-Instruct \
+  --additional-config '{"enable_cpu_binding": true}'
+```
+
+### Online serving example with CPU binding disabled
+
+```bash
+vllm serve Qwen/Qwen2.5-7B-Instruct \
+  --additional-config '{"enable_cpu_binding": false}'
+```
+
+### Offline inference example with CPU binding enabled
+
+```python
+from vllm import LLM
+
+llm = LLM(
+    model="Qwen/Qwen2.5-7B-Instruct",
+    additional_config={"enable_cpu_binding": True},
+)
+```
+
+### Offline inference example with CPU binding disabled
+
+```python
+from vllm import LLM
+
+llm = LLM(
+  model="Qwen/Qwen2.5-7B-Instruct",
+  additional_config={"enable_cpu_binding": False},
+)
+```
+
+## Dependencies
+
+### Installation
+
+#### Ubuntu/Debian
+
+```bash
+sudo apt-get update
+sudo apt-get install -y util-linux numactl procps
+```
+
+#### RHEL/CentOS/Alma/Rocky
+
+```bash
+sudo yum install -y util-linux numactl procps-ng
+```
+
+#### openEuler
+
+```bash
+sudo dnf install -y util-linux numactl procps-ng
+```
+
+### IRQ binding's additional considerations
+
+For best results, if you run inside a docker container, which `systemctl` is likely unavailable, stop `irqbalance` service on the host manually before starting vLLM. Also make sure the container has the necessary permissions to write to `/proc/irq/*/smp_affinity` for IRQ binding:
+
+- **Stop `irqbalance` service**:
+
+    For example, on Ubuntu system, you can run the following command to stop irqbalance:
+    ```bash
+    sudo systemctl stop irqbalance
+    ```
+
+    After you finish the vLLM process, you can restore irqbalance on the host:
+
+    ```bash
+    sudo systemctl start irqbalance
+    ```
+
+- **Permissions**:
+    - Read access to `/proc/self/status` and `/proc/interrupts`
+    - Write access to `/proc/irq/*/smp_affinity` for IRQ binding
+
+## Common Issues & Troubleshooting
+
+|Error/Warning Message|Core Cause|Solution|
+|---|---|---|
+|Can not get running npu info.|The npu-smi process table is empty, or the `ASCEND_RT_VISIBLE_DEVICES` environment variable filters out all NPUs.|1. Ensure the process is running on visible NPUs; 2. Verify that the `ASCEND_RT_VISIBLE_DEVICES` value matches the actual logical NPU IDs.|
+|Insufficient CPUs for binding...|The number of CPU cores allocated to each NPU is less than the minimum requirement of 5.|1. Expand the allowed CPU list; 2. Reduce the number of visible NPUs.|
+|NPU topo affinity not found...|npu-smi is unable to retrieve NPU topology affinity information.|Verify the integrity of the npu-smi installation and ensure the user has sufficient execution permissions.|
+|Bind cpus failed in rankX...|The CPU binding process failed (e.g., taskset is unavailable, or the user lacks write permissions for /proc/irq).|1. Confirm that required tools (taskset, lscpu, npu-smi) are installed and available; 2. Verify the Cpus_allowed_list in `/proc/self/status` is valid.|
+
+## Key Limitations
+
+- ARM architecture only: Binding is automatically skipped on x86_64 systems.
+
+- Symmetric NUMA layout required for optimal performance: CPU numbering should be aligned with NUMA nodes. Non-symmetric layouts may result in cross-NUMA CPU pools, reducing locality.
+
+- IRQ binding requires write permissions for /proc/irq. Memory binding depends on the `migratepages` tool; if unavailable, memory migration is skipped.
+
+## FAQ
+
+**Q1: Does CPU binding work on x86_64?**
+
+No. The binding is skipped on non‑ARM CPUs.
+
+**Q2: Why are only the current rank’s IRQs bound?**
+
+To avoid multiple processes overwriting IRQ affinity settings for the same device.
+
+**Q3: What if my cpuset already limits CPUs?**
+
+The binder uses Cpus_allowed_list from /proc/self/status as the only eligible CPU set. Ensure this list is large enough.
+
+**Q4: Does CPU binding change model outputs?**
+
+No. It only affects host‑side affinity and should not change numerical results.
+
+---
+
+## Summary
+
+1. **Core Objective**: Reduce cross‑NUMA communication by pinning vLLM processes and threads to specific CPU cores, thereby stabilizing inference latency in Ascend NPU deployments (only applicable to ARM architectures).
+
+2. **Usage**: Enable or disable with `enable_cpu_binding` via `additional_config` in both online and offline workflows.
+
+3. **Key Limitations**: ARM‑only; relies on symmetric NUMA layouts; binding fails if the CPU pool has fewer than 5 cores; binding errors trigger a warning log but do not terminate the process.
--- a/docs/source/user_guide/feature_guide/index.md
+++ b/docs/source/user_guide/feature_guide/index.md
@@ -6,6 +6,7 @@ This section provides a detailed usage guide of vLLM Ascend features.
 :caption: Feature Guide
 :maxdepth: 1
 graph_mode
+cpu_binding
 quantization
 sleep_mode
 structured_output