[Doc][CPU binding] Add user/developer guide for CPU binding (#7045)

### What this PR does / why we need it?
This PR adds comprehensive documentation for the CPU binding feature on
Ascend NPUs. It includes:

- A detailed developer guide
(`docs/source/developer_guide/feature_guide/cpu_binding.md`) covering
the design, internal logic, allocation examples, and troubleshooting for
the CPU binding mechanism.
- A concise user guide
(`docs/source/user_guide/feature_guide/cpu_binding.md`) explaining the
core concepts, usage, and common issues for end-users.
- An update to `additional_config.md` to use consistent terminology for
binding strategies (`global-slicing` and `topo-affinity`).

This documentation is needed to help both developers and users
understand, use, and debug the CPU binding feature, which is critical
for performance on ARM+Ascend platforms.

### Does this PR introduce _any_ user-facing change?
No. This is a documentation-only update.

### How was this patch tested?
The documentation has been reviewed for clarity and technical accuracy.
The examples and descriptions align with the implementation in
`vllm_ascend/cpu_binding.py`.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Signed-off-by: c00818886 <chenchuwei@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
This commit is contained in:
Frank Chen
2026-03-10 15:59:31 +08:00
committed by GitHub
parent 33234aa0c5
commit 14c71b19e1
5 changed files with 363 additions and 1 deletions

View File

@@ -0,0 +1,228 @@
# CPU Binding
## Overview
CPU binding pins vLLM Ascend worker processes and key threads to specific CPU cores to reduce CPUNPU crossNUMA traffic and stabilize latency under multiprocess workloads. It is designed for ARM servers running Ascend NPUs and is automatically executed during worker initialization when enabled.
## Background
On multisocket ARM systems, the OS scheduler may place vLLM threads on CPUs far from the local NPU, causing NUMA crosstraffic and jitter. CPU binding enforces a deterministic CPU placement strategy and optionally binds NPU IRQs to the same CPU pool. This is distinct from other performance features (e.g., graph mode or dynamic batch) because it is purely a hostside affinity policy and does not change model execution logic.
## Design & How it works
### Key concepts
- **Allowed CPU list**: The cpuset from /proc/self/status (Cpus_allowed_list). All allocations are constrained to this list.
- **Running NPU list**: Logical NPU IDs extracted from npusmi process listing, optionally filtered by ASCEND_RT_VISIBLE_DEVICES.
- **CPU pool per NPU**: The CPU list assigned to each logical NPU ID based on the binding mode.
- **Binding modes & Device behavior**:
| Device type | Default mode | Description |
| ----------- | ------------ | ------------ |
| A3 (No Affinity) | `global_slice` | Splits the allowed CPU list evenly based on the **total number of global logical NPUs**, ensuring each NPU is assigned a contiguous segment of CPU cores. This prevents CPU core overlap across multiple process groups. |
| A2 / 310P / Others | `topo_affinity` | Allocates CPUs based on NPU topology affinity (`npusmi info -t topo`). If multiple NPUs are assigned to a single NUMA node (which may cause bandwidth contention), the CPU allocation extends to adjacent NUMA nodes. |
- **Default**: enabled (enable_cpu_binding = true).
- **Fallback**: If NPU topo affinity is unavailable, global_slice is used.
- **Failure handling**: Any exception in binding is logged as a warning and **binding is skipped for that rank**.
### Execution flow (simplified)
1. **Feature entry**: worker initialization calls `bind_cpus(local_rank)` when `enable_cpu_binding` is true.
2. **CPU architecture gate**: If the CPU is not ARM, binding is skipped with a log.
3. **Collect device info**:
- Map logical NPU IDs from `npusmi info -m`.
- Detect running NPU IDs from npusmi info process table.
- Read cpuset from /proc/self/status.
- Read topo affinity from `npusmi info -t topo`.
4. **Build CPU pools**:
- Use **global_slice** for A3 devices; **topo_affinity** for A2 and 310P.
- If topo affinity is missing, fall back to global_slice.
- Ensure each NPU has at least 5 CPUs.
5. **Allocate perrole CPUs**:
- Reserve the first two CPUs for IRQ binding.
- `main`: pool[2:-2]
- `acl`: pool[-2]
- `release`: pool[-1]
6. **Bind threads**:
- Main process is pinned to `main` CPUs.
- ACL threads (named with acl_thread) are pinned to `acl` CPU.
- Release threads (named with release_thread) are pinned to `release` CPU.
7. **Bind NPU IRQs (optional)**:
- If /proc/irq is writable, bind SQ/CQ IRQs to the first two CPUs in the pool.
- irqbalance may be stopped to prevent overrides.
8. **Memory binding (optional)**:
- If migratepages is available, memory for ACL threads is migrated to the NPUs NUMA node.
## Allocation plan examples
The allocation plan is derived directly from the CPU pool per NPU and then split into roles:
- IRQ CPUs: pool[0], pool[1]
- `main`: pool[2:-2]
- `acl`: pool[-2]
- `release`: pool[-1]
Below are concrete examples that reflect the actual code paths.
### Example 1: A3 inference server with 640 CPUs and 16 NPUs
- allowed_cpus = [0..639] (640 CPUs)
- NUMA nodes = 0..7 (8 NUMA nodes, symmetric layout)
- total_npus = 16
- running_npu_list = [0..15]
- base = 640 // 16 = 40, extra = 0
- Each NPU gets a 40CPU pool.
|NPU ID|Assigned CPU Cores (global_slice)|Role Division (IRQ/Main/ACL/Release)|
|---|---|---|
|0|0-39|`IRQ`: 0-1, `Main`: 2-37, `ACL`: 38, `Release`: 39|
|1|40-79|`IRQ`: 40-41, `Main`: 42-77, `ACL`: 78, `Release`: 79|
|...|...|...|
|15|600-639|`IRQ`: 600-601, `Main`: 602-637, `ACL`: 638, `Release`: 639|
This layout remains deterministic even when multiple processes share the same cpuset, because slicing is based on the global logical NPU ID.
### Example 2: A3 global_slice, even split
**Inputs**:
- allowed_cpus = [0..23] (24 CPUs)
- NUMA nodes = 0..1 (2 NUMA nodes, symmetric layout; NUMA0 = 0..11, NUMA1 = 12..23)
- total_npus = 4 (from npu-smi info -m)
- running_npu_list = [0, 1, 2, 3]
**Global slice**:
- base = 24 // 4 = 6, extra = 0
- Each NPU gets a 6CPU pool.
|NPU ID|Assigned CPU Cores (global_slice)|Role Division (IRQ/Main/ACL/Release)|
|---|---|---|
|0|0-5|`IRQ`: 0-1, `Main`: 2-3, `ACL`: 4, `Release`: 5|
|1|6-11|`IRQ`: 6-7, `Main`: 8-9, `ACL`: 10, `Release`: 11|
|2|12-17|`IRQ`: 12-13, `Main`: 14-15, `ACL`: 16, `Release`: 17|
|3|18-23|`IRQ`: 18-19, `Main`: 20-21, `ACL`: 22, `Release`: 23|
### Example 3: A3 global_slice, remainder distribution
**Inputs**:
- allowed_cpus = [0..16] (17 CPUs)
- NUMA nodes = 0..1 (2 NUMA nodes, symmetric layout; NUMA0 = 0..7, NUMA1 = 8..16)
- total_npus = 3
- running_npu_list = [0, 1, 2]
**Global slice**:
- base = 17 // 3 = 5, extra = 2
- NPU0 pool size = 6 (base+1)
- NPU1 pool size = 6 (base+1)
- NPU2 pool size = 5 (base)
|NPU ID|Assigned CPU Cores (global_slice)|Role Division (IRQ/Main/ACL/Release)|
|---|---|---|
|0|0-5|`IRQ`: 0-1, `Main`: 2-3, `ACL`: 4, `Release`: 5|
|1|6-11|`IRQ`: 6-7, `Main`: 8-9, `ACL`: 10, `Release`: 11|
|2|12-16|`IRQ`: 12-13, `Main`: 14, `ACL`: 15, `Release`: 16|
Note: When a pool size is exactly 5, `main` has a single CPU (pool[2]). If any pool is <5, binding raises an error.
**NUMA analysis**:
- With the symmetric NUMA layout above (NUMA0 = 0..7, NUMA1 = 8..16), NPU0 stays within NUMA0, NPU2 stays within NUMA1, but NPU1 spans both NUMA0 (6,7) and NUMA1 (8..11). This is a direct consequence of global slicing over the ordered cpuset; the remainder distribution does not enforce NUMA boundaries.
- If the cpuset numbering is interleaved across NUMA nodes (nonsymmetric layout), crossNUMA pools can happen even earlier. This is why symmetric NUMA layout is recommended for best locality.
### Known limitations and future improvements
With the current `global_slice` strategy, some CPU/NPU layouts cannot avoid crossNUMA pools. A future enhancement should incorporate NUMA node boundaries into the slicing logic so that pools remain within a single NUMA node whenever possible.
### Example 4: global_slice with visible subset of NPUs
**Inputs**:
- total_npus = 8 (from npu-smi info -m)
- running_npu_list = [2, 3] (filtered by ASCEND_RT_VISIBLE_DEVICES)
- allowed_cpus = [0..39] (40 CPUs)
- NUMA nodes = 0..3 (4 NUMA nodes, symmetric layout; 0..9, 10..19, 20..29, 30..39)
**Global slice**:
- base = 40 // 8 = 5, extra = 0
- Only the visible logical NPUs get pools, but slicing uses the global NPU ID so different processes do not overlap.
|NPU ID|Assigned CPU Cores (global_slice)|Role Division (IRQ/Main/ACL/Release)|
|---|---|---|
|2|10-14|`IRQ`: 10-11, `Main`: 12, `ACL`: 13, `Release`: 14|
|3|15-19|`IRQ`: 15-16, `Main`: 17, `ACL`: 18, `Release`: 19|
### Example 5: A2/310P topo_affinity with NUMA extension
**Inputs**:
- npu_affinity = {0: [0..7], 1: [0..7]} (from `npu-smi info -t topo`)
- allowed_cpus = [0..15] (16 CPUs)
- NUMA nodes = 0..1 (2 NUMA nodes; NUMA0 = 0..7, NUMA1 = 8..15)
**NUMA extension**:
- Both NPUs are on NUMA0, so each pool extends to the nearest NUMA node to reduce contention.
- NPU0 extends to NUMA1 -> [0..15]
- NPU1 extends to NUMA1 -> [0..15]
Because both pools are identical, the allocator applies average distribution across NPUs to avoid overlap. With a pool [0..15] and 2 NPUs, the final pools become:
|NPU ID|Assigned CPU Cores (topo_affinity)|Role Division (IRQ/Main/ACL/Release)|
|---|---|---|
|0|0-7|`IRQ`: 0-1, `Main`: 2-5, `ACL`: 6, `Release`: 7|
|1|8-15|`IRQ`: 8-9, `Main`: 10-13, `ACL`: 14, `Release`: 15|
### Example 6: Minimum CPUs per NPU
**Inputs**:
- total_npus = 2
- allowed_cpus = [0..7] (8 CPUs)
- NUMA nodes = 0..1 (2 NUMA nodes, symmetric layout; NUMA0 = 0..3, NUMA1 = 4..7)
**Result**:
- base = 4, which is < 5, so binding fails with: "Insufficient CPUs for binding with IRQ/ACL/REL reservations..."
|NPU ID|Assigned CPU Cores|Role Division (IRQ/Main/ACL/Release)|
|---|---|---|
|0|N/A|Binding error (insufficient CPUs per NPU)|
|1|N/A|Binding error (insufficient CPUs per NPU)|
To resolve, either reduce total_npus or enlarge the cpuset so that each NPU has at least 5 CPUs.
### Logging and verification
- Logs show the selected binding mode and the allocation plan, for example:
- `[cpu_bind_mode] mode=global_slice rank=0 visible_npus=[...]`
- `The CPU allocation plan is as follows: ...`
- You can verify affinity via taskset or /proc/<pid>/status after startup.
## Limitations & Notes
- **ARMonly**: Binding is skipped on nonARM CPUs.
- **Minimum CPU requirement**: Each logical NPU requires at least 5 CPUs. If the cpuset is smaller, binding fails with an error.
- **NUMA symmetry assumption**: For best locality, the current strategies assume the cpuset is evenly distributed across NUMA nodes and CPU numbering aligns with NUMA layout; otherwise NUMA locality may be suboptimal.
- Example (symmetric layout): 2 NUMA nodes, 64 CPUs total. NUMA0 = CPUs 031, NUMA1 = CPUs 3263, and the cpuset is 063. With 4 logical NPUs, global slicing yields 16 CPUs per NPU (015, 1631, 3247, 4863), so each NPUs pool stays within a single NUMA node.
- **Runtime dependencies**:
- Requires npusmi and lscpu commands.
- IRQ binding requires write access to /proc/irq.
- Memory binding requires migratepages; otherwise it is skipped.
- **IRQ side effects**: irqbalance may be stopped to avoid overriding bindings.
- **Perprocess behavior**: Only the current ranks NPU is used for IRQ binding to avoid crossprocess overwrite.
### Debug logging
Use the standard vLLM logging configuration to enable debug logs. The binding process emits debug messages (e.g., `[cpu_global_slice] ...`) when debug level is enabled.
## References
- CPU binding implementation: vllm_ascend/cpu_binding.py (`DeviceInfo`, `CpuAlloc`, `bind_cpus`)
- Worker integration: vllm_ascend/worker/worker.py (`NPUWorker._init_device`)
- Additional config option: docs/source/user_guide/configuration/additional_config.md (`enable_cpu_binding`)
- Tests: tests/ut/device_allocator/test_cpu_binding.py

View File

@@ -6,6 +6,7 @@ This section provides an overview of the features implemented in vLLM Ascend. De
:caption: Feature Guide
:maxdepth: 1
patch
cpu_binding
ModelRunner_prepare_inputs
disaggregated_prefill
eplb_swift_balancer