[Doc][CPU binding] Add user/developer guide for CPU binding (#7045)

### What this PR does / why we need it?
This PR adds comprehensive documentation for the CPU binding feature on
Ascend NPUs. It includes:

- A detailed developer guide
(`docs/source/developer_guide/feature_guide/cpu_binding.md`) covering
the design, internal logic, allocation examples, and troubleshooting for
the CPU binding mechanism.
- A concise user guide
(`docs/source/user_guide/feature_guide/cpu_binding.md`) explaining the
core concepts, usage, and common issues for end-users.
- An update to `additional_config.md` to use consistent terminology for
binding strategies (`global-slicing` and `topo-affinity`).

This documentation is needed to help both developers and users
understand, use, and debug the CPU binding feature, which is critical
for performance on ARM+Ascend platforms.

### Does this PR introduce _any_ user-facing change?
No. This is a documentation-only update.

### How was this patch tested?
The documentation has been reviewed for clarity and technical accuracy.
The examples and descriptions align with the implementation in
`vllm_ascend/cpu_binding.py`.

- vLLM version: v0.16.0
- vLLM main:
4034c3d32e

---------

Signed-off-by: chenchuw886 <chenchuw@huawei.com>
Signed-off-by: c00818886 <chenchuwei@huawei.com>
Co-authored-by: chenchuw886 <chenchuw@huawei.com>
This commit is contained in:
Frank Chen
2026-03-10 15:59:31 +08:00
committed by GitHub
parent 33234aa0c5
commit 14c71b19e1
5 changed files with 363 additions and 1 deletions

View File

@@ -38,7 +38,7 @@ The following table lists additional configuration options available in vLLM Asc
| `multistream_overlap_shared_expert` | bool | `False` | Whether to enable multi-stream shared expert. This option only takes effect on MoE models with shared experts. |
| `multistream_overlap_gate` | bool | `False` | Whether to enable multi-stream overlap gate. This option only takes effect on MoE models with shared experts. |
| `recompute_scheduler_enable` | bool | `False` | Whether to enable recompute scheduler. |
| `enable_cpu_binding` | bool | `True` | Whether to enable CPU binding. Only takes effect on ARM CPUs; when enabled, A3 uses NUMA-balanced binding strategy and other device types use NUMA-affinity's. |
| `enable_cpu_binding` | bool | `True` | Whether to enable CPU binding. Only takes effect on ARM CPUs; A3 uses the global-slicing CPU allocation strategy and other device types use the topo-affinity CPU allocation strategy. |
| `SLO_limits_for_dynamic_batch` | int | `-1` | SLO limits for dynamic batch. This is new scheduler to support dynamic batch feature |
| `enable_npugraph_ex` | bool | `False` | Whether to enable npugraph_ex graph mode. |
| `pa_shape_list` | list | `[]` | The custom shape list of page attention ops. |