[Doc] Add a perf tune section (#5127)

### What this PR does / why we need it?
This patch purpose to 
1. add a  section on os point of perf tune doc
2. Set some default env in the image for performance

- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c

---------

Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
Li Wang
2025-12-19 14:52:52 +08:00
committed by GitHub
parent a6eaf816f1
commit 5ab6d124e5
7 changed files with 121 additions and 12 deletions

View File

@@ -182,3 +182,87 @@ Plus, there are more features for performance optimization in specific scenarios
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).
### 5. OS Optimization
This section describes operating systemlevel optimizations applied on the host machine (bare metal or Kubernetes node) to improve performance stability, latency, and throughput for inference workloads.
:::{note}
These settings must be applied on the host OS and with root privileges. not inside containers.
:::
#### 5.1
Set CPU Frequency Governor to `performance`
```shell
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
```
Purpose
- Forces all CPU cores to run under the `performance` governor
- Disables dynamic frequency scaling (e.g., `ondemand`, `powersave`)
Benefits
- Keeps CPU cores at maximum frequency
- Reduces latency jitter
- Improves predictability for inference workloads
#### 5.2 Disable Swap Usage
```shell
sysctl -w vm.swappiness=0
```
Purpose
- Minimizes the kernels tendency to swap memory pages to disk
Benefits
- Prevents severe latency spikes caused by swapping
- Improves stability for large in-memory models
Notes
- For inference workloads, swap can introduce second-level latency
- Recommended values are `0` or `1`
#### 5.3 Disable Automatic NUMA Balancing
```shell
sysctl -w kernel.numa_balancing=0
```
Purpose
- Disables the kernels automatic NUMA page migration mechanism
Benefits
- Prevents background memory page migrations
- Reduces unpredictable memory access latency
- Improves performance stability on NUMA systems
Recommended For
- Multi-socket servers
- Ascend / NPU deployments with explicit NUMA binding
- Systems with manually managed CPU and memory affinity
#### 5.4 Increase Scheduler Migration Cost
```shell
sysctl -w kernel.sched_migration_cost_ns=50000
```
Purpose
- Increases the cost for the scheduler to migrate tasks between CPU cores
Benefits
- Reduces frequent thread migration
- Improves CPU cache locality
- Lowers latency jitter for inference workloads
Parameter Details
- Unit: nanoseconds (ns)
- Typical recommended range: 50000100000
- Higher values encourage threads to stay on the same CPU core