[Doc] Add a perf tune section (#5127)
### What this PR does / why we need it?
This patch purpose to
1. add a section on os point of perf tune doc
2. Set some default env in the image for performance
- vLLM version: v0.12.0
- vLLM main:
ad32e3e19c
---------
Signed-off-by: wangli <wangli858794774@gmail.com>
This commit is contained in:
@@ -182,3 +182,87 @@ Plus, there are more features for performance optimization in specific scenarios
|
||||
- `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
|
||||
- `HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
|
||||
- `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).
|
||||
|
||||
### 5. OS Optimization
|
||||
|
||||
This section describes operating system–level optimizations applied on the host machine (bare metal or Kubernetes node) to improve performance stability, latency, and throughput for inference workloads.
|
||||
|
||||
:::{note}
|
||||
These settings must be applied on the host OS and with root privileges. not inside containers.
|
||||
:::
|
||||
|
||||
#### 5.1
|
||||
|
||||
Set CPU Frequency Governor to `performance`
|
||||
|
||||
```shell
|
||||
echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
|
||||
```
|
||||
|
||||
Purpose
|
||||
- Forces all CPU cores to run under the `performance` governor
|
||||
- Disables dynamic frequency scaling (e.g., `ondemand`, `powersave`)
|
||||
|
||||
Benefits
|
||||
- Keeps CPU cores at maximum frequency
|
||||
- Reduces latency jitter
|
||||
- Improves predictability for inference workloads
|
||||
|
||||
#### 5.2 Disable Swap Usage
|
||||
|
||||
```shell
|
||||
sysctl -w vm.swappiness=0
|
||||
```
|
||||
|
||||
Purpose
|
||||
|
||||
- Minimizes the kernel’s tendency to swap memory pages to disk
|
||||
|
||||
Benefits
|
||||
|
||||
- Prevents severe latency spikes caused by swapping
|
||||
- Improves stability for large in-memory models
|
||||
|
||||
Notes
|
||||
- For inference workloads, swap can introduce second-level latency
|
||||
- Recommended values are `0` or `1`
|
||||
|
||||
#### 5.3 Disable Automatic NUMA Balancing
|
||||
|
||||
```shell
|
||||
sysctl -w kernel.numa_balancing=0
|
||||
```
|
||||
|
||||
Purpose
|
||||
|
||||
- Disables the kernel’s automatic NUMA page migration mechanism
|
||||
|
||||
Benefits
|
||||
|
||||
- Prevents background memory page migrations
|
||||
- Reduces unpredictable memory access latency
|
||||
- Improves performance stability on NUMA systems
|
||||
|
||||
Recommended For
|
||||
- Multi-socket servers
|
||||
- Ascend / NPU deployments with explicit NUMA binding
|
||||
- Systems with manually managed CPU and memory affinity
|
||||
|
||||
#### 5.4 Increase Scheduler Migration Cost
|
||||
|
||||
```shell
|
||||
sysctl -w kernel.sched_migration_cost_ns=50000
|
||||
```
|
||||
|
||||
Purpose
|
||||
- Increases the cost for the scheduler to migrate tasks between CPU cores
|
||||
|
||||
Benefits
|
||||
- Reduces frequent thread migration
|
||||
- Improves CPU cache locality
|
||||
- Lowers latency jitter for inference workloads
|
||||
|
||||
Parameter Details
|
||||
- Unit: nanoseconds (ns)
|
||||
- Typical recommended range: 50000–100000
|
||||
- Higher values encourage threads to stay on the same CPU core
|
||||
|
||||
Reference in New Issue
Block a user