[Doc] Add a perf tune section (#5127)

### What this PR does / why we need it? This patch purpose to 1. add a section on os point of perf tune doc 2. Set some default env in the image for performance - vLLM version: v0.12.0 - vLLM main: ad32e3e19c --------- Signed-off-by: wangli <wangli858794774@gmail.com>
2025-12-19 14:52:52 +08:00
parent a6eaf816f1
commit 5ab6d124e5
7 changed files with 121 additions and 12 deletions
--- a/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md
+++ b/docs/source/developer_guide/performance_and_debug/optimization_and_tuning.md
@@ -182,3 +182,87 @@ Plus, there are more features for performance optimization in specific scenarios
 - `HCCL_RDMA_TC`: Use this var to configure traffic class of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0045.html).
 - `HCCL_RDMA_SL`: Use this var to configure service level of RDMA NIC. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0046.html).
 - `HCCL_BUFFSIZE`: Use this var to control the cache size for sharing data between two NPUs. Find more details [here](https://www.hiascend.com/document/detail/zh/Pytorch/600/ptmoddevg/trainingmigrguide/performance_tuning_0047.html).
+
+### 5. OS Optimization
+
+This section describes operating system–level optimizations applied on the host machine (bare metal or Kubernetes node) to improve performance stability, latency, and throughput for inference workloads.
+
+:::{note}
+These settings must be applied on the host OS and with root privileges. not inside containers.
+:::
+
+#### 5.1
+
+Set CPU Frequency Governor to `performance`
+
+```shell
+echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
+```
+
+Purpose
+- Forces all CPU cores to run under the `performance` governor
+- Disables dynamic frequency scaling (e.g., `ondemand`, `powersave`)
+
+Benefits
+- Keeps CPU cores at maximum frequency
+- Reduces latency jitter
+- Improves predictability for inference workloads
+
+#### 5.2 Disable Swap Usage
+
+```shell
+sysctl -w vm.swappiness=0
+```
+
+Purpose
+
+- Minimizes the kernel’s tendency to swap memory pages to disk
+
+Benefits
+
+- Prevents severe latency spikes caused by swapping
+- Improves stability for large in-memory models
+
+Notes
+- For inference workloads, swap can introduce second-level latency
+- Recommended values are `0` or `1`
+
+#### 5.3 Disable Automatic NUMA Balancing
+
+```shell
+sysctl -w kernel.numa_balancing=0
+```
+
+Purpose
+
+- Disables the kernel’s automatic NUMA page migration mechanism
+
+Benefits
+
+- Prevents background memory page migrations
+- Reduces unpredictable memory access latency
+- Improves performance stability on NUMA systems
+
+Recommended For
+- Multi-socket servers
+- Ascend / NPU deployments with explicit NUMA binding
+- Systems with manually managed CPU and memory affinity
+
+#### 5.4 Increase Scheduler Migration Cost
+
+```shell
+sysctl -w kernel.sched_migration_cost_ns=50000
+```
+
+Purpose
+- Increases the cost for the scheduler to migrate tasks between CPU cores
+
+Benefits
+- Reduces frequent thread migration
+- Improves CPU cache locality
+- Lowers latency jitter for inference workloads
+  
+Parameter Details
+- Unit: nanoseconds (ns)
+- Typical recommended range: 50000–100000
+- Higher values encourage threads to stay on the same CPU core