[Doc] Sensitive word modification (#8303)
<!-- Thanks for sending a pull request! BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing/overview.html --> ### What this PR does / why we need it? This PR updates the documentation to replace specific hardware terms (e.g., HBM, 910B, 310P) with more generic or branded terms (e.g., on-chip memory, Atlas inference products) to comply with sensitive word requirements. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? --------- Signed-off-by: herizhen <1270637059@qq.com> Signed-off-by: herizhen <59841270+herizhen@users.noreply.github.com>
This commit is contained in:
@@ -136,9 +136,9 @@ The problem is usually caused by the installation of a development or editable v
|
||||
|
||||
OOM errors typically occur when the model exceeds the memory capacity of a single NPU. For general guidance, you can refer to [vLLM OOM troubleshooting documentation](https://docs.vllm.ai/en/latest/usage/troubleshooting/#out-of-memory).
|
||||
|
||||
In scenarios where NPUs have limited high bandwidth memory (HBM) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
|
||||
In scenarios where NPUs have limited high bandwidth memory (on-chip memory) capacity, dynamic memory allocation/deallocation during inference can exacerbate memory fragmentation, leading to OOM. To address this:
|
||||
|
||||
- **Limit `--max-model-len`**: It can save the HBM usage for KV cache initialization step.
|
||||
- **Limit `--max-model-len`**: It can save the on-chip memory usage for KV cache initialization step.
|
||||
|
||||
- **Adjust `--gpu-memory-utilization`**: If unspecified, the default value is `0.9`. You can decrease this value to reserve more memory to reduce fragmentation risks. See details in: [vLLM - Inference and Serving - Engine Arguments](https://docs.vllm.ai/en/latest/cli/serve/#-gpu-memory-utilization).
|
||||
|
||||
|
||||
Reference in New Issue
Block a user