Improve perf tuning docs (#7071)
This commit is contained in:
@@ -1,14 +1,16 @@
|
||||
# Troubleshooting
|
||||
|
||||
This page lists some common errors and tips for fixing them.
|
||||
This page lists common errors and tips for resolving them.
|
||||
|
||||
## CUDA out of memory
|
||||
If you see out of memory (OOM) errors, you can try to tune the following parameters.
|
||||
- If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
|
||||
- If OOM happens during decoding, try to decrease `--max-running-requests`.
|
||||
- You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
|
||||
## CUDA Out of Memory
|
||||
If you encounter out-of-memory (OOM) errors, you can adjust the following parameters:
|
||||
|
||||
## CUDA error: an illegal memory access was encountered
|
||||
This error may be due to kernel errors or out-of-memory issues.
|
||||
- If it is a kernel error, it is not easy to fix. Please file an issue on the GitHub.
|
||||
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." Please refer to the above section to avoid the OOM.
|
||||
- If OOM occurs during prefill, try reducing `--chunked-prefill-size` to `4096` or `2048`. This saves memory but slows down the prefill speed for long prompts.
|
||||
- If OOM occurs during decoding, try lowering `--max-running-requests`.
|
||||
- You can also reduce `--mem-fraction-static` to a smaller value, such as 0.8 or 0.7. This decreases the memory usage of the KV cache memory pool and helps prevent OOM errors during both prefill and decoding. However, it limits maximum concurrency and reduces peak throughput.
|
||||
- Another common case for OOM is requesting input logprobs for a long prompt as it requires significant memory. To address this, set `logprob_start_len` in your sampling parameters to include only the necessary parts. If you do need input logprobs for a long prompt, try reducing `--mem-fraction-static`.
|
||||
|
||||
## CUDA Error: Illegal Memory Access Encountered
|
||||
This error may result from kernel errors or out-of-memory issues:
|
||||
- If it is a kernel error, resolving it may be challenging. Please file an issue on GitHub.
|
||||
- If it is an out-of-memory issue, it may sometimes be reported as this error instead of "Out of Memory." Refer to the section above for guidance on avoiding OOM issues.
|
||||
|
||||
Reference in New Issue
Block a user