[Doc] fix docs (#1949)

This commit is contained in:
Lianmin Zheng
2024-11-07 18:20:41 -08:00
committed by GitHub
parent c77c1e05ba
commit 1ae270c5d0
4 changed files with 8 additions and 8 deletions

View File

@@ -36,17 +36,17 @@ The core features include:
:caption: Frontend Tutorial :caption: Frontend Tutorial
frontend/frontend.md frontend/frontend.md
frontend/choices_methods.md
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: References :caption: References
references/supported_models.md
references/sampling_params.md references/sampling_params.md
references/hyperparameter_tuning.md references/hyperparameter_tuning.md
references/supported_models.md
references/benchmark_and_profiling.md references/benchmark_and_profiling.md
references/choices_methods.md
references/custom_chat_template.md references/custom_chat_template.md
references/contributor_guide.md references/contributor_guide.md
references/troubleshooting.md references/troubleshooting.md

View File

@@ -26,9 +26,9 @@ Data parallelism is better for throughput. When there is enough GPU memory, alwa
### Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests` ### Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests`
If you see out of memory (OOM) errors, you can try to tune the following parameters. If you see out of memory (OOM) errors, you can try to tune the following parameters.
If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`. - If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
If OOM happens during decoding, try to decrease `--max-running-requests`. - If OOM happens during decoding, try to decrease `--max-running-requests`.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding. - You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
### Try Advanced Options ### Try Advanced Options
- To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly. - To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly.

View File

@@ -4,9 +4,9 @@ This page lists some common errors and tips for fixing them.
## CUDA out of memory ## CUDA out of memory
If you see out of memory (OOM) errors, you can try to tune the following parameters. If you see out of memory (OOM) errors, you can try to tune the following parameters.
If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`. - If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
If OOM happens during decoding, try to decrease `--max-running-requests`. - If OOM happens during decoding, try to decrease `--max-running-requests`.
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding. - You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
## CUDA error: an illegal memory access was encountered ## CUDA error: an illegal memory access was encountered
This error may be due to kernel errors or out-of-memory issues. This error may be due to kernel errors or out-of-memory issues.