From 1ae270c5d0873c0bcd02b9078e3a6bd0f12fbc1d Mon Sep 17 00:00:00 2001 From: Lianmin Zheng Date: Thu, 7 Nov 2024 18:20:41 -0800 Subject: [PATCH] [Doc] fix docs (#1949) --- docs/{references => frontend}/choices_methods.md | 0 docs/index.rst | 4 ++-- docs/references/hyperparameter_tuning.md | 6 +++--- docs/references/troubleshooting.md | 6 +++--- 4 files changed, 8 insertions(+), 8 deletions(-) rename docs/{references => frontend}/choices_methods.md (100%) diff --git a/docs/references/choices_methods.md b/docs/frontend/choices_methods.md similarity index 100% rename from docs/references/choices_methods.md rename to docs/frontend/choices_methods.md diff --git a/docs/index.rst b/docs/index.rst index 130b29811..e81cdd149 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -36,17 +36,17 @@ The core features include: :caption: Frontend Tutorial frontend/frontend.md + frontend/choices_methods.md .. toctree:: :maxdepth: 1 :caption: References + references/supported_models.md references/sampling_params.md references/hyperparameter_tuning.md - references/supported_models.md references/benchmark_and_profiling.md - references/choices_methods.md references/custom_chat_template.md references/contributor_guide.md references/troubleshooting.md diff --git a/docs/references/hyperparameter_tuning.md b/docs/references/hyperparameter_tuning.md index 89faa479b..499b81bc0 100644 --- a/docs/references/hyperparameter_tuning.md +++ b/docs/references/hyperparameter_tuning.md @@ -26,9 +26,9 @@ Data parallelism is better for throughput. When there is enough GPU memory, alwa ### Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests` If you see out of memory (OOM) errors, you can try to tune the following parameters. -If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`. -If OOM happens during decoding, try to decrease `--max-running-requests`. -You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding. +- If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`. +- If OOM happens during decoding, try to decrease `--max-running-requests`. +- You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding. ### Try Advanced Options - To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly. diff --git a/docs/references/troubleshooting.md b/docs/references/troubleshooting.md index becb186df..8442bb205 100644 --- a/docs/references/troubleshooting.md +++ b/docs/references/troubleshooting.md @@ -4,9 +4,9 @@ This page lists some common errors and tips for fixing them. ## CUDA out of memory If you see out of memory (OOM) errors, you can try to tune the following parameters. -If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`. -If OOM happens during decoding, try to decrease `--max-running-requests`. -You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding. +- If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`. +- If OOM happens during decoding, try to decrease `--max-running-requests`. +- You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding. ## CUDA error: an illegal memory access was encountered This error may be due to kernel errors or out-of-memory issues.