From 023d0a73df989a24535653f5290d63de369b8d75 Mon Sep 17 00:00:00 2001 From: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Date: Sat, 16 Nov 2024 03:09:10 +0800 Subject: [PATCH] fix small typos in docs (#2047) --- docs/backend/backend.md | 4 ++-- docs/frontend/frontend.md | 2 +- docs/references/hyperparameter_tuning.md | 4 ++-- docs/references/troubleshooting.md | 2 +- 4 files changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/backend/backend.md b/docs/backend/backend.md index 2e6b4287e..cce345e10 100644 --- a/docs/backend/backend.md +++ b/docs/backend/backend.md @@ -79,8 +79,8 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ``` python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096 ``` -- To enable the experimental overlapped scheduler, add `--enable-overlap-schedule`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly. -- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly. +- To enable the experimental overlapped scheduler, add `--enable-overlap-schedule`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currently. +- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currently. - To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies. - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`. diff --git a/docs/frontend/frontend.md b/docs/frontend/frontend.md index 07c4c1e52..8b56fa487 100644 --- a/docs/frontend/frontend.md +++ b/docs/frontend/frontend.md @@ -1,5 +1,5 @@ # Frontend: Structured Generation Language (SGLang) -The frontend language can be used with local models or API models. It is an alternative to the OpenAI API. You may found it easier to use for complex prompting workflow. +The frontend language can be used with local models or API models. It is an alternative to the OpenAI API. You may find it easier to use for complex prompting workflow. ## Quick Start The example below shows how to use SGLang to answer a multi-turn question. diff --git a/docs/references/hyperparameter_tuning.md b/docs/references/hyperparameter_tuning.md index cb5089951..2729b968a 100644 --- a/docs/references/hyperparameter_tuning.md +++ b/docs/references/hyperparameter_tuning.md @@ -31,8 +31,8 @@ If you see out of memory (OOM) errors, you can try to tune the following paramet - You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding. ### Try Advanced Options -- To enable the experimental overlapped scheduler, add `--enable-overlap-schedule`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly. -- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly. +- To enable the experimental overlapped scheduler, add `--enable-overlap-schedule`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currently. +- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currently. ### Tune `--schedule-policy` If the workload has many shared prefixes, use the default `--schedule-policy lpm`. `lpm` stands for longest prefix match. diff --git a/docs/references/troubleshooting.md b/docs/references/troubleshooting.md index 8442bb205..5a7646a11 100644 --- a/docs/references/troubleshooting.md +++ b/docs/references/troubleshooting.md @@ -11,4 +11,4 @@ If you see out of memory (OOM) errors, you can try to tune the following paramet ## CUDA error: an illegal memory access was encountered This error may be due to kernel errors or out-of-memory issues. - If it is a kernel error, it is not easy to fix. Please file an issue on the GitHub. -- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." Please refer to the above seciton to avoid the OOM. +- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." Please refer to the above section to avoid the OOM.