diff --git a/README.md b/README.md index 81a08e085..8c5fa7ad5 100644 --- a/README.md +++ b/README.md @@ -56,10 +56,12 @@ You can install SGLang using any of the methods below. pip install --upgrade pip pip install "sglang[all]" -# Install FlashInfer CUDA kernels +# Install FlashInfer accelerated kernels pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ ``` +**Important: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions.** + ### Method 2: From source ``` # Use the last release branch @@ -69,10 +71,12 @@ cd sglang pip install --upgrade pip pip install -e "python[all]" -# Install FlashInfer CUDA kernels +# Install FlashInfer accelerated kernels pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ ``` +**Important: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions.** + ### Method 3: Using docker The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker). Replace `` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens). @@ -226,7 +230,8 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ``` python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096 ``` -- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. +- To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly. +- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly. - To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies. - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`. @@ -247,7 +252,6 @@ We also provide an inference engine **without a HTTP server**. For example, ```python import sglang as sgl - def main(): prompts = [ "Hello, my name is", @@ -267,12 +271,8 @@ if __name__ == "__main__": main() ``` -This can be used for: - -1. **Offline Batch Inference** -2. **Building Custom Servers** - -You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) +This can be used for offline batch inference and building custom servers. +You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine). ### Supported Models @@ -440,7 +440,6 @@ print(state["answer_1"]) ``` #### More Examples - Anthropic and VertexAI (Gemini) models are also supported. You can find more examples at [examples/quick_start](examples/frontend_language/quick_start). diff --git a/docs/en/backend.md b/docs/en/backend.md index 516ab2af0..0b1103511 100644 --- a/docs/en/backend.md +++ b/docs/en/backend.md @@ -79,7 +79,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ``` python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096 ``` -- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. +- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly. - To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies. - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`. @@ -100,7 +100,6 @@ We also provide an inference engine **without a HTTP server**. For example, ```python import sglang as sgl - def main(): prompts = [ "Hello, my name is", @@ -120,12 +119,8 @@ if __name__ == "__main__": main() ``` -This can be used for: - -1. **Offline Batch Inference** -2. **Building Custom Servers** - -You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine) +This can be used for offline batch inference and building custom servers. +You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine). ### Supported Models diff --git a/docs/en/frontend.md b/docs/en/frontend.md index a90c29032..2efd21e3c 100644 --- a/docs/en/frontend.md +++ b/docs/en/frontend.md @@ -68,7 +68,6 @@ print(state["answer_1"]) ``` #### More Examples - Anthropic and VertexAI (Gemini) models are also supported. You can find more examples at [examples/quick_start](https://github.com/sgl-project/sglang/tree/main/examples/frontend_language/quick_start). diff --git a/docs/en/hyperparameter_tuning.md b/docs/en/hyperparameter_tuning.md index f2bf9d55f..b0aa15e8a 100644 --- a/docs/en/hyperparameter_tuning.md +++ b/docs/en/hyperparameter_tuning.md @@ -6,11 +6,11 @@ Achieving a large batch size is the most important thing for attaining high thro When the server is running at full load, look for the following in the log: -```Decode batch. #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 417``` +```Decode batch. #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 317``` ### Tune Your Request Submission Speed `#queue-req` indicates the number of requests in the queue. If you frequently see `#queue-req == 0`, it suggests you are bottlenecked by the request submission speed. -A healthy range for `#queue-req` is `50 - 1000`. +A healthy range for `#queue-req` is `50 - 500`. On the other hand, do not make `#queue-req` too large because it will also increase the scheduling overhead on the server. ### Tune `--schedule-conservativeness` @@ -31,6 +31,10 @@ If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096 If OOM happens during decoding, try to decrease `--max-running-requests`. You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding. +### Try advanced options +- To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly. +- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly. + ### (Minor) Tune `--schedule-policy` If you have many shared prefixes, use the default `--schedule-policy lpm`. `lpm` stands for longest prefix match. When you have no shared prefixes at all or you always send the requests with the shared prefixes together, diff --git a/docs/en/install.md b/docs/en/install.md index 55eed71ae..b118a9289 100644 --- a/docs/en/install.md +++ b/docs/en/install.md @@ -7,23 +7,27 @@ You can install SGLang using any of the methods below. pip install --upgrade pip pip install "sglang[all]" -# Install FlashInfer CUDA kernels +# Install FlashInfer accelerated kernels pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ ``` +**Important: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions.** + ### Method 2: From source ``` # Use the last release branch -git clone -b v0.3.0 https://github.com/sgl-project/sglang.git +git clone -b v0.3.4.post1 https://github.com/sgl-project/sglang.git cd sglang pip install --upgrade pip pip install -e "python[all]" -# Install FlashInfer CUDA kernels +# Install FlashInfer accelerated kernels pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/ ``` +**Important: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions.** + ### Method 3: Using docker The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker). Replace `` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens). @@ -94,3 +98,4 @@ sky status --endpoint 30000 sglang ### Common Notes - [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub. - If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`. +- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. This allows you to build SGLang programs locally and execute them by connecting to the remote backend.