Refactor the docs (#9031)

2025-08-10 19:49:45 -07:00
parent 0f229c07f1
commit 2449a0afe2
80 changed files with 619 additions and 750 deletions
--- a/docs/platforms/amd_gpu.md
+++ b/docs/platforms/amd_gpu.md
@@ -0,0 +1,158 @@
+# AMD GPUs
+
+This document describes how run SGLang on AMD GPUs. If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues).
+
+## System Configuration
+
+When using AMD GPUs (such as MI300X), certain system-level optimizations help ensure stable performance. Here we take MI300X as an example. AMD provides official documentation for MI300X optimization and system tuning:
+
+- [AMD MI300X Tuning Guides](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html)
+- [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/vllm-benchmark.html)
+- [AMD Instinct MI300X System Optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html)
+- [AMD Instinct MI300X Workload Optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html)
+- [Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html)
+
+**NOTE:** We strongly recommend reading these docs and guides entirely to fully utilize your system.
+
+Below are a few key settings to confirm or enable for SGLang:
+
+### Update GRUB Settings
+
+In `/etc/default/grub`, append the following to `GRUB_CMDLINE_LINUX`:
+
+```text
+pci=realloc=off iommu=pt
+```
+
+Afterward, run `sudo update-grub` (or your distro’s equivalent) and reboot.
+
+### Disable NUMA Auto-Balancing
+
+```bash
+sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
+```
+
+You can automate or verify this change using [this helpful script](https://github.com/ROCm/triton/blob/rocm_env/scripts/amd/env_check.sh).
+
+Again, please go through the entire documentation to confirm your system is using the recommended configuration.
+
+## Install SGLang
+
+You can install SGLang using one of the methods below.
+
+### Install from Source
+
+```bash
+# Use the last release branch
+git clone -b v0.5.0rc0 https://github.com/sgl-project/sglang.git
+cd sglang
+
+# Compile sgl-kernel
+pip install --upgrade pip
+cd sgl-kernel
+python setup_rocm.py install
+
+# Install sglang python package
+cd ..
+pip install -e "python[all_hip]"
+```
+
+### Install Using Docker (Recommended)
+
+The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile.rocm](https://github.com/sgl-project/sglang/tree/main/docker).
+
+The steps below show how to build and use an image.
+
+1. Build the docker image.
+   If you use pre-built images, you can skip this step and replace `sglang_image` with the pre-built image names in the steps below.
+
+   ```bash
+   docker build -t sglang_image -f Dockerfile.rocm .
+   ```
+
+2. Create a convenient alias.
+
+   ```bash
+   alias drun='docker run -it --rm --network=host --privileged --device=/dev/kfd --device=/dev/dri \
+       --ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \
+       --security-opt seccomp=unconfined \
+       -v $HOME/dockerx:/dockerx \
+       -v /data:/data'
+   ```
+
+   If you are using RDMA, please note that:
+     - `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
+     - You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
+
+3. Launch the server.
+
+   **NOTE:** Replace `<secret>` below with your [huggingface hub token](https://huggingface.co/docs/hub/en/security-tokens).
+
+   ```bash
+   drun -p 30000:30000 \
+       -v ~/.cache/huggingface:/root/.cache/huggingface \
+       --env "HF_TOKEN=<secret>" \
+       sglang_image \
+       python3 -m sglang.launch_server \
+       --model-path NousResearch/Meta-Llama-3.1-8B \
+       --host 0.0.0.0 \
+       --port 30000
+   ```
+
+4. To verify the utility, you can run a benchmark in another terminal or refer to [other docs](https://docs.sglang.ai/backend/openai_api_completions.html) to send requests to the engine.
+
+   ```bash
+   drun sglang_image \
+       python3 -m sglang.bench_serving \
+       --backend sglang \
+       --dataset-name random \
+       --num-prompts 4000 \
+       --random-input 128 \
+       --random-output 128
+   ```
+
+With your AMD system properly configured and SGLang installed, you can now fully leverage AMD hardware to power SGLang’s machine learning capabilities.
+
+## Examples
+
+### Running DeepSeek-V3
+
+The only difference when running DeepSeek-V3 is in how you start the server. Here's an example command:
+
+```bash
+drun -p 30000:30000 \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    --ipc=host \
+    --env "HF_TOKEN=<secret>" \
+    sglang_image \
+    python3 -m sglang.launch_server \
+    --model-path deepseek-ai/DeepSeek-V3 \ # <- here
+    --tp 8 \
+    --trust-remote-code \
+    --host 0.0.0.0 \
+    --port 30000
+```
+
+[Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726) could also be a good reference.
+
+### Running Llama3.1
+
+Running Llama3.1 is nearly identical to running DeepSeek-V3. The only difference is in the model specified when starting the server, shown by the following example command:
+
+```bash
+drun -p 30000:30000 \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    --ipc=host \
+    --env "HF_TOKEN=<secret>" \
+    sglang_image \
+    python3 -m sglang.launch_server \
+    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ # <- here
+    --tp 8 \
+    --trust-remote-code \
+    --host 0.0.0.0 \
+    --port 30000
+```
+
+### Warmup Step
+
+When the server displays `The server is fired up and ready to roll!`, it means the startup is successful.
--- a/docs/platforms/ascend_npu.md
+++ b/docs/platforms/ascend_npu.md
@@ -0,0 +1,7 @@
+# Ascend NPUs
+
+## Install
+TODO
+
+## Examples
+TODO
--- a/docs/platforms/blackwell_gpu.md
+++ b/docs/platforms/blackwell_gpu.md
@@ -0,0 +1,9 @@
+# Blackwell GPUs
+
+We will release the pre-built wheels soon. Before that, please try to compile from source or check the blackwell docker images from [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags).
+
+## B200 with x86 CPUs
+TODO
+
+## GB200/GB300 with ARM CPUs
+TODO
--- a/docs/platforms/cpu_server.md
+++ b/docs/platforms/cpu_server.md
@@ -0,0 +1,197 @@
+# CPU Servers
+
+The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
+Specifically, SGLang is well optimized on the CPUs equipped with Intel® AMX® Instructions,
+which are 4th generation or newer Intel® Xeon® Scalable Processors.
+
+## Optimized Model List
+
+A list of popular LLMs are optimized and run efficiently on CPU,
+including the most notable open-source models like Llama series, Qwen series,
+and the phenomenal high-quality reasoning model DeepSeek-R1.
+
+| Model Name | BF16 | w8a8_int8 | FP8 |
+|:---:|:---:|:---:|:---:|
+| DeepSeek-R1 |   | [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
+| Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [RedHatAI/Llama-3.2-3B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8) |   |
+| Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8) |   |
+| QwQ-32B |   | [RedHatAI/QwQ-32B-quantized.w8a8](https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8) |   |
+| DeepSeek-Distilled-Llama |   | [RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8](https://huggingface.co/RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8) |   |
+| Qwen3-235B |   |   | [Qwen/Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8) |
+
+**Note:** The model identifiers listed in the table above
+have been verified on 6th Gen Intel® Xeon® P-core platforms.
+
+## Installation
+
+### Install Using Docker
+
+It is recommended to use Docker for setting up the SGLang environment.
+A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.xeon) is provided to facilitate the installation.
+Replace `<secret>` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens).
+
+```bash
+# Clone the SGLang repository
+git clone https://github.com/sgl-project/sglang.git
+cd sglang/docker
+
+# Build the docker image
+docker build -t sglang-cpu:main -f Dockerfile.xeon .
+
+# Initiate a docker container
+docker run \
+    -it \
+    --privileged \
+    --ipc=host \
+    --network=host \
+    -v /dev/shm:/dev/shm \
+    -v ~/.cache/huggingface:/root/.cache/huggingface \
+    -p 30000:30000 \
+    -e "HF_TOKEN=<secret>" \
+    sglang-cpu:main /bin/bash
+```
+
+### Install From Source
+
+If you'd prefer to install SGLang in a bare metal environment,
+the command list is as below.
+It is worth noting that the environment variable `SGLANG_USE_CPU_ENGINE=1`
+is required to enable SGLang service with CPU engine.
+
+```bash
+# Create and activate a conda environment
+conda create -n sgl-cpu python=3.12 -y
+conda activate sgl-cpu
+
+# Optional: Set PyTorch CPU as primary pip install channel to avoid installing CUDA version
+pip config set global.index-url https://download.pytorch.org/whl/cpu
+pip config set global.extra-index-url https://pypi.org/simple
+
+# Check if some conda related environment variables have been set
+env | grep -i conda
+# The following environment variable settings are required
+# if they have not been set properly
+export CONDA_EXE=$(which conda)
+export CONDA_ROOT=${CONDA_EXE}/../..
+export CONDA_PREFIX=${CONDA_ROOT}/envs/sgl-cpu
+export PATH=${PATH}:${CONDA_ROOT}/bin:${CONDA_ROOT}/condabin
+
+# Clone the SGLang code
+git clone https://github.com/sgl-project/sglang.git
+cd sglang
+git checkout <YOUR-DESIRED-VERSION>
+
+# Install SGLang dependent libs, and build SGLang main package
+pip install --upgrade pip setuptools
+conda install -y libsqlite==3.48.0 gperftools tbb libnuma numactl
+pip install intel-openmp
+pip install -e "python[all_cpu]"
+
+# Build the CPU backend kernels
+cd sgl-kernel
+cp pyproject_cpu.toml pyproject.toml
+pip install -v .
+
+# Other required environment variables
+# Recommend to set these in ~/.bashrc in order not to set every time in a new terminal
+export SGLANG_USE_CPU_ENGINE=1
+export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so:${CONDA_PREFIX}/lib/libtcmalloc.so:${CONDA_PREFIX}/lib/libtbbmalloc.so.2
+```
+
+## Launch of the Serving Engine
+
+Example command to launch SGLang serving:
+
+```bash
+python -m sglang.launch_server   \
+    --model <MODEL_ID_OR_PATH>   \
+    --trust-remote-code          \
+    --disable-overlap-schedule   \
+    --device cpu                 \
+    --host 0.0.0.0               \
+    --tp 6
+```
+
+Notes:
+
+1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`.
+
+2. The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6).
+    The number of TP specified is how many TP ranks will be used during the execution.
+    In a CPU platform, a TP rank means a sub-NUMA cluster (SNC).
+    Usually we can get the SNC information (How many available) from Operation System.
+    User can specify TP to be no more than the total available SNCs in current system.
+
+    If the specified TP rank number differs from the total SNC count,
+    the system will automatically utilize the first `n` SNCs.
+    Note that `n` cannot exceed the total SNC number, doing so will result in an error.
+
+    To specify the cores to be used, we need to explicitly set the environment variable `SGLANG_CPU_OMP_THREADS_BIND`.
+    For example, if we want to run the SGLang service using the first 40 cores of each SNC on a Xeon® 6980P server,
+    which has 43-43-42 cores on the 3 SNCs of a socket, we should set:
+
+    ```bash
+    export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"
+    ```
+
+3. A warmup step is automatically triggered when the service is started.
+The server is ready when you see the log `The server is fired up and ready to roll!`.
+
+## Benchmarking with Requests
+
+You can benchmark the performance via the `bench_serving` script.
+Run the command in another terminal.
+
+```bash
+python -m sglang.bench_serving   \
+    --dataset-name random        \
+    --random-input-len 1024      \
+    --random-output-len 1024     \
+    --num-prompts 1              \
+    --request-rate inf           \
+    --random-range-ratio 1.0
+```
+
+The detail explanations of the parameters can be looked up by the command:
+
+```bash
+python -m sglang.bench_serving -h
+```
+
+Additionally, the requests can be formed with
+[OpenAI Completions API](https://docs.sglang.ai/backend/openai_api_completions.html)
+and sent via the command line (e.g. using `curl`) or via your own script.
+
+## Example: Running DeepSeek-R1
+
+An example command to launch service for W8A8 DeepSeek-R1 on a Xeon® 6980P server
+
+```bash
+python -m sglang.launch_server                 \
+    --model meituan/DeepSeek-R1-Channel-INT8   \
+    --trust-remote-code                        \
+    --disable-overlap-schedule                 \
+    --device cpu                               \
+    --quantization w8a8_int8                   \
+    --host 0.0.0.0                             \
+    --mem-fraction-static 0.8                  \
+    --max-total-token 65536                    \
+    --tp 6
+```
+
+Similarly, an example command to launch service for FP8 DeepSeek-R1 would be
+
+```bash
+python -m sglang.launch_server                 \
+    --model deepseek-ai/DeepSeek-R1            \
+    --trust-remote-code                        \
+    --disable-overlap-schedule                 \
+    --device cpu                               \
+    --host 0.0.0.0                             \
+    --mem-fraction-static 0.8                  \
+    --max-total-token 65536                    \
+    --tp 6
+```
+
+Then you can test with `bench_serving` command or construct your own command or script
+following [the benchmarking example](#benchmarking-with-requests).
--- a/docs/platforms/nvidia_jetson.md
+++ b/docs/platforms/nvidia_jetson.md
@@ -0,0 +1,76 @@
+# NVIDIA Jetson Orin
+
+## Prerequisites
+
+Before starting, ensure the following:
+
+- [**NVIDIA Jetson AGX Orin Devkit**](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/) is set up with **JetPack 6.1** or later.
+- **CUDA Toolkit** and **cuDNN** are installed.
+- Verify that the Jetson AGX Orin is in **high-performance mode**:
+```bash
+sudo nvpmodel -m 0
+```
+* * * * *
+## Installing and running SGLang with Jetson Containers
+Clone the jetson-containers github repository:
+```
+git clone https://github.com/dusty-nv/jetson-containers.git
+```
+Run the installation script:
+```
+bash jetson-containers/install.sh
+```
+Build the container:
+```
+CUDA_VERSION=12.6 jetson-containers build sglang
+```
+Run the container:
+```
+docker run --runtime nvidia -it --rm --network=host IMAGE_NAME
+```
+* * * * *
+
+Running Inference
+-----------------------------------------
+
+Launch the server:
+```bash
+python -m sglang.launch_server \
+  --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+  --device cuda \
+  --dtype half \
+  --attention-backend flashinfer \
+  --mem-fraction-static 0.8 \
+  --context-length 8192
+```
+The quantization and limited context length (`--dtype half --context-length 8192`) are due to the limited computational resources in [Nvidia jetson kit](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/). A detailed explanation can be found in [Server Arguments](../backend/server_arguments.md).
+
+After launching the engine, refer to [Chat completions](https://docs.sglang.ai/backend/openai_api_completions.html#Usage) to test the usability.
+* * * * *
+Running quantization with TorchAO
+-------------------------------------
+TorchAO is suggested to NVIDIA Jetson Orin.
+```bash
+python -m sglang.launch_server \
+    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
+    --device cuda \
+    --dtype bfloat16 \
+    --attention-backend flashinfer \
+    --mem-fraction-static 0.8 \
+    --context-length 8192 \
+    --torchao-config int4wo-128
+```
+This enables TorchAO's int4 weight-only quantization with a 128-group size. The usage of `--torchao-config int4wo-128` is also for memory efficiency.
+
+
+* * * * *
+Structured output with XGrammar
+-------------------------------
+Please refer to [SGLang doc structured output](../backend/structured_outputs.ipynb).
+* * * * *
+
+Thanks to the support from [shahizat](https://github.com/shahizat).
+
+References
+----------
+-   [NVIDIA Jetson AGX Orin Documentation](https://developer.nvidia.com/embedded/jetson-agx-orin)
--- a/docs/platforms/tpu.md
+++ b/docs/platforms/tpu.md
@@ -0,0 +1,3 @@
+# TPU
+
+The support for TPU is under active development. Please stay tuned.