Refactor the docs (#9031)

This commit is contained in:
Lianmin Zheng
2025-08-10 19:49:45 -07:00
committed by GitHub
parent 0f229c07f1
commit 2449a0afe2
80 changed files with 619 additions and 750 deletions

158
docs/platforms/amd_gpu.md Normal file
View File

@@ -0,0 +1,158 @@
# AMD GPUs
This document describes how run SGLang on AMD GPUs. If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues).
## System Configuration
When using AMD GPUs (such as MI300X), certain system-level optimizations help ensure stable performance. Here we take MI300X as an example. AMD provides official documentation for MI300X optimization and system tuning:
- [AMD MI300X Tuning Guides](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html)
- [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/vllm-benchmark.html)
- [AMD Instinct MI300X System Optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html)
- [AMD Instinct MI300X Workload Optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html)
- [Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html)
**NOTE:** We strongly recommend reading these docs and guides entirely to fully utilize your system.
Below are a few key settings to confirm or enable for SGLang:
### Update GRUB Settings
In `/etc/default/grub`, append the following to `GRUB_CMDLINE_LINUX`:
```text
pci=realloc=off iommu=pt
```
Afterward, run `sudo update-grub` (or your distros equivalent) and reboot.
### Disable NUMA Auto-Balancing
```bash
sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
```
You can automate or verify this change using [this helpful script](https://github.com/ROCm/triton/blob/rocm_env/scripts/amd/env_check.sh).
Again, please go through the entire documentation to confirm your system is using the recommended configuration.
## Install SGLang
You can install SGLang using one of the methods below.
### Install from Source
```bash
# Use the last release branch
git clone -b v0.5.0rc0 https://github.com/sgl-project/sglang.git
cd sglang
# Compile sgl-kernel
pip install --upgrade pip
cd sgl-kernel
python setup_rocm.py install
# Install sglang python package
cd ..
pip install -e "python[all_hip]"
```
### Install Using Docker (Recommended)
The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile.rocm](https://github.com/sgl-project/sglang/tree/main/docker).
The steps below show how to build and use an image.
1. Build the docker image.
If you use pre-built images, you can skip this step and replace `sglang_image` with the pre-built image names in the steps below.
```bash
docker build -t sglang_image -f Dockerfile.rocm .
```
2. Create a convenient alias.
```bash
alias drun='docker run -it --rm --network=host --privileged --device=/dev/kfd --device=/dev/dri \
--ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
-v $HOME/dockerx:/dockerx \
-v /data:/data'
```
If you are using RDMA, please note that:
- `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
- You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
3. Launch the server.
**NOTE:** Replace `<secret>` below with your [huggingface hub token](https://huggingface.co/docs/hub/en/security-tokens).
```bash
drun -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
sglang_image \
python3 -m sglang.launch_server \
--model-path NousResearch/Meta-Llama-3.1-8B \
--host 0.0.0.0 \
--port 30000
```
4. To verify the utility, you can run a benchmark in another terminal or refer to [other docs](https://docs.sglang.ai/backend/openai_api_completions.html) to send requests to the engine.
```bash
drun sglang_image \
python3 -m sglang.bench_serving \
--backend sglang \
--dataset-name random \
--num-prompts 4000 \
--random-input 128 \
--random-output 128
```
With your AMD system properly configured and SGLang installed, you can now fully leverage AMD hardware to power SGLangs machine learning capabilities.
## Examples
### Running DeepSeek-V3
The only difference when running DeepSeek-V3 is in how you start the server. Here's an example command:
```bash
drun -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--ipc=host \
--env "HF_TOKEN=<secret>" \
sglang_image \
python3 -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-V3 \ # <- here
--tp 8 \
--trust-remote-code \
--host 0.0.0.0 \
--port 30000
```
[Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726) could also be a good reference.
### Running Llama3.1
Running Llama3.1 is nearly identical to running DeepSeek-V3. The only difference is in the model specified when starting the server, shown by the following example command:
```bash
drun -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--ipc=host \
--env "HF_TOKEN=<secret>" \
sglang_image \
python3 -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ # <- here
--tp 8 \
--trust-remote-code \
--host 0.0.0.0 \
--port 30000
```
### Warmup Step
When the server displays `The server is fired up and ready to roll!`, it means the startup is successful.

View File

@@ -0,0 +1,7 @@
# Ascend NPUs
## Install
TODO
## Examples
TODO

View File

@@ -0,0 +1,9 @@
# Blackwell GPUs
We will release the pre-built wheels soon. Before that, please try to compile from source or check the blackwell docker images from [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags).
## B200 with x86 CPUs
TODO
## GB200/GB300 with ARM CPUs
TODO

View File

@@ -0,0 +1,197 @@
# CPU Servers
The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
Specifically, SGLang is well optimized on the CPUs equipped with Intel® AMX® Instructions,
which are 4th generation or newer Intel® Xeon® Scalable Processors.
## Optimized Model List
A list of popular LLMs are optimized and run efficiently on CPU,
including the most notable open-source models like Llama series, Qwen series,
and the phenomenal high-quality reasoning model DeepSeek-R1.
| Model Name | BF16 | w8a8_int8 | FP8 |
|:---:|:---:|:---:|:---:|
| DeepSeek-R1 | | [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
| Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [RedHatAI/Llama-3.2-3B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8) | |
| Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8) | |
| QwQ-32B | | [RedHatAI/QwQ-32B-quantized.w8a8](https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8) | |
| DeepSeek-Distilled-Llama | | [RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8](https://huggingface.co/RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8) | |
| Qwen3-235B | | | [Qwen/Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8) |
**Note:** The model identifiers listed in the table above
have been verified on 6th Gen Intel® Xeon® P-core platforms.
## Installation
### Install Using Docker
It is recommended to use Docker for setting up the SGLang environment.
A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.xeon) is provided to facilitate the installation.
Replace `<secret>` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens).
```bash
# Clone the SGLang repository
git clone https://github.com/sgl-project/sglang.git
cd sglang/docker
# Build the docker image
docker build -t sglang-cpu:main -f Dockerfile.xeon .
# Initiate a docker container
docker run \
-it \
--privileged \
--ipc=host \
--network=host \
-v /dev/shm:/dev/shm \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-p 30000:30000 \
-e "HF_TOKEN=<secret>" \
sglang-cpu:main /bin/bash
```
### Install From Source
If you'd prefer to install SGLang in a bare metal environment,
the command list is as below.
It is worth noting that the environment variable `SGLANG_USE_CPU_ENGINE=1`
is required to enable SGLang service with CPU engine.
```bash
# Create and activate a conda environment
conda create -n sgl-cpu python=3.12 -y
conda activate sgl-cpu
# Optional: Set PyTorch CPU as primary pip install channel to avoid installing CUDA version
pip config set global.index-url https://download.pytorch.org/whl/cpu
pip config set global.extra-index-url https://pypi.org/simple
# Check if some conda related environment variables have been set
env | grep -i conda
# The following environment variable settings are required
# if they have not been set properly
export CONDA_EXE=$(which conda)
export CONDA_ROOT=${CONDA_EXE}/../..
export CONDA_PREFIX=${CONDA_ROOT}/envs/sgl-cpu
export PATH=${PATH}:${CONDA_ROOT}/bin:${CONDA_ROOT}/condabin
# Clone the SGLang code
git clone https://github.com/sgl-project/sglang.git
cd sglang
git checkout <YOUR-DESIRED-VERSION>
# Install SGLang dependent libs, and build SGLang main package
pip install --upgrade pip setuptools
conda install -y libsqlite==3.48.0 gperftools tbb libnuma numactl
pip install intel-openmp
pip install -e "python[all_cpu]"
# Build the CPU backend kernels
cd sgl-kernel
cp pyproject_cpu.toml pyproject.toml
pip install -v .
# Other required environment variables
# Recommend to set these in ~/.bashrc in order not to set every time in a new terminal
export SGLANG_USE_CPU_ENGINE=1
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so:${CONDA_PREFIX}/lib/libtcmalloc.so:${CONDA_PREFIX}/lib/libtbbmalloc.so.2
```
## Launch of the Serving Engine
Example command to launch SGLang serving:
```bash
python -m sglang.launch_server \
--model <MODEL_ID_OR_PATH> \
--trust-remote-code \
--disable-overlap-schedule \
--device cpu \
--host 0.0.0.0 \
--tp 6
```
Notes:
1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`.
2. The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6).
The number of TP specified is how many TP ranks will be used during the execution.
In a CPU platform, a TP rank means a sub-NUMA cluster (SNC).
Usually we can get the SNC information (How many available) from Operation System.
User can specify TP to be no more than the total available SNCs in current system.
If the specified TP rank number differs from the total SNC count,
the system will automatically utilize the first `n` SNCs.
Note that `n` cannot exceed the total SNC number, doing so will result in an error.
To specify the cores to be used, we need to explicitly set the environment variable `SGLANG_CPU_OMP_THREADS_BIND`.
For example, if we want to run the SGLang service using the first 40 cores of each SNC on a Xeon® 6980P server,
which has 43-43-42 cores on the 3 SNCs of a socket, we should set:
```bash
export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"
```
3. A warmup step is automatically triggered when the service is started.
The server is ready when you see the log `The server is fired up and ready to roll!`.
## Benchmarking with Requests
You can benchmark the performance via the `bench_serving` script.
Run the command in another terminal.
```bash
python -m sglang.bench_serving \
--dataset-name random \
--random-input-len 1024 \
--random-output-len 1024 \
--num-prompts 1 \
--request-rate inf \
--random-range-ratio 1.0
```
The detail explanations of the parameters can be looked up by the command:
```bash
python -m sglang.bench_serving -h
```
Additionally, the requests can be formed with
[OpenAI Completions API](https://docs.sglang.ai/backend/openai_api_completions.html)
and sent via the command line (e.g. using `curl`) or via your own script.
## Example: Running DeepSeek-R1
An example command to launch service for W8A8 DeepSeek-R1 on a Xeon® 6980P server
```bash
python -m sglang.launch_server \
--model meituan/DeepSeek-R1-Channel-INT8 \
--trust-remote-code \
--disable-overlap-schedule \
--device cpu \
--quantization w8a8_int8 \
--host 0.0.0.0 \
--mem-fraction-static 0.8 \
--max-total-token 65536 \
--tp 6
```
Similarly, an example command to launch service for FP8 DeepSeek-R1 would be
```bash
python -m sglang.launch_server \
--model deepseek-ai/DeepSeek-R1 \
--trust-remote-code \
--disable-overlap-schedule \
--device cpu \
--host 0.0.0.0 \
--mem-fraction-static 0.8 \
--max-total-token 65536 \
--tp 6
```
Then you can test with `bench_serving` command or construct your own command or script
following [the benchmarking example](#benchmarking-with-requests).

View File

@@ -0,0 +1,76 @@
# NVIDIA Jetson Orin
## Prerequisites
Before starting, ensure the following:
- [**NVIDIA Jetson AGX Orin Devkit**](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/) is set up with **JetPack 6.1** or later.
- **CUDA Toolkit** and **cuDNN** are installed.
- Verify that the Jetson AGX Orin is in **high-performance mode**:
```bash
sudo nvpmodel -m 0
```
* * * * *
## Installing and running SGLang with Jetson Containers
Clone the jetson-containers github repository:
```
git clone https://github.com/dusty-nv/jetson-containers.git
```
Run the installation script:
```
bash jetson-containers/install.sh
```
Build the container:
```
CUDA_VERSION=12.6 jetson-containers build sglang
```
Run the container:
```
docker run --runtime nvidia -it --rm --network=host IMAGE_NAME
```
* * * * *
Running Inference
-----------------------------------------
Launch the server:
```bash
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--device cuda \
--dtype half \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192
```
The quantization and limited context length (`--dtype half --context-length 8192`) are due to the limited computational resources in [Nvidia jetson kit](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/). A detailed explanation can be found in [Server Arguments](../backend/server_arguments.md).
After launching the engine, refer to [Chat completions](https://docs.sglang.ai/backend/openai_api_completions.html#Usage) to test the usability.
* * * * *
Running quantization with TorchAO
-------------------------------------
TorchAO is suggested to NVIDIA Jetson Orin.
```bash
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--device cuda \
--dtype bfloat16 \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192 \
--torchao-config int4wo-128
```
This enables TorchAO's int4 weight-only quantization with a 128-group size. The usage of `--torchao-config int4wo-128` is also for memory efficiency.
* * * * *
Structured output with XGrammar
-------------------------------
Please refer to [SGLang doc structured output](../backend/structured_outputs.ipynb).
* * * * *
Thanks to the support from [shahizat](https://github.com/shahizat).
References
----------
- [NVIDIA Jetson AGX Orin Documentation](https://developer.nvidia.com/embedded/jetson-agx-orin)

3
docs/platforms/tpu.md Normal file
View File

@@ -0,0 +1,3 @@
# TPU
The support for TPU is under active development. Please stay tuned.