Refactor the docs (#9031)
This commit is contained in:
158
docs/platforms/amd_gpu.md
Normal file
158
docs/platforms/amd_gpu.md
Normal file
@@ -0,0 +1,158 @@
|
||||
# AMD GPUs
|
||||
|
||||
This document describes how run SGLang on AMD GPUs. If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues).
|
||||
|
||||
## System Configuration
|
||||
|
||||
When using AMD GPUs (such as MI300X), certain system-level optimizations help ensure stable performance. Here we take MI300X as an example. AMD provides official documentation for MI300X optimization and system tuning:
|
||||
|
||||
- [AMD MI300X Tuning Guides](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html)
|
||||
- [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/vllm-benchmark.html)
|
||||
- [AMD Instinct MI300X System Optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html)
|
||||
- [AMD Instinct MI300X Workload Optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html)
|
||||
- [Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html)
|
||||
|
||||
**NOTE:** We strongly recommend reading these docs and guides entirely to fully utilize your system.
|
||||
|
||||
Below are a few key settings to confirm or enable for SGLang:
|
||||
|
||||
### Update GRUB Settings
|
||||
|
||||
In `/etc/default/grub`, append the following to `GRUB_CMDLINE_LINUX`:
|
||||
|
||||
```text
|
||||
pci=realloc=off iommu=pt
|
||||
```
|
||||
|
||||
Afterward, run `sudo update-grub` (or your distro’s equivalent) and reboot.
|
||||
|
||||
### Disable NUMA Auto-Balancing
|
||||
|
||||
```bash
|
||||
sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
|
||||
```
|
||||
|
||||
You can automate or verify this change using [this helpful script](https://github.com/ROCm/triton/blob/rocm_env/scripts/amd/env_check.sh).
|
||||
|
||||
Again, please go through the entire documentation to confirm your system is using the recommended configuration.
|
||||
|
||||
## Install SGLang
|
||||
|
||||
You can install SGLang using one of the methods below.
|
||||
|
||||
### Install from Source
|
||||
|
||||
```bash
|
||||
# Use the last release branch
|
||||
git clone -b v0.5.0rc0 https://github.com/sgl-project/sglang.git
|
||||
cd sglang
|
||||
|
||||
# Compile sgl-kernel
|
||||
pip install --upgrade pip
|
||||
cd sgl-kernel
|
||||
python setup_rocm.py install
|
||||
|
||||
# Install sglang python package
|
||||
cd ..
|
||||
pip install -e "python[all_hip]"
|
||||
```
|
||||
|
||||
### Install Using Docker (Recommended)
|
||||
|
||||
The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile.rocm](https://github.com/sgl-project/sglang/tree/main/docker).
|
||||
|
||||
The steps below show how to build and use an image.
|
||||
|
||||
1. Build the docker image.
|
||||
If you use pre-built images, you can skip this step and replace `sglang_image` with the pre-built image names in the steps below.
|
||||
|
||||
```bash
|
||||
docker build -t sglang_image -f Dockerfile.rocm .
|
||||
```
|
||||
|
||||
2. Create a convenient alias.
|
||||
|
||||
```bash
|
||||
alias drun='docker run -it --rm --network=host --privileged --device=/dev/kfd --device=/dev/dri \
|
||||
--ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \
|
||||
--security-opt seccomp=unconfined \
|
||||
-v $HOME/dockerx:/dockerx \
|
||||
-v /data:/data'
|
||||
```
|
||||
|
||||
If you are using RDMA, please note that:
|
||||
- `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
|
||||
- You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
|
||||
|
||||
3. Launch the server.
|
||||
|
||||
**NOTE:** Replace `<secret>` below with your [huggingface hub token](https://huggingface.co/docs/hub/en/security-tokens).
|
||||
|
||||
```bash
|
||||
drun -p 30000:30000 \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HF_TOKEN=<secret>" \
|
||||
sglang_image \
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path NousResearch/Meta-Llama-3.1-8B \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000
|
||||
```
|
||||
|
||||
4. To verify the utility, you can run a benchmark in another terminal or refer to [other docs](https://docs.sglang.ai/backend/openai_api_completions.html) to send requests to the engine.
|
||||
|
||||
```bash
|
||||
drun sglang_image \
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend sglang \
|
||||
--dataset-name random \
|
||||
--num-prompts 4000 \
|
||||
--random-input 128 \
|
||||
--random-output 128
|
||||
```
|
||||
|
||||
With your AMD system properly configured and SGLang installed, you can now fully leverage AMD hardware to power SGLang’s machine learning capabilities.
|
||||
|
||||
## Examples
|
||||
|
||||
### Running DeepSeek-V3
|
||||
|
||||
The only difference when running DeepSeek-V3 is in how you start the server. Here's an example command:
|
||||
|
||||
```bash
|
||||
drun -p 30000:30000 \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--ipc=host \
|
||||
--env "HF_TOKEN=<secret>" \
|
||||
sglang_image \
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path deepseek-ai/DeepSeek-V3 \ # <- here
|
||||
--tp 8 \
|
||||
--trust-remote-code \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000
|
||||
```
|
||||
|
||||
[Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726) could also be a good reference.
|
||||
|
||||
### Running Llama3.1
|
||||
|
||||
Running Llama3.1 is nearly identical to running DeepSeek-V3. The only difference is in the model specified when starting the server, shown by the following example command:
|
||||
|
||||
```bash
|
||||
drun -p 30000:30000 \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--ipc=host \
|
||||
--env "HF_TOKEN=<secret>" \
|
||||
sglang_image \
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ # <- here
|
||||
--tp 8 \
|
||||
--trust-remote-code \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000
|
||||
```
|
||||
|
||||
### Warmup Step
|
||||
|
||||
When the server displays `The server is fired up and ready to roll!`, it means the startup is successful.
|
||||
7
docs/platforms/ascend_npu.md
Normal file
7
docs/platforms/ascend_npu.md
Normal file
@@ -0,0 +1,7 @@
|
||||
# Ascend NPUs
|
||||
|
||||
## Install
|
||||
TODO
|
||||
|
||||
## Examples
|
||||
TODO
|
||||
9
docs/platforms/blackwell_gpu.md
Normal file
9
docs/platforms/blackwell_gpu.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Blackwell GPUs
|
||||
|
||||
We will release the pre-built wheels soon. Before that, please try to compile from source or check the blackwell docker images from [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags).
|
||||
|
||||
## B200 with x86 CPUs
|
||||
TODO
|
||||
|
||||
## GB200/GB300 with ARM CPUs
|
||||
TODO
|
||||
197
docs/platforms/cpu_server.md
Normal file
197
docs/platforms/cpu_server.md
Normal file
@@ -0,0 +1,197 @@
|
||||
# CPU Servers
|
||||
|
||||
The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
|
||||
Specifically, SGLang is well optimized on the CPUs equipped with Intel® AMX® Instructions,
|
||||
which are 4th generation or newer Intel® Xeon® Scalable Processors.
|
||||
|
||||
## Optimized Model List
|
||||
|
||||
A list of popular LLMs are optimized and run efficiently on CPU,
|
||||
including the most notable open-source models like Llama series, Qwen series,
|
||||
and the phenomenal high-quality reasoning model DeepSeek-R1.
|
||||
|
||||
| Model Name | BF16 | w8a8_int8 | FP8 |
|
||||
|:---:|:---:|:---:|:---:|
|
||||
| DeepSeek-R1 | | [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
|
||||
| Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [RedHatAI/Llama-3.2-3B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8) | |
|
||||
| Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8) | |
|
||||
| QwQ-32B | | [RedHatAI/QwQ-32B-quantized.w8a8](https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8) | |
|
||||
| DeepSeek-Distilled-Llama | | [RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8](https://huggingface.co/RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8) | |
|
||||
| Qwen3-235B | | | [Qwen/Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8) |
|
||||
|
||||
**Note:** The model identifiers listed in the table above
|
||||
have been verified on 6th Gen Intel® Xeon® P-core platforms.
|
||||
|
||||
## Installation
|
||||
|
||||
### Install Using Docker
|
||||
|
||||
It is recommended to use Docker for setting up the SGLang environment.
|
||||
A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.xeon) is provided to facilitate the installation.
|
||||
Replace `<secret>` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens).
|
||||
|
||||
```bash
|
||||
# Clone the SGLang repository
|
||||
git clone https://github.com/sgl-project/sglang.git
|
||||
cd sglang/docker
|
||||
|
||||
# Build the docker image
|
||||
docker build -t sglang-cpu:main -f Dockerfile.xeon .
|
||||
|
||||
# Initiate a docker container
|
||||
docker run \
|
||||
-it \
|
||||
--privileged \
|
||||
--ipc=host \
|
||||
--network=host \
|
||||
-v /dev/shm:/dev/shm \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
-p 30000:30000 \
|
||||
-e "HF_TOKEN=<secret>" \
|
||||
sglang-cpu:main /bin/bash
|
||||
```
|
||||
|
||||
### Install From Source
|
||||
|
||||
If you'd prefer to install SGLang in a bare metal environment,
|
||||
the command list is as below.
|
||||
It is worth noting that the environment variable `SGLANG_USE_CPU_ENGINE=1`
|
||||
is required to enable SGLang service with CPU engine.
|
||||
|
||||
```bash
|
||||
# Create and activate a conda environment
|
||||
conda create -n sgl-cpu python=3.12 -y
|
||||
conda activate sgl-cpu
|
||||
|
||||
# Optional: Set PyTorch CPU as primary pip install channel to avoid installing CUDA version
|
||||
pip config set global.index-url https://download.pytorch.org/whl/cpu
|
||||
pip config set global.extra-index-url https://pypi.org/simple
|
||||
|
||||
# Check if some conda related environment variables have been set
|
||||
env | grep -i conda
|
||||
# The following environment variable settings are required
|
||||
# if they have not been set properly
|
||||
export CONDA_EXE=$(which conda)
|
||||
export CONDA_ROOT=${CONDA_EXE}/../..
|
||||
export CONDA_PREFIX=${CONDA_ROOT}/envs/sgl-cpu
|
||||
export PATH=${PATH}:${CONDA_ROOT}/bin:${CONDA_ROOT}/condabin
|
||||
|
||||
# Clone the SGLang code
|
||||
git clone https://github.com/sgl-project/sglang.git
|
||||
cd sglang
|
||||
git checkout <YOUR-DESIRED-VERSION>
|
||||
|
||||
# Install SGLang dependent libs, and build SGLang main package
|
||||
pip install --upgrade pip setuptools
|
||||
conda install -y libsqlite==3.48.0 gperftools tbb libnuma numactl
|
||||
pip install intel-openmp
|
||||
pip install -e "python[all_cpu]"
|
||||
|
||||
# Build the CPU backend kernels
|
||||
cd sgl-kernel
|
||||
cp pyproject_cpu.toml pyproject.toml
|
||||
pip install -v .
|
||||
|
||||
# Other required environment variables
|
||||
# Recommend to set these in ~/.bashrc in order not to set every time in a new terminal
|
||||
export SGLANG_USE_CPU_ENGINE=1
|
||||
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so:${CONDA_PREFIX}/lib/libtcmalloc.so:${CONDA_PREFIX}/lib/libtbbmalloc.so.2
|
||||
```
|
||||
|
||||
## Launch of the Serving Engine
|
||||
|
||||
Example command to launch SGLang serving:
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--model <MODEL_ID_OR_PATH> \
|
||||
--trust-remote-code \
|
||||
--disable-overlap-schedule \
|
||||
--device cpu \
|
||||
--host 0.0.0.0 \
|
||||
--tp 6
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`.
|
||||
|
||||
2. The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6).
|
||||
The number of TP specified is how many TP ranks will be used during the execution.
|
||||
In a CPU platform, a TP rank means a sub-NUMA cluster (SNC).
|
||||
Usually we can get the SNC information (How many available) from Operation System.
|
||||
User can specify TP to be no more than the total available SNCs in current system.
|
||||
|
||||
If the specified TP rank number differs from the total SNC count,
|
||||
the system will automatically utilize the first `n` SNCs.
|
||||
Note that `n` cannot exceed the total SNC number, doing so will result in an error.
|
||||
|
||||
To specify the cores to be used, we need to explicitly set the environment variable `SGLANG_CPU_OMP_THREADS_BIND`.
|
||||
For example, if we want to run the SGLang service using the first 40 cores of each SNC on a Xeon® 6980P server,
|
||||
which has 43-43-42 cores on the 3 SNCs of a socket, we should set:
|
||||
|
||||
```bash
|
||||
export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"
|
||||
```
|
||||
|
||||
3. A warmup step is automatically triggered when the service is started.
|
||||
The server is ready when you see the log `The server is fired up and ready to roll!`.
|
||||
|
||||
## Benchmarking with Requests
|
||||
|
||||
You can benchmark the performance via the `bench_serving` script.
|
||||
Run the command in another terminal.
|
||||
|
||||
```bash
|
||||
python -m sglang.bench_serving \
|
||||
--dataset-name random \
|
||||
--random-input-len 1024 \
|
||||
--random-output-len 1024 \
|
||||
--num-prompts 1 \
|
||||
--request-rate inf \
|
||||
--random-range-ratio 1.0
|
||||
```
|
||||
|
||||
The detail explanations of the parameters can be looked up by the command:
|
||||
|
||||
```bash
|
||||
python -m sglang.bench_serving -h
|
||||
```
|
||||
|
||||
Additionally, the requests can be formed with
|
||||
[OpenAI Completions API](https://docs.sglang.ai/backend/openai_api_completions.html)
|
||||
and sent via the command line (e.g. using `curl`) or via your own script.
|
||||
|
||||
## Example: Running DeepSeek-R1
|
||||
|
||||
An example command to launch service for W8A8 DeepSeek-R1 on a Xeon® 6980P server
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--model meituan/DeepSeek-R1-Channel-INT8 \
|
||||
--trust-remote-code \
|
||||
--disable-overlap-schedule \
|
||||
--device cpu \
|
||||
--quantization w8a8_int8 \
|
||||
--host 0.0.0.0 \
|
||||
--mem-fraction-static 0.8 \
|
||||
--max-total-token 65536 \
|
||||
--tp 6
|
||||
```
|
||||
|
||||
Similarly, an example command to launch service for FP8 DeepSeek-R1 would be
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--model deepseek-ai/DeepSeek-R1 \
|
||||
--trust-remote-code \
|
||||
--disable-overlap-schedule \
|
||||
--device cpu \
|
||||
--host 0.0.0.0 \
|
||||
--mem-fraction-static 0.8 \
|
||||
--max-total-token 65536 \
|
||||
--tp 6
|
||||
```
|
||||
|
||||
Then you can test with `bench_serving` command or construct your own command or script
|
||||
following [the benchmarking example](#benchmarking-with-requests).
|
||||
76
docs/platforms/nvidia_jetson.md
Normal file
76
docs/platforms/nvidia_jetson.md
Normal file
@@ -0,0 +1,76 @@
|
||||
# NVIDIA Jetson Orin
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before starting, ensure the following:
|
||||
|
||||
- [**NVIDIA Jetson AGX Orin Devkit**](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/) is set up with **JetPack 6.1** or later.
|
||||
- **CUDA Toolkit** and **cuDNN** are installed.
|
||||
- Verify that the Jetson AGX Orin is in **high-performance mode**:
|
||||
```bash
|
||||
sudo nvpmodel -m 0
|
||||
```
|
||||
* * * * *
|
||||
## Installing and running SGLang with Jetson Containers
|
||||
Clone the jetson-containers github repository:
|
||||
```
|
||||
git clone https://github.com/dusty-nv/jetson-containers.git
|
||||
```
|
||||
Run the installation script:
|
||||
```
|
||||
bash jetson-containers/install.sh
|
||||
```
|
||||
Build the container:
|
||||
```
|
||||
CUDA_VERSION=12.6 jetson-containers build sglang
|
||||
```
|
||||
Run the container:
|
||||
```
|
||||
docker run --runtime nvidia -it --rm --network=host IMAGE_NAME
|
||||
```
|
||||
* * * * *
|
||||
|
||||
Running Inference
|
||||
-----------------------------------------
|
||||
|
||||
Launch the server:
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
|
||||
--device cuda \
|
||||
--dtype half \
|
||||
--attention-backend flashinfer \
|
||||
--mem-fraction-static 0.8 \
|
||||
--context-length 8192
|
||||
```
|
||||
The quantization and limited context length (`--dtype half --context-length 8192`) are due to the limited computational resources in [Nvidia jetson kit](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/). A detailed explanation can be found in [Server Arguments](../backend/server_arguments.md).
|
||||
|
||||
After launching the engine, refer to [Chat completions](https://docs.sglang.ai/backend/openai_api_completions.html#Usage) to test the usability.
|
||||
* * * * *
|
||||
Running quantization with TorchAO
|
||||
-------------------------------------
|
||||
TorchAO is suggested to NVIDIA Jetson Orin.
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
|
||||
--device cuda \
|
||||
--dtype bfloat16 \
|
||||
--attention-backend flashinfer \
|
||||
--mem-fraction-static 0.8 \
|
||||
--context-length 8192 \
|
||||
--torchao-config int4wo-128
|
||||
```
|
||||
This enables TorchAO's int4 weight-only quantization with a 128-group size. The usage of `--torchao-config int4wo-128` is also for memory efficiency.
|
||||
|
||||
|
||||
* * * * *
|
||||
Structured output with XGrammar
|
||||
-------------------------------
|
||||
Please refer to [SGLang doc structured output](../backend/structured_outputs.ipynb).
|
||||
* * * * *
|
||||
|
||||
Thanks to the support from [shahizat](https://github.com/shahizat).
|
||||
|
||||
References
|
||||
----------
|
||||
- [NVIDIA Jetson AGX Orin Documentation](https://developer.nvidia.com/embedded/jetson-agx-orin)
|
||||
3
docs/platforms/tpu.md
Normal file
3
docs/platforms/tpu.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# TPU
|
||||
|
||||
The support for TPU is under active development. Please stay tuned.
|
||||
Reference in New Issue
Block a user