adapt to sglang v0.5.2rc1 on dcu
This commit is contained in:
158
docs/platforms/amd_gpu.md
Normal file
158
docs/platforms/amd_gpu.md
Normal file
@@ -0,0 +1,158 @@
|
||||
# AMD GPUs
|
||||
|
||||
This document describes how run SGLang on AMD GPUs. If you encounter issues or have questions, please [open an issue](https://github.com/sgl-project/sglang/issues).
|
||||
|
||||
## System Configuration
|
||||
|
||||
When using AMD GPUs (such as MI300X), certain system-level optimizations help ensure stable performance. Here we take MI300X as an example. AMD provides official documentation for MI300X optimization and system tuning:
|
||||
|
||||
- [AMD MI300X Tuning Guides](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/index.html)
|
||||
- [LLM inference performance validation on AMD Instinct MI300X](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference/vllm-benchmark.html)
|
||||
- [AMD Instinct MI300X System Optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html)
|
||||
- [AMD Instinct MI300X Workload Optimization](https://rocm.docs.amd.com/en/latest/how-to/rocm-for-ai/inference-optimization/workload.html)
|
||||
- [Supercharge DeepSeek-R1 Inference on AMD Instinct MI300X](https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR1-Part2/README.html)
|
||||
|
||||
**NOTE:** We strongly recommend reading these docs and guides entirely to fully utilize your system.
|
||||
|
||||
Below are a few key settings to confirm or enable for SGLang:
|
||||
|
||||
### Update GRUB Settings
|
||||
|
||||
In `/etc/default/grub`, append the following to `GRUB_CMDLINE_LINUX`:
|
||||
|
||||
```text
|
||||
pci=realloc=off iommu=pt
|
||||
```
|
||||
|
||||
Afterward, run `sudo update-grub` (or your distro’s equivalent) and reboot.
|
||||
|
||||
### Disable NUMA Auto-Balancing
|
||||
|
||||
```bash
|
||||
sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
|
||||
```
|
||||
|
||||
You can automate or verify this change using [this helpful script](https://github.com/ROCm/triton/blob/rocm_env/scripts/amd/env_check.sh).
|
||||
|
||||
Again, please go through the entire documentation to confirm your system is using the recommended configuration.
|
||||
|
||||
## Install SGLang
|
||||
|
||||
You can install SGLang using one of the methods below.
|
||||
|
||||
### Install from Source
|
||||
|
||||
```bash
|
||||
# Use the last release branch
|
||||
git clone -b v0.5.2rc1 https://github.com/sgl-project/sglang.git
|
||||
cd sglang
|
||||
|
||||
# Compile sgl-kernel
|
||||
pip install --upgrade pip
|
||||
cd sgl-kernel
|
||||
python setup_rocm.py install
|
||||
|
||||
# Install sglang python package
|
||||
cd ..
|
||||
pip install -e "python[all_hip]"
|
||||
```
|
||||
|
||||
### Install Using Docker (Recommended)
|
||||
|
||||
The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile.rocm](https://github.com/sgl-project/sglang/tree/main/docker).
|
||||
|
||||
The steps below show how to build and use an image.
|
||||
|
||||
1. Build the docker image.
|
||||
If you use pre-built images, you can skip this step and replace `sglang_image` with the pre-built image names in the steps below.
|
||||
|
||||
```bash
|
||||
docker build -t sglang_image -f Dockerfile.rocm .
|
||||
```
|
||||
|
||||
2. Create a convenient alias.
|
||||
|
||||
```bash
|
||||
alias drun='docker run -it --rm --network=host --privileged --device=/dev/kfd --device=/dev/dri \
|
||||
--ipc=host --shm-size 16G --group-add video --cap-add=SYS_PTRACE \
|
||||
--security-opt seccomp=unconfined \
|
||||
-v $HOME/dockerx:/dockerx \
|
||||
-v /data:/data'
|
||||
```
|
||||
|
||||
If you are using RDMA, please note that:
|
||||
- `--network host` and `--privileged` are required by RDMA. If you don't need RDMA, you can remove them.
|
||||
- You may need to set `NCCL_IB_GID_INDEX` if you are using RoCE, for example: `export NCCL_IB_GID_INDEX=3`.
|
||||
|
||||
3. Launch the server.
|
||||
|
||||
**NOTE:** Replace `<secret>` below with your [huggingface hub token](https://huggingface.co/docs/hub/en/security-tokens).
|
||||
|
||||
```bash
|
||||
drun -p 30000:30000 \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--env "HF_TOKEN=<secret>" \
|
||||
sglang_image \
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path NousResearch/Meta-Llama-3.1-8B \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000
|
||||
```
|
||||
|
||||
4. To verify the utility, you can run a benchmark in another terminal or refer to [other docs](https://docs.sglang.ai/backend/openai_api_completions.html) to send requests to the engine.
|
||||
|
||||
```bash
|
||||
drun sglang_image \
|
||||
python3 -m sglang.bench_serving \
|
||||
--backend sglang \
|
||||
--dataset-name random \
|
||||
--num-prompts 4000 \
|
||||
--random-input 128 \
|
||||
--random-output 128
|
||||
```
|
||||
|
||||
With your AMD system properly configured and SGLang installed, you can now fully leverage AMD hardware to power SGLang’s machine learning capabilities.
|
||||
|
||||
## Examples
|
||||
|
||||
### Running DeepSeek-V3
|
||||
|
||||
The only difference when running DeepSeek-V3 is in how you start the server. Here's an example command:
|
||||
|
||||
```bash
|
||||
drun -p 30000:30000 \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--ipc=host \
|
||||
--env "HF_TOKEN=<secret>" \
|
||||
sglang_image \
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path deepseek-ai/DeepSeek-V3 \ # <- here
|
||||
--tp 8 \
|
||||
--trust-remote-code \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000
|
||||
```
|
||||
|
||||
[Running DeepSeek-R1 on a single NDv5 MI300X VM](https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726) could also be a good reference.
|
||||
|
||||
### Running Llama3.1
|
||||
|
||||
Running Llama3.1 is nearly identical to running DeepSeek-V3. The only difference is in the model specified when starting the server, shown by the following example command:
|
||||
|
||||
```bash
|
||||
drun -p 30000:30000 \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
--ipc=host \
|
||||
--env "HF_TOKEN=<secret>" \
|
||||
sglang_image \
|
||||
python3 -m sglang.launch_server \
|
||||
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ # <- here
|
||||
--tp 8 \
|
||||
--trust-remote-code \
|
||||
--host 0.0.0.0 \
|
||||
--port 30000
|
||||
```
|
||||
|
||||
### Warmup Step
|
||||
|
||||
When the server displays `The server is fired up and ready to roll!`, it means the startup is successful.
|
||||
206
docs/platforms/ascend_npu.md
Normal file
206
docs/platforms/ascend_npu.md
Normal file
@@ -0,0 +1,206 @@
|
||||
# Ascend NPUs
|
||||
|
||||
You can install SGLang using any of the methods below. Please go through `System Settings` section to ensure the clusters are roaring at max performance. Feel free to leave an issue [here at sglang](https://github.com/sgl-project/sglang/issues) if you encounter any issues or have any problems.
|
||||
|
||||
## System Settings
|
||||
|
||||
### CPU performance power scheme
|
||||
|
||||
The default power scheme on Ascend hardware is `ondemand` which could affect performance, changing it to `performance` is recommended.
|
||||
|
||||
```shell
|
||||
echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
|
||||
|
||||
# Make sure changes are applied successfully
|
||||
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor # shows performance
|
||||
```
|
||||
|
||||
### Disable NUMA balancing
|
||||
|
||||
```shell
|
||||
sudo sysctl -w kernel.numa_balancing=0
|
||||
|
||||
# Check
|
||||
cat /proc/sys/kernel/numa_balancing # shows 0
|
||||
```
|
||||
|
||||
### Prevent swapping out system memory
|
||||
|
||||
```shell
|
||||
sudo sysctl -w vm.swappiness=10
|
||||
|
||||
# Check
|
||||
cat /proc/sys/vm/swappiness # shows 10
|
||||
```
|
||||
|
||||
## Installing SGLang
|
||||
|
||||
### Method 1: Installing from source with prerequisites
|
||||
|
||||
#### Python Version
|
||||
|
||||
Only `python==3.11` is supported currently. If you don't want to break system pre-installed python, try installing with [conda](https://github.com/conda/conda).
|
||||
|
||||
```shell
|
||||
conda create --name sglang_npu python=3.11
|
||||
conda activate sglang_npu
|
||||
```
|
||||
|
||||
#### MemFabric Adaptor
|
||||
|
||||
_TODO: MemFabric is still a working project yet open sourced til August/September, 2025. We will release it as prebuilt wheel package for now._
|
||||
|
||||
_Notice: Prebuilt wheel package is based on `aarch64`, please leave an issue [here at sglang](https://github.com/sgl-project/sglang/issues) to let us know the requests for `amd64` build._
|
||||
|
||||
MemFabric Adaptor is a drop-in replacement of Mooncake Transfer Engine that enables KV cache transfer on Ascend NPU clusters.
|
||||
|
||||
```shell
|
||||
MF_WHL_NAME="mf_adapter-1.0.0-cp311-cp311-linux_aarch64.whl"
|
||||
MEMFABRIC_URL="https://sglang-ascend.obs.cn-east-3.myhuaweicloud.com/sglang/${MF_WHL_NAME}"
|
||||
wget -O "${MF_WHL_NAME}" "${MEMFABRIC_URL}" && pip install "./${MF_WHL_NAME}"
|
||||
```
|
||||
|
||||
#### Pytorch and Pytorch Framework Adaptor on Ascend
|
||||
|
||||
Only `torch==2.6.0` is supported currently due to NPUgraph and Triton-on-Ascend's limitation, however a more generalized version will be release by the end of September, 2025.
|
||||
|
||||
```shell
|
||||
PYTORCH_VERSION=2.6.0
|
||||
TORCHVISION_VERSION=0.21.0
|
||||
pip install torch==$PYTORCH_VERSION torchvision==$TORCHVISION_VERSION --index-url https://download.pytorch.org/whl/cpu
|
||||
|
||||
PTA_VERSION="v7.1.0.1-pytorch2.6.0"
|
||||
PTA_NAME="torch_npu-2.6.0.post1-cp311-cp311-manylinux_2_28_aarch64.whl"
|
||||
PTA_URL="https://gitee.com/ascend/pytorch/releases/download/${PTA_VERSION}/${PTA_WHL_NAME}"
|
||||
wget -O "${PTA_NAME}" "${PTA_URL}" && pip install "./${PTA_NAME}"
|
||||
```
|
||||
|
||||
#### vLLM
|
||||
|
||||
vLLM is still a major prerequisite on Ascend NPU. Because of `torch==2.6.0` limitation, only vLLM v0.8.5 is supported.
|
||||
|
||||
```shell
|
||||
VLLM_TAG=v0.8.5
|
||||
git clone --depth 1 https://github.com/vllm-project/vllm.git --branch $VLLM_TAG
|
||||
(cd vllm && VLLM_TARGET_DEVICE="empty" pip install -v -e .)
|
||||
```
|
||||
|
||||
#### Triton on Ascend
|
||||
|
||||
_Notice:_ We recommend installing triton-ascend from source due to its rapid development, the version on PYPI can't keep up for know. This problem will be solved on Sep. 2025, afterwards `pip install` would be the one and only installing method.
|
||||
|
||||
Please follow Triton-on-Ascend's [installation guide from source](https://gitee.com/ascend/triton-ascend#2%E6%BA%90%E4%BB%A3%E7%A0%81%E5%AE%89%E8%A3%85-triton-ascend) to install the latest `triton-ascend` package.
|
||||
|
||||
#### DeepEP-compatible Library
|
||||
|
||||
We are also providing a DeepEP-compatible Library as a drop-in replacement of deepseek-ai's DeepEP library, check the [installation guide](https://github.com/sgl-project/sgl-kernel-npu/blob/main/python/deep_ep/README.md).
|
||||
|
||||
#### Installing SGLang from source
|
||||
|
||||
```shell
|
||||
# Use the last release branch
|
||||
git clone -b v0.5.2rc1 https://github.com/sgl-project/sglang.git
|
||||
cd sglang
|
||||
|
||||
pip install --upgrade pip
|
||||
pip install -e python[srt_npu]
|
||||
```
|
||||
|
||||
### Method 2: Using docker
|
||||
|
||||
__Notice:__ `--privileged` and `--network=host` are required by RDMA, which is typically needed by Ascend NPU clusters.
|
||||
|
||||
__Notice:__ The following docker command is based on Atlas 800I A3 machines. If you are using Atlas 800I A2, make sure only `davinci[0-7]` are mapped into container.
|
||||
|
||||
```shell
|
||||
# Clone the SGLang repository
|
||||
git clone https://github.com/sgl-project/sglang.git
|
||||
cd sglang/docker
|
||||
|
||||
# Build the docker image
|
||||
docker build -t sglang-npu:main -f Dockerfile.npu .
|
||||
|
||||
alias drun='docker run -it --rm --privileged --network=host --ipc=host --shm-size=16g \
|
||||
--device=/dev/davinci0 --device=/dev/davinci1 --device=/dev/davinci2 --device=/dev/davinci3 \
|
||||
--device=/dev/davinci4 --device=/dev/davinci5 --device=/dev/davinci6 --device=/dev/davinci7 \
|
||||
--device=/dev/davinci8 --device=/dev/davinci9 --device=/dev/davinci10 --device=/dev/davinci11 \
|
||||
--device=/dev/davinci12 --device=/dev/davinci13 --device=/dev/davinci14 --device=/dev/davinci15 \
|
||||
--device=/dev/davinci_manager --device=/dev/hisi_hdc \
|
||||
--volume /usr/local/sbin:/usr/local/sbin --volume /usr/local/Ascend/driver:/usr/local/Ascend/driver \
|
||||
--volume /usr/local/Ascend/firmware:/usr/local/Ascend/firmware \
|
||||
--volume /etc/ascend_install.info:/etc/ascend_install.info \
|
||||
--volume /var/queue_schedule:/var/queue_schedule --volume ~/.cache/:/root/.cache/'
|
||||
|
||||
drun --env "HF_TOKEN=<secret>" \
|
||||
sglang-npu:main \
|
||||
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --attention-backend ascend --host 0.0.0.0 --port 30000
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Running DeepSeek-V3
|
||||
|
||||
Running DeepSeek with PD disaggregation on 2 x Atlas 800I A3.
|
||||
Model weights could be found [here](https://modelers.cn/models/State_Cloud/Deepseek-R1-bf16-hfd-w8a8).
|
||||
|
||||
Prefill:
|
||||
|
||||
```shell
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export ASCEND_MF_STORE_URL="tcp://<PREFILL_HOST_IP>:<PORT>"
|
||||
|
||||
drun sglang-npu:main \
|
||||
python3 -m sglang.launch_server --model-path State_Cloud/DeepSeek-R1-bf16-hfd-w8a8 \
|
||||
--trust-remote-code \
|
||||
--attention-backend ascend \
|
||||
--mem-fraction-static 0.8 \
|
||||
--quantization w8a8_int8 \
|
||||
--tp-size 16 \
|
||||
--dp-size 1 \
|
||||
--nnodes 1 \
|
||||
--node-rank 0 \
|
||||
--disaggregation-mode prefill \
|
||||
--disaggregation-bootstrap-port 6657 \
|
||||
--disaggregation-transfer-backend ascend \
|
||||
--dist-init-addr <PREFILL_HOST_IP>:6688 \
|
||||
--host <PREFILL_HOST_IP> \
|
||||
--port 8000
|
||||
```
|
||||
|
||||
Decode:
|
||||
|
||||
```shell
|
||||
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
|
||||
export ASCEND_MF_STORE_URL="tcp://<PREFILL_HOST_IP>:<PORT>"
|
||||
export HCCL_BUFFSIZE=200
|
||||
export SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=24
|
||||
|
||||
drun sglang-npu:main \
|
||||
python3 -m sglang.launch_server --model-path State_Cloud/DeepSeek-R1-bf16-hfd-w8a8 \
|
||||
--trust-remote-code \
|
||||
--attention-backend ascend \
|
||||
--mem-fraction-static 0.8 \
|
||||
--quantization w8a8_int8 \
|
||||
--enable-deepep-moe \
|
||||
--deepep-mode low_latency \
|
||||
--tp-size 16 \
|
||||
--dp-size 1 \
|
||||
--ep-size 16 \
|
||||
--nnodes 1 \
|
||||
--node-rank 0 \
|
||||
--disaggregation-mode decode \
|
||||
--disaggregation-transfer-backend ascend \
|
||||
--dist-init-addr <DECODE_HOST_IP>:6688 \
|
||||
--host <DECODE_HOST_IP> \
|
||||
--port 8001
|
||||
```
|
||||
|
||||
Mini_LB:
|
||||
|
||||
```shell
|
||||
drun sglang-npu:main \
|
||||
python -m sglang.srt.disaggregation.launch_lb \
|
||||
--prefill http://<PREFILL_HOST_IP>:8000 \
|
||||
--decode http://<DECODE_HOST_IP>:8001 \
|
||||
--host 127.0.0.1 --port 5000
|
||||
```
|
||||
9
docs/platforms/blackwell_gpu.md
Normal file
9
docs/platforms/blackwell_gpu.md
Normal file
@@ -0,0 +1,9 @@
|
||||
# Blackwell GPUs
|
||||
|
||||
We will release the pre-built wheels soon. Before that, please try to compile from source or check the blackwell docker images from [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags).
|
||||
|
||||
## B200 with x86 CPUs
|
||||
TODO
|
||||
|
||||
## GB200/GB300 with ARM CPUs
|
||||
TODO
|
||||
197
docs/platforms/cpu_server.md
Normal file
197
docs/platforms/cpu_server.md
Normal file
@@ -0,0 +1,197 @@
|
||||
# CPU Servers
|
||||
|
||||
The document addresses how to set up the [SGLang](https://github.com/sgl-project/sglang) environment and run LLM inference on CPU servers.
|
||||
Specifically, SGLang is well optimized on the CPUs equipped with Intel® AMX® Instructions,
|
||||
which are 4th generation or newer Intel® Xeon® Scalable Processors.
|
||||
|
||||
## Optimized Model List
|
||||
|
||||
A list of popular LLMs are optimized and run efficiently on CPU,
|
||||
including the most notable open-source models like Llama series, Qwen series,
|
||||
and the phenomenal high-quality reasoning model DeepSeek-R1.
|
||||
|
||||
| Model Name | BF16 | w8a8_int8 | FP8 |
|
||||
|:---:|:---:|:---:|:---:|
|
||||
| DeepSeek-R1 | | [meituan/DeepSeek-R1-Channel-INT8](https://huggingface.co/meituan/DeepSeek-R1-Channel-INT8) | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) |
|
||||
| Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [RedHatAI/Llama-3.2-3B-quantized.w8a8](https://huggingface.co/RedHatAI/Llama-3.2-3B-Instruct-quantized.w8a8) | |
|
||||
| Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8](https://huggingface.co/RedHatAI/Meta-Llama-3.1-8B-quantized.w8a8) | |
|
||||
| QwQ-32B | | [RedHatAI/QwQ-32B-quantized.w8a8](https://huggingface.co/RedHatAI/QwQ-32B-quantized.w8a8) | |
|
||||
| DeepSeek-Distilled-Llama | | [RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8](https://huggingface.co/RedHatAI/DeepSeek-R1-Distill-Llama-70B-quantized.w8a8) | |
|
||||
| Qwen3-235B | | | [Qwen/Qwen3-235B-A22B-FP8](https://huggingface.co/Qwen/Qwen3-235B-A22B-FP8) |
|
||||
|
||||
**Note:** The model identifiers listed in the table above
|
||||
have been verified on 6th Gen Intel® Xeon® P-core platforms.
|
||||
|
||||
## Installation
|
||||
|
||||
### Install Using Docker
|
||||
|
||||
It is recommended to use Docker for setting up the SGLang environment.
|
||||
A [Dockerfile](https://github.com/sgl-project/sglang/blob/main/docker/Dockerfile.xeon) is provided to facilitate the installation.
|
||||
Replace `<secret>` below with your [HuggingFace access token](https://huggingface.co/docs/hub/en/security-tokens).
|
||||
|
||||
```bash
|
||||
# Clone the SGLang repository
|
||||
git clone https://github.com/sgl-project/sglang.git
|
||||
cd sglang/docker
|
||||
|
||||
# Build the docker image
|
||||
docker build -t sglang-cpu:main -f Dockerfile.xeon .
|
||||
|
||||
# Initiate a docker container
|
||||
docker run \
|
||||
-it \
|
||||
--privileged \
|
||||
--ipc=host \
|
||||
--network=host \
|
||||
-v /dev/shm:/dev/shm \
|
||||
-v ~/.cache/huggingface:/root/.cache/huggingface \
|
||||
-p 30000:30000 \
|
||||
-e "HF_TOKEN=<secret>" \
|
||||
sglang-cpu:main /bin/bash
|
||||
```
|
||||
|
||||
### Install From Source
|
||||
|
||||
If you'd prefer to install SGLang in a bare metal environment,
|
||||
the command list is as below.
|
||||
It is worth noting that the environment variable `SGLANG_USE_CPU_ENGINE=1`
|
||||
is required to enable SGLang service with CPU engine.
|
||||
|
||||
```bash
|
||||
# Create and activate a conda environment
|
||||
conda create -n sgl-cpu python=3.12 -y
|
||||
conda activate sgl-cpu
|
||||
|
||||
# Optional: Set PyTorch CPU as primary pip install channel to avoid installing CUDA version
|
||||
pip config set global.index-url https://download.pytorch.org/whl/cpu
|
||||
pip config set global.extra-index-url https://pypi.org/simple
|
||||
|
||||
# Check if some conda related environment variables have been set
|
||||
env | grep -i conda
|
||||
# The following environment variable settings are required
|
||||
# if they have not been set properly
|
||||
export CONDA_EXE=$(which conda)
|
||||
export CONDA_ROOT=${CONDA_EXE}/../..
|
||||
export CONDA_PREFIX=${CONDA_ROOT}/envs/sgl-cpu
|
||||
export PATH=${PATH}:${CONDA_ROOT}/bin:${CONDA_ROOT}/condabin
|
||||
|
||||
# Clone the SGLang code
|
||||
git clone https://github.com/sgl-project/sglang.git
|
||||
cd sglang
|
||||
git checkout <YOUR-DESIRED-VERSION>
|
||||
|
||||
# Install SGLang dependent libs, and build SGLang main package
|
||||
pip install --upgrade pip setuptools
|
||||
conda install -y libsqlite==3.48.0 gperftools tbb libnuma numactl
|
||||
pip install intel-openmp
|
||||
pip install -e "python[all_cpu]"
|
||||
|
||||
# Build the CPU backend kernels
|
||||
cd sgl-kernel
|
||||
cp pyproject_cpu.toml pyproject.toml
|
||||
pip install -v .
|
||||
|
||||
# Other required environment variables
|
||||
# Recommend to set these in ~/.bashrc in order not to set every time in a new terminal
|
||||
export SGLANG_USE_CPU_ENGINE=1
|
||||
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libiomp5.so:${CONDA_PREFIX}/lib/libtcmalloc.so:${CONDA_PREFIX}/lib/libtbbmalloc.so.2
|
||||
```
|
||||
|
||||
## Launch of the Serving Engine
|
||||
|
||||
Example command to launch SGLang serving:
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--model <MODEL_ID_OR_PATH> \
|
||||
--trust-remote-code \
|
||||
--disable-overlap-schedule \
|
||||
--device cpu \
|
||||
--host 0.0.0.0 \
|
||||
--tp 6
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
1. For running W8A8 quantized models, please add the flag `--quantization w8a8_int8`.
|
||||
|
||||
2. The flag `--tp 6` specifies that tensor parallelism will be applied using 6 ranks (TP6).
|
||||
The number of TP specified is how many TP ranks will be used during the execution.
|
||||
In a CPU platform, a TP rank means a sub-NUMA cluster (SNC).
|
||||
Usually we can get the SNC information (How many available) from Operation System.
|
||||
User can specify TP to be no more than the total available SNCs in current system.
|
||||
|
||||
If the specified TP rank number differs from the total SNC count,
|
||||
the system will automatically utilize the first `n` SNCs.
|
||||
Note that `n` cannot exceed the total SNC number, doing so will result in an error.
|
||||
|
||||
To specify the cores to be used, we need to explicitly set the environment variable `SGLANG_CPU_OMP_THREADS_BIND`.
|
||||
For example, if we want to run the SGLang service using the first 40 cores of each SNC on a Xeon® 6980P server,
|
||||
which has 43-43-42 cores on the 3 SNCs of a socket, we should set:
|
||||
|
||||
```bash
|
||||
export SGLANG_CPU_OMP_THREADS_BIND="0-39|43-82|86-125|128-167|171-210|214-253"
|
||||
```
|
||||
|
||||
3. A warmup step is automatically triggered when the service is started.
|
||||
The server is ready when you see the log `The server is fired up and ready to roll!`.
|
||||
|
||||
## Benchmarking with Requests
|
||||
|
||||
You can benchmark the performance via the `bench_serving` script.
|
||||
Run the command in another terminal.
|
||||
|
||||
```bash
|
||||
python -m sglang.bench_serving \
|
||||
--dataset-name random \
|
||||
--random-input-len 1024 \
|
||||
--random-output-len 1024 \
|
||||
--num-prompts 1 \
|
||||
--request-rate inf \
|
||||
--random-range-ratio 1.0
|
||||
```
|
||||
|
||||
The detail explanations of the parameters can be looked up by the command:
|
||||
|
||||
```bash
|
||||
python -m sglang.bench_serving -h
|
||||
```
|
||||
|
||||
Additionally, the requests can be formed with
|
||||
[OpenAI Completions API](https://docs.sglang.ai/backend/openai_api_completions.html)
|
||||
and sent via the command line (e.g. using `curl`) or via your own script.
|
||||
|
||||
## Example: Running DeepSeek-R1
|
||||
|
||||
An example command to launch service for W8A8 DeepSeek-R1 on a Xeon® 6980P server
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--model meituan/DeepSeek-R1-Channel-INT8 \
|
||||
--trust-remote-code \
|
||||
--disable-overlap-schedule \
|
||||
--device cpu \
|
||||
--quantization w8a8_int8 \
|
||||
--host 0.0.0.0 \
|
||||
--mem-fraction-static 0.8 \
|
||||
--max-total-token 65536 \
|
||||
--tp 6
|
||||
```
|
||||
|
||||
Similarly, an example command to launch service for FP8 DeepSeek-R1 would be
|
||||
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--model deepseek-ai/DeepSeek-R1 \
|
||||
--trust-remote-code \
|
||||
--disable-overlap-schedule \
|
||||
--device cpu \
|
||||
--host 0.0.0.0 \
|
||||
--mem-fraction-static 0.8 \
|
||||
--max-total-token 65536 \
|
||||
--tp 6
|
||||
```
|
||||
|
||||
Then you can test with `bench_serving` command or construct your own command or script
|
||||
following [the benchmarking example](#benchmarking-with-requests).
|
||||
76
docs/platforms/nvidia_jetson.md
Normal file
76
docs/platforms/nvidia_jetson.md
Normal file
@@ -0,0 +1,76 @@
|
||||
# NVIDIA Jetson Orin
|
||||
|
||||
## Prerequisites
|
||||
|
||||
Before starting, ensure the following:
|
||||
|
||||
- [**NVIDIA Jetson AGX Orin Devkit**](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/) is set up with **JetPack 6.1** or later.
|
||||
- **CUDA Toolkit** and **cuDNN** are installed.
|
||||
- Verify that the Jetson AGX Orin is in **high-performance mode**:
|
||||
```bash
|
||||
sudo nvpmodel -m 0
|
||||
```
|
||||
* * * * *
|
||||
## Installing and running SGLang with Jetson Containers
|
||||
Clone the jetson-containers github repository:
|
||||
```
|
||||
git clone https://github.com/dusty-nv/jetson-containers.git
|
||||
```
|
||||
Run the installation script:
|
||||
```
|
||||
bash jetson-containers/install.sh
|
||||
```
|
||||
Build the container:
|
||||
```
|
||||
CUDA_VERSION=12.6 jetson-containers build sglang
|
||||
```
|
||||
Run the container:
|
||||
```
|
||||
docker run --runtime nvidia -it --rm --network=host IMAGE_NAME
|
||||
```
|
||||
* * * * *
|
||||
|
||||
Running Inference
|
||||
-----------------------------------------
|
||||
|
||||
Launch the server:
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
|
||||
--device cuda \
|
||||
--dtype half \
|
||||
--attention-backend flashinfer \
|
||||
--mem-fraction-static 0.8 \
|
||||
--context-length 8192
|
||||
```
|
||||
The quantization and limited context length (`--dtype half --context-length 8192`) are due to the limited computational resources in [Nvidia jetson kit](https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/). A detailed explanation can be found in [Server Arguments](../backend/server_arguments.md).
|
||||
|
||||
After launching the engine, refer to [Chat completions](https://docs.sglang.ai/backend/openai_api_completions.html#Usage) to test the usability.
|
||||
* * * * *
|
||||
Running quantization with TorchAO
|
||||
-------------------------------------
|
||||
TorchAO is suggested to NVIDIA Jetson Orin.
|
||||
```bash
|
||||
python -m sglang.launch_server \
|
||||
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
|
||||
--device cuda \
|
||||
--dtype bfloat16 \
|
||||
--attention-backend flashinfer \
|
||||
--mem-fraction-static 0.8 \
|
||||
--context-length 8192 \
|
||||
--torchao-config int4wo-128
|
||||
```
|
||||
This enables TorchAO's int4 weight-only quantization with a 128-group size. The usage of `--torchao-config int4wo-128` is also for memory efficiency.
|
||||
|
||||
|
||||
* * * * *
|
||||
Structured output with XGrammar
|
||||
-------------------------------
|
||||
Please refer to [SGLang doc structured output](../advanced_features/structured_outputs.ipynb).
|
||||
* * * * *
|
||||
|
||||
Thanks to the support from [shahizat](https://github.com/shahizat).
|
||||
|
||||
References
|
||||
----------
|
||||
- [NVIDIA Jetson AGX Orin Documentation](https://developer.nvidia.com/embedded/jetson-agx-orin)
|
||||
3
docs/platforms/tpu.md
Normal file
3
docs/platforms/tpu.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# TPU
|
||||
|
||||
The support for TPU is under active development. Please stay tuned.
|
||||
Reference in New Issue
Block a user