Install SGLang

You can install SGLang using any of the methods below.

Method 1: With pip

pip install --upgrade pip
pip install sgl-kernel --force-reinstall --no-deps
pip install "sglang[all]>=0.4.2.post4" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer/

Note: SGLang currently uses torch 2.5, so you need to install the flashinfer version for torch 2.5. If you want to install flashinfer separately, please refer to FlashInfer installation doc.

Method 2: From source

# Use the last release branch
git clone -b v0.4.2.post4 https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
pip install sgl-kernel --force-reinstall --no-deps
pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.5/flashinfer/

Note: SGLang currently uses torch 2.5, so you need to install the flashinfer version for torch 2.5. If you want to install flashinfer separately, please refer to FlashInfer installation doc.

If you want to work on development in SGLang, it is highly recommended that you use docker. Please refer to setup docker container for guidance. The image used is lmsysorg/sglang:dev.

Note: To AMD ROCm system with Instinct/MI GPUs, do following instead:

# Use the last release branch
git clone -b v0.4.2.post4 https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
cd sgl-kernel
python setup_rocm.py install
cd ..
pip install -e "python[all_hip]"

Method 3: Using docker

The docker images are available on Docker Hub as lmsysorg/sglang, built from Dockerfile. Replace <secret> below with your huggingface hub token.

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

Note: To AMD ROCm system with Instinct/MI GPUs, it is recommended to use docker/Dockerfile.rocm to build images, example and usage as below:

docker build --build-arg SGL_BRANCH=v0.4.2.post4 -t v0.4.2.post4-rocm630 -f Dockerfile.rocm .

alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \
    --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    -v $HOME/dockerx:/dockerx -v /data:/data'

drun -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    v0.4.2.post4-rocm630 \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

# Till flashinfer backend available, --attention-backend triton --sampling-backend pytorch are set by default
drun v0.4.2.post4-rocm630 python3 -m sglang.bench_one_batch --batch-size 32 --input 1024 --output 128 --model amd/Meta-Llama-3.1-8B-Instruct-FP8-KV --tp 8 --quantization fp8

Method 4: Using docker compose

This method is recommended if you plan to serve it as a service. A better approach is to use the k8s-sglang-service.yaml.

Copy the compose.yml to your local machine
Execute the command docker compose up -d in your terminal.

Method 5: Run on Kubernetes or Clouds with SkyPilot

To deploy on Kubernetes or 12+ clouds, you can use SkyPilot.

Install SkyPilot and set up Kubernetes cluster or cloud access: see SkyPilot's documentation.
Deploy on your own infra with a single command and get the HTTP API endpoint:

SkyPilot YAML: sglang.yaml

# sglang.yaml
envs:
  HF_TOKEN: null

resources:
  image_id: docker:lmsysorg/sglang:latest
  accelerators: A100
  ports: 30000

run: |
  conda deactivate
  python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000

# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml

# Get the HTTP API endpoint
sky status --endpoint 30000 sglang

To further scale up your deployment with autoscaling and failure recovery, check out the SkyServe + SGLang guide.

Common Notes

FlashInfer is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding --attention-backend triton --sampling-backend pytorch and open an issue on GitHub.
If you only need to use OpenAI models with the frontend language, you can avoid installing other dependencies by using pip install "sglang[openai]".
The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run pip install sglang, and for the backend, use pip install sglang[srt]. This allows you to build SGLang programs locally and execute them by connecting to the remote backend.

6.0 KiB Raw Blame History