4.4 KiB
SGLang Installation Guide
SGLang consists of a frontend language (Structured Generation Language, SGLang) and a backend runtime (SGLang Runtime, SRT). The frontend can be used separately from the backend, allowing for a detached frontend-backend setup.
Quick Installation Options
1. Frontend Installation (Client-side, any platform)
pip install --upgrade pip
pip install sglang
Note: You can check these examples for how to use frontend and backend separately.
2. Backend Installation (Server-side, Linux only)
pip install --upgrade pip
pip install "sglang[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
Note: The backend (SRT) is only needed on the server side and is only available for Linux right now.
Important: Please check the flashinfer installation guidance to install the proper version according to your PyTorch and CUDA versions.
3. From Source (Latest version, Linux only for full installation)
# Use the latest release branch
# As of this documentation, it's v0.2.15, but newer versions may be available
# Do not clone the main branch directly; always use a specific release version
# The main branch may contain unresolved bugs before a new release
git clone -b v0.2.15 https://github.com/sgl-project/sglang.git
cd sglang
pip install -e "python[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
4. OpenAI Backend Only (Client-side, any platform)
If you only need to use the OpenAI backend, you can avoid installing other dependencies by using:
pip install "sglang[openai]"
Advanced Installation Options
1. Using Docker (Server-side, Linux only)
The docker images are available on Docker Hub as lmsysorg/sglang, built from Dockerfile. Replace <secret> below with your huggingface hub token.
docker run --gpus all -p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" --ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
2.Using docker compose
This method is recommended if you plan to serve it as a service. A better approach is to use the k8s-sglang-service.yaml.
- Copy the compose.yml to your local machine
- Execute the command
docker compose up -din your terminal.
3.Run on Kubernetes or Clouds with SkyPilot
More
To deploy on Kubernetes or 12+ clouds, you can use SkyPilot.
- Install SkyPilot and set up Kubernetes cluster or cloud access: see SkyPilot's documentation.
- Deploy on your own infra with a single command and get the HTTP API endpoint:
SkyPilot YAML: sglang.yaml
# sglang.yaml
envs:
HF_TOKEN: null
resources:
image_id: docker:lmsysorg/sglang:latest
accelerators: A100
ports: 30000
run: |
conda deactivate
python3 -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml
# Get the HTTP API endpoint
sky status --endpoint 30000 sglang
- To further scale up your deployment with autoscaling and failure recovery, check out the SkyServe + SGLang guide.
Troubleshooting
- For FlashInfer issues on newer GPUs, use
--disable-flashinfer --disable-flashinfer-samplingwhen launching the server. - For out-of-memory errors, try
--mem-fraction-static 0.7when launching the server.
For more details and advanced usage, visit the SGLang GitHub repository.