Files
sglang/docs/get_started/install.md

132 lines
5.7 KiB
Markdown
Raw Normal View History

2024-10-30 22:28:00 -07:00
# Install SGLang
2025-08-10 19:49:45 -07:00
You can install SGLang using one of the methods below.
2025-08-10 19:49:45 -07:00
This page primarily applies to common NVIDIA GPU platforms.
For other or newer platforms, please refer to the dedicated pages for [NVIDIA Blackwell GPUs](../platforms/blackwell_gpu.md), [AMD GPUs](../platforms/amd_gpu.md), [Intel Xeon CPUs](../platforms/cpu_server.md), [NVIDIA Jetson](../platforms/nvidia_jetson.md), [Ascend NPUs](../platforms/ascend_npu.md).
## Method 1: With pip or uv
2025-08-10 19:49:45 -07:00
It is recommended to use uv for faster installation:
```bash
pip install --upgrade pip
pip install uv
2025-09-11 16:09:20 -07:00
uv pip install "sglang[all]>=0.5.2"
```
2025-08-10 19:49:45 -07:00
**Quick fixes to common problems**
- If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions:
1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.
2024-10-30 22:28:00 -07:00
## Method 2: From source
2025-04-16 09:12:26 +08:00
```bash
2024-09-09 20:48:28 -07:00
# Use the last release branch
2025-09-11 16:09:20 -07:00
git clone -b v0.5.2 https://github.com/sgl-project/sglang.git
2024-09-09 20:48:28 -07:00
cd sglang
2025-08-10 19:49:45 -07:00
# Install the python packages
2024-09-09 20:48:28 -07:00
pip install --upgrade pip
pip install -e "python[all]"
2024-09-09 20:48:28 -07:00
```
2025-08-10 19:49:45 -07:00
**Quick fixes to common problems**
- If you want to develop SGLang, it is recommended to use docker. Please refer to [setup docker container](../developer_guide/development_guide_using_docker.md#setup-docker-container). The docker image is `lmsysorg/sglang:dev`.
2025-05-12 12:53:26 -07:00
## Method 3: Using docker
2025-04-16 09:12:26 +08:00
2025-08-10 19:49:45 -07:00
The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
2025-05-12 12:53:26 -07:00
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
```bash
2024-09-09 20:48:28 -07:00
docker run --gpus all \
--shm-size 32g \
2024-09-09 20:48:28 -07:00
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
2024-09-09 20:48:28 -07:00
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
```
2025-08-10 19:49:45 -07:00
## Method 4: Using Kubernetes
2025-08-10 19:49:45 -07:00
Please check out [OME](https://github.com/sgl-project/ome), a Kubernetes operator for enterprise-grade management and serving of large language models (LLMs).
2025-08-10 19:49:45 -07:00
<details>
<summary>More</summary>
2025-08-10 19:49:45 -07:00
1. Option 1: For single node serving (typically when the model size fits into GPUs on one node)
2025-08-10 19:49:45 -07:00
Execute command `kubectl apply -f docker/k8s-sglang-service.yaml`, to create k8s deployment and service, with llama-31-8b as example.
2. Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as `DeepSeek-R1`)
Modify the LLM model path and arguments as necessary, then execute command `kubectl apply -f docker/k8s-sglang-distributed-sts.yaml`, to create two nodes k8s statefulset and serving service.
2025-08-10 19:49:45 -07:00
</details>
2025-08-10 19:49:45 -07:00
## Method 5: Using docker compose
2024-09-09 20:48:28 -07:00
<details>
<summary>More</summary>
2024-09-09 20:48:28 -07:00
> This method is recommended if you plan to serve it as a service.
2024-10-26 04:32:36 -07:00
> A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml).
2024-10-26 04:32:36 -07:00
1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine
2. Execute the command `docker compose up -d` in your terminal.
2024-09-09 20:48:28 -07:00
</details>
## Method 6: Run on Kubernetes or Clouds with SkyPilot
<details>
<summary>More</summary>
To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).
1. Install SkyPilot and set up Kubernetes cluster or cloud access: see [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
2. Deploy on your own infra with a single command and get the HTTP API endpoint:
<details>
<summary>SkyPilot YAML: <code>sglang.yaml</code></summary>
```yaml
# sglang.yaml
envs:
HF_TOKEN: null
resources:
image_id: docker:lmsysorg/sglang:latest
accelerators: A100
ports: 30000
run: |
conda deactivate
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 30000
```
2025-04-16 09:12:26 +08:00
</details>
```bash
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml
# Get the HTTP API endpoint
sky status --endpoint 30000 sglang
```
2025-04-16 09:12:26 +08:00
3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve).
</details>
2024-10-30 22:28:00 -07:00
## Common Notes
2025-04-16 09:12:26 +08:00
2024-09-15 07:03:16 -07:00
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
2025-06-19 12:00:19 -07:00
- To reinstall flashinfer locally, use the following command: `pip3 install --upgrade flashinfer-python --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.
2025-08-10 19:49:45 -07:00
- If you only need to use OpenAI API models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.