sglang/docs/get_started/install.md

# Install SGLang

You can install SGLang using one of the methods below.

This page primarily applies to common NVIDIA GPU platforms.
For other or newer platforms, please refer to the dedicated pages for [NVIDIA Blackwell GPUs](../platforms/blackwell_gpu.md), [AMD GPUs](../platforms/amd_gpu.md), [Intel Xeon CPUs](../platforms/cpu_server.md), [NVIDIA Jetson](../platforms/nvidia_jetson.md), [Ascend NPUs](../platforms/ascend_npu.md).

## Method 1: With pip or uv

It is recommended to use uv for faster installation:

```bash
pip install --upgrade pip
pip install uv
uv pip install "sglang[all]>=0.5.2"
```

**Quick fixes to common problems**
- If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions:
  1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
  2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.

## Method 2: From source

```bash
# Use the last release branch
git clone -b v0.5.2 https://github.com/sgl-project/sglang.git
cd sglang

# Install the python packages
pip install --upgrade pip
pip install -e "python[all]"
```

**Quick fixes to common problems**
- If you want to develop SGLang, it is recommended to use docker. Please refer to [setup docker container](../developer_guide/development_guide_using_docker.md#setup-docker-container). The docker image is `lmsysorg/sglang:dev`.

## Method 3: Using docker

The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).

```bash
docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
```

## Method 4: Using Kubernetes

Please check out [OME](https://github.com/sgl-project/ome), a Kubernetes operator for enterprise-grade management and serving of large language models (LLMs).

<details>
<summary>More</summary>

1. Option 1: For single node serving (typically when the model size fits into GPUs on one node)

   Execute command `kubectl apply -f docker/k8s-sglang-service.yaml`, to create k8s deployment and service, with llama-31-8b as example.

2. Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as `DeepSeek-R1`)

   Modify the LLM model path and arguments as necessary, then execute command `kubectl apply -f docker/k8s-sglang-distributed-sts.yaml`, to create two nodes k8s statefulset and serving service.

</details>

## Method 5: Using docker compose

<details>
<summary>More</summary>

> This method is recommended if you plan to serve it as a service.
> A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml).

1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine
2. Execute the command `docker compose up -d` in your terminal.
</details>

## Method 6: Run on Kubernetes or Clouds with SkyPilot

<details>
<summary>More</summary>

To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).

1. Install SkyPilot and set up Kubernetes cluster or cloud access: see [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
2. Deploy on your own infra with a single command and get the HTTP API endpoint:
<details>
<summary>SkyPilot YAML: <code>sglang.yaml</code></summary>

```yaml
# sglang.yaml
envs:
  HF_TOKEN: null

resources:
  image_id: docker:lmsysorg/sglang:latest
  accelerators: A100
  ports: 30000

run: |
  conda deactivate
  python3 -m sglang.launch_server \
    --model-path meta-llama/Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000
```

</details>

```bash
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml

# Get the HTTP API endpoint
sky status --endpoint 30000 sglang
```

3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve).
</details>

## Common Notes

- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
- To reinstall flashinfer locally, use the following command: `pip3 install --upgrade flashinfer-python --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.
- If you only need to use OpenAI API models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.
Fix warnings in doc build (#1852) 2024-10-30 22:28:00 -07:00			`# Install SGLang`
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`You can install SGLang using one of the methods below.`
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`This page primarily applies to common NVIDIA GPU platforms.`
			`For other or newer platforms, please refer to the dedicated pages for [NVIDIA Blackwell GPUs](../platforms/blackwell_gpu.md), [AMD GPUs](../platforms/amd_gpu.md), [Intel Xeon CPUs](../platforms/cpu_server.md), [NVIDIA Jetson](../platforms/nvidia_jetson.md), [Ascend NPUs](../platforms/ascend_npu.md).`
[Docs]: Fix Multi-User Port Allocation Conflicts (#3601) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: simveit <simp.veitner@gmail.com> 2025-02-19 19:15:44 +00:00
Docs: delete sgl-kernel install in docs (#3845) 2025-02-26 18:25:43 +01:00			`## Method 1: With pip or uv`
[Docs]: Fix Multi-User Port Allocation Conflicts (#3601) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: simveit <simp.veitner@gmail.com> 2025-02-19 19:15:44 +00:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`It is recommended to use uv for faster installation:`

[Docs]: Fix Multi-User Port Allocation Conflicts (#3601) Co-authored-by: zhaochenyang20 <zhaochen20@outlook.com> Co-authored-by: simveit <simp.veitner@gmail.com> 2025-02-19 19:15:44 +00:00			```bash
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00			`pip install --upgrade pip`
Docs: delete sgl-kernel install in docs (#3845) 2025-02-26 18:25:43 +01:00			`pip install uv`
chore: bump v0.5.2 (#10221) 2025-09-11 16:09:20 -07:00			`uv pip install "sglang[all]>=0.5.2"`
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00			```

Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`Quick fixes to common problems`
Misc clean up; Remove the support of jump forward (#4032) 2025-03-03 07:02:14 -08:00			- If you encounter `OSError: CUDA_HOME environment variable is not set`. Please set it to your CUDA install root with either of the following solutions:
			1. Use `export CUDA_HOME=/usr/local/cuda-<your-cuda-version>` to set the `CUDA_HOME` environment variable.
			`2. Install FlashInfer first following [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html), then install SGLang as described above.`
Update install docs (#3553) Co-authored-by: Chayenne <zhaochen20@outlook.com> 2025-02-13 22:42:51 +01:00
Fix warnings in doc build (#1852) 2024-10-30 22:28:00 -07:00			`## Method 2: From source`
[Docs] Update start/install.md (#5398) 2025-04-16 09:12:26 +08:00
			```bash
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			`# Use the last release branch`
chore: bump v0.5.2 (#10221) 2025-09-11 16:09:20 -07:00			`git clone -b v0.5.2 https://github.com/sgl-project/sglang.git`
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			`cd sglang`
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`# Install the python packages`
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			`pip install --upgrade pip`
feat: use flashinfer jit package (#5547) 2025-04-19 00:28:39 -07:00			`pip install -e "python[all]"`
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			```
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`Quick fixes to common problems`
			- If you want to develop SGLang, it is recommended to use docker. Please refer to [setup docker container](../developer_guide/development_guide_using_docker.md#setup-docker-container). The docker image is `lmsysorg/sglang:dev`.
[CPU] Add tutorial docs for SGL on CPU (#8000) 2025-07-25 15:03:16 +08:00
Revert "fix some typos" (#6244) 2025-05-12 12:53:26 -07:00			`## Method 3: Using docker`
[Docs] Update start/install.md (#5398) 2025-04-16 09:12:26 +08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`The docker images are available on Docker Hub at [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).`
Revert "fix some typos" (#6244) 2025-05-12 12:53:26 -07:00			Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00
			```bash
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			`docker run --gpus all \`
docs: add shm size for docker run (#1986) 2024-11-10 22:14:48 +08:00			`--shm-size 32g \`
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			`-p 30000:30000 \`
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00			`-v ~/.cache/huggingface:/root/.cache/huggingface \`
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			`--env "HF_TOKEN=<secret>" \`
			`--ipc=host \`
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00			`lmsysorg/sglang:latest \`
[Fix] Fix all the Huggingface paths (#1553) 2024-10-02 10:12:07 -07:00			`python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000`
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00			```

Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`## Method 4: Using Kubernetes`
[Docs, ROCm] update install to cover ROCm with MI GPUs (#1915) 2024-11-04 01:40:57 -08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`Please check out [OME](https://github.com/sgl-project/ome), a Kubernetes operator for enterprise-grade management and serving of large language models (LLMs).`
[Docs, ROCm] update install to cover ROCm with MI GPUs (#1915) 2024-11-04 01:40:57 -08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`<details>`
			`<summary>More</summary>`
[Docs, ROCm] update install to cover ROCm with MI GPUs (#1915) 2024-11-04 01:40:57 -08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`1. Option 1: For single node serving (typically when the model size fits into GPUs on one node)`
[Docs, ROCm] update install to cover ROCm with MI GPUs (#1915) 2024-11-04 01:40:57 -08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			Execute command `kubectl apply -f docker/k8s-sglang-service.yaml`, to create k8s deployment and service, with llama-31-8b as example.

			2. Option 2: For multi-node serving (usually when a large model requires more than one GPU node, such as `DeepSeek-R1`)

			Modify the LLM model path and arguments as necessary, then execute command `kubectl apply -f docker/k8s-sglang-distributed-sts.yaml`, to create two nodes k8s statefulset and serving service.
[Docs, ROCm] update install to cover ROCm with MI GPUs (#1915) 2024-11-04 01:40:57 -08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`</details>`
[CPU] Add tutorial docs for SGL on CPU (#8000) 2025-07-25 15:03:16 +08:00
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			`## Method 5: Using docker compose`
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00
			`<details>`
			`<summary>More</summary>`
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			`> This method is recommended if you plan to serve it as a service.`
Update ci workflows (#1804) 2024-10-26 04:32:36 -07:00			`> A better approach is to use the [k8s-sglang-service.yaml](https://github.com/sgl-project/sglang/blob/main/docker/k8s-sglang-service.yaml).`
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00
Update ci workflows (#1804) 2024-10-26 04:32:36 -07:00			`1. Copy the [compose.yml](https://github.com/sgl-project/sglang/blob/main/docker/compose.yaml) to your local machine`
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00			2. Execute the command `docker compose up -d` in your terminal.
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			`</details>`
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00
[docker] Distributed Serving with k8s Statefulset ( good example for DeepSeek-R1) (#3631) Signed-off-by: Peter Pan <Peter.Pan@daocloud.io> Co-authored-by: Kebe <kebe.liu@daocloud.io> 2025-03-09 15:41:20 +08:00			`## Method 6: Run on Kubernetes or Clouds with SkyPilot`
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00
			`<details>`
			`<summary>More</summary>`

			`To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).`

			`1. Install SkyPilot and set up Kubernetes cluster or cloud access: see [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).`
			`2. Deploy on your own infra with a single command and get the HTTP API endpoint:`
			`<details>`
			`<summary>SkyPilot YAML: <code>sglang.yaml</code></summary>`

			```yaml
			`# sglang.yaml`
			`envs:`
			`HF_TOKEN: null`

			`resources:`
			`image_id: docker:lmsysorg/sglang:latest`
			`accelerators: A100`
			`ports: 30000`

			`run: \|`
			`conda deactivate`
			`python3 -m sglang.launch_server \`
[Fix] Fix all the Huggingface paths (#1553) 2024-10-02 10:12:07 -07:00			`--model-path meta-llama/Llama-3.1-8B-Instruct \`
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00			`--host 0.0.0.0 \`
			`--port 30000`
			```
[Docs] Update start/install.md (#5398) 2025-04-16 09:12:26 +08:00
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00			`</details>`

			```bash
			`# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.`
			`HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml`

			`# Get the HTTP API endpoint`
			`sky status --endpoint 30000 sglang`
			```
[Docs] Update start/install.md (#5398) 2025-04-16 09:12:26 +08:00
Adding Documentation for installation (#1300) Co-authored-by: zhaochen20 <zhaochenyang20@gmail.com> 2024-09-10 10:09:13 +08:00			`3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve).`
			`</details>`

Fix warnings in doc build (#1852) 2024-10-30 22:28:00 -07:00			`## Common Notes`
[Docs] Update start/install.md (#5398) 2025-04-16 09:12:26 +08:00
Release v0.3.1 (#1430) 2024-09-15 07:03:16 -07:00			- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
docs: update installation (#7366) 2025-06-19 12:00:19 -07:00			- To reinstall flashinfer locally, use the following command: `pip3 install --upgrade flashinfer-python --force-reinstall --no-deps` and then delete the cache with `rm -rf ~/.cache/flashinfer`.
Refactor the docs (#9031) 2025-08-10 19:49:45 -07:00			- If you only need to use OpenAI API models with the frontend language, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
			- The language frontend operates independently of the backend runtime. You can install the frontend locally without needing a GPU, while the backend can be set up on a GPU-enabled machine. To install the frontend, run `pip install sglang`, and for the backend, use `pip install sglang[srt]`. `srt` is the abbreviation of SGLang runtime.