sglang/docs/backend/server_arguments.md

# Server Arguments

- To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
```
- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
```
- If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
```
- See [hyperparameter tuning](../references/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
```
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currently.
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports other [quantization strategies (INT8/FP8)](https://github.com/sgl-project/sglang/blob/v0.3.6/python/sglang/srt/server_args.py#L671) as well.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](../references/custom_chat_template.md).

- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
```
# Node 0
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 0

# Node 1
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 1
```
[Docs] Add Support for Pydantic Structured Output Format (#2697) 2025-01-01 23:08:43 +00:00			`# Server Arguments`
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00
Update backend.md (#1429) 2024-09-15 02:55:34 -07:00			- To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command.
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			```
Update backend.md (#1429) 2024-09-15 02:55:34 -07:00			`python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2`
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			```
Update backend.md (#1429) 2024-09-15 02:55:34 -07:00			- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total.
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			```
Update backend.md (#1429) 2024-09-15 02:55:34 -07:00			`python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2`
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			```
			- If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`.
			```
Update backend.md (#1429) 2024-09-15 02:55:34 -07:00			`python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7`
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			```
[Doc] improve relative links and structure (#1924) 2024-11-05 01:12:10 -08:00			`- See [hyperparameter tuning](../references/hyperparameter_tuning.md) on tuning hyperparameters for better performance.`
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			`- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.`
			```
Update backend.md (#1429) 2024-09-15 02:55:34 -07:00			`python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096`
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			```
fix small typos in docs (#2047) 2024-11-16 03:09:10 +08:00			- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currently.
Update backend.md (#2251) 2024-11-28 23:18:07 -08:00			- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports other [quantization strategies (INT8/FP8)](https://github.com/sgl-project/sglang/blob/v0.3.6/python/sglang/srt/server_args.py#L671) as well.
Update backend.md (#1429) 2024-09-15 02:55:34 -07:00			- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
			- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
[Doc] improve relative links and structure (#1924) 2024-11-05 01:12:10 -08:00			`- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](../references/custom_chat_template.md).`
Improve docs and fix the broken links (#1875) 2024-11-01 17:47:44 -07:00
Improve process creation (#1534) 2024-09-29 02:36:12 -07:00			- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			```
			`# Node 0`
Support multi-node DP attention (#2925) Co-authored-by: dhou-xai <dhou@x.ai> 2025-01-16 11:15:00 -08:00			`python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 0`
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00
			`# Node 1`
Support multi-node DP attention (#2925) Co-authored-by: dhou-xai <dhou@x.ai> 2025-01-16 11:15:00 -08:00			`python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 1`
[Docs] Improve documentations (#1368) 2024-09-09 20:48:28 -07:00			```