2.6 KiB
Server Arguments
- To enable multi-GPU tensor parallelism, add
--tp 2. If it reports the error "peer access is not supported between these two devices", add--enable-p2p-checkto the server launch command.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
- To enable multi-GPU data parallelism, add
--dp 2. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
- If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of
--mem-fraction-static. The default value is0.9.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
- See hyperparameter tuning on tuning hyperparameters for better performance.
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
-
To enable torch.compile acceleration, add
--enable-torch-compile. It accelerates small models on small batch sizes. This does not work for FP8 currently. -
To enable torchao quantization, add
--torchao-config int4wo-128. It supports other quantization strategies (INT8/FP8) as well. -
To enable fp8 weight quantization, add
--quantization fp8on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. -
To enable fp8 kv cache quantization, add
--kv-cache-dtype fp8_e5m2. -
If the model does not have a chat template in the Hugging Face tokenizer, you can specify a custom chat template.
-
To run tensor parallelism on multiple nodes, add
--nnodes 2. If you have two nodes with two GPUs on each node and want to run TP=4, letsgl-dev-0be the hostname of the first node and50000be an available port, you can use the following commands. If you meet deadlock, please try to add--disable-cuda-graph
# Node 0
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 0
# Node 1
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --dist-init-addr sgl-dev-0:50000 --nnodes 2 --node-rank 1