2024-10-30 21:57:49 -07:00
# Backend: SGLang Runtime (SRT)
2024-09-09 20:48:28 -07:00
The SGLang Runtime (SRT) is an efficient serving engine.
2024-10-30 22:28:00 -07:00
## Quick Start
2024-09-09 20:48:28 -07:00
Launch a server
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
```
Send a request
```
curl http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Once upon a time,",
"sampling_params": {
"max_new_tokens": 16,
"temperature": 0
}
}'
```
2024-09-28 14:43:35 -07:00
2024-11-05 01:12:10 -08:00
Learn more about the argument specification, streaming, and multi-modal support [here ](../references/sampling_params.md ).
2024-09-09 20:48:28 -07:00
2024-10-30 22:28:00 -07:00
## OpenAI Compatible API
2024-09-09 20:48:28 -07:00
In addition, the server supports OpenAI-compatible APIs.
```python
import openai
client = openai.Client(
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
# Text completion
response = client.completions.create(
model="default",
prompt="The capital of France is",
temperature=0,
max_tokens=32,
)
print(response)
# Chat completion
response = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(response)
# Text embedding
response = client.embeddings.create(
model="default",
input="How are you today",
)
print(response)
```
2024-09-28 14:43:35 -07:00
It supports streaming, vision, and almost all features of the Chat/Completions/Models/Batch endpoints specified by the [OpenAI API Reference ](https://platform.openai.com/docs/api-reference/ ).
2024-09-09 20:48:28 -07:00
2024-10-30 22:28:00 -07:00
## Additional Server Arguments
2024-09-15 02:55:34 -07:00
- To enable multi-GPU tensor parallelism, add `--tp 2` . If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command.
2024-09-09 20:48:28 -07:00
```
2024-09-15 02:55:34 -07:00
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
2024-09-09 20:48:28 -07:00
```
2024-09-15 02:55:34 -07:00
- To enable multi-GPU data parallelism, add `--dp 2` . Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total.
2024-09-09 20:48:28 -07:00
```
2024-09-15 02:55:34 -07:00
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
2024-09-09 20:48:28 -07:00
```
- If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static` . The default value is `0.9` .
```
2024-09-15 02:55:34 -07:00
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
2024-09-09 20:48:28 -07:00
```
2024-11-05 01:12:10 -08:00
- See [hyperparameter tuning ](../references/hyperparameter_tuning.md ) on tuning hyperparameters for better performance.
2024-09-09 20:48:28 -07:00
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
```
2024-09-15 02:55:34 -07:00
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
2024-09-09 20:48:28 -07:00
```
2024-10-30 21:57:49 -07:00
- To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler` . It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly.
2024-10-23 11:28:48 -07:00
- To enable torch.compile acceleration, add `--enable-torch-compile` . It accelerates small models on small batch sizes. This does not work for FP8 currenly.
2024-09-28 14:43:35 -07:00
- To enable torchao quantization, add `--torchao-config int4wo-128` . It supports various quantization strategies.
2024-09-15 02:55:34 -07:00
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2` .
2024-11-05 01:12:10 -08:00
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template ](../references/custom_chat_template.md ).
2024-11-01 17:47:44 -07:00
2024-09-29 02:36:12 -07:00
- To run tensor parallelism on multiple nodes, add `--nnodes 2` . If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
2024-09-09 20:48:28 -07:00
```
# Node 0
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 0
# Node 1
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
```
2024-09-15 06:36:06 -07:00
2024-10-30 22:28:00 -07:00
## Engine Without HTTP Server
2024-10-13 00:33:58 -07:00
2024-10-14 19:56:21 -07:00
We also provide an inference engine **without a HTTP server** . For example,
```python
import sglang as sgl
def main():
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = {"temperature": 0.8, "top_p": 0.95}
llm = sgl.Engine(model_path="meta-llama/Meta-Llama-3.1-8B-Instruct")
outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
print("===============================")
print(f"Prompt: {prompt}\nGenerated text: {output['text']}")
if __name__ == "__main__":
main()
```
2024-10-23 11:28:48 -07:00
This can be used for offline batch inference and building custom servers.
You can view the full example [here ](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine ).
2024-10-13 00:33:58 -07:00
2024-11-05 01:12:10 -08:00
## Use Models From ModelScope
2024-09-09 20:48:28 -07:00
< details >
< summary > More< / summary >
To use a model from [ModelScope ](https://www.modelscope.cn ), set the environment variable SGLANG_USE_MODELSCOPE.
```
export SGLANG_USE_MODELSCOPE=true
```
Launch [Qwen2-7B-Instruct ](https://www.modelscope.cn/models/qwen/qwen2-7b-instruct ) Server
```
SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000
```
2024-10-30 21:57:49 -07:00
Or start it by docker.
```bash
docker run --gpus all \
-p 30000:30000 \
-v ~/.cache/modelscope:/root/.cache/modelscope \
--env "SGLANG_USE_MODELSCOPE=true" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000
```
2024-11-06 21:46:04 +08:00
2024-09-09 20:48:28 -07:00
< / details >
2024-11-05 01:12:10 -08:00
## Example: Run Llama 3.1 405B
2024-09-09 20:48:28 -07:00
< details >
< summary > More< / summary >
```bash
# Run 405B (fp8) on a single node
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
# Run 405B (fp16) on two nodes
## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port
GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 --disable-cuda-graph
## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port
GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph
```
< / details >