Update Readme (#660)
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
This commit is contained in:
282
README.md
282
README.md
@@ -6,23 +6,29 @@
|
|||||||
|
|
||||||
| [**Blog**](https://lmsys.org/blog/2024-01-17-sglang/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
|
| [**Blog**](https://lmsys.org/blog/2024-01-17-sglang/) | [**Paper**](https://arxiv.org/abs/2312.07104) |
|
||||||
|
|
||||||
SGLang is a structured generation language designed for large language models (LLMs).
|
SGLang is a fast serving framework for large language models and vision language models.
|
||||||
It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system.
|
It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
|
||||||
|
|
||||||
The core features include:
|
The core features include:
|
||||||
|
- **Fast Backend Runtime**: Efficient serving with RadixAttention for prefix caching, continuous batching, token attention (paged attention), tensor parallelism, flashinfer kernels, jump-forward constrained decoding, and quantization (AWQ/FP8/GPTQ/Marlin).
|
||||||
- **Flexible Frontend Language**: Enables easy programming of LLM applications with chained generation calls, advanced prompting, control flow, multiple modalities, parallelism, and external interactions.
|
- **Flexible Frontend Language**: Enables easy programming of LLM applications with chained generation calls, advanced prompting, control flow, multiple modalities, parallelism, and external interactions.
|
||||||
- **High-Performance Backend Runtime**: Features RadixAttention for accelerating complex LLM programs by reusing the KV cache across multiple calls. It can also serve as a standalone inference engine with all common techniques implemented (e.g., continuous batching and tensor parallelism).
|
|
||||||
|
|
||||||
## News
|
## News
|
||||||
|
- [2024/04] 🔥 SGLang is used by the official **LLaVA-NeXT (video)** release ([blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)).
|
||||||
- [2024/02] 🔥 SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
|
- [2024/02] 🔥 SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
|
||||||
- [2024/01] 🔥 SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
|
|
||||||
- [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
|
- [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
|
||||||
|
|
||||||
|
<details>
|
||||||
|
<summary>More</summary>
|
||||||
|
|
||||||
|
- [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
|
||||||
|
|
||||||
|
</details>
|
||||||
|
|
||||||
## Contents
|
## Contents
|
||||||
- [Install](#install)
|
- [Install](#install)
|
||||||
- [Quick Start](#quick-start)
|
|
||||||
- [Frontend: Structured Generation Language (SGLang)](#frontend-structured-generation-language-sglang)
|
|
||||||
- [Backend: SGLang Runtime (SRT)](#backend-sglang-runtime-srt)
|
- [Backend: SGLang Runtime (SRT)](#backend-sglang-runtime-srt)
|
||||||
|
- [Frontend: Structured Generation Language (SGLang)](#frontend-structured-generation-language-sglang)
|
||||||
- [Benchmark And Performance](#benchmark-and-performance)
|
- [Benchmark And Performance](#benchmark-and-performance)
|
||||||
- [Roadmap](#roadmap)
|
- [Roadmap](#roadmap)
|
||||||
- [Citation And Acknowledgment](#citation-and-acknowledgment)
|
- [Citation And Acknowledgment](#citation-and-acknowledgment)
|
||||||
@@ -70,13 +76,118 @@ pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/
|
|||||||
- If you cannot install FlashInfer, check out its [installation](https://docs.flashinfer.ai/installation.html#) page. If you still cannot install it, you can use the slower Triton kernels by adding `--disable-flashinfer` when launching the server.
|
- If you cannot install FlashInfer, check out its [installation](https://docs.flashinfer.ai/installation.html#) page. If you still cannot install it, you can use the slower Triton kernels by adding `--disable-flashinfer` when launching the server.
|
||||||
- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
|
- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
|
||||||
|
|
||||||
## Quick Start
|
## Backend: SGLang Runtime (SRT)
|
||||||
|
The SGLang Runtime (SRT) is an efficient serving engine.
|
||||||
|
|
||||||
|
### Launching a Server
|
||||||
|
Launch a server
|
||||||
|
```
|
||||||
|
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
|
||||||
|
```
|
||||||
|
|
||||||
|
Send a request
|
||||||
|
```
|
||||||
|
curl http://localhost:30000/generate \
|
||||||
|
-H "Content-Type: application/json" \
|
||||||
|
-d '{
|
||||||
|
"text": "Once upon a time,",
|
||||||
|
"sampling_params": {
|
||||||
|
"max_new_tokens": 16,
|
||||||
|
"temperature": 0
|
||||||
|
}
|
||||||
|
}'
|
||||||
|
```
|
||||||
|
Learn more about the argument format [here](docs/sampling_params.md).
|
||||||
|
|
||||||
|
### OpenAI Compatible API
|
||||||
|
In addition, the server supports OpenAI-compatible APIs.
|
||||||
|
|
||||||
|
```python
|
||||||
|
import openai
|
||||||
|
client = openai.Client(
|
||||||
|
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
|
||||||
|
|
||||||
|
# Text completion
|
||||||
|
response = client.completions.create(
|
||||||
|
model="default",
|
||||||
|
prompt="The capital of France is",
|
||||||
|
temperature=0,
|
||||||
|
max_tokens=32,
|
||||||
|
)
|
||||||
|
print(response)
|
||||||
|
|
||||||
|
# Chat completion
|
||||||
|
response = client.chat.completions.create(
|
||||||
|
model="default",
|
||||||
|
messages=[
|
||||||
|
{"role": "system", "content": "You are a helpful AI assistant"},
|
||||||
|
{"role": "user", "content": "List 3 countries and their capitals."},
|
||||||
|
],
|
||||||
|
temperature=0,
|
||||||
|
max_tokens=64,
|
||||||
|
)
|
||||||
|
print(response)
|
||||||
|
```
|
||||||
|
|
||||||
|
It supports streaming and most features of the Chat/Completions/Models endpoints specified by the [OpenAI API Reference](https://platform.openai.com/docs/api-reference/).
|
||||||
|
|
||||||
|
### Additional Server Arguments
|
||||||
|
- Add `--tp 2` to enable tensor parallelism. If it indicates `peer access is not supported between these two devices`, add `--enable-p2p-check` option.
|
||||||
|
```
|
||||||
|
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --tp 2
|
||||||
|
```
|
||||||
|
- Add `--dp 2` to enable data parallelism. It can also be used together with tp. Data parallelism is better for throughput if there is enough memory.
|
||||||
|
```
|
||||||
|
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --dp 2 --tp 2
|
||||||
|
```
|
||||||
|
- If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`
|
||||||
|
```
|
||||||
|
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --mem-fraction-static 0.7
|
||||||
|
```
|
||||||
|
- See [hyperparameter_tuning.md](docs/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
|
||||||
|
- Add `--nnodes 2` to run tensor parallelism on multiple nodes. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-1` be the hostname of the first node and `50000` be an available port.
|
||||||
|
```
|
||||||
|
# Node 0
|
||||||
|
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-1:50000 --nnodes 2 --node-rank 0
|
||||||
|
|
||||||
|
# Node 1
|
||||||
|
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-1:50000 --nnodes 2 --node-rank 1
|
||||||
|
```
|
||||||
|
- If the model does not have a template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/custom_chat_template.md).
|
||||||
|
|
||||||
|
### Supported Models
|
||||||
|
|
||||||
|
- Llama / Llama 2 / Llama 3
|
||||||
|
- Mistral / Mixtral
|
||||||
|
- Gemma / Gemma 2
|
||||||
|
- Qwen / Qwen 2 / Qwen 2 MoE
|
||||||
|
- LLaVA 1.5 / 1.6
|
||||||
|
- `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
|
||||||
|
- `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
|
||||||
|
- `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-34b --tokenizer-path liuhaotian/llava-v1.6-34b-tokenizer --port 3000`
|
||||||
|
- LLaVA-NeXT-Video
|
||||||
|
- see [srt_example_llava_v.sh](examples/usage/llava_video/srt_example_llava_v.sh)
|
||||||
|
- Yi-VL
|
||||||
|
- see [srt_example_yi_vl.py](examples/quick_start/srt_example_yi_vl.py).
|
||||||
|
- StableLM
|
||||||
|
- Command-R
|
||||||
|
- DBRX
|
||||||
|
- Grok
|
||||||
|
- ChatGLM
|
||||||
|
- InternLM 2
|
||||||
|
|
||||||
|
Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/model_support.md).
|
||||||
|
|
||||||
|
## Frontend: Structured Generation Language (SGLang)
|
||||||
|
The frontend language can be used with local models or API models.
|
||||||
|
|
||||||
|
### Quick Start
|
||||||
The example below shows how to use sglang to answer a mulit-turn question.
|
The example below shows how to use sglang to answer a mulit-turn question.
|
||||||
|
|
||||||
### Using Local Models
|
#### Using Local Models
|
||||||
First, launch a server with
|
First, launch a server with
|
||||||
```
|
```
|
||||||
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
|
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
|
||||||
```
|
```
|
||||||
|
|
||||||
Then, connect to the server and answer a multi-turn question.
|
Then, connect to the server and answer a multi-turn question.
|
||||||
@@ -105,7 +216,7 @@ for m in state.messages():
|
|||||||
print(state["answer_1"])
|
print(state["answer_1"])
|
||||||
```
|
```
|
||||||
|
|
||||||
### Using OpenAI Models
|
#### Using OpenAI Models
|
||||||
Set the OpenAI API Key
|
Set the OpenAI API Key
|
||||||
```
|
```
|
||||||
export OPENAI_API_KEY=sk-******
|
export OPENAI_API_KEY=sk-******
|
||||||
@@ -136,13 +247,12 @@ for m in state.messages():
|
|||||||
print(state["answer_1"])
|
print(state["answer_1"])
|
||||||
```
|
```
|
||||||
|
|
||||||
### More Examples
|
#### More Examples
|
||||||
|
|
||||||
Anthropic and VertexAI (Gemini) models are also supported.
|
Anthropic and VertexAI (Gemini) models are also supported.
|
||||||
You can find more examples at [examples/quick_start](examples/quick_start).
|
You can find more examples at [examples/quick_start](examples/quick_start).
|
||||||
|
|
||||||
## Frontend: Structured Generation Language (SGLang)
|
### Language Feature
|
||||||
|
|
||||||
To begin with, import sglang.
|
To begin with, import sglang.
|
||||||
```python
|
```python
|
||||||
import sglang as sgl
|
import sglang as sgl
|
||||||
@@ -155,7 +265,7 @@ The system will manage the state, chat template, parallelism and batching for yo
|
|||||||
|
|
||||||
The complete code for the examples below can be found at [readme_examples.py](examples/usage/readme_examples.py)
|
The complete code for the examples below can be found at [readme_examples.py](examples/usage/readme_examples.py)
|
||||||
|
|
||||||
### Control Flow
|
#### Control Flow
|
||||||
You can use any Python code within the function body, including control flow, nested function calls, and external libraries.
|
You can use any Python code within the function body, including control flow, nested function calls, and external libraries.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
@@ -170,7 +280,7 @@ def tool_use(s, question):
|
|||||||
s += "The key word to search is" + sgl.gen("word")
|
s += "The key word to search is" + sgl.gen("word")
|
||||||
```
|
```
|
||||||
|
|
||||||
### Parallelism
|
#### Parallelism
|
||||||
Use `fork` to launch parallel prompts.
|
Use `fork` to launch parallel prompts.
|
||||||
Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel.
|
Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel.
|
||||||
|
|
||||||
@@ -192,7 +302,7 @@ def tip_suggestion(s):
|
|||||||
s += "In summary" + sgl.gen("summary")
|
s += "In summary" + sgl.gen("summary")
|
||||||
```
|
```
|
||||||
|
|
||||||
### Multi Modality
|
#### Multi Modality
|
||||||
Use `sgl.image` to pass an image as input.
|
Use `sgl.image` to pass an image as input.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
@@ -204,7 +314,7 @@ def image_qa(s, image_file, question):
|
|||||||
|
|
||||||
See also [srt_example_llava.py](examples/quick_start/srt_example_llava.py).
|
See also [srt_example_llava.py](examples/quick_start/srt_example_llava.py).
|
||||||
|
|
||||||
### Constrained Decoding
|
#### Constrained Decoding
|
||||||
Use `regex` to specify a regular expression as a decoding constraint.
|
Use `regex` to specify a regular expression as a decoding constraint.
|
||||||
This is only supported for local models.
|
This is only supported for local models.
|
||||||
|
|
||||||
@@ -219,7 +329,7 @@ def regular_expression_gen(s):
|
|||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
### JSON Decoding
|
#### JSON Decoding
|
||||||
Use `regex` to specify a JSON schema with a regular expression.
|
Use `regex` to specify a JSON schema with a regular expression.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
@@ -248,8 +358,7 @@ def character_gen(s, name):
|
|||||||
|
|
||||||
See also [json_decode.py](examples/usage/json_decode.py) for an additional example on specifying formats with Pydantic models.
|
See also [json_decode.py](examples/usage/json_decode.py) for an additional example on specifying formats with Pydantic models.
|
||||||
|
|
||||||
|
#### Batching
|
||||||
### Batching
|
|
||||||
Use `run_batch` to run a batch of requests with continuous batching.
|
Use `run_batch` to run a batch of requests with continuous batching.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
@@ -268,7 +377,7 @@ states = text_qa.run_batch(
|
|||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Streaming
|
#### Streaming
|
||||||
Add `stream=True` to enable streaming.
|
Add `stream=True` to enable streaming.
|
||||||
|
|
||||||
```python
|
```python
|
||||||
@@ -287,139 +396,10 @@ for out in state.text_iter():
|
|||||||
print(out, end="", flush=True)
|
print(out, end="", flush=True)
|
||||||
```
|
```
|
||||||
|
|
||||||
### Tips and Implementation Details
|
#### Tips and Implementation Details
|
||||||
- The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability.
|
- The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability.
|
||||||
- The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.
|
- The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.
|
||||||
|
|
||||||
## Backend: SGLang Runtime (SRT)
|
|
||||||
The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
|
|
||||||
However, it can also be used as a standalone API server.
|
|
||||||
In this case, the [RadixAttention](https://arxiv.org/abs/2312.07104) can still greatly accelerate many use cases with automatic KV cache reuse.
|
|
||||||
|
|
||||||
### Usage
|
|
||||||
Launch a server
|
|
||||||
```
|
|
||||||
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000
|
|
||||||
```
|
|
||||||
|
|
||||||
Send a request
|
|
||||||
```
|
|
||||||
curl http://localhost:30000/generate \
|
|
||||||
-H "Content-Type: application/json" \
|
|
||||||
-d '{
|
|
||||||
"text": "Once upon a time,",
|
|
||||||
"sampling_params": {
|
|
||||||
"max_new_tokens": 16,
|
|
||||||
"temperature": 0
|
|
||||||
}
|
|
||||||
}'
|
|
||||||
```
|
|
||||||
Learn more about the argument format [here](docs/sampling_params.md).
|
|
||||||
|
|
||||||
### OpenAI Compatible API
|
|
||||||
In addition, the server supports an experimental OpenAI-compatible API.
|
|
||||||
|
|
||||||
```python
|
|
||||||
import openai
|
|
||||||
client = openai.Client(
|
|
||||||
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
|
|
||||||
|
|
||||||
# Text completion
|
|
||||||
response = client.completions.create(
|
|
||||||
model="default",
|
|
||||||
prompt="The capital of France is",
|
|
||||||
temperature=0,
|
|
||||||
max_tokens=32,
|
|
||||||
)
|
|
||||||
print(response)
|
|
||||||
|
|
||||||
# Chat completion
|
|
||||||
response = client.chat.completions.create(
|
|
||||||
model="default",
|
|
||||||
messages=[
|
|
||||||
{"role": "system", "content": "You are a helpful AI assistant"},
|
|
||||||
{"role": "user", "content": "List 3 countries and their capitals."},
|
|
||||||
],
|
|
||||||
temperature=0,
|
|
||||||
max_tokens=64,
|
|
||||||
)
|
|
||||||
print(response)
|
|
||||||
```
|
|
||||||
|
|
||||||
By default, the server uses the chat template specified in the model tokenizer from Hugging Face. It should just work for most official models such as Llama-2/Llama-3.
|
|
||||||
|
|
||||||
If needed, you can also override the chat template when launching the server:
|
|
||||||
|
|
||||||
```
|
|
||||||
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
|
|
||||||
```
|
|
||||||
|
|
||||||
If the chat template you are looking for is missing, you are welcome to contribute it.
|
|
||||||
Meanwhile, you can also temporarily register your chat template as follows:
|
|
||||||
|
|
||||||
```json
|
|
||||||
{
|
|
||||||
"name": "my_model",
|
|
||||||
"system": "<|im_start|>system",
|
|
||||||
"user": "<|im_start|>user",
|
|
||||||
"assistant": "<|im_start|>assistant",
|
|
||||||
"sep_style": "CHATML",
|
|
||||||
"sep": "<|im_end|>",
|
|
||||||
"stop_str": ["<|im_end|>", "<|im_start|>"]
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
```
|
|
||||||
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json
|
|
||||||
```
|
|
||||||
|
|
||||||
### Additional Arguments
|
|
||||||
- Add `--tp 2` to enable tensor parallelism. If it indicates `peer access is not supported between these two devices`, add `--enable-p2p-check` option.
|
|
||||||
```
|
|
||||||
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --tp 2
|
|
||||||
```
|
|
||||||
- Add `--dp 2` to enable data parallelism. It can also be used together with tp. Data parallelism is better for throughput if there is enough memory.
|
|
||||||
```
|
|
||||||
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --dp 2 --tp 2
|
|
||||||
```
|
|
||||||
- If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`
|
|
||||||
```
|
|
||||||
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7
|
|
||||||
```
|
|
||||||
- See [hyperparameter_tuning.md](docs/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
|
|
||||||
- Add `--nnodes 2` to run tensor parallelism on multiple nodes. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-1` be the hostname of the first node and `50000` be an available port.
|
|
||||||
```
|
|
||||||
# Node 0
|
|
||||||
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --tp 4 --nccl-init sgl-dev-1:50000 --nnodes 2 --node-rank 0
|
|
||||||
|
|
||||||
# Node 1
|
|
||||||
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --tp 4 --nccl-init sgl-dev-1:50000 --nnodes 2 --node-rank 1
|
|
||||||
```
|
|
||||||
|
|
||||||
### Supported Models
|
|
||||||
- Llama
|
|
||||||
- Mistral
|
|
||||||
- Mixtral
|
|
||||||
- Qwen / Qwen 2 / Qwen 2 MoE
|
|
||||||
- Gemma / Gemma 2
|
|
||||||
- `python -m sglang.launch_server --model-path google/gemma-7b-it --port 30000 --attention-reduce-in-fp32`
|
|
||||||
- LLaVA
|
|
||||||
- `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
|
|
||||||
- `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000`
|
|
||||||
- `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-34b --tokenizer-path liuhaotian/llava-v1.6-34b-tokenizer --port 3000`
|
|
||||||
- LLaVA-NeXT-Video
|
|
||||||
- see [srt_example_llava_v.sh](examples/usage/llava_video/srt_example_llava_v.sh)
|
|
||||||
- Yi-VL
|
|
||||||
- see [srt_example_yi_vl.py](examples/quick_start/srt_example_yi_vl.py).
|
|
||||||
- StableLM
|
|
||||||
- Command-R
|
|
||||||
- DBRX
|
|
||||||
- Grok
|
|
||||||
- ChatGLM
|
|
||||||
- AWQ/GPTQ/Marlin quantization
|
|
||||||
|
|
||||||
Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/model_support.md).
|
|
||||||
|
|
||||||
## Benchmark And Performance
|
## Benchmark And Performance
|
||||||
- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
|
- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
|
||||||

|

|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
## Benchmark Results
|
# Benchmark Results
|
||||||
|
|
||||||
We tested our system on the following common LLM workloads and reported the achieved throughput:
|
We tested our system on the following common LLM workloads and reported the achieved throughput:
|
||||||
- **[MMLU](https://arxiv.org/abs/2009.03300)**: A 5-shot, multi-choice, multi-task benchmark.
|
- **[MMLU](https://arxiv.org/abs/2009.03300)**: A 5-shot, multi-choice, multi-task benchmark.
|
||||||
|
|||||||
28
docs/custom_chat_template.md
Normal file
28
docs/custom_chat_template.md
Normal file
@@ -0,0 +1,28 @@
|
|||||||
|
# Custom Chat Template in SGLang Runtime
|
||||||
|
|
||||||
|
By default, the server uses the chat template specified in the model tokenizer from Hugging Face. It should just work for most official models such as Llama-2/Llama-3.
|
||||||
|
|
||||||
|
If needed, you can also override the chat template when launching the server:
|
||||||
|
|
||||||
|
```
|
||||||
|
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
|
||||||
|
```
|
||||||
|
|
||||||
|
If the chat template you are looking for is missing, you are welcome to contribute it.
|
||||||
|
Meanwhile, you can also temporarily register your chat template as follows:
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"name": "my_model",
|
||||||
|
"system": "<|im_start|>system",
|
||||||
|
"user": "<|im_start|>user",
|
||||||
|
"assistant": "<|im_start|>assistant",
|
||||||
|
"sep_style": "CHATML",
|
||||||
|
"sep": "<|im_end|>",
|
||||||
|
"stop_str": ["<|im_end|>", "<|im_start|>"]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
```
|
||||||
|
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json
|
||||||
|
```
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
## How to Support a New Model
|
# How to Support a New Model
|
||||||
|
|
||||||
To support a new model in SGLang, you only need to add a single file under [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn from existing model implementations and create new files for the new models. Most models are based on the transformer architecture, making them very similar.
|
To support a new model in SGLang, you only need to add a single file under [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn from existing model implementations and create new files for the new models. Most models are based on the transformer architecture, making them very similar.
|
||||||
|
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
## Sampling Parameters of SGLang Runtime
|
# Sampling Parameters in SGLang Runtime
|
||||||
This doc describes the sampling parameters of the SGLang Runtime.
|
This doc describes the sampling parameters of the SGLang Runtime.
|
||||||
|
|
||||||
The `/generate` endpoint accepts the following arguments in the JSON format.
|
The `/generate` endpoint accepts the following arguments in the JSON format.
|
||||||
@@ -6,11 +6,11 @@ The `/generate` endpoint accepts the following arguments in the JSON format.
|
|||||||
```python
|
```python
|
||||||
@dataclass
|
@dataclass
|
||||||
class GenerateReqInput:
|
class GenerateReqInput:
|
||||||
# The input prompt
|
# The input prompt. It can be a single prompt or a batch of prompts.
|
||||||
text: Union[List[str], str]
|
text: Union[List[str], str]
|
||||||
# The token ids for text; one can either specify text or input_ids
|
# The token ids for text; one can either specify text or input_ids
|
||||||
input_ids: Optional[Union[List[List[int]], List[int]]] = None
|
input_ids: Optional[Union[List[List[int]], List[int]]] = None
|
||||||
# The image input
|
# The image input. It can be a file name.
|
||||||
image_data: Optional[Union[List[str], str]] = None
|
image_data: Optional[Union[List[str], str]] = None
|
||||||
# The sampling_params
|
# The sampling_params
|
||||||
sampling_params: Union[List[Dict], Dict] = None
|
sampling_params: Union[List[Dict], Dict] = None
|
||||||
|
|||||||
@@ -1,4 +1,4 @@
|
|||||||
## SRT Unit Tests
|
# SRT Unit Tests
|
||||||
|
|
||||||
### Latency Alignment
|
### Latency Alignment
|
||||||
Make sure your changes do not slow down the following benchmarks
|
Make sure your changes do not slow down the following benchmarks
|
||||||
|
|||||||
@@ -1,8 +1,7 @@
|
|||||||
# Code Structure
|
# Code Structure
|
||||||
|
|
||||||
- `backend`: Various backends for the language interpreter.
|
|
||||||
- `lang`: The frontend language.
|
- `lang`: The frontend language.
|
||||||
- `srt`: The serving engine for running local models. (SRT = SGLang Runtime).
|
- `srt`: The backend engine for running local models. (SRT = SGLang Runtime).
|
||||||
- `test`: Test utilities.
|
- `test`: Test utilities.
|
||||||
- `api.py`: Public API.
|
- `api.py`: Public API.
|
||||||
- `bench_latency.py`: Benchmark utilities.
|
- `bench_latency.py`: Benchmark utilities.
|
||||||
|
|||||||
@@ -22,16 +22,16 @@ from sglang.api import (
|
|||||||
video,
|
video,
|
||||||
)
|
)
|
||||||
|
|
||||||
# SGL Backends
|
|
||||||
from sglang.backend.anthropic import Anthropic
|
|
||||||
from sglang.backend.litellm import LiteLLM
|
|
||||||
from sglang.backend.openai import OpenAI
|
|
||||||
from sglang.backend.runtime_endpoint import RuntimeEndpoint
|
|
||||||
from sglang.backend.vertexai import VertexAI
|
|
||||||
|
|
||||||
# Global Configurations
|
# Global Configurations
|
||||||
from sglang.global_config import global_config
|
from sglang.global_config import global_config
|
||||||
|
|
||||||
|
# SGL Backends
|
||||||
|
from sglang.lang.backend.anthropic import Anthropic
|
||||||
|
from sglang.lang.backend.litellm import LiteLLM
|
||||||
|
from sglang.lang.backend.openai import OpenAI
|
||||||
|
from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint
|
||||||
|
from sglang.lang.backend.vertexai import VertexAI
|
||||||
|
|
||||||
# public APIs management
|
# public APIs management
|
||||||
__all__ = [
|
__all__ = [
|
||||||
"global_config",
|
"global_config",
|
||||||
|
|||||||
@@ -4,8 +4,8 @@ import os
|
|||||||
import re
|
import re
|
||||||
from typing import Callable, List, Optional, Union
|
from typing import Callable, List, Optional, Union
|
||||||
|
|
||||||
from sglang.backend.base_backend import BaseBackend
|
|
||||||
from sglang.global_config import global_config
|
from sglang.global_config import global_config
|
||||||
|
from sglang.lang.backend.base_backend import BaseBackend
|
||||||
from sglang.lang.ir import (
|
from sglang.lang.ir import (
|
||||||
SglExpr,
|
SglExpr,
|
||||||
SglExprList,
|
SglExprList,
|
||||||
|
|||||||
@@ -2,7 +2,7 @@ from typing import List, Optional, Union
|
|||||||
|
|
||||||
import numpy as np
|
import numpy as np
|
||||||
|
|
||||||
from sglang.backend.base_backend import BaseBackend
|
from sglang.lang.backend.base_backend import BaseBackend
|
||||||
from sglang.lang.chat_template import get_chat_template
|
from sglang.lang.chat_template import get_chat_template
|
||||||
from sglang.lang.interpreter import StreamExecutor
|
from sglang.lang.interpreter import StreamExecutor
|
||||||
from sglang.lang.ir import SglSamplingParams
|
from sglang.lang.ir import SglSamplingParams
|
||||||
@@ -1,6 +1,6 @@
|
|||||||
from typing import Mapping, Optional
|
from typing import Mapping, Optional
|
||||||
|
|
||||||
from sglang.backend.base_backend import BaseBackend
|
from sglang.lang.backend.base_backend import BaseBackend
|
||||||
from sglang.lang.chat_template import get_chat_template_by_model_path
|
from sglang.lang.chat_template import get_chat_template_by_model_path
|
||||||
from sglang.lang.interpreter import StreamExecutor
|
from sglang.lang.interpreter import StreamExecutor
|
||||||
from sglang.lang.ir import SglSamplingParams
|
from sglang.lang.ir import SglSamplingParams
|
||||||
@@ -6,7 +6,7 @@ from typing import Callable, List, Optional, Union
|
|||||||
|
|
||||||
import numpy as np
|
import numpy as np
|
||||||
|
|
||||||
from sglang.backend.base_backend import BaseBackend
|
from sglang.lang.backend.base_backend import BaseBackend
|
||||||
from sglang.lang.chat_template import ChatTemplate, get_chat_template_by_model_path
|
from sglang.lang.chat_template import ChatTemplate, get_chat_template_by_model_path
|
||||||
from sglang.lang.interpreter import StreamExecutor
|
from sglang.lang.interpreter import StreamExecutor
|
||||||
from sglang.lang.ir import SglSamplingParams
|
from sglang.lang.ir import SglSamplingParams
|
||||||
@@ -3,8 +3,8 @@ from typing import List, Optional
|
|||||||
|
|
||||||
import numpy as np
|
import numpy as np
|
||||||
|
|
||||||
from sglang.backend.base_backend import BaseBackend
|
|
||||||
from sglang.global_config import global_config
|
from sglang.global_config import global_config
|
||||||
|
from sglang.lang.backend.base_backend import BaseBackend
|
||||||
from sglang.lang.chat_template import get_chat_template_by_model_path
|
from sglang.lang.chat_template import get_chat_template_by_model_path
|
||||||
from sglang.lang.interpreter import StreamExecutor
|
from sglang.lang.interpreter import StreamExecutor
|
||||||
from sglang.lang.ir import SglSamplingParams
|
from sglang.lang.ir import SglSamplingParams
|
||||||
@@ -2,7 +2,7 @@ import os
|
|||||||
import warnings
|
import warnings
|
||||||
from typing import Optional
|
from typing import Optional
|
||||||
|
|
||||||
from sglang.backend.base_backend import BaseBackend
|
from sglang.lang.backend.base_backend import BaseBackend
|
||||||
from sglang.lang.chat_template import get_chat_template
|
from sglang.lang.chat_template import get_chat_template
|
||||||
from sglang.lang.interpreter import StreamExecutor
|
from sglang.lang.interpreter import StreamExecutor
|
||||||
from sglang.lang.ir import SglSamplingParams
|
from sglang.lang.ir import SglSamplingParams
|
||||||
@@ -3,8 +3,8 @@
|
|||||||
import uuid
|
import uuid
|
||||||
from typing import Any, Callable, Dict, List, Optional, Union
|
from typing import Any, Callable, Dict, List, Optional, Union
|
||||||
|
|
||||||
from sglang.backend.base_backend import BaseBackend
|
|
||||||
from sglang.global_config import global_config
|
from sglang.global_config import global_config
|
||||||
|
from sglang.lang.backend.base_backend import BaseBackend
|
||||||
from sglang.lang.interpreter import ProgramState, ProgramStateGroup
|
from sglang.lang.interpreter import ProgramState, ProgramStateGroup
|
||||||
from sglang.lang.ir import (
|
from sglang.lang.ir import (
|
||||||
SglArgument,
|
SglArgument,
|
||||||
|
|||||||
@@ -26,7 +26,7 @@ import uvloop
|
|||||||
from fastapi import FastAPI, Request
|
from fastapi import FastAPI, Request
|
||||||
from fastapi.responses import JSONResponse, Response, StreamingResponse
|
from fastapi.responses import JSONResponse, Response, StreamingResponse
|
||||||
|
|
||||||
from sglang.backend.runtime_endpoint import RuntimeEndpoint
|
from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint
|
||||||
from sglang.srt.constrained import disable_cache
|
from sglang.srt.constrained import disable_cache
|
||||||
from sglang.srt.hf_transformers_utils import get_tokenizer
|
from sglang.srt.hf_transformers_utils import get_tokenizer
|
||||||
from sglang.srt.managers.controller.manager_multi import (
|
from sglang.srt.managers.controller.manager_multi import (
|
||||||
|
|||||||
@@ -166,6 +166,15 @@ class ServerArgs:
|
|||||||
"--quantization",
|
"--quantization",
|
||||||
type=str,
|
type=str,
|
||||||
default=ServerArgs.quantization,
|
default=ServerArgs.quantization,
|
||||||
|
choices=[
|
||||||
|
"awq",
|
||||||
|
"fp8",
|
||||||
|
"gptq",
|
||||||
|
"marlin",
|
||||||
|
"gptq_marlin",
|
||||||
|
"squeezellm",
|
||||||
|
"bitsandbytes",
|
||||||
|
],
|
||||||
help="The quantization method.",
|
help="The quantization method.",
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
@@ -243,13 +252,13 @@ class ServerArgs:
|
|||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--show-time-cost",
|
"--show-time-cost",
|
||||||
action="store_true",
|
action="store_true",
|
||||||
help="Show time cost of custom marks",
|
help="Show time cost of custom marks.",
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--api-key",
|
"--api-key",
|
||||||
type=str,
|
type=str,
|
||||||
default=ServerArgs.api_key,
|
default=ServerArgs.api_key,
|
||||||
help="Set API key of the server",
|
help="Set API key of the server.",
|
||||||
)
|
)
|
||||||
|
|
||||||
# Data parallelism
|
# Data parallelism
|
||||||
@@ -285,17 +294,17 @@ class ServerArgs:
|
|||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--disable-flashinfer",
|
"--disable-flashinfer",
|
||||||
action="store_true",
|
action="store_true",
|
||||||
help="Disable flashinfer inference kernels",
|
help="Disable flashinfer inference kernels.",
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--disable-radix-cache",
|
"--disable-radix-cache",
|
||||||
action="store_true",
|
action="store_true",
|
||||||
help="Disable RadixAttention",
|
help="Disable RadixAttention for prefix caching.",
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--disable-regex-jump-forward",
|
"--disable-regex-jump-forward",
|
||||||
action="store_true",
|
action="store_true",
|
||||||
help="Disable regex jump-forward",
|
help="Disable regex jump-forward.",
|
||||||
)
|
)
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--disable-cuda-graph",
|
"--disable-cuda-graph",
|
||||||
|
|||||||
@@ -306,7 +306,7 @@ def test_image_qa():
|
|||||||
assert (
|
assert (
|
||||||
"taxi" in state.messages()[-1]["content"]
|
"taxi" in state.messages()[-1]["content"]
|
||||||
or "car" in state.messages()[-1]["content"]
|
or "car" in state.messages()[-1]["content"]
|
||||||
)
|
), f"{state.messages()[-1]['content']}"
|
||||||
|
|
||||||
|
|
||||||
def test_stream():
|
def test_stream():
|
||||||
|
|||||||
@@ -6,9 +6,9 @@ from functools import partial
|
|||||||
import numpy as np
|
import numpy as np
|
||||||
import requests
|
import requests
|
||||||
|
|
||||||
from sglang.backend.openai import OpenAI
|
|
||||||
from sglang.backend.runtime_endpoint import RuntimeEndpoint
|
|
||||||
from sglang.global_config import global_config
|
from sglang.global_config import global_config
|
||||||
|
from sglang.lang.backend.openai import OpenAI
|
||||||
|
from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint
|
||||||
from sglang.utils import get_exception_traceback
|
from sglang.utils import get_exception_traceback
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -1,7 +1,7 @@
|
|||||||
import unittest
|
import unittest
|
||||||
|
|
||||||
import sglang as sgl
|
import sglang as sgl
|
||||||
from sglang.backend.runtime_endpoint import RuntimeEndpoint
|
from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint
|
||||||
|
|
||||||
|
|
||||||
class TestBind(unittest.TestCase):
|
class TestBind(unittest.TestCase):
|
||||||
|
|||||||
@@ -1,7 +1,7 @@
|
|||||||
import unittest
|
import unittest
|
||||||
|
|
||||||
import sglang as sgl
|
import sglang as sgl
|
||||||
from sglang.backend.base_backend import BaseBackend
|
from sglang.lang.backend.base_backend import BaseBackend
|
||||||
from sglang.lang.chat_template import get_chat_template
|
from sglang.lang.chat_template import get_chat_template
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
@@ -10,7 +10,6 @@ The image features a man standing on the back of a yellow taxi cab, holding
|
|||||||
import argparse
|
import argparse
|
||||||
import asyncio
|
import asyncio
|
||||||
import json
|
import json
|
||||||
import time
|
|
||||||
|
|
||||||
import aiohttp
|
import aiohttp
|
||||||
import requests
|
import requests
|
||||||
|
|||||||
Reference in New Issue
Block a user