sglang/README.md

<div align="center">
<img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400"></img>

[![PyPI](https://img.shields.io/pypi/v/sglang)](https://pypi.org/project/sglang)
![PyPI - Downloads](https://img.shields.io/pypi/dm/sglang)
[![license](https://img.shields.io/github/license/sgl-project/sglang.svg)](https://github.com/sgl-project/sglang/tree/main/LICENSE)
[![issue resolution](https://img.shields.io/github/issues-closed-raw/sgl-project/sglang)](https://github.com/sgl-project/sglang/issues)
[![open issues](https://img.shields.io/github/issues-raw/sgl-project/sglang)](https://github.com/sgl-project/sglang/issues)

</div>

--------------------------------------------------------------------------------

| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Join Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) | [**Join Weekly Development Meeting**](https://t.co/4BFjCLnVHq) |

SGLang is a fast serving framework for large language models and vision language models.
It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
The core features include:

- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, jump-forward constrained decoding, continuous batching, token attention (paged attention), tensor parallelism, FlashInfer kernels, chunked prefill, and quantization (INT4/FP8/AWQ/GPTQ).
- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
- **Extensive Model Support**: Supports a wide range of generative models (Llama 3, Gemma 2, Mistral, QWen, DeepSeek, LLaVA, etc.) and embedding models (e5-mistral), with easy extensibility for integrating new models.
- **Active Community**: SGLang is open-source and backed by an active community with industry adoption.

## News
- [2024/09] 🔥 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
- [2024/07] 🔥 Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
- [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).

<details>
<summary>More</summary>

- [2024/04] SGLang is used by the official **LLaVA-NeXT (video)** release ([blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)).
- [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
- [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).

</details>

## Contents
- [Install](#install)
- [Backend: SGLang Runtime (SRT)](#backend-sglang-runtime-srt)
- [Frontend: Structured Generation Language (SGLang)](#frontend-structured-generation-language-sglang)
- [Benchmark And Performance](#benchmark-and-performance)
- [Roadmap](#roadmap)
- [Citation And Acknowledgment](#citation-and-acknowledgment)

## Install

You can install SGLang using any of the methods below.

### Method 1: With pip
```
pip install --upgrade pip
pip install "sglang[all]"

# Install FlashInfer CUDA kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
```

### Method 2: From source
```
# Use the last release branch
git clone -b v0.3.2 https://github.com/sgl-project/sglang.git
cd sglang

pip install --upgrade pip
pip install -e "python[all]"

# Install FlashInfer CUDA kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
```

### Method 3: Using docker
The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).

```bash
docker run --gpus all \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
```

### Method 4: Using docker compose

<details>
<summary>More</summary>

> This method is recommended if you plan to serve it as a service.
> A better approach is to use the [k8s-sglang-service.yaml](docker/k8s-sglang-service.yaml).

1. Copy the [compose.yml](docker/compose.yaml) to your local machine
2. Execute the command `docker compose up -d` in your terminal.
</details>

### Method 5: Run on Kubernetes or Clouds with SkyPilot

<details>
<summary>More</summary>

To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).

1. Install SkyPilot and set up Kubernetes cluster or cloud access: see [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
2. Deploy on your own infra with a single command and get the HTTP API endpoint:
<details>
<summary>SkyPilot YAML: <code>sglang.yaml</code></summary>

```yaml
# sglang.yaml
envs:
  HF_TOKEN: null

resources:
  image_id: docker:lmsysorg/sglang:latest
  accelerators: A100
  ports: 30000

run: |
  conda deactivate
  python3 -m sglang.launch_server \
    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
    --host 0.0.0.0 \
    --port 30000
```
</details>

```bash
# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml

# Get the HTTP API endpoint
sky status --endpoint 30000 sglang
```
3. To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve).
</details>


### Common Notes
- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.

## Backend: SGLang Runtime (SRT)
The SGLang Runtime (SRT) is an efficient serving engine.

### Quick Start
Launch a server
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
```

Send a request
```
curl http://localhost:30000/generate \
  -H "Content-Type: application/json" \
  -d '{
    "text": "Once upon a time,",
    "sampling_params": {
      "max_new_tokens": 16,
      "temperature": 0
    }
  }'
```

Learn more about the argument specification, streaming, and multi-modal support [here](docs/en/sampling_params.md).

### OpenAI Compatible API
In addition, the server supports OpenAI-compatible APIs.

```python
import openai
client = openai.Client(
    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")

# Text completion
response = client.completions.create(
	model="default",
	prompt="The capital of France is",
	temperature=0,
	max_tokens=32,
)
print(response)

# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)
print(response)

# Text embedding
response = client.embeddings.create(
    model="default",
    input="How are you today",
)
print(response)
```

It supports streaming, vision, and almost all features of the Chat/Completions/Models/Batch endpoints specified by the [OpenAI API Reference](https://platform.openai.com/docs/api-reference/).

### Additional Server Arguments
- To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
```
- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
```
- If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
```
- See [hyperparameter_tuning.md](docs/en/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
```
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes.
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/en/custom_chat_template.md).
- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
```
# Node 0
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 0

# Node 1
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
```
 
### Supported Models

**Generative Models**
- Llama / Llama 2 / Llama 3 / Llama 3.1
- Mistral / Mixtral / Mistral NeMo
- Gemma / Gemma 2
- Qwen / Qwen 2 / Qwen 2 MoE
- DeepSeek / DeepSeek 2
- OLMoE
- [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/)
  - `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --port=30000 --chat-template=chatml-llava`
  - `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8 --chat-template=chatml-llava`
  - Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](test/srt/test_vision_openai_server.py)
- LLaVA 1.5 / 1.6 / NeXT
  - `python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000 --tp-size=1 --chat-template=llava_llama_3`
  - `python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --port=30000 --tp-size=8 --chat-template=chatml-llava`
  - Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](test/srt/test_vision_openai_server.py)
- Yi-VL
- StableLM
- Command-R
- DBRX
- Grok
- ChatGLM
- InternLM 2
- Exaone 3
- BaiChuan2
- MiniCPM / MiniCPM 3
- XVERSE / XVERSE MoE
- SmolLM

**Embedding Models**

- e5-mistral
- gte-Qwen2
  - `python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding`

Instructions for supporting a new model are [here](docs/en/model_support.md).

#### Use Models From ModelScope
<details>
<summary>More</summary>

To use a model from [ModelScope](https://www.modelscope.cn), set the environment variable SGLANG_USE_MODELSCOPE.
```
export SGLANG_USE_MODELSCOPE=true
```
Launch [Qwen2-7B-Instruct](https://www.modelscope.cn/models/qwen/qwen2-7b-instruct) Server
```
SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000
```

Or start it by docker.
```bash
docker run --gpus all \
    -p 30000:30000 \
    -v ~/.cache/modelscope:/root/.cache/modelscope \
    --env "SGLANG_USE_MODELSCOPE=true" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000
```
  
</details>

#### Run Llama 3.1 405B
<details>
<summary>More</summary>

```bash
# Run 405B (fp8) on a single node
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8

# Run 405B (fp16) on two nodes
## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port
GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 --disable-cuda-graph

## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port
GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph
```

</details>

### Benchmark Performance

- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as for `launch_server.py`.
  Note that this is not a dynamic batching server, so it may run out of memory for a batch size that a real server can handle.
  A real server truncates the prefill into several batches, while this unit test does not. For accurate large batch testing, please use `sglang.bench_serving` instead.
  ```
  python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
  ```
- Benchmark online serving. Launch a server first and run the following command.
  ```
  python3 -m sglang.bench_serving --backend sglang --num-prompt 10
  ```

## Frontend: Structured Generation Language (SGLang)
The frontend language can be used with local models or API models. It is an alternative to the OpenAI API. You may found it easier to use for complex prompting workflow.

### Quick Start
The example below shows how to use sglang to answer a mulit-turn question.

#### Using Local Models
First, launch a server with
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
```

Then, connect to the server and answer a multi-turn question.

```python
from sglang import function, system, user, assistant, gen, set_default_backend, RuntimeEndpoint

@function
def multi_turn_question(s, question_1, question_2):
    s += system("You are a helpful assistant.")
    s += user(question_1)
    s += assistant(gen("answer_1", max_tokens=256))
    s += user(question_2)
    s += assistant(gen("answer_2", max_tokens=256))

set_default_backend(RuntimeEndpoint("http://localhost:30000"))

state = multi_turn_question.run(
    question_1="What is the capital of the United States?",
    question_2="List two local attractions.",
)

for m in state.messages():
    print(m["role"], ":", m["content"])

print(state["answer_1"])
```

#### Using OpenAI Models
Set the OpenAI API Key
```
export OPENAI_API_KEY=sk-******
```

Then, answer a multi-turn question.
```python
from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI

@function
def multi_turn_question(s, question_1, question_2):
    s += system("You are a helpful assistant.")
    s += user(question_1)
    s += assistant(gen("answer_1", max_tokens=256))
    s += user(question_2)
    s += assistant(gen("answer_2", max_tokens=256))

set_default_backend(OpenAI("gpt-3.5-turbo"))

state = multi_turn_question.run(
    question_1="What is the capital of the United States?",
    question_2="List two local attractions.",
)

for m in state.messages():
    print(m["role"], ":", m["content"])

print(state["answer_1"])
```

#### More Examples

Anthropic and VertexAI (Gemini) models are also supported.
You can find more examples at [examples/quick_start](examples/frontend_language/quick_start).

### Language Feature
To begin with, import sglang.
```python
import sglang as sgl
```

`sglang` provides some simple primitives such as `gen`, `select`, `fork`, `image`.
You can implement your prompt flow in a function decorated by `sgl.function`.
You can then invoke the function with `run` or `run_batch`.
The system will manage the state, chat template, parallelism and batching for you.

The complete code for the examples below can be found at [readme_examples.py](examples/frontend_language/usage/readme_examples.py)

#### Control Flow
You can use any Python code within the function body, including control flow, nested function calls, and external libraries.

```python
@sgl.function
def tool_use(s, question):
    s += "To answer this question: " + question + ". "
    s += "I need to use a " + sgl.gen("tool", choices=["calculator", "search engine"]) + ". "

    if s["tool"] == "calculator":
        s += "The math expression is" + sgl.gen("expression")
    elif s["tool"] == "search engine":
        s += "The key word to search is" + sgl.gen("word")
```

#### Parallelism
Use `fork` to launch parallel prompts.
Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel.

```python
@sgl.function
def tip_suggestion(s):
    s += (
        "Here are two tips for staying healthy: "
        "1. Balanced Diet. 2. Regular Exercise.\n\n"
    )

    forks = s.fork(2)
    for i, f in enumerate(forks):
        f += f"Now, expand tip {i+1} into a paragraph:\n"
        f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")

    s += "Tip 1:" + forks[0]["detailed_tip"] + "\n"
    s += "Tip 2:" + forks[1]["detailed_tip"] + "\n"
    s += "In summary" + sgl.gen("summary")
```

#### Multi-Modality
Use `sgl.image` to pass an image as input.

```python
@sgl.function
def image_qa(s, image_file, question):
    s += sgl.user(sgl.image(image_file) + question)
    s += sgl.assistant(sgl.gen("answer", max_tokens=256)
```

See also [srt_example_llava.py](examples/frontend_language/quick_start/local_example_llava_next.py).

#### Constrained Decoding
Use `regex` to specify a regular expression as a decoding constraint.
This is only supported for local models.

```python
@sgl.function
def regular_expression_gen(s):
    s += "Q: What is the IP address of the Google DNS servers?\n"
    s += "A: " + sgl.gen(
        "answer",
        temperature=0,
        regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
    )
```

#### JSON Decoding
Use `regex` to specify a JSON schema with a regular expression.

```python
character_regex = (
    r"""\{\n"""
    + r"""    "name": "[\w\d\s]{1,16}",\n"""
    + r"""    "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
    + r"""    "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
    + r"""    "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
    + r"""    "wand": \{\n"""
    + r"""        "wood": "[\w\d\s]{1,16}",\n"""
    + r"""        "core": "[\w\d\s]{1,16}",\n"""
    + r"""        "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
    + r"""    \},\n"""
    + r"""    "alive": "(Alive|Deceased)",\n"""
    + r"""    "patronus": "[\w\d\s]{1,16}",\n"""
    + r"""    "bogart": "[\w\d\s]{1,16}"\n"""
    + r"""\}"""
)

@sgl.function
def character_gen(s, name):
    s += name + " is a character in Harry Potter. Please fill in the following information about this character.\n"
    s += sgl.gen("json_output", max_tokens=256, regex=character_regex)
```

See also [json_decode.py](examples/frontend_language/usage/json_decode.py) for an additional example of specifying formats with Pydantic models.

#### Batching
Use `run_batch` to run a batch of requests with continuous batching.

```python
@sgl.function
def text_qa(s, question):
    s += "Q: " + question + "\n"
    s += "A:" + sgl.gen("answer", stop="\n")

states = text_qa.run_batch(
    [
        {"question": "What is the capital of the United Kingdom?"},
        {"question": "What is the capital of France?"},
        {"question": "What is the capital of Japan?"},
    ],
    progress_bar=True
)
```

#### Streaming
Add `stream=True` to enable streaming.

```python
@sgl.function
def text_qa(s, question):
    s += "Q: " + question + "\n"
    s += "A:" + sgl.gen("answer", stop="\n")

state = text_qa.run(
    question="What is the capital of France?",
    temperature=0.1,
    stream=True
)

for out in state.text_iter():
    print(out, end="", flush=True)
```

#### Roles

Use `sgl.system`， `sgl.user` and `sgl.assistant` to set roles when using Chat models. You can also define more complex role prompts using begin and end tokens.

```python
@sgl.function
def chat_example(s):
    s += sgl.system("You are a helpful assistant.")
    # Same as: s += s.system("You are a helpful assistant.")

    with s.user():
        s += "Question: What is the capital of France?"

    s += sgl.assistant_begin()
    s += "Answer: " + sgl.gen(max_tokens=100, stop="\n")
    s += sgl.assistant_end()
```

#### Tips and Implementation Details
- The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability.
- The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.

## Benchmark And Performance
![8b_throughput](https://lmsys.org/images/blog/sglang_llama3/8b_throughput.svg)
![70b_fp8_throughput](https://lmsys.org/images/blog/sglang_llama3/70b_fp8_throughput.svg)

Learn more at this [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/).

## Roadmap
[Development Roadmap (2024 Q4)](https://github.com/sgl-project/sglang/issues/1487)

## Citation And Acknowledgment
Please cite our paper, [SGLang: Efficient Execution of Structured Language Model Programs](https://arxiv.org/abs/2312.07104), if you find the project useful.
We also learned from the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql).
-												Add logo (#275)


											
										
										
											2024-03-10 18:51:47 -07:00
+								<div align="center">
-												fix: resolve the logo display issue on the PyPI page (#726)


											
										
										
											2024-07-25 20:47:46 +10:00
+								<img src="https://raw.githubusercontent.com/sgl-project/sglang/main/assets/logo.png" alt="logo" width="400"></img>
-												Add logo (#275)


											
										
										
											2024-03-10 18:51:47 -07:00
-												docs: update README (#788)


											
										
										
											2024-07-28 22:24:27 +10:00
+								[![PyPI](https://img.shields.io/pypi/v/sglang)](https://pypi.org/project/sglang)
 								![PyPI - Downloads](https://img.shields.io/pypi/dm/sglang)
 								[![license](https://img.shields.io/github/license/sgl-project/sglang.svg)](https://github.com/sgl-project/sglang/tree/main/LICENSE)
 								[![issue resolution](https://img.shields.io/github/issues-closed-raw/sgl-project/sglang)](https://github.com/sgl-project/sglang/issues)
 								[![open issues](https://img.shields.io/github/issues-raw/sgl-project/sglang)](https://github.com/sgl-project/sglang/issues)
-												docs: make badges center (#789)


											
										
										
											2024-07-28 22:27:52 +10:00
+								</div>
-												Add logo (#275)


											
										
										
											2024-03-10 18:51:47 -07:00
+								--------------------------------------------------------------------------------
-												[Event] Update meeting link (#1529)


											
										
										
											2024-09-27 13:30:04 -07:00
+								| [**Blog**](https://lmsys.org/blog/2024-07-25-sglang-llama3/) | [**Paper**](https://arxiv.org/abs/2312.07104) | [**Join Slack**](https://join.slack.com/t/sgl-fru7574/shared_invite/zt-2ngly9muu-t37XiH87qvD~6rVBTkTEHw) | [**Join Weekly Development Meeting**](https://t.co/4BFjCLnVHq) |
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								SGLang is a fast serving framework for large language models and vision language models.
 								It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language.
-												Improve logging & fix litellm dependency. (#512)

											
										
										
											2024-06-07 12:51:40 -07:00
+								The core features include:
-												[Docs] Improve documentations (#1368)


											
										
										
											2024-09-09 20:48:28 -07:00
 								- **Fast Backend Runtime**: Provides efficient serving with RadixAttention for prefix caching, jump-forward constrained decoding, continuous batching, token attention (paged attention), tensor parallelism, FlashInfer kernels, chunked prefill, and quantization (INT4/FP8/AWQ/GPTQ).
 								- **Flexible Frontend Language**: Offers an intuitive interface for programming LLM applications, including chained generation calls, advanced prompting, control flow, multi-modal inputs, parallelism, and external interactions.
 								- **Extensive Model Support**: Supports a wide range of generative models (Llama 3, Gemma 2, Mistral, QWen, DeepSeek, LLaVA, etc.) and embedding models (e5-mistral), with easy extensibility for integrating new models.
-												Fix padding in the cuda graph (#1469)


											
										
										
											2024-09-19 01:52:15 -07:00
+								- **Active Community**: SGLang is open-source and backed by an active community with industry adoption.
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
-												Release 0.1.11 (#134)


											
										
										
											2024-02-03 02:50:13 -08:00
+								## News
-												[Doc] update news (#1327)


											
										
										
											2024-09-04 21:21:21 +10:00
+								- [2024/09] 🔥 SGLang v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
-												Update readme (#731)

											
										
										
											2024-07-25 09:13:37 -07:00
+								- [2024/07] 🔥 Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
 								- [2024/02] SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)).
-												Release 0.1.11 (#134)


											
										
										
											2024-02-03 02:50:13 -08:00
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								<details>
 								<summary>More</summary>
-												Cleanup readme, llava examples, usage examples and nccl init (#1194)


											
										
										
											2024-08-24 08:02:23 -07:00
+								- [2024/04] SGLang is used by the official **LLaVA-NeXT (video)** release ([blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)).
-												Update readme (#731)

											
										
										
											2024-07-25 09:13:37 -07:00
+								- [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)).
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								- [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)).
 								</details>
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								## Contents
 								- [Install](#install)
 								- [Backend: SGLang Runtime (SRT)](#backend-sglang-runtime-srt)
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								- [Frontend: Structured Generation Language (SGLang)](#frontend-structured-generation-language-sglang)
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								- [Benchmark And Performance](#benchmark-and-performance)
 								- [Roadmap](#roadmap)
 								- [Citation And Acknowledgment](#citation-and-acknowledgment)
 								## Install
-												[Docs] Improve documentations (#1368)


											
										
										
											2024-09-09 20:48:28 -07:00
+								You can install SGLang using any of the methods below.
-												Add install with pip (#3)


											
										
										
											2024-01-09 12:43:40 -08:00
+								### Method 1: With pip
 								```
-												misc: update bulid instruction (#724)


											
										
										
											2024-07-25 17:08:11 +10:00
+								pip install --upgrade pip
-												Add install with pip (#3)


											
										
										
											2024-01-09 12:43:40 -08:00
+								pip install "sglang[all]"
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
-												Update install commands (#583)

											
										
										
											2024-07-03 02:07:34 -07:00
+								# Install FlashInfer CUDA kernels
-												chore: update vllm to 0.5.4 (#966)


											
										
										
											2024-08-07 19:15:41 +08:00
+								pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
-												Update install commands (#583)

											
										
										
											2024-07-03 02:07:34 -07:00
+								```
-												Turn on flashinfer by default (#578)


											
										
										
											2024-07-02 02:25:07 -07:00
-												Add install with pip (#3)


											
										
										
											2024-01-09 12:43:40 -08:00
+								### Method 2: From source
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								```
-												Bump version to 0.2.10 (#923)


											
										
										
											2024-08-04 16:52:51 -07:00
+								# Use the last release branch
-												Release v0.3.2 (#1512)


											
										
										
											2024-09-24 23:17:09 -07:00
+								git clone -b v0.3.2 https://github.com/sgl-project/sglang.git
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								cd sglang
-												update readme

											
										
										
											2024-07-21 02:58:57 -07:00
+								pip install --upgrade pip
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								pip install -e "python[all]"
-												Update install commands (#583)

											
										
										
											2024-07-03 02:07:34 -07:00
+								# Install FlashInfer CUDA kernels
-												chore: update vllm to 0.5.4 (#966)


											
										
										
											2024-08-07 19:15:41 +08:00
+								pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
-												Update install commands (#583)

											
										
										
											2024-07-03 02:07:34 -07:00
+								```
-												Turn on flashinfer by default (#578)


											
										
										
											2024-07-02 02:25:07 -07:00
-												Update README.md

											
										
										
											2024-07-04 00:55:40 -07:00
+								### Method 3: Using docker
-												[Docs] Improve documentations (#1368)


											
										
										
											2024-09-09 20:48:28 -07:00
+								The docker images are available on Docker Hub as [lmsysorg/sglang](https://hub.docker.com/r/lmsysorg/sglang/tags), built from [Dockerfile](https://github.com/sgl-project/sglang/tree/main/docker).
-												docs: update README.md (#843)


											
										
										
											2024-07-31 11:48:18 +09:00
+								Replace `<secret>` below with your huggingface hub [token](https://huggingface.co/docs/hub/en/security-tokens).
-												Add docker file (#588)

Co-authored-by: Ying Sheng <ying.sheng@databricks.com>
											
										
										
											2024-07-04 00:53:49 -07:00
-												Reduce docker size (#632)


											
										
										
											2024-07-16 16:12:12 -07:00
+								```bash
 								docker run --gpus all \
 								    -p 30000:30000 \
 								    -v ~/.cache/huggingface:/root/.cache/huggingface \
-												misc: replace deprecated variable HUGGING_FACE_HUB_TOKEN with HF_TOKEN (#752)


											
										
										
											2024-07-27 04:19:30 +10:00
+								    --env "HF_TOKEN=<secret>" \
-												Reduce docker size (#632)


											
										
										
											2024-07-16 16:12:12 -07:00
+								    --ipc=host \
 								    lmsysorg/sglang:latest \
-												Example file for docker compose and k8s (#1006)


											
										
										
											2024-08-14 06:07:57 +08:00
+								    python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000
-												Reduce docker size (#632)


											
										
										
											2024-07-16 16:12:12 -07:00
+								```
-												Example file for docker compose and k8s (#1006)


											
										
										
											2024-08-14 06:07:57 +08:00
+								### Method 4: Using docker compose
-												Improve docs and warnings (#1164)


											
										
										
											2024-08-20 08:31:29 -07:00
+								<details>
-												Fix readme (#1236)


											
										
										
											2024-08-28 14:51:41 +08:00
+								<summary>More</summary>
-												[Docs] Fix rendering of details in README (#1179)


											
										
										
											2024-08-21 16:05:33 -07:00
-												Example file for docker compose and k8s (#1006)


											
										
										
											2024-08-14 06:07:57 +08:00
+								> This method is recommended if you plan to serve it as a service.
-												Better unit tests for adding a new model (#1488)


											
										
										
											2024-09-22 01:50:37 -07:00
+								> A better approach is to use the [k8s-sglang-service.yaml](docker/k8s-sglang-service.yaml).
-												Example file for docker compose and k8s (#1006)


											
										
										
											2024-08-14 06:07:57 +08:00
-												Better unit tests for adding a new model (#1488)


											
										
										
											2024-09-22 01:50:37 -07:00
+. Copy the [compose.yml](docker/compose.yaml) to your local machine
-												Example file for docker compose and k8s (#1006)


											
										
										
											2024-08-14 06:07:57 +08:00
+. Execute the command `docker compose up -d` in your terminal.
-												Improve docs and warnings (#1164)


											
										
										
											2024-08-20 08:31:29 -07:00
+								</details>
-												Example file for docker compose and k8s (#1006)


											
										
										
											2024-08-14 06:07:57 +08:00
-												[Docs] Add instruction for running on clouds and kubernetes with SkyPilot (#1144)

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
											
										
										
											2024-08-18 23:01:55 -07:00
+								### Method 5: Run on Kubernetes or Clouds with SkyPilot
-												Improve docs and warnings (#1164)


											
										
										
											2024-08-20 08:31:29 -07:00
+								<details>
-												Fix readme (#1236)


											
										
										
											2024-08-28 14:51:41 +08:00
+								<summary>More</summary>
-												[Docs] Fix rendering of details in README (#1179)


											
										
										
											2024-08-21 16:05:33 -07:00
-												[Docs] Add instruction for running on clouds and kubernetes with SkyPilot (#1144)

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
											
										
										
											2024-08-18 23:01:55 -07:00
+								To deploy on Kubernetes or 12+ clouds, you can use [SkyPilot](https://github.com/skypilot-org/skypilot).
 . Install SkyPilot and set up Kubernetes cluster or cloud access: see [SkyPilot's documentation](https://skypilot.readthedocs.io/en/latest/getting-started/installation.html).
 . Deploy on your own infra with a single command and get the HTTP API endpoint:
 								<details>
 								<summary>SkyPilot YAML: <code>sglang.yaml</code></summary>
 								```yaml
 								# sglang.yaml
 								envs:
 								  HF_TOKEN: null
 								resources:
 								  image_id: docker:lmsysorg/sglang:latest
 								  accelerators: A100
 								  ports: 30000
 								run: |
 								  conda deactivate
 								  python3 -m sglang.launch_server \
 								    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
 								    --host 0.0.0.0 \
 								    --port 30000
 								```
-												fix: resolve README render (#1166)


											
										
										
											2024-08-21 03:23:52 +10:00
+								</details>
-												[Docs] Add instruction for running on clouds and kubernetes with SkyPilot (#1144)

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
											
										
										
											2024-08-18 23:01:55 -07:00
 								```bash
 								# Deploy on any cloud or Kubernetes cluster. Use --cloud <cloud> to select a specific cloud provider.
 								HF_TOKEN=<secret> sky launch -c sglang --env HF_TOKEN sglang.yaml
 								# Get the HTTP API endpoint
 								sky status --endpoint 30000 sglang
 								```
 . To further scale up your deployment with autoscaling and failure recovery, check out the [SkyServe + SGLang guide](https://github.com/skypilot-org/skypilot/tree/master/llm/sglang#serving-llama-2-with-sglang-for-more-traffic-using-skyserve).
-												Improve docs and warnings (#1164)


											
										
										
											2024-08-20 08:31:29 -07:00
+								</details>
-												[Docs] Add instruction for running on clouds and kubernetes with SkyPilot (#1144)

Co-authored-by: Zongheng Yang <zongheng.y@gmail.com>
											
										
										
											2024-08-18 23:01:55 -07:00
-												Update README.md

											
										
										
											2024-07-04 00:55:40 -07:00
+								### Common Notes
-												Release v0.3.1 (#1430)


											
										
										
											2024-09-15 07:03:16 -07:00
+								- [FlashInfer](https://github.com/flashinfer-ai/flashinfer) is the default attention kernel backend. It only supports sm75 and above. If you encounter any FlashInfer-related issues on sm75+ devices (e.g., T4, A10, A100, L4, L40S, H100), please switch to other kernels by adding `--attention-backend triton --sampling-backend pytorch` and open an issue on GitHub.
-												Update install commands (#583)

											
										
										
											2024-07-03 02:07:34 -07:00
+								- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`.
-												Fix for T4 GPUs (#16)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-01-16 15:49:03 -08:00
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								## Backend: SGLang Runtime (SRT)
 								The SGLang Runtime (SRT) is an efficient serving engine.
-												Update docs

											
										
										
											2024-07-19 11:40:06 -07:00
+								### Quick Start
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								Launch a server
 								```
 								python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
 								```
 								Send a request
 								```
 								curl http://localhost:30000/generate \
 								  -H "Content-Type: application/json" \
 								  -d '{
 								    "text": "Once upon a time,",
 								    "sampling_params": {
 								      "max_new_tokens": 16,
 								      "temperature": 0
 								    }
 								  }'
 								```
-												Multiple minor fixes (#1530)


											
										
										
											2024-09-28 14:43:35 -07:00
 								Learn more about the argument specification, streaming, and multi-modal support [here](docs/en/sampling_params.md).
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
 								### OpenAI Compatible API
 								In addition, the server supports OpenAI-compatible APIs.
 								```python
 								import openai
 								client = openai.Client(
 								    base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
 								# Text completion
 								response = client.completions.create(
 									model="default",
 									prompt="The capital of France is",
 									temperature=0,
 									max_tokens=32,
 								)
 								print(response)
 								# Chat completion
 								response = client.chat.completions.create(
 								    model="default",
 								    messages=[
 								        {"role": "system", "content": "You are a helpful AI assistant"},
 								        {"role": "user", "content": "List 3 countries and their capitals."},
 								    ],
 								    temperature=0,
 								    max_tokens=64,
 								)
 								print(response)
-												Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186)

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
											
										
										
											2024-08-26 01:29:12 +08:00
 								# Text embedding
 								response = client.embeddings.create(
 								    model="default",
 								    input="How are you today",
 								)
 								print(response)
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								```
-												Multiple minor fixes (#1530)


											
										
										
											2024-09-28 14:43:35 -07:00
+								It supports streaming, vision, and almost all features of the Chat/Completions/Models/Batch endpoints specified by the [OpenAI API Reference](https://platform.openai.com/docs/api-reference/).
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
 								### Additional Server Arguments
-												Enable torch.compile for triton backend (#1422)


											
										
										
											2024-09-14 15:38:37 -07:00
+								- To enable multi-GPU tensor parallelism, add `--tp 2`. If it reports the error "peer access is not supported between these two devices", add `--enable-p2p-check` to the server launch command.
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								```
-												Enable torch.compile for triton backend (#1422)


											
										
										
											2024-09-14 15:38:37 -07:00
+								python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								```
-												Enable torch.compile for triton backend (#1422)


											
										
										
											2024-09-14 15:38:37 -07:00
+								- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total.
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								```
-												Enable torch.compile for triton backend (#1422)


											
										
										
											2024-09-14 15:38:37 -07:00
+								python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								```
-												Clean up readme and arguments of chunked prefill (#1022)


											
										
										
											2024-08-11 01:18:52 -07:00
+								- If you see out-of-memory errors during serving, try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`.
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								```
-												Enable torch.compile for triton backend (#1422)


											
										
										
											2024-09-14 15:38:37 -07:00
+								python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								```
-												Clean up readme and arguments of chunked prefill (#1022)


											
										
										
											2024-08-11 01:18:52 -07:00
+								- See [hyperparameter_tuning.md](docs/en/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
 								- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
-												Update README.md (#927)


											
										
										
											2024-08-04 23:01:35 -07:00
+								```
-												Enable torch.compile for triton backend (#1422)


											
										
										
											2024-09-14 15:38:37 -07:00
+								python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
-												Update README.md (#927)


											
										
										
											2024-08-04 23:01:35 -07:00
+								```
-												Enable torch.compile for triton backend (#1422)


											
										
										
											2024-09-14 15:38:37 -07:00
+								- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes.
-												Multiple minor fixes (#1530)


											
										
										
											2024-09-28 14:43:35 -07:00
+								- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies.
-												Enable torch.compile for triton backend (#1422)


											
										
										
											2024-09-14 15:38:37 -07:00
+								- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
 								- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
 								- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/en/custom_chat_template.md).
-												Improve process creation (#1534)


											
										
										
											2024-09-29 02:36:12 -07:00
+								- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								```
 								# Node 0
-												Allow disabling streaming in bench (#687)


											
										
										
											2024-07-21 01:12:34 -07:00
+								python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 0
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
 								# Node 1
-												Allow disabling streaming in bench (#687)


											
										
										
											2024-07-21 01:12:34 -07:00
+								python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-0:50000 --nnodes 2 --node-rank 1
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								```
-												Clean up unit tests (#1020)


											
										
										
											2024-08-10 15:09:03 -07:00
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								### Supported Models
-												Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186)

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
											
										
										
											2024-08-26 01:29:12 +08:00
+								**Generative Models**
-												docs: update supported models (#719)


											
										
										
											2024-07-25 09:34:01 +10:00
+								- Llama / Llama 2 / Llama 3 / Llama 3.1
-												Clean up readme and arguments of chunked prefill (#1022)


											
										
										
											2024-08-11 01:18:52 -07:00
+								- Mistral / Mixtral / Mistral NeMo
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								- Gemma / Gemma 2
 								- Qwen / Qwen 2 / Qwen 2 MoE
-												Update supported models (#763)


											
										
										
											2024-07-26 22:53:53 -07:00
+								- DeepSeek / DeepSeek 2
-												Multiple minor fixes (#1530)


											
										
										
											2024-09-28 14:43:35 -07:00
+								- OLMoE
-												Cleanup readme, llava examples, usage examples and nccl init (#1194)


											
										
										
											2024-08-24 08:02:23 -07:00
+								- [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/)
-												Update README.md for llava-onevision instructions (#1313)


											
										
										
											2024-09-03 01:43:08 -07:00
+								  - `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --port=30000 --chat-template=chatml-llava`
-												[Fix] Fix llava on multi images (#1247)


											
										
										
											2024-08-28 06:33:05 -07:00
+								  - `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8 --chat-template=chatml-llava`
-												Cleanup readme, llava examples, usage examples and nccl init (#1194)


											
										
										
											2024-08-24 08:02:23 -07:00
+								  - Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](test/srt/test_vision_openai_server.py)
 								- LLaVA 1.5 / 1.6 / NeXT
 								  - `python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000 --tp-size=1 --chat-template=llava_llama_3`
 								  - `python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --port=30000 --tp-size=8 --chat-template=chatml-llava`
 								  - Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](test/srt/test_vision_openai_server.py)
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								- Yi-VL
 								- StableLM
 								- Command-R
 								- DBRX
 								- Grok
 								- ChatGLM
 								- InternLM 2
-												Report median instead of mean in bench_latency.py (#1269)


											
										
										
											2024-08-30 06:05:01 -07:00
+								- Exaone 3
-												BaiChuan2 Model (#1367)

Co-authored-by: wanpenghan <wanpenghan@sohu-inc.com>
											
										
										
											2024-09-11 18:55:24 +08:00
+								- BaiChuan2
-												Support MiniCPM3 (#1371)


											
										
										
											2024-09-10 17:57:52 +08:00
+								- MiniCPM / MiniCPM 3
-												Add Support for XVERSE Models (Dense and MoE) to sglang (#1397)

Co-authored-by: will he <hexin@xverse.cn>
Co-authored-by: root <root@localhost.localdomain>
Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-09-12 16:47:52 +08:00
+								- XVERSE / XVERSE MoE
-												Add support for tie_word_embeddings when loading weights + support for SmolLM (#1508)


											
										
										
											2024-09-24 21:50:20 -07:00
+								- SmolLM
-												Fix README format (#1399)


											
										
										
											2024-09-12 14:46:51 +08:00
-												Support Alibaba-NLP/gte-Qwen2-7B-instruct embedding Model (#1186)

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
											
										
										
											2024-08-26 01:29:12 +08:00
+								**Embedding Models**
 								- e5-mistral
 								- gte-Qwen2
 								  - `python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding`
-												Better unit tests for adding a new model (#1488)


											
										
										
											2024-09-22 01:50:37 -07:00
+								Instructions for supporting a new model are [here](docs/en/model_support.md).
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
-												Clean up unit tests (#1020)


											
										
										
											2024-08-10 15:09:03 -07:00
+								#### Use Models From ModelScope
-												Cleanup readme, llava examples, usage examples and nccl init (#1194)


											
										
										
											2024-08-24 08:02:23 -07:00
+								<details>
-												Fix readme (#1236)


											
										
										
											2024-08-28 14:51:41 +08:00
+								<summary>More</summary>
-												Cleanup readme, llava examples, usage examples and nccl init (#1194)


											
										
										
											2024-08-24 08:02:23 -07:00
-												Update README.md (#1198)


											
										
										
											2024-08-24 14:50:05 -07:00
+								To use a model from [ModelScope](https://www.modelscope.cn), set the environment variable SGLANG_USE_MODELSCOPE.
-												Clean up unit tests (#1020)


											
										
										
											2024-08-10 15:09:03 -07:00
+								```
 								export SGLANG_USE_MODELSCOPE=true
 								```
 								Launch [Qwen2-7B-Instruct](https://www.modelscope.cn/models/qwen/qwen2-7b-instruct) Server
 								```
 								SGLANG_USE_MODELSCOPE=true python -m sglang.launch_server --model-path qwen/Qwen2-7B-Instruct --port 30000
-												Update README.md (#1198)


											
										
										
											2024-08-24 14:50:05 -07:00
+								```
-												[bugfix] Add modelscope package to avoid docker image without modelscope (#1520)


											
										
										
											2024-09-29 03:43:22 +08:00
 								Or start it by docker.
 								```bash
 								docker run --gpus all \
 								    -p 30000:30000 \
 								    -v ~/.cache/modelscope:/root/.cache/modelscope \
 								    --env "SGLANG_USE_MODELSCOPE=true" \
 								    --ipc=host \
 								    lmsysorg/sglang:latest \
 								    python3 -m sglang.launch_server --model-path Qwen/Qwen2.5-7B-Instruct --host 0.0.0.0 --port 30000
 								```
-												Update README.md (#1198)


											
										
										
											2024-08-24 14:50:05 -07:00
-												Cleanup readme, llava examples, usage examples and nccl init (#1194)


											
										
										
											2024-08-24 08:02:23 -07:00
+								</details>
-												Clean up unit tests (#1020)


											
										
										
											2024-08-10 15:09:03 -07:00
 								#### Run Llama 3.1 405B
-												Update README.md (#1198)


											
										
										
											2024-08-24 14:50:05 -07:00
+								<details>
-												Fix readme (#1236)


											
										
										
											2024-08-28 14:51:41 +08:00
+								<summary>More</summary>
-												Update README.md (#927)


											
										
										
											2024-08-04 23:01:35 -07:00
 								```bash
-												Cleanup readme, llava examples, usage examples and nccl init (#1194)


											
										
										
											2024-08-24 08:02:23 -07:00
+								# Run 405B (fp8) on a single node
-												Update README.md (#927)


											
										
										
											2024-08-04 23:01:35 -07:00
+								python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct-FP8 --tp 8
-												Cleanup readme, llava examples, usage examples and nccl init (#1194)


											
										
										
											2024-08-24 08:02:23 -07:00
+								# Run 405B (fp16) on two nodes
 								## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port
 								GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 0 --disable-cuda-graph
-												Update README.md (#927)


											
										
										
											2024-08-04 23:01:35 -07:00
-												Cleanup readme, llava examples, usage examples and nccl init (#1194)


											
										
										
											2024-08-24 08:02:23 -07:00
+								## on the first node, replace the `172.16.4.52:20000` with your own first node ip address and port
 								GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-405B-Instruct --tp 16 --nccl-init-addr 172.16.4.52:20000 --nnodes 2 --node-rank 1 --disable-cuda-graph
-												Update README.md (#927)


											
										
										
											2024-08-04 23:01:35 -07:00
+								```
-												Update README.md (#1198)


											
										
										
											2024-08-24 14:50:05 -07:00
+								</details>
-												Add benchmark instructions (#663)


											
										
										
											2024-07-19 11:12:23 -07:00
+								### Benchmark Performance
-												Update README.md (#1239)


											
										
										
											2024-08-28 02:15:52 -07:00
+								- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as for `launch_server.py`.
 								  Note that this is not a dynamic batching server, so it may run out of memory for a batch size that a real server can handle.
 								  A real server truncates the prefill into several batches, while this unit test does not. For accurate large batch testing, please use `sglang.bench_serving` instead.
-												Add benchmark instructions (#663)


											
										
										
											2024-07-19 11:12:23 -07:00
+								  ```
-												Revert "Organize public APIs" (#815)


											
										
										
											2024-07-29 19:40:28 -07:00
+								  python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
-												Add benchmark instructions (#663)


											
										
										
											2024-07-19 11:12:23 -07:00
+								  ```
 								- Benchmark online serving. Launch a server first and run the following command.
 								  ```
-												Revert "Organize public APIs" (#815)


											
										
										
											2024-07-29 19:40:28 -07:00
+								  python3 -m sglang.bench_serving --backend sglang --num-prompt 10
-												Add benchmark instructions (#663)


											
										
										
											2024-07-19 11:12:23 -07:00
+								  ```
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								## Frontend: Structured Generation Language (SGLang)
-												Clean up readme and arguments of chunked prefill (#1022)


											
										
										
											2024-08-11 01:18:52 -07:00
+								The frontend language can be used with local models or API models. It is an alternative to the OpenAI API. You may found it easier to use for complex prompting workflow.
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
 								### Quick Start
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								The example below shows how to use sglang to answer a mulit-turn question.
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								#### Using Local Models
-												Update quick start examples (#120)


											
										
										
											2024-01-30 04:29:32 -08:00
+								First, launch a server with
-												Update Readme (#11)

											
										
										
											2024-01-16 02:46:27 -08:00
+								```
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
-												Update Readme (#11)

											
										
										
											2024-01-16 02:46:27 -08:00
+								```
-												Update quick start examples (#120)


											
										
										
											2024-01-30 04:29:32 -08:00
+								Then, connect to the server and answer a multi-turn question.
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								```python
-												Update quick start examples (#120)


											
										
										
											2024-01-30 04:29:32 -08:00
+								from sglang import function, system, user, assistant, gen, set_default_backend, RuntimeEndpoint
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
 								@function
 								def multi_turn_question(s, question_1, question_2):
 								    s += system("You are a helpful assistant.")
 								    s += user(question_1)
 								    s += assistant(gen("answer_1", max_tokens=256))
 								    s += user(question_2)
 								    s += assistant(gen("answer_2", max_tokens=256))
-												Update quick start examples (#120)


											
										
										
											2024-01-30 04:29:32 -08:00
+								set_default_backend(RuntimeEndpoint("http://localhost:30000"))
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
 								state = multi_turn_question.run(
 								    question_1="What is the capital of the United States?",
 								    question_2="List two local attractions.",
 								)
 								for m in state.messages():
 								    print(m["role"], ":", m["content"])
-												Return logprob for choices (#87)


											
										
										
											2024-01-23 05:07:30 -08:00
 								print(state["answer_1"])
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								```
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								#### Using OpenAI Models
-												Update quick start examples (#120)


											
										
										
											2024-01-30 04:29:32 -08:00
+								Set the OpenAI API Key
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								```
-												Update quick start examples (#120)


											
										
										
											2024-01-30 04:29:32 -08:00
+								export OPENAI_API_KEY=sk-******
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								```
-												Update quick start examples (#120)


											
										
										
											2024-01-30 04:29:32 -08:00
+								Then, answer a multi-turn question.
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								```python
-												Update quick start examples (#120)


											
										
										
											2024-01-30 04:29:32 -08:00
+								from sglang import function, system, user, assistant, gen, set_default_backend, OpenAI
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
 								@function
 								def multi_turn_question(s, question_1, question_2):
 								    s += system("You are a helpful assistant.")
 								    s += user(question_1)
 								    s += assistant(gen("answer_1", max_tokens=256))
 								    s += user(question_2)
 								    s += assistant(gen("answer_2", max_tokens=256))
-												Update quick start examples (#120)


											
										
										
											2024-01-30 04:29:32 -08:00
+								set_default_backend(OpenAI("gpt-3.5-turbo"))
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
 								state = multi_turn_question.run(
 								    question_1="What is the capital of the United States?",
 								    question_2="List two local attractions.",
 								)
 								for m in state.messages():
 								    print(m["role"], ":", m["content"])
-												Return logprob for choices (#87)


											
										
										
											2024-01-23 05:07:30 -08:00
 								print(state["answer_1"])
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								```
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								#### More Examples
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
-												Improve docs & Rename Gemini -> VertexAI (#19)


											
										
										
											2024-01-17 02:54:41 -08:00
+								Anthropic and VertexAI (Gemini) models are also supported.
-												[doc] fix quick start link (#1282)


											
										
										
											2024-08-31 22:54:34 -07:00
+								You can find more examples at [examples/quick_start](examples/frontend_language/quick_start).
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								### Language Feature
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
+								To begin with, import sglang.
 								```python
 								import sglang as sgl
 								```
-												Update Readme (#11)

											
										
										
											2024-01-16 02:46:27 -08:00
+								`sglang` provides some simple primitives such as `gen`, `select`, `fork`, `image`.
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
+								You can implement your prompt flow in a function decorated by `sgl.function`.
 								You can then invoke the function with `run` or `run_batch`.
-												Update quick start examples (#120)


											
										
										
											2024-01-30 04:29:32 -08:00
+								The system will manage the state, chat template, parallelism and batching for you.
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
-												[doc] Fix more broken links (#1294)


											
										
										
											2024-09-01 14:46:36 -07:00
+								The complete code for the examples below can be found at [readme_examples.py](examples/frontend_language/usage/readme_examples.py)
-												Improve docs & Add JSON decode example (#121)


											
										
										
											2024-01-30 05:45:27 -08:00
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								#### Control Flow
-												Improve docs (#17)


											
										
										
											2024-01-16 19:53:55 -08:00
+								You can use any Python code within the function body, including control flow, nested function calls, and external libraries.
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
+								```python
 								@sgl.function
-												Improve docs & Add JSON decode example (#121)


											
										
										
											2024-01-30 05:45:27 -08:00
+								def tool_use(s, question):
 								    s += "To answer this question: " + question + ". "
 								    s += "I need to use a " + sgl.gen("tool", choices=["calculator", "search engine"]) + ". "
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
 								    if s["tool"] == "calculator":
 								        s += "The math expression is" + sgl.gen("expression")
-												Improve docs & Add JSON decode example (#121)


											
										
										
											2024-01-30 05:45:27 -08:00
+								    elif s["tool"] == "search engine":
 								        s += "The key word to search is" + sgl.gen("word")
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
+								```
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								#### Parallelism
-												Improve docs (#17)


											
										
										
											2024-01-16 19:53:55 -08:00
+								Use `fork` to launch parallel prompts.
 								Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel.
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
+								```python
 								@sgl.function
 								def tip_suggestion(s):
 								    s += (
 								        "Here are two tips for staying healthy: "
 								        "1. Balanced Diet. 2. Regular Exercise.\n\n"
 								    )
-												Improve docs (#17)


											
										
										
											2024-01-16 19:53:55 -08:00
+								    forks = s.fork(2)
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
+								    for i, f in enumerate(forks):
 								        f += f"Now, expand tip {i+1} into a paragraph:\n"
 								        f += sgl.gen(f"detailed_tip", max_tokens=256, stop="\n\n")
 								    s += "Tip 1:" + forks[0]["detailed_tip"] + "\n"
 								    s += "Tip 2:" + forks[1]["detailed_tip"] + "\n"
 								    s += "In summary" + sgl.gen("summary")
 								```
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
-												Update README.md (#1198)


											
										
										
											2024-08-24 14:50:05 -07:00
+								#### Multi-Modality
-												Improve docs (#17)


											
										
										
											2024-01-16 19:53:55 -08:00
+								Use `sgl.image` to pass an image as input.
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								```python
 								@sgl.function
-												Update readme.md

											
										
										
											2024-01-08 21:20:23 +00:00
+								def image_qa(s, image_file, question):
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								    s += sgl.user(sgl.image(image_file) + question)
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
+								    s += sgl.assistant(sgl.gen("answer", max_tokens=256)
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								```
-												[doc] Fix more broken links (#1294)


											
										
										
											2024-09-01 14:46:36 -07:00
+								See also [srt_example_llava.py](examples/frontend_language/quick_start/local_example_llava_next.py).
-												Improve docs & Add JSON decode example (#121)


											
										
										
											2024-01-30 05:45:27 -08:00
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								#### Constrained Decoding
-												Fix the error message and dependency of openai backend (#71)


											
										
										
											2024-01-21 14:56:25 -08:00
+								Use `regex` to specify a regular expression as a decoding constraint.
 								This is only supported for local models.
-												Improve docs (#17)


											
										
										
											2024-01-16 19:53:55 -08:00
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
+								```python
-												Update docs (#12)

											
										
										
											2024-01-16 04:18:54 -08:00
+								@sgl.function
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
+								def regular_expression_gen(s):
 								    s += "Q: What is the IP address of the Google DNS servers?\n"
-												Update docs (#12)

											
										
										
											2024-01-16 04:18:54 -08:00
+								    s += "A: " + sgl.gen(
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
+								        "answer",
 								        temperature=0,
 								        regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
 								    )
 								```
-												Update readme.md

											
										
										
											2024-01-08 21:20:23 +00:00
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								#### JSON Decoding
-												update json decoding docs

											
										
										
											2024-02-03 17:42:01 -08:00
+								Use `regex` to specify a JSON schema with a regular expression.
-												Improve docs & Add JSON decode example (#121)


											
										
										
											2024-01-30 05:45:27 -08:00
 								```python
 								character_regex = (
 								    r"""\{\n"""
 								    + r"""    "name": "[\w\d\s]{1,16}",\n"""
 								    + r"""    "house": "(Gryffindor|Slytherin|Ravenclaw|Hufflepuff)",\n"""
 								    + r"""    "blood status": "(Pure-blood|Half-blood|Muggle-born)",\n"""
 								    + r"""    "occupation": "(student|teacher|auror|ministry of magic|death eater|order of the phoenix)",\n"""
 								    + r"""    "wand": \{\n"""
 								    + r"""        "wood": "[\w\d\s]{1,16}",\n"""
 								    + r"""        "core": "[\w\d\s]{1,16}",\n"""
 								    + r"""        "length": [0-9]{1,2}\.[0-9]{0,2}\n"""
 								    + r"""    \},\n"""
 								    + r"""    "alive": "(Alive|Deceased)",\n"""
 								    + r"""    "patronus": "[\w\d\s]{1,16}",\n"""
 								    + r"""    "bogart": "[\w\d\s]{1,16}"\n"""
 								    + r"""\}"""
 								)
 								@sgl.function
 								def character_gen(s, name):
-												improve docs

											
										
										
											2024-02-05 11:22:06 +00:00
+								    s += name + " is a character in Harry Potter. Please fill in the following information about this character.\n"
-												Improve docs & Add JSON decode example (#121)


											
										
										
											2024-01-30 05:45:27 -08:00
+								    s += sgl.gen("json_output", max_tokens=256, regex=character_regex)
 								```
-												[doc] Fix more broken links (#1294)


											
										
										
											2024-09-01 14:46:36 -07:00
+								See also [json_decode.py](examples/frontend_language/usage/json_decode.py) for an additional example of specifying formats with Pydantic models.
-												Improve docs & Add JSON decode example (#121)


											
										
										
											2024-01-30 05:45:27 -08:00
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								#### Batching
-												Improve docs (#17)


											
										
										
											2024-01-16 19:53:55 -08:00
+								Use `run_batch` to run a batch of requests with continuous batching.
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
+								```python
 								@sgl.function
 								def text_qa(s, question):
 								    s += "Q: " + question + "\n"
 								    s += "A:" + sgl.gen("answer", stop="\n")
 								states = text_qa.run_batch(
 								    [
 								        {"question": "What is the capital of the United Kingdom?"},
 								        {"question": "What is the capital of France?"},
 								        {"question": "What is the capital of Japan?"},
 								    ],
-												Improve docs (#17)


											
										
										
											2024-01-16 19:53:55 -08:00
+								    progress_bar=True
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
+								)
 								```
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								#### Streaming
-												Improve docs (#17)


											
										
										
											2024-01-16 19:53:55 -08:00
+								Add `stream=True` to enable streaming.
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
+								```python
 								@sgl.function
 								def text_qa(s, question):
 								    s += "Q: " + question + "\n"
 								    s += "A:" + sgl.gen("answer", stop="\n")
-												correct a mistake on the README.md (#182)


											
										
										
											2024-02-11 22:25:57 +01:00
+								state = text_qa.run(
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
+								    question="What is the capital of France?",
-												Improve docs (#17)


											
										
										
											2024-01-16 19:53:55 -08:00
+								    temperature=0.1,
 								    stream=True
 								)
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
+								for out in state.text_iter():
 								    print(out, end="", flush=True)
 								```
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
-												Add role documentation, add system begin & end tokens (#793)


											
										
										
											2024-07-29 14:02:49 +08:00
+								#### Roles
 								Use `sgl.system`， `sgl.user` and `sgl.assistant` to set roles when using Chat models. You can also define more complex role prompts using begin and end tokens.
 								```python
 								@sgl.function
 								def chat_example(s):
 								    s += sgl.system("You are a helpful assistant.")
 								    # Same as: s += s.system("You are a helpful assistant.")
 								    with s.user():
 								        s += "Question: What is the capital of France?"
 								    s += sgl.assistant_begin()
 								    s += "Answer: " + sgl.gen(max_tokens=100, stop="\n")
 								    s += sgl.assistant_end()
 								```
-												Update Readme (#660)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-19 09:54:01 -07:00
+								#### Tips and Implementation Details
-												[Feat] Expose logprob options to `sgl.gen` API (#503)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
											
										
										
											2024-07-09 15:35:39 +08:00
+								- The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability.
 								- The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`.
-												Update README.md
											
										
										
											2024-01-23 03:43:19 -08:00
-												Update readme (#731)

											
										
										
											2024-07-25 09:13:37 -07:00
+								## Benchmark And Performance
 								![8b_throughput](https://lmsys.org/images/blog/sglang_llama3/8b_throughput.svg)
 								![70b_fp8_throughput](https://lmsys.org/images/blog/sglang_llama3/70b_fp8_throughput.svg)
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
-												Update readme (#731)

											
										
										
											2024-07-25 09:13:37 -07:00
+								Learn more at this [blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/).
-												Improve Readme (#10)

											
										
										
											2024-01-15 21:37:11 -08:00
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
+								## Roadmap
-												Better unit tests for adding a new model (#1488)


											
										
										
											2024-09-22 01:50:37 -07:00
+								[Development Roadmap (2024 Q4)](https://github.com/sgl-project/sglang/issues/1487)
-												release initial code

Co-authored-by: Ying Sheng <sqy1415@gmail.com>
Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu>
Co-authored-by: parasol-aser <3848358+parasol-aser@users.noreply.github.com>
Co-authored-by: LiviaSun <33578456+ChuyueSun@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>

											
										
										
											2024-01-08 04:37:50 +00:00
 								## Citation And Acknowledgment
-												Update README.md
											
										
										
											2024-07-16 19:18:54 -07:00
+								Please cite our paper, [SGLang: Efficient Execution of Structured Language Model Programs](https://arxiv.org/abs/2312.07104), if you find the project useful.
 								We also learned from the design and reused code from the following projects: [Guidance](https://github.com/guidance-ai/guidance), [vLLM](https://github.com/vllm-project/vllm), [LightLLM](https://github.com/ModelTC/lightllm), [FlashInfer](https://github.com/flashinfer-ai/flashinfer), [Outlines](https://github.com/outlines-dev/outlines), and [LMQL](https://github.com/eth-sri/lmql).