diff --git a/README.md b/README.md index afafeb2f7..f9034587d 100644 --- a/README.md +++ b/README.md @@ -6,23 +6,29 @@ | [**Blog**](https://lmsys.org/blog/2024-01-17-sglang/) | [**Paper**](https://arxiv.org/abs/2312.07104) | -SGLang is a structured generation language designed for large language models (LLMs). -It makes your interaction with LLMs faster and more controllable by co-designing the frontend language and the runtime system. +SGLang is a fast serving framework for large language models and vision language models. +It makes your interaction with models faster and more controllable by co-designing the backend runtime and frontend language. The core features include: +- **Fast Backend Runtime**: Efficient serving with RadixAttention for prefix caching, continuous batching, token attention (paged attention), tensor parallelism, flashinfer kernels, jump-forward constrained decoding, and quantization (AWQ/FP8/GPTQ/Marlin). - **Flexible Frontend Language**: Enables easy programming of LLM applications with chained generation calls, advanced prompting, control flow, multiple modalities, parallelism, and external interactions. -- **High-Performance Backend Runtime**: Features RadixAttention for accelerating complex LLM programs by reusing the KV cache across multiple calls. It can also serve as a standalone inference engine with all common techniques implemented (e.g., continuous batching and tensor parallelism). ## News +- [2024/04] 🔥 SGLang is used by the official **LLaVA-NeXT (video)** release ([blog](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/)). - [2024/02] 🔥 SGLang enables **3x faster JSON decoding** with compressed finite state machine ([blog](https://lmsys.org/blog/2024-02-05-compressed-fsm/)). -- [2024/01] 🔥 SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)). - [2024/01] SGLang provides up to **5x faster inference** with RadixAttention ([blog](https://lmsys.org/blog/2024-01-17-sglang/)). +
+More + +- [2024/01] SGLang powers the serving of the official **LLaVA v1.6** release demo ([usage](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#demo)). + +
+ ## Contents - [Install](#install) -- [Quick Start](#quick-start) -- [Frontend: Structured Generation Language (SGLang)](#frontend-structured-generation-language-sglang) - [Backend: SGLang Runtime (SRT)](#backend-sglang-runtime-srt) +- [Frontend: Structured Generation Language (SGLang)](#frontend-structured-generation-language-sglang) - [Benchmark And Performance](#benchmark-and-performance) - [Roadmap](#roadmap) - [Citation And Acknowledgment](#citation-and-acknowledgment) @@ -70,13 +76,118 @@ pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/ - If you cannot install FlashInfer, check out its [installation](https://docs.flashinfer.ai/installation.html#) page. If you still cannot install it, you can use the slower Triton kernels by adding `--disable-flashinfer` when launching the server. - If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install "sglang[openai]"`. -## Quick Start +## Backend: SGLang Runtime (SRT) +The SGLang Runtime (SRT) is an efficient serving engine. + +### Launching a Server +Launch a server +``` +python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 +``` + +Send a request +``` +curl http://localhost:30000/generate \ + -H "Content-Type: application/json" \ + -d '{ + "text": "Once upon a time,", + "sampling_params": { + "max_new_tokens": 16, + "temperature": 0 + } + }' +``` +Learn more about the argument format [here](docs/sampling_params.md). + +### OpenAI Compatible API +In addition, the server supports OpenAI-compatible APIs. + +```python +import openai +client = openai.Client( + base_url="http://127.0.0.1:30000/v1", api_key="EMPTY") + +# Text completion +response = client.completions.create( + model="default", + prompt="The capital of France is", + temperature=0, + max_tokens=32, +) +print(response) + +# Chat completion +response = client.chat.completions.create( + model="default", + messages=[ + {"role": "system", "content": "You are a helpful AI assistant"}, + {"role": "user", "content": "List 3 countries and their capitals."}, + ], + temperature=0, + max_tokens=64, +) +print(response) +``` + +It supports streaming and most features of the Chat/Completions/Models endpoints specified by the [OpenAI API Reference](https://platform.openai.com/docs/api-reference/). + +### Additional Server Arguments +- Add `--tp 2` to enable tensor parallelism. If it indicates `peer access is not supported between these two devices`, add `--enable-p2p-check` option. +``` +python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --tp 2 +``` +- Add `--dp 2` to enable data parallelism. It can also be used together with tp. Data parallelism is better for throughput if there is enough memory. +``` +python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --dp 2 --tp 2 +``` +- If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9` +``` +python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --mem-fraction-static 0.7 +``` +- See [hyperparameter_tuning.md](docs/hyperparameter_tuning.md) on tuning hyperparameters for better performance. +- Add `--nnodes 2` to run tensor parallelism on multiple nodes. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-1` be the hostname of the first node and `50000` be an available port. +``` +# Node 0 +python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-1:50000 --nnodes 2 --node-rank 0 + +# Node 1 +python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 4 --nccl-init sgl-dev-1:50000 --nnodes 2 --node-rank 1 +``` +- If the model does not have a template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/custom_chat_template.md). + +### Supported Models + +- Llama / Llama 2 / Llama 3 +- Mistral / Mixtral +- Gemma / Gemma 2 +- Qwen / Qwen 2 / Qwen 2 MoE +- LLaVA 1.5 / 1.6 + - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000` + - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000` + - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-34b --tokenizer-path liuhaotian/llava-v1.6-34b-tokenizer --port 3000` +- LLaVA-NeXT-Video + - see [srt_example_llava_v.sh](examples/usage/llava_video/srt_example_llava_v.sh) +- Yi-VL + - see [srt_example_yi_vl.py](examples/quick_start/srt_example_yi_vl.py). +- StableLM +- Command-R +- DBRX +- Grok +- ChatGLM +- InternLM 2 + +Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/model_support.md). + +## Frontend: Structured Generation Language (SGLang) +The frontend language can be used with local models or API models. + +### Quick Start The example below shows how to use sglang to answer a mulit-turn question. -### Using Local Models +#### Using Local Models First, launch a server with ``` -python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 +python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 ``` Then, connect to the server and answer a multi-turn question. @@ -105,7 +216,7 @@ for m in state.messages(): print(state["answer_1"]) ``` -### Using OpenAI Models +#### Using OpenAI Models Set the OpenAI API Key ``` export OPENAI_API_KEY=sk-****** @@ -136,13 +247,12 @@ for m in state.messages(): print(state["answer_1"]) ``` -### More Examples +#### More Examples Anthropic and VertexAI (Gemini) models are also supported. You can find more examples at [examples/quick_start](examples/quick_start). -## Frontend: Structured Generation Language (SGLang) - +### Language Feature To begin with, import sglang. ```python import sglang as sgl @@ -155,7 +265,7 @@ The system will manage the state, chat template, parallelism and batching for yo The complete code for the examples below can be found at [readme_examples.py](examples/usage/readme_examples.py) -### Control Flow +#### Control Flow You can use any Python code within the function body, including control flow, nested function calls, and external libraries. ```python @@ -170,7 +280,7 @@ def tool_use(s, question): s += "The key word to search is" + sgl.gen("word") ``` -### Parallelism +#### Parallelism Use `fork` to launch parallel prompts. Because `sgl.gen` is non-blocking, the for loop below issues two generation calls in parallel. @@ -192,7 +302,7 @@ def tip_suggestion(s): s += "In summary" + sgl.gen("summary") ``` -### Multi Modality +#### Multi Modality Use `sgl.image` to pass an image as input. ```python @@ -204,7 +314,7 @@ def image_qa(s, image_file, question): See also [srt_example_llava.py](examples/quick_start/srt_example_llava.py). -### Constrained Decoding +#### Constrained Decoding Use `regex` to specify a regular expression as a decoding constraint. This is only supported for local models. @@ -219,7 +329,7 @@ def regular_expression_gen(s): ) ``` -### JSON Decoding +#### JSON Decoding Use `regex` to specify a JSON schema with a regular expression. ```python @@ -248,8 +358,7 @@ def character_gen(s, name): See also [json_decode.py](examples/usage/json_decode.py) for an additional example on specifying formats with Pydantic models. - -### Batching +#### Batching Use `run_batch` to run a batch of requests with continuous batching. ```python @@ -268,7 +377,7 @@ states = text_qa.run_batch( ) ``` -### Streaming +#### Streaming Add `stream=True` to enable streaming. ```python @@ -287,139 +396,10 @@ for out in state.text_iter(): print(out, end="", flush=True) ``` -### Tips and Implementation Details +#### Tips and Implementation Details - The `choices` argument in `sgl.gen` is implemented by computing the [token-length normalized log probabilities](https://blog.eleuther.ai/multiple-choice-normalization/) of all choices and selecting the one with the highest probability. - The `regex` argument in `sgl.gen` is implemented through autoregressive decoding with logit bias masking, according to the constraints set by the regex. It is compatible with `temperature=0` and `temperature != 0`. -## Backend: SGLang Runtime (SRT) -The SGLang Runtime (SRT) is designed to work best with the SGLang frontend. -However, it can also be used as a standalone API server. -In this case, the [RadixAttention](https://arxiv.org/abs/2312.07104) can still greatly accelerate many use cases with automatic KV cache reuse. - -### Usage -Launch a server -``` -python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 -``` - -Send a request -``` -curl http://localhost:30000/generate \ - -H "Content-Type: application/json" \ - -d '{ - "text": "Once upon a time,", - "sampling_params": { - "max_new_tokens": 16, - "temperature": 0 - } - }' -``` -Learn more about the argument format [here](docs/sampling_params.md). - -### OpenAI Compatible API -In addition, the server supports an experimental OpenAI-compatible API. - -```python -import openai -client = openai.Client( - base_url="http://127.0.0.1:30000/v1", api_key="EMPTY") - -# Text completion -response = client.completions.create( - model="default", - prompt="The capital of France is", - temperature=0, - max_tokens=32, -) -print(response) - -# Chat completion -response = client.chat.completions.create( - model="default", - messages=[ - {"role": "system", "content": "You are a helpful AI assistant"}, - {"role": "user", "content": "List 3 countries and their capitals."}, - ], - temperature=0, - max_tokens=64, -) -print(response) -``` - -By default, the server uses the chat template specified in the model tokenizer from Hugging Face. It should just work for most official models such as Llama-2/Llama-3. - -If needed, you can also override the chat template when launching the server: - -``` -python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2 -``` - -If the chat template you are looking for is missing, you are welcome to contribute it. -Meanwhile, you can also temporarily register your chat template as follows: - -```json -{ - "name": "my_model", - "system": "<|im_start|>system", - "user": "<|im_start|>user", - "assistant": "<|im_start|>assistant", - "sep_style": "CHATML", - "sep": "<|im_end|>", - "stop_str": ["<|im_end|>", "<|im_start|>"] -} -``` - -``` -python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json -``` - -### Additional Arguments -- Add `--tp 2` to enable tensor parallelism. If it indicates `peer access is not supported between these two devices`, add `--enable-p2p-check` option. -``` -python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --tp 2 -``` -- Add `--dp 2` to enable data parallelism. It can also be used together with tp. Data parallelism is better for throughput if there is enough memory. -``` -python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --dp 2 --tp 2 -``` -- If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9` -``` -python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7 -``` -- See [hyperparameter_tuning.md](docs/hyperparameter_tuning.md) on tuning hyperparameters for better performance. -- Add `--nnodes 2` to run tensor parallelism on multiple nodes. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-1` be the hostname of the first node and `50000` be an available port. -``` -# Node 0 -python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --tp 4 --nccl-init sgl-dev-1:50000 --nnodes 2 --node-rank 0 - -# Node 1 -python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --tp 4 --nccl-init sgl-dev-1:50000 --nnodes 2 --node-rank 1 -``` - -### Supported Models -- Llama -- Mistral -- Mixtral -- Qwen / Qwen 2 / Qwen 2 MoE -- Gemma / Gemma 2 - - `python -m sglang.launch_server --model-path google/gemma-7b-it --port 30000 --attention-reduce-in-fp32` -- LLaVA - - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000` - - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-vicuna-7b --tokenizer-path llava-hf/llava-1.5-7b-hf --chat-template vicuna_v1.1 --port 30000` - - `python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.6-34b --tokenizer-path liuhaotian/llava-v1.6-34b-tokenizer --port 3000` -- LLaVA-NeXT-Video - - see [srt_example_llava_v.sh](examples/usage/llava_video/srt_example_llava_v.sh) -- Yi-VL - - see [srt_example_yi_vl.py](examples/quick_start/srt_example_yi_vl.py). -- StableLM -- Command-R -- DBRX -- Grok -- ChatGLM -- AWQ/GPTQ/Marlin quantization - -Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/model_support.md). - ## Benchmark And Performance - Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1 ![llama_7b](assets/llama_7b.jpg) diff --git a/docs/benchmark_results.md b/docs/benchmark_results.md index 519dfec3f..2688c0c16 100644 --- a/docs/benchmark_results.md +++ b/docs/benchmark_results.md @@ -1,4 +1,4 @@ -## Benchmark Results +# Benchmark Results We tested our system on the following common LLM workloads and reported the achieved throughput: - **[MMLU](https://arxiv.org/abs/2009.03300)**: A 5-shot, multi-choice, multi-task benchmark. diff --git a/docs/custom_chat_template.md b/docs/custom_chat_template.md new file mode 100644 index 000000000..815c7e676 --- /dev/null +++ b/docs/custom_chat_template.md @@ -0,0 +1,28 @@ +# Custom Chat Template in SGLang Runtime + +By default, the server uses the chat template specified in the model tokenizer from Hugging Face. It should just work for most official models such as Llama-2/Llama-3. + +If needed, you can also override the chat template when launching the server: + +``` +python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2 +``` + +If the chat template you are looking for is missing, you are welcome to contribute it. +Meanwhile, you can also temporarily register your chat template as follows: + +```json +{ + "name": "my_model", + "system": "<|im_start|>system", + "user": "<|im_start|>user", + "assistant": "<|im_start|>assistant", + "sep_style": "CHATML", + "sep": "<|im_end|>", + "stop_str": ["<|im_end|>", "<|im_start|>"] +} +``` + +``` +python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json +``` \ No newline at end of file diff --git a/docs/model_support.md b/docs/model_support.md index b9b680741..7e1230d9b 100644 --- a/docs/model_support.md +++ b/docs/model_support.md @@ -1,4 +1,4 @@ -## How to Support a New Model +# How to Support a New Model To support a new model in SGLang, you only need to add a single file under [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models). You can learn from existing model implementations and create new files for the new models. Most models are based on the transformer architecture, making them very similar. diff --git a/docs/sampling_params.md b/docs/sampling_params.md index 065bbc2d5..745f20823 100644 --- a/docs/sampling_params.md +++ b/docs/sampling_params.md @@ -1,4 +1,4 @@ -## Sampling Parameters of SGLang Runtime +# Sampling Parameters in SGLang Runtime This doc describes the sampling parameters of the SGLang Runtime. The `/generate` endpoint accepts the following arguments in the JSON format. @@ -6,11 +6,11 @@ The `/generate` endpoint accepts the following arguments in the JSON format. ```python @dataclass class GenerateReqInput: - # The input prompt + # The input prompt. It can be a single prompt or a batch of prompts. text: Union[List[str], str] # The token ids for text; one can either specify text or input_ids input_ids: Optional[Union[List[List[int]], List[int]]] = None - # The image input + # The image input. It can be a file name. image_data: Optional[Union[List[str], str]] = None # The sampling_params sampling_params: Union[List[Dict], Dict] = None diff --git a/docs/test_process.md b/docs/test_process.md index 18f91c6d4..90958ec62 100644 --- a/docs/test_process.md +++ b/docs/test_process.md @@ -1,4 +1,4 @@ -## SRT Unit Tests +# SRT Unit Tests ### Latency Alignment Make sure your changes do not slow down the following benchmarks diff --git a/python/sglang/README.md b/python/sglang/README.md index 2f298c2c3..0b01ec1df 100644 --- a/python/sglang/README.md +++ b/python/sglang/README.md @@ -1,8 +1,7 @@ # Code Structure -- `backend`: Various backends for the language interpreter. - `lang`: The frontend language. -- `srt`: The serving engine for running local models. (SRT = SGLang Runtime). +- `srt`: The backend engine for running local models. (SRT = SGLang Runtime). - `test`: Test utilities. - `api.py`: Public API. - `bench_latency.py`: Benchmark utilities. diff --git a/python/sglang/__init__.py b/python/sglang/__init__.py index aea66b2e7..c71156448 100644 --- a/python/sglang/__init__.py +++ b/python/sglang/__init__.py @@ -22,16 +22,16 @@ from sglang.api import ( video, ) -# SGL Backends -from sglang.backend.anthropic import Anthropic -from sglang.backend.litellm import LiteLLM -from sglang.backend.openai import OpenAI -from sglang.backend.runtime_endpoint import RuntimeEndpoint -from sglang.backend.vertexai import VertexAI - # Global Configurations from sglang.global_config import global_config +# SGL Backends +from sglang.lang.backend.anthropic import Anthropic +from sglang.lang.backend.litellm import LiteLLM +from sglang.lang.backend.openai import OpenAI +from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint +from sglang.lang.backend.vertexai import VertexAI + # public APIs management __all__ = [ "global_config", diff --git a/python/sglang/api.py b/python/sglang/api.py index 043893568..c32943963 100644 --- a/python/sglang/api.py +++ b/python/sglang/api.py @@ -4,8 +4,8 @@ import os import re from typing import Callable, List, Optional, Union -from sglang.backend.base_backend import BaseBackend from sglang.global_config import global_config +from sglang.lang.backend.base_backend import BaseBackend from sglang.lang.ir import ( SglExpr, SglExprList, diff --git a/python/sglang/bench.py b/python/sglang/bench_serving.py similarity index 100% rename from python/sglang/bench.py rename to python/sglang/bench_serving.py diff --git a/python/sglang/backend/__init__.py b/python/sglang/lang/backend/__init__.py similarity index 100% rename from python/sglang/backend/__init__.py rename to python/sglang/lang/backend/__init__.py diff --git a/python/sglang/backend/anthropic.py b/python/sglang/lang/backend/anthropic.py similarity index 97% rename from python/sglang/backend/anthropic.py rename to python/sglang/lang/backend/anthropic.py index d96d0f04f..5a36bd9ac 100644 --- a/python/sglang/backend/anthropic.py +++ b/python/sglang/lang/backend/anthropic.py @@ -2,7 +2,7 @@ from typing import List, Optional, Union import numpy as np -from sglang.backend.base_backend import BaseBackend +from sglang.lang.backend.base_backend import BaseBackend from sglang.lang.chat_template import get_chat_template from sglang.lang.interpreter import StreamExecutor from sglang.lang.ir import SglSamplingParams diff --git a/python/sglang/backend/base_backend.py b/python/sglang/lang/backend/base_backend.py similarity index 100% rename from python/sglang/backend/base_backend.py rename to python/sglang/lang/backend/base_backend.py diff --git a/python/sglang/backend/litellm.py b/python/sglang/lang/backend/litellm.py similarity index 97% rename from python/sglang/backend/litellm.py rename to python/sglang/lang/backend/litellm.py index d9b4023ca..f6dac6293 100644 --- a/python/sglang/backend/litellm.py +++ b/python/sglang/lang/backend/litellm.py @@ -1,6 +1,6 @@ from typing import Mapping, Optional -from sglang.backend.base_backend import BaseBackend +from sglang.lang.backend.base_backend import BaseBackend from sglang.lang.chat_template import get_chat_template_by_model_path from sglang.lang.interpreter import StreamExecutor from sglang.lang.ir import SglSamplingParams diff --git a/python/sglang/backend/openai.py b/python/sglang/lang/backend/openai.py similarity index 99% rename from python/sglang/backend/openai.py rename to python/sglang/lang/backend/openai.py index 6f65f4eab..06701cb37 100644 --- a/python/sglang/backend/openai.py +++ b/python/sglang/lang/backend/openai.py @@ -6,7 +6,7 @@ from typing import Callable, List, Optional, Union import numpy as np -from sglang.backend.base_backend import BaseBackend +from sglang.lang.backend.base_backend import BaseBackend from sglang.lang.chat_template import ChatTemplate, get_chat_template_by_model_path from sglang.lang.interpreter import StreamExecutor from sglang.lang.ir import SglSamplingParams diff --git a/python/sglang/backend/runtime_endpoint.py b/python/sglang/lang/backend/runtime_endpoint.py similarity index 99% rename from python/sglang/backend/runtime_endpoint.py rename to python/sglang/lang/backend/runtime_endpoint.py index d845e8116..772577336 100644 --- a/python/sglang/backend/runtime_endpoint.py +++ b/python/sglang/lang/backend/runtime_endpoint.py @@ -3,8 +3,8 @@ from typing import List, Optional import numpy as np -from sglang.backend.base_backend import BaseBackend from sglang.global_config import global_config +from sglang.lang.backend.base_backend import BaseBackend from sglang.lang.chat_template import get_chat_template_by_model_path from sglang.lang.interpreter import StreamExecutor from sglang.lang.ir import SglSamplingParams diff --git a/python/sglang/backend/vertexai.py b/python/sglang/lang/backend/vertexai.py similarity index 98% rename from python/sglang/backend/vertexai.py rename to python/sglang/lang/backend/vertexai.py index 871cb4f88..c27733b3e 100644 --- a/python/sglang/backend/vertexai.py +++ b/python/sglang/lang/backend/vertexai.py @@ -2,7 +2,7 @@ import os import warnings from typing import Optional -from sglang.backend.base_backend import BaseBackend +from sglang.lang.backend.base_backend import BaseBackend from sglang.lang.chat_template import get_chat_template from sglang.lang.interpreter import StreamExecutor from sglang.lang.ir import SglSamplingParams diff --git a/python/sglang/lang/tracer.py b/python/sglang/lang/tracer.py index 53f772163..cfe9198bc 100644 --- a/python/sglang/lang/tracer.py +++ b/python/sglang/lang/tracer.py @@ -3,8 +3,8 @@ import uuid from typing import Any, Callable, Dict, List, Optional, Union -from sglang.backend.base_backend import BaseBackend from sglang.global_config import global_config +from sglang.lang.backend.base_backend import BaseBackend from sglang.lang.interpreter import ProgramState, ProgramStateGroup from sglang.lang.ir import ( SglArgument, diff --git a/python/sglang/srt/server.py b/python/sglang/srt/server.py index f0d425e67..522874035 100644 --- a/python/sglang/srt/server.py +++ b/python/sglang/srt/server.py @@ -26,7 +26,7 @@ import uvloop from fastapi import FastAPI, Request from fastapi.responses import JSONResponse, Response, StreamingResponse -from sglang.backend.runtime_endpoint import RuntimeEndpoint +from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint from sglang.srt.constrained import disable_cache from sglang.srt.hf_transformers_utils import get_tokenizer from sglang.srt.managers.controller.manager_multi import ( diff --git a/python/sglang/srt/server_args.py b/python/sglang/srt/server_args.py index 50fe7cd17..264985fb5 100644 --- a/python/sglang/srt/server_args.py +++ b/python/sglang/srt/server_args.py @@ -166,6 +166,15 @@ class ServerArgs: "--quantization", type=str, default=ServerArgs.quantization, + choices=[ + "awq", + "fp8", + "gptq", + "marlin", + "gptq_marlin", + "squeezellm", + "bitsandbytes", + ], help="The quantization method.", ) parser.add_argument( @@ -243,13 +252,13 @@ class ServerArgs: parser.add_argument( "--show-time-cost", action="store_true", - help="Show time cost of custom marks", + help="Show time cost of custom marks.", ) parser.add_argument( "--api-key", type=str, default=ServerArgs.api_key, - help="Set API key of the server", + help="Set API key of the server.", ) # Data parallelism @@ -285,17 +294,17 @@ class ServerArgs: parser.add_argument( "--disable-flashinfer", action="store_true", - help="Disable flashinfer inference kernels", + help="Disable flashinfer inference kernels.", ) parser.add_argument( "--disable-radix-cache", action="store_true", - help="Disable RadixAttention", + help="Disable RadixAttention for prefix caching.", ) parser.add_argument( "--disable-regex-jump-forward", action="store_true", - help="Disable regex jump-forward", + help="Disable regex jump-forward.", ) parser.add_argument( "--disable-cuda-graph", diff --git a/python/sglang/test/test_programs.py b/python/sglang/test/test_programs.py index 6fa8f8214..c9e8139df 100644 --- a/python/sglang/test/test_programs.py +++ b/python/sglang/test/test_programs.py @@ -306,7 +306,7 @@ def test_image_qa(): assert ( "taxi" in state.messages()[-1]["content"] or "car" in state.messages()[-1]["content"] - ) + ), f"{state.messages()[-1]['content']}" def test_stream(): diff --git a/python/sglang/test/test_utils.py b/python/sglang/test/test_utils.py index 693bade6f..af7f3765e 100644 --- a/python/sglang/test/test_utils.py +++ b/python/sglang/test/test_utils.py @@ -6,9 +6,9 @@ from functools import partial import numpy as np import requests -from sglang.backend.openai import OpenAI -from sglang.backend.runtime_endpoint import RuntimeEndpoint from sglang.global_config import global_config +from sglang.lang.backend.openai import OpenAI +from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint from sglang.utils import get_exception_traceback diff --git a/test/lang/test_bind_cache.py b/test/lang/test_bind_cache.py index 9cba14ce4..b2c6bfbe8 100644 --- a/test/lang/test_bind_cache.py +++ b/test/lang/test_bind_cache.py @@ -1,7 +1,7 @@ import unittest import sglang as sgl -from sglang.backend.runtime_endpoint import RuntimeEndpoint +from sglang.lang.backend.runtime_endpoint import RuntimeEndpoint class TestBind(unittest.TestCase): diff --git a/test/lang/test_tracing.py b/test/lang/test_tracing.py index 266ce65fe..ae7a95cad 100644 --- a/test/lang/test_tracing.py +++ b/test/lang/test_tracing.py @@ -1,7 +1,7 @@ import unittest import sglang as sgl -from sglang.backend.base_backend import BaseBackend +from sglang.lang.backend.base_backend import BaseBackend from sglang.lang.chat_template import get_chat_template diff --git a/test/srt/test_httpserver_llava.py b/test/srt/test_httpserver_llava.py index e3cf1b799..a7912fcc2 100644 --- a/test/srt/test_httpserver_llava.py +++ b/test/srt/test_httpserver_llava.py @@ -10,7 +10,6 @@ The image features a man standing on the back of a yellow taxi cab, holding import argparse import asyncio import json -import time import aiohttp import requests