change file tree (#1859)
Co-authored-by: Chayenne <zhaochenyang@g.ucla.edu>
This commit is contained in:
53
docs/references/benchmark_and_profiling.md
Normal file
53
docs/references/benchmark_and_profiling.md
Normal file
@@ -0,0 +1,53 @@
|
||||
# Benchmark and Profiling
|
||||
|
||||
## Benchmark
|
||||
- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as for `launch_server.py`. Note that this is not a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this unit test does not. For accurate large batch testing, consider using `sglang.bench_serving`.
|
||||
```
|
||||
python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
|
||||
```
|
||||
- Benchmark online serving. Launch a server first and run the following command.
|
||||
```
|
||||
python3 -m sglang.bench_serving --backend sglang --num-prompt 10
|
||||
```
|
||||
|
||||
## Profile with Nsight
|
||||
0. Prerequisite
|
||||
```bash
|
||||
# install nsys
|
||||
# https://docs.nvidia.com/nsight-systems/InstallationGuide/index.html
|
||||
apt update
|
||||
apt install -y --no-install-recommends gnupg
|
||||
echo "deb http://developer.download.nvidia.com/devtools/repos/ubuntu$(source /etc/lsb-release; echo "$DISTRIB_RELEASE" | tr -d .)/$(dpkg --print-architecture) /" | tee /etc/apt/sources.list.d/nvidia-devtools.list
|
||||
apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
|
||||
apt update
|
||||
apt install nsight-systems-cli
|
||||
```
|
||||
|
||||
1. To profile a single batch, use `nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3-8B --batch-size 64 --input-len 512`
|
||||
|
||||
2. To profile a server, e.g.
|
||||
|
||||
```bash
|
||||
# server
|
||||
# set the delay and duration times according to needs
|
||||
nsys profile --trace-fork-before-exec=true --cuda-graph-trace=node -o sglang.out --delay 60 --duration 70 python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --disable-radix-cache
|
||||
|
||||
# client
|
||||
python3 -m sglang.bench_serving --backend sglang --num-prompts 6000 --dataset-name random --random-input 4096 --random-output 2048
|
||||
```
|
||||
|
||||
3. Use NVTX, e.g.
|
||||
|
||||
```bash
|
||||
# install nvtx
|
||||
pip install nvtx
|
||||
|
||||
# code snippets
|
||||
import nvtx
|
||||
with nvtx.annotate("description", color="color"):
|
||||
# some critical code
|
||||
```
|
||||
|
||||
## Other tips
|
||||
|
||||
1. You can benchmark a model using dummy weights by only providing the config.json file. This allows for quick testing of model variants without training. To do so, add `--load-format dummy` to the above commands and then you only need a correct `config.json` under the checkpoint folder.
|
||||
77
docs/references/choices_methods.md
Normal file
77
docs/references/choices_methods.md
Normal file
@@ -0,0 +1,77 @@
|
||||
# Choices Methods in SGLang
|
||||
This doc describes the choices methods supported by SGLang.
|
||||
|
||||
The optional `choices_method` arg determines how options supplied to SGLang's `choices` primitive are selected. Only the `RuntimeEndpoint` backend supports the `choices_method` arg. Other backends, such as `OpenAI`, have bespoke selection implementations due to API limitations.
|
||||
|
||||
## Methods
|
||||
|
||||
### Token Length Normalized
|
||||
|
||||
Token length normalized is the default SGLang choices method. It selects the option with the highest average logprob across all of its tokens.
|
||||
|
||||
Usage example (alternatively, simply omit the `choices_method` arg):
|
||||
```python
|
||||
@sgl.function
|
||||
def example(s):
|
||||
s += sgl.user("What is the capital of France?")
|
||||
s += sgl.assistant(
|
||||
sgl.gen(
|
||||
"answer",
|
||||
choices=["London", "Paris", "Berlin"],
|
||||
choices_method=sgl.token_length_normalized,
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
This can perform poorly if an option contains many tokens, where its later tokens are predicted with high confidence based on its earlier tokens. For instance, even strong models will fail the above example if the specified options are `["Paris", "Antidisestablishmentarianism"]`.
|
||||
|
||||
### Greedy Token Selection
|
||||
|
||||
Greedy token selection simply selects the option with the highest logprob for its initial token. For overlapping options where one option is a subset of a longer option, the logprobs of the shorter option are extended using its average logprob for comparison against the longer option.
|
||||
|
||||
Usage example:
|
||||
```python
|
||||
@sgl.function
|
||||
def example(s):
|
||||
s += sgl.user("What is the capital of France?")
|
||||
s += sgl.assistant(
|
||||
sgl.gen(
|
||||
"answer",
|
||||
choices=["London", "Paris", "Berlin"],
|
||||
choices_method=sgl.greedy_token_selection,
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
This can perform poorly if an option misleads the model down a bad path based on an attractive initial token. For instance, greedy selection will result in an incorrect response for this example:
|
||||
```python
|
||||
@sgl.function
|
||||
def us_president_example(s):
|
||||
s += sgl.user("Name a US president.")
|
||||
s += sgl.assistant(
|
||||
sgl.gen(
|
||||
"answer",
|
||||
choices=["Donald Duck", "Millard Fillmore"],
|
||||
choices_method=sgl.greedy_token_selection,
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
### Unconditional Likelihood Normalized
|
||||
|
||||
Unconditional likelihood normalized selects the option with the highest average token logprob once normalized by the unconditional token logprobs, as described in [this EleutherAI blogpost](https://blog.eleuther.ai/multiple-choice-normalization/). This method incurs an additional LLM call to obtain the unconditional likelihoods.
|
||||
|
||||
Usage example:
|
||||
```python
|
||||
@sgl.function
|
||||
def example(s):
|
||||
s += sgl.user("What is the capital of France?")
|
||||
s += sgl.assistant(
|
||||
sgl.gen(
|
||||
"answer",
|
||||
choices=["London", "Paris", "Berlin"],
|
||||
choices_method=sgl.unconditional_likelihood_normalized,
|
||||
)
|
||||
)
|
||||
```
|
||||
14
docs/references/contributor_guide.md
Normal file
14
docs/references/contributor_guide.md
Normal file
@@ -0,0 +1,14 @@
|
||||
# Contributor Guide
|
||||
|
||||
## Format Your Code
|
||||
Use these commands to format your code and pass CI linting tests.
|
||||
|
||||
```
|
||||
pip3 install pre-commit
|
||||
cd sglang
|
||||
pre-commit install
|
||||
pre-commit run --all-files
|
||||
```
|
||||
|
||||
## Add Unit Tests
|
||||
Add unit tests under [sglang/test](https://github.com/sgl-project/sglang/tree/main/test). You can learn how to add and run tests from the README.md in that folder.
|
||||
31
docs/references/custom_chat_template.md
Normal file
31
docs/references/custom_chat_template.md
Normal file
@@ -0,0 +1,31 @@
|
||||
# Custom Chat Template in SGLang Runtime
|
||||
|
||||
**NOTE**: There are two chat template systems in SGLang project. This document is about setting a custom chat template for the OpenAI-compatible API server (defined at [conversation.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/conversation.py)). It is NOT related to the chat template used in the SGLang language frontend (defined at [chat_template.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/lang/chat_template.py)).
|
||||
|
||||
By default, the server uses the chat template specified in the model tokenizer from Hugging Face.
|
||||
It should just work for most official models such as Llama-2/Llama-3.
|
||||
|
||||
If needed, you can also override the chat template when launching the server:
|
||||
|
||||
```
|
||||
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
|
||||
```
|
||||
|
||||
If the chat template you are looking for is missing, you are welcome to contribute it.
|
||||
Meanwhile, you can also temporarily register your chat template as follows:
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "my_model",
|
||||
"system": "<|im_start|>system",
|
||||
"user": "<|im_start|>user",
|
||||
"assistant": "<|im_start|>assistant",
|
||||
"sep_style": "CHATML",
|
||||
"sep": "<|im_end|>",
|
||||
"stop_str": ["<|im_end|>", "<|im_start|>"]
|
||||
}
|
||||
```
|
||||
|
||||
```
|
||||
python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json
|
||||
```
|
||||
40
docs/references/hyperparameter_tuning.md
Normal file
40
docs/references/hyperparameter_tuning.md
Normal file
@@ -0,0 +1,40 @@
|
||||
# Guide on Hyperparameter Tuning
|
||||
|
||||
## Achieving Peak Throughput
|
||||
Achieving a large batch size is the most important thing for attaining high throughput.
|
||||
|
||||
When the server is running at full load, look for the following in the log:
|
||||
|
||||
```Decode batch. #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 317```
|
||||
|
||||
### Tune Your Request Submission Speed
|
||||
`#queue-req` indicates the number of requests in the queue. If you frequently see `#queue-req == 0`, it suggests you are bottlenecked by the request submission speed.
|
||||
A healthy range for `#queue-req` is `50 - 500`.
|
||||
On the other hand, do not make `#queue-req` too large because it will also increase the scheduling overhead on the server, especially when using the default longest-prefix-match schedule policy (`--schedule-policy lpm`).
|
||||
|
||||
### Tune `--schedule-conservativeness`
|
||||
`token usage` indicates the KV cache memory utilization of the server. `token usage > 0.9` means good utilization.
|
||||
If you frequently see `token usage < 0.9` and `#queue-req > 0`, it means the server is too conservative about taking in new requests. You can decrease `--schedule-conservativeness` to a value like 0.3.
|
||||
The case of server being too conservative can happen when users send many requests with a large `max_new_tokens` but the requests stop very early due to EOS or stop strings.
|
||||
|
||||
On the other hand, if you see `token usage` very high and you frequently see warnings like
|
||||
`decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3.
|
||||
If you see `decode out of memory happened` occasionally but not frequently, it is okay.
|
||||
|
||||
### Tune `--dp-size` and `--tp-size`
|
||||
Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput.
|
||||
|
||||
### Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests`
|
||||
If you see out of memory (OOM) errors, you can decrease these parameters.
|
||||
If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
|
||||
If OOM happens during decoding, try to decrease `--max-running-requests`.
|
||||
You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
|
||||
|
||||
### Try Advanced Options
|
||||
- To enable the experimental overlapped scheduler, add `--enable-overlap-scheduler`. It overlaps CPU scheduler with GPU computation and can accelerate almost all workloads. This does not work for constrained decoding currenly.
|
||||
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly.
|
||||
|
||||
### Tune `--schedule-policy`
|
||||
If the workload has many shared prefixes, use the default `--schedule-policy lpm`. `lpm` stands for longest prefix match.
|
||||
When you have no shared prefixes at all or you always send the requests with the shared prefixes together,
|
||||
you can try `--schedule-policy fcfs`. `fcfs` stands for first come first serve. `fcfs` has a lower scheduling overhead.
|
||||
3
docs/references/learn_more.md
Normal file
3
docs/references/learn_more.md
Normal file
@@ -0,0 +1,3 @@
|
||||
# Learn more
|
||||
|
||||
You can find more blogs, slides, and videos about SGLang at [https://github.com/sgl-project/sgl-learning-materials](https://github.com/sgl-project/sgl-learning-materials).
|
||||
35
docs/references/model_support.md
Normal file
35
docs/references/model_support.md
Normal file
@@ -0,0 +1,35 @@
|
||||
# How to Support a New Model
|
||||
|
||||
To support a new model in SGLang, you only need to add a single file under [SGLang Models Directory](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/models).
|
||||
You can learn from existing model implementations and create new files for the new models.
|
||||
For most models, you should be able to find a similar model to start with (e.g., starting from Llama).
|
||||
|
||||
## Test the correctness
|
||||
|
||||
### Interactive debugging
|
||||
For interactive debugging, you can compare the outputs of huggingface/transformers and SGLang.
|
||||
The following two commands should give the same text output and very similar prefill logits.
|
||||
|
||||
- Get the reference output by `python3 scripts/playground/reference_hf.py --model [new model]`
|
||||
- Get the SGLang output by `python3 -m sglang.bench_latency --correct --model [new model]`
|
||||
|
||||
### Add the model to the test suite
|
||||
To make sure the new model is well maintained in the future, it is better to add it to the test suite.
|
||||
You can add it to the `ALL_OTHER_MODELS` list in the [test_generation_models.py](https://github.com/sgl-project/sglang/blob/main/test/srt/models/test_generation_models.py) and run the following command to test it.
|
||||
|
||||
For example, if the model is Qwen/Qwen2-1.5B
|
||||
```
|
||||
ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerationModels.test_others
|
||||
```
|
||||
|
||||
## Port a model from vLLM to SGLang
|
||||
Another valuable resource is the [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models). vLLM has extensive coverage of models, and SGLang reuses vLLM's interface and some layers to implement the models. This similarity makes it easy to port many models from vLLM to SGLang.
|
||||
|
||||
To port a model from vLLM to SGLang, you can compare these two files [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) and [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py). This comparison will help you understand how to convert a model implementation from vLLM to SGLang. The major difference is the replacement of Attention with RadixAttention. The other parts are almost identical. Specifically,
|
||||
- Replace vllm's `Attention` with `RadixAttention`. Note that you need to pass `layer_id` all the way to `RadixAttention`.
|
||||
- Replace vllm's `LogitsProcessor` with SGLang's `LogitsProcessor`.
|
||||
- Replace other vLLM layers with SGLang layers (e.g., `RMSNorm`, `SiluAndMul`).
|
||||
- Remove `Sample`.
|
||||
- Change `forward()` functions, and add `forward_batch`.
|
||||
- Add `EntryClass` at the end.
|
||||
|
||||
428
docs/references/sampling_params.md
Normal file
428
docs/references/sampling_params.md
Normal file
@@ -0,0 +1,428 @@
|
||||
# Sampling Parameters in SGLang Runtime
|
||||
This doc describes the sampling parameters of the SGLang Runtime.
|
||||
It is the low-level endpoint of the runtime.
|
||||
If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API
|
||||
](https://github.com/sgl-project/sglang?tab=readme-ov-file#openai-compatible-api).
|
||||
|
||||
The `/generate` endpoint accepts the following arguments in the JSON format.
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class GenerateReqInput:
|
||||
# The input prompt. It can be a single prompt or a batch of prompts.
|
||||
text: Optional[Union[List[str], str]] = None
|
||||
# The token ids for text; one can either specify text or input_ids.
|
||||
input_ids: Optional[Union[List[List[int]], List[int]]] = None
|
||||
# The image input. It can be a file name, a url, or base64 encoded string.
|
||||
# See also python/sglang/srt/utils.py:load_image.
|
||||
image_data: Optional[Union[List[str], str]] = None
|
||||
# The sampling_params. See descriptions below.
|
||||
sampling_params: Union[List[Dict], Dict] = None
|
||||
# The request id.
|
||||
rid: Optional[Union[List[str], str]] = None
|
||||
# Whether to return logprobs.
|
||||
return_logprob: Optional[Union[List[bool], bool]] = None
|
||||
# The start location of the prompt for return_logprob.
|
||||
# By default, this value is "-1", which means it will only return logprobs for output tokens.
|
||||
logprob_start_len: Optional[Union[List[int], int]] = None
|
||||
# The number of top logprobs to return.
|
||||
top_logprobs_num: Optional[Union[List[int], int]] = None
|
||||
# Whether to detokenize tokens in text in the returned logprobs.
|
||||
return_text_in_logprobs: bool = False
|
||||
# Whether to stream output.
|
||||
stream: bool = False
|
||||
```
|
||||
|
||||
The `sampling_params` follows this format
|
||||
|
||||
```python
|
||||
# The maximum number of output tokens
|
||||
max_new_tokens: int = 128,
|
||||
# Stop when hitting any of the strings in this list.
|
||||
stop: Optional[Union[str, List[str]]] = None,
|
||||
# Stop when hitting any of the token_ids in this list. Could be useful when mixed with
|
||||
# `min_new_tokens`.
|
||||
stop_token_ids: Optional[List[int]] = [],
|
||||
# Sampling temperature
|
||||
temperature: float = 1.0,
|
||||
# Top-p sampling
|
||||
top_p: float = 1.0,
|
||||
# Top-k sampling
|
||||
top_k: int = -1,
|
||||
# Min-p sampling
|
||||
min_p: float = 0.0,
|
||||
# Whether to ignore EOS token.
|
||||
ignore_eos: bool = False,
|
||||
# Whether to skip the special tokens during detokenization.
|
||||
skip_special_tokens: bool = True,
|
||||
# Whether to add spaces between special tokens during detokenization.
|
||||
spaces_between_special_tokens: bool = True,
|
||||
# Constrains the output to follow a given regular expression.
|
||||
regex: Optional[str] = None,
|
||||
# Do parallel sampling and return `n` outputs.
|
||||
n: int = 1,
|
||||
# Constrains the output to follow a given JSON schema.
|
||||
# `regex` and `json_schema` cannot be set at the same time.
|
||||
json_schema: Optional[str] = None,
|
||||
|
||||
## Penalties. See [Performance Implications on Penalties] section below for more informations.
|
||||
|
||||
# Float that penalizes new tokens based on their frequency in the generated text so far.
|
||||
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to
|
||||
# repeat tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
|
||||
frequency_penalty: float = 0.0,
|
||||
# Float that penalizes new tokens based on whether they appear in the generated text so far.
|
||||
# Values > 0 encourage the model to use new tokens, while values < 0 encourage the model to repeat
|
||||
# tokens. Must be -2 <= value <= 2. Setting to 0 (default) will disable this penalty.
|
||||
presence_penalty: float = 0.0,
|
||||
# Float that penalizes new tokens based on whether they appear in the prompt and the generated text
|
||||
# so far. Values > 1 encourage the model to use new tokens, while values < 1 encourage the model to
|
||||
# repeat tokens. Must be 0 <= value <= 2. Setting to 1 (default) will disable this penalty.
|
||||
repetition_penalty: float = 1.0,
|
||||
# Guides inference to generate at least this number of tokens by penalizing logits of tokenizer's
|
||||
# EOS token and `stop_token_ids` to -inf, until the output token reaches given length.
|
||||
# Note that any of the `stop` string can be generated before reaching `min_new_tokens`, as it is
|
||||
# difficult to infer the correct token ID by given `stop` strings.
|
||||
# Must be 0 <= value < max_new_tokens. Setting to 0 (default) will disable this penalty.
|
||||
min_new_tokens: int = 0,
|
||||
```
|
||||
|
||||
## Examples
|
||||
|
||||
### Normal
|
||||
Launch a server
|
||||
```
|
||||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000
|
||||
```
|
||||
|
||||
Send a request
|
||||
```python
|
||||
import requests
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:30000/generate",
|
||||
json={
|
||||
"text": "The capital of France is",
|
||||
"sampling_params": {
|
||||
"temperature": 0,
|
||||
"max_new_tokens": 32,
|
||||
},
|
||||
},
|
||||
)
|
||||
print(response.json())
|
||||
```
|
||||
|
||||
### Streaming
|
||||
Send a request and stream the output
|
||||
```python
|
||||
import requests, json
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:30000/generate",
|
||||
json={
|
||||
"text": "The capital of France is",
|
||||
"sampling_params": {
|
||||
"temperature": 0,
|
||||
"max_new_tokens": 32,
|
||||
},
|
||||
"stream": True,
|
||||
},
|
||||
stream=True,
|
||||
)
|
||||
|
||||
prev = 0
|
||||
for chunk in response.iter_lines(decode_unicode=False):
|
||||
chunk = chunk.decode("utf-8")
|
||||
if chunk and chunk.startswith("data:"):
|
||||
if chunk == "data: [DONE]":
|
||||
break
|
||||
data = json.loads(chunk[5:].strip("\n"))
|
||||
output = data["text"].strip()
|
||||
print(output[prev:], end="", flush=True)
|
||||
prev = len(output)
|
||||
print("")
|
||||
```
|
||||
|
||||
### Multi modal
|
||||
|
||||
Launch a server
|
||||
```
|
||||
python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --chat-template chatml-llava
|
||||
```
|
||||
|
||||
Download an image
|
||||
```
|
||||
curl -o example_image.png -L https://github.com/sgl-project/sglang/blob/main/test/lang/example_image.png?raw=true
|
||||
```
|
||||
|
||||
Send a request
|
||||
```python
|
||||
import requests
|
||||
|
||||
response = requests.post(
|
||||
"http://localhost:30000/generate",
|
||||
json={
|
||||
"text": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
|
||||
"<|im_start|>user\n<image>\nDescribe this image in a very short sentence.<|im_end|>\n"
|
||||
"<|im_start|>assistant\n",
|
||||
"image_data": "example_image.png",
|
||||
"sampling_params": {
|
||||
"temperature": 0,
|
||||
"max_new_tokens": 32,
|
||||
},
|
||||
},
|
||||
)
|
||||
print(response.json())
|
||||
```
|
||||
|
||||
The `image_data` can be a file name, a URL, or a base64 encoded string. See also `python/sglang/srt/utils.py:load_image`.
|
||||
Streaming is supported in a similar manner as [above](#streaming).
|
||||
|
||||
## Performance Implications on Penalties
|
||||
|
||||
While you can apply penalties by supplying relevant `sampling_params`, this comes with some drawbacks.
|
||||
|
||||
These drawbacks will be applied to every single requests in the same batch, as penalizers also applies in batch.
|
||||
|
||||
### Latency
|
||||
|
||||
While we try to compute penalty algorithms through CUDA, it is still additional computation on top of the basic sampling logic. For detailed overhead, we recommend you to run your own benchmarks, but you can find samples below to get a glimpse.
|
||||
|
||||
### Memory
|
||||
|
||||
Since we compute penalty algorithms through CUDA, the logic stores relevant parameters on GPU. This is usually in a scale of `vocab_size` multiplied by `running_requests`.
|
||||
|
||||
You can run your own benchmark with desired parameters on your own hardware to make sure it's not OOMing before using.
|
||||
|
||||
Tuning `--mem-fraction-static` and/or `--max-running-requests` will help.
|
||||
|
||||
### Benchmarks
|
||||
|
||||
All the benchmarks below were ran on NVIDIA H100 SXM5.
|
||||
|
||||
<details>
|
||||
|
||||
#### Baseline
|
||||
|
||||
Measured at [dc9d06d886151707f97d0b78095df9de262fd3c9](https://github.com/sgl-project/sglang/commit/dc9d06d886151707f97d0b78095df9de262fd3c9).
|
||||
|
||||
```
|
||||
$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512
|
||||
|
||||
============ Serving Benchmark Result ============
|
||||
Backend: sglang
|
||||
Traffic request rate: inf
|
||||
Successful requests: 3000
|
||||
Benchmark duration (s): 66.11
|
||||
Total input tokens: 378633
|
||||
Total generated tokens: 775651
|
||||
Total generated tokens (retokenized): 775118
|
||||
Request throughput (req/s): 45.38
|
||||
Input token throughput (tok/s): 5727.04
|
||||
Output token throughput (tok/s): 11732.16
|
||||
----------------End-to-End Latency----------------
|
||||
Mean E2E Latency (ms): 40881.94
|
||||
Median E2E Latency (ms): 43967.10
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 19884.75
|
||||
Median TTFT (ms): 14226.56
|
||||
P99 TTFT (ms): 47738.97
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 91.96
|
||||
Median TPOT (ms): 90.11
|
||||
P99 TPOT (ms): 308.54
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 174.54
|
||||
Median ITL (ms): 58.56
|
||||
P99 ITL (ms): 440.18
|
||||
==================================================
|
||||
```
|
||||
|
||||
#### All Together
|
||||
|
||||
```
|
||||
$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
|
||||
"frequency_penalty": 1.1,
|
||||
"presence_penalty": 1.1,
|
||||
"repetition_penalty": 0.1,
|
||||
"min_new_tokens": 5
|
||||
}'
|
||||
|
||||
============ Serving Benchmark Result ============
|
||||
Backend: sglang
|
||||
Traffic request rate: inf
|
||||
Successful requests: 3000
|
||||
Benchmark duration (s): 78.35
|
||||
Total input tokens: 378633
|
||||
Total generated tokens: 775651
|
||||
Total generated tokens (retokenized): 774756
|
||||
Request throughput (req/s): 38.29
|
||||
Input token throughput (tok/s): 4832.86
|
||||
Output token throughput (tok/s): 9900.39
|
||||
----------------End-to-End Latency----------------
|
||||
Mean E2E Latency (ms): 49017.68
|
||||
Median E2E Latency (ms): 52825.70
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 23892.60
|
||||
Median TTFT (ms): 18895.47
|
||||
P99 TTFT (ms): 57426.01
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 114.54
|
||||
Median TPOT (ms): 107.27
|
||||
P99 TPOT (ms): 293.31
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 205.68
|
||||
Median ITL (ms): 73.97
|
||||
P99 ITL (ms): 453.86
|
||||
==================================================
|
||||
```
|
||||
|
||||
#### Frequency Penalty
|
||||
|
||||
```
|
||||
$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
|
||||
"frequency_penalty": 1.1
|
||||
}'
|
||||
|
||||
============ Serving Benchmark Result ============
|
||||
Backend: sglang
|
||||
Traffic request rate: inf
|
||||
Successful requests: 3000
|
||||
Benchmark duration (s): 72.72
|
||||
Total input tokens: 378633
|
||||
Total generated tokens: 775651
|
||||
Total generated tokens (retokenized): 774955
|
||||
Request throughput (req/s): 41.26
|
||||
Input token throughput (tok/s): 5206.84
|
||||
Output token throughput (tok/s): 10666.51
|
||||
----------------End-to-End Latency----------------
|
||||
Mean E2E Latency (ms): 45445.56
|
||||
Median E2E Latency (ms): 48960.39
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 22363.16
|
||||
Median TTFT (ms): 17125.02
|
||||
P99 TTFT (ms): 52920.95
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 104.71
|
||||
Median TPOT (ms): 98.30
|
||||
P99 TPOT (ms): 268.06
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 191.60
|
||||
Median ITL (ms): 67.83
|
||||
P99 ITL (ms): 455.46
|
||||
==================================================
|
||||
```
|
||||
|
||||
#### Presence Penalty
|
||||
|
||||
```
|
||||
$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
|
||||
"presence_penalty": 1.1
|
||||
}'
|
||||
|
||||
============ Serving Benchmark Result ============
|
||||
Backend: sglang
|
||||
Traffic request rate: inf
|
||||
Successful requests: 3000
|
||||
Benchmark duration (s): 72.04
|
||||
Total input tokens: 378633
|
||||
Total generated tokens: 775651
|
||||
Total generated tokens (retokenized): 775210
|
||||
Request throughput (req/s): 41.64
|
||||
Input token throughput (tok/s): 5255.98
|
||||
Output token throughput (tok/s): 10767.18
|
||||
----------------End-to-End Latency----------------
|
||||
Mean E2E Latency (ms): 44926.61
|
||||
Median E2E Latency (ms): 48302.88
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 22095.39
|
||||
Median TTFT (ms): 16740.93
|
||||
P99 TTFT (ms): 52554.03
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 103.54
|
||||
Median TPOT (ms): 97.37
|
||||
P99 TPOT (ms): 271.86
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 189.86
|
||||
Median ITL (ms): 68.45
|
||||
P99 ITL (ms): 447.11
|
||||
==================================================
|
||||
```
|
||||
|
||||
#### Repetition Penalty
|
||||
|
||||
```
|
||||
$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
|
||||
"repetition_penalty": 0.1
|
||||
}'
|
||||
|
||||
============ Serving Benchmark Result ============
|
||||
Backend: sglang
|
||||
Traffic request rate: inf
|
||||
Successful requests: 3000
|
||||
Benchmark duration (s): 74.54
|
||||
Total input tokens: 378633
|
||||
Total generated tokens: 775651
|
||||
Total generated tokens (retokenized): 766008
|
||||
Request throughput (req/s): 40.24
|
||||
Input token throughput (tok/s): 5079.36
|
||||
Output token throughput (tok/s): 10405.35
|
||||
----------------End-to-End Latency----------------
|
||||
Mean E2E Latency (ms): 46530.38
|
||||
Median E2E Latency (ms): 50302.65
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 22603.47
|
||||
Median TTFT (ms): 17167.08
|
||||
P99 TTFT (ms): 54497.85
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 117.59
|
||||
Median TPOT (ms): 101.79
|
||||
P99 TPOT (ms): 320.04
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 195.26
|
||||
Median ITL (ms): 69.51
|
||||
P99 ITL (ms): 433.86
|
||||
==================================================
|
||||
```
|
||||
|
||||
#### Min New Tokens
|
||||
|
||||
The min new tokens penalizer computes until generation process reaches given `min_new_tokens`.
|
||||
|
||||
Dislike other penalizers, setting this to higher value will have more latency implications.
|
||||
|
||||
```
|
||||
$ python3 -m sglang.bench_serving --backend sglang --port 8413 --dataset-name random --num-prompts 3000 --random-input 256 --random-output 512 --extra-request-body '{
|
||||
"min_new_tokens": 5
|
||||
}'
|
||||
|
||||
============ Serving Benchmark Result ============
|
||||
Backend: sglang
|
||||
Traffic request rate: inf
|
||||
Successful requests: 3000
|
||||
Benchmark duration (s): 66.94
|
||||
Total input tokens: 378633
|
||||
Total generated tokens: 775651
|
||||
Total generated tokens (retokenized): 775220
|
||||
Request throughput (req/s): 44.81
|
||||
Input token throughput (tok/s): 5656.13
|
||||
Output token throughput (tok/s): 11586.90
|
||||
----------------End-to-End Latency----------------
|
||||
Mean E2E Latency (ms): 41888.55
|
||||
Median E2E Latency (ms): 45354.16
|
||||
---------------Time to First Token----------------
|
||||
Mean TTFT (ms): 20866.91
|
||||
Median TTFT (ms): 16219.79
|
||||
P99 TTFT (ms): 49263.91
|
||||
-----Time per Output Token (excl. 1st token)------
|
||||
Mean TPOT (ms): 97.05
|
||||
Median TPOT (ms): 89.76
|
||||
P99 TPOT (ms): 233.50
|
||||
---------------Inter-token Latency----------------
|
||||
Mean ITL (ms): 179.17
|
||||
Median ITL (ms): 55.08
|
||||
P99 ITL (ms): 409.12
|
||||
==================================================
|
||||
```
|
||||
|
||||
</details>
|
||||
13
docs/references/troubleshooting.md
Normal file
13
docs/references/troubleshooting.md
Normal file
@@ -0,0 +1,13 @@
|
||||
# Troubleshooting
|
||||
|
||||
This page lists some common errors and tips for fixing them.
|
||||
|
||||
## CUDA error: an illegal memory access was encountered
|
||||
This error may be due to kernel errors or out-of-memory issues.
|
||||
- If it is a kernel error, it is not easy to fix.
|
||||
- If it is out-of-memory, sometimes it will report this error instead of "Out-of-memory." In this case, try setting a smaller value for `--mem-fraction-static`. The default value of `--mem-fraction-static` is around 0.8 - 0.9.
|
||||
|
||||
## The server hangs
|
||||
If the server hangs, try disabling some optimizations when launching the server.
|
||||
- Add `--disable-cuda-graph`.
|
||||
- Add `--sampling-backend pytorch`.
|
||||
Reference in New Issue
Block a user