sglang

EngineX-Hygon/sglang

Fork 0

Files

History

Yineng Zhang 79794af52d docs: highlight ttft itl and throughput (#1337 )

2024-09-06 00:00:06 +10:00

README.md

docs: highlight ttft itl and throughput (#1337 )

2024-09-06 00:00:06 +10:00

README.md

How to reproduce the benchmark results for SGLang v0.3.0 compared to vLLM v0.6.0

Installation

# install sglang v0.3.0
pip install --upgrade pip
pip install "sglang[all]"==0.3.0
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/

# install vllm v0.6.0
pip install vllm==0.6.0

Notes

We referred to the reproduction method in https://github.com/vllm-project/vllm/issues/8176, and added the --num-scheduler-steps 10 parameter when starting the vLLM server. The gpu_memory_utilization of vLLM is by default 0.9 at both TP 1 and TP 4, while SGLang's mem_frac is 0.88 at TP 1 and 0.85 at TP 4, so we manually set it to 0.88 at TP 4.

Online benchmarks

# Llama 3.1 8B Instruct on 1 x A100
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --disable-log-requests --num-scheduler-steps 10 --max_model_len 4096

# Llama 3.1 70B Instruct on 4 x H100
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-70B-Instruct --disable-radix-cache --tp 4
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --disable-log-requests --num-scheduler-steps 10 --tensor 4 --max_model_len 4096

# bench serving
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 1200 --request-rate 4
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 2400 --request-rate 8
python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt --num-prompts 1200 --request-rate 4
python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt --num-prompts 2400 --request-rate 8

Offline benchmarks

# Llama 3.1 8B Instruct on 1 x A100
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --enable-torch-compile --disable-radix-cache
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-8B-Instruct --disable-log-requests --num-scheduler-steps 10 --max_model_len 4096

# Llama 3.1 70B Instruct on 4 x H100
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-70B-Instruct --disable-radix-cache --tp 4 --mem-frac 0.88
python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-70B-Instruct --disable-log-requests --num-scheduler-steps 10 --tensor 4 --max_model_len 4096

# bench serving
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 5000
python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt --num-prompts 5000

Online benchmark results

Llama 3.1 8B Instruct 1 x A100 80G

RPS	Num prompts	Engine	Median E2E Latency	Median TTFT	Median TPOT	Median ITL
4	1200	SGLang	1564.17	31.98	13.17	11.93
4	1200	vLLM	1691.97	100.48	14.14	129.32
8	2400	SGLang	2175.02	35.68	17.85	14.41
8	2400	vLLM	2137.16	120.39	17.09	158.63

Llama 3.1 70B Insruct 4 x H100 80G

RPS	Num Prompts	Engine	Median E2E Latency	Median TTFT	Median TPOT	Median ITL
4	1200	SGLang	3005.24	53.94	25.03	21.67
4	1200	vLLM	2915.60	179.15	23.58	231.23
8	2400	SGLang	4064.98	58.11	33.07	24.45
8	2400	vLLM	3752.38	207.12	29.15	275.32

Offline benchmark results

Llama 3.1 8B Instruct 1 x A100 80G

RPS	Num Prompts	Engine	Request throughput	Output token throughput
inf	5000	SGLang	22.03	4281.51
inf	5000	vLLM	21.27	4132.37

Llama 3.1 70B Insruct 4 x H100 80G

RPS	Num Prompts	Engine	Request throughput	Output token throughput
inf	5000	SGLang	19.84	3856.01
inf	5000	vLLM	19.04	3700.64