From e71d4ab3f941e8ecec461480b582e50170a6842e Mon Sep 17 00:00:00 2001 From: Lianmin Zheng Date: Tue, 16 Jan 2024 04:18:54 -0800 Subject: [PATCH] Update docs (#12) --- README.md | 8 ++++---- docs/benchmark_results.md | 23 +++++++++++++++++++++++ 2 files changed, 27 insertions(+), 4 deletions(-) create mode 100644 docs/benchmark_results.md diff --git a/README.md b/README.md index 2c7306a8c..e6f518d5c 100644 --- a/README.md +++ b/README.md @@ -153,10 +153,10 @@ def image_qa(s, image_file, question): ### Constrained Decoding ```python -@function +@sgl.function def regular_expression_gen(s): s += "Q: What is the IP address of the Google DNS servers?\n" - s += "A: " + gen( + s += "A: " + sgl.gen( "answer", temperature=0, regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)", @@ -197,7 +197,7 @@ for out in state.text_iter(): ## Backend: SGLang Runtime (SRT) The SGLang Runtime (SRT) is designed to work best with the SGLang frontend. However, it can also be used as a standalone API server. -In this case, the RadixAttention can still greatly accelerate many use cases. +In this case, the [RadixAttention](https://arxiv.org/abs/2312.07104) can still greatly accelerate many use cases. ### Usage Launch a server @@ -237,7 +237,7 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port - Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8 ![mixtral_8x7b](assets/mixtral_8x7b.jpg) -Learn more [here](). +Learn more [here](docs/benchmark_results.md). ## Roadmap - [ ] Function call diff --git a/docs/benchmark_results.md b/docs/benchmark_results.md new file mode 100644 index 000000000..425982113 --- /dev/null +++ b/docs/benchmark_results.md @@ -0,0 +1,23 @@ +## Benchmark Results + +We tested our system on the following common LLM workloads and reported the achieved throughput: +- **[MMLU](https://arxiv.org/abs/2009.03300)**: A 5-shot, multi-choice, multi-task benchmark. +- **[HellaSwag](https://arxiv.org/abs/1905.07830)**: A 20-shot, multi-choice sentence completion benchmark. +- **[ReAct Agent](https://arxiv.org/abs/2210.03629)**: An agent task using prompt traces collected from the original ReAct paper. +- **[Tree-of-Thought](https://arxiv.org/pdf/2305.10601.pdf)**: A custom tree search-based prompt for solving GSM-8K problems. +- **JSON Decode**: Extracting information from a Wikipedia page and outputting it in JSON format. +- **Chat (short)**: A synthetic chat benchmark where each conversation includes 4 turns with short LLM outputs. +- **Chat (long)**: A synthetic chat benchmark where each conversation includes 4 turns with long LLM outputs. +- **[DSPy RAG](https://github.com/stanfordnlp/dspy)**: A retrieval-augmented generation pipeline in the DSPy tutorial. +- **[LLaVA Bench](https://github.com/haotian-liu/LLaVA)**: Running LLaVA v1.5, a vision language model on the LLaVA-in-the-wild benchmark. + +We tested both Llama-7B on one NVIDIA A10G GPU (24GB) and Mixtral-8x7B on 8 NVIDIA A10G GPUs with tensor parallelism, using FP16 precision. We used vllm v0.2.5, guidance v0.1.8, and Hugging Face TGI v1.3.0 as baseline systems. + + +- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1 +![llama_7b](../assets/llama_7b.jpg) + +- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8 +![mixtral_8x7b](../assets/mixtral_8x7b.jpg) + +The benchmark code is available [here](https://github.com/sgl-project/sglang/tree/main/benchmark).