Update docs (#12)

This commit is contained in:
Lianmin Zheng
2024-01-16 04:18:54 -08:00
parent fbf42263f1
commit e71d4ab3f9
2 changed files with 27 additions and 4 deletions

View File

@@ -153,10 +153,10 @@ def image_qa(s, image_file, question):
### Constrained Decoding ### Constrained Decoding
```python ```python
@function @sgl.function
def regular_expression_gen(s): def regular_expression_gen(s):
s += "Q: What is the IP address of the Google DNS servers?\n" s += "Q: What is the IP address of the Google DNS servers?\n"
s += "A: " + gen( s += "A: " + sgl.gen(
"answer", "answer",
temperature=0, temperature=0,
regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)", regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
@@ -197,7 +197,7 @@ for out in state.text_iter():
## Backend: SGLang Runtime (SRT) ## Backend: SGLang Runtime (SRT)
The SGLang Runtime (SRT) is designed to work best with the SGLang frontend. The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
However, it can also be used as a standalone API server. However, it can also be used as a standalone API server.
In this case, the RadixAttention can still greatly accelerate many use cases. In this case, the [RadixAttention](https://arxiv.org/abs/2312.07104) can still greatly accelerate many use cases.
### Usage ### Usage
Launch a server Launch a server
@@ -237,7 +237,7 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8 - Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
![mixtral_8x7b](assets/mixtral_8x7b.jpg) ![mixtral_8x7b](assets/mixtral_8x7b.jpg)
Learn more [here](). Learn more [here](docs/benchmark_results.md).
## Roadmap ## Roadmap
- [ ] Function call - [ ] Function call

23
docs/benchmark_results.md Normal file
View File

@@ -0,0 +1,23 @@
## Benchmark Results
We tested our system on the following common LLM workloads and reported the achieved throughput:
- **[MMLU](https://arxiv.org/abs/2009.03300)**: A 5-shot, multi-choice, multi-task benchmark.
- **[HellaSwag](https://arxiv.org/abs/1905.07830)**: A 20-shot, multi-choice sentence completion benchmark.
- **[ReAct Agent](https://arxiv.org/abs/2210.03629)**: An agent task using prompt traces collected from the original ReAct paper.
- **[Tree-of-Thought](https://arxiv.org/pdf/2305.10601.pdf)**: A custom tree search-based prompt for solving GSM-8K problems.
- **JSON Decode**: Extracting information from a Wikipedia page and outputting it in JSON format.
- **Chat (short)**: A synthetic chat benchmark where each conversation includes 4 turns with short LLM outputs.
- **Chat (long)**: A synthetic chat benchmark where each conversation includes 4 turns with long LLM outputs.
- **[DSPy RAG](https://github.com/stanfordnlp/dspy)**: A retrieval-augmented generation pipeline in the DSPy tutorial.
- **[LLaVA Bench](https://github.com/haotian-liu/LLaVA)**: Running LLaVA v1.5, a vision language model on the LLaVA-in-the-wild benchmark.
We tested both Llama-7B on one NVIDIA A10G GPU (24GB) and Mixtral-8x7B on 8 NVIDIA A10G GPUs with tensor parallelism, using FP16 precision. We used vllm v0.2.5, guidance v0.1.8, and Hugging Face TGI v1.3.0 as baseline systems.
- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
![llama_7b](../assets/llama_7b.jpg)
- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
![mixtral_8x7b](../assets/mixtral_8x7b.jpg)
The benchmark code is available [here](https://github.com/sgl-project/sglang/tree/main/benchmark).