Update docs (#12)
This commit is contained in:
@@ -153,10 +153,10 @@ def image_qa(s, image_file, question):
|
|||||||
|
|
||||||
### Constrained Decoding
|
### Constrained Decoding
|
||||||
```python
|
```python
|
||||||
@function
|
@sgl.function
|
||||||
def regular_expression_gen(s):
|
def regular_expression_gen(s):
|
||||||
s += "Q: What is the IP address of the Google DNS servers?\n"
|
s += "Q: What is the IP address of the Google DNS servers?\n"
|
||||||
s += "A: " + gen(
|
s += "A: " + sgl.gen(
|
||||||
"answer",
|
"answer",
|
||||||
temperature=0,
|
temperature=0,
|
||||||
regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
|
regex=r"((25[0-5]|2[0-4]\d|[01]?\d\d?).){3}(25[0-5]|2[0-4]\d|[01]?\d\d?)",
|
||||||
@@ -197,7 +197,7 @@ for out in state.text_iter():
|
|||||||
## Backend: SGLang Runtime (SRT)
|
## Backend: SGLang Runtime (SRT)
|
||||||
The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
|
The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
|
||||||
However, it can also be used as a standalone API server.
|
However, it can also be used as a standalone API server.
|
||||||
In this case, the RadixAttention can still greatly accelerate many use cases.
|
In this case, the [RadixAttention](https://arxiv.org/abs/2312.07104) can still greatly accelerate many use cases.
|
||||||
|
|
||||||
### Usage
|
### Usage
|
||||||
Launch a server
|
Launch a server
|
||||||
@@ -237,7 +237,7 @@ python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port
|
|||||||
- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
|
- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
|
||||||

|

|
||||||
|
|
||||||
Learn more [here]().
|
Learn more [here](docs/benchmark_results.md).
|
||||||
|
|
||||||
## Roadmap
|
## Roadmap
|
||||||
- [ ] Function call
|
- [ ] Function call
|
||||||
|
|||||||
23
docs/benchmark_results.md
Normal file
23
docs/benchmark_results.md
Normal file
@@ -0,0 +1,23 @@
|
|||||||
|
## Benchmark Results
|
||||||
|
|
||||||
|
We tested our system on the following common LLM workloads and reported the achieved throughput:
|
||||||
|
- **[MMLU](https://arxiv.org/abs/2009.03300)**: A 5-shot, multi-choice, multi-task benchmark.
|
||||||
|
- **[HellaSwag](https://arxiv.org/abs/1905.07830)**: A 20-shot, multi-choice sentence completion benchmark.
|
||||||
|
- **[ReAct Agent](https://arxiv.org/abs/2210.03629)**: An agent task using prompt traces collected from the original ReAct paper.
|
||||||
|
- **[Tree-of-Thought](https://arxiv.org/pdf/2305.10601.pdf)**: A custom tree search-based prompt for solving GSM-8K problems.
|
||||||
|
- **JSON Decode**: Extracting information from a Wikipedia page and outputting it in JSON format.
|
||||||
|
- **Chat (short)**: A synthetic chat benchmark where each conversation includes 4 turns with short LLM outputs.
|
||||||
|
- **Chat (long)**: A synthetic chat benchmark where each conversation includes 4 turns with long LLM outputs.
|
||||||
|
- **[DSPy RAG](https://github.com/stanfordnlp/dspy)**: A retrieval-augmented generation pipeline in the DSPy tutorial.
|
||||||
|
- **[LLaVA Bench](https://github.com/haotian-liu/LLaVA)**: Running LLaVA v1.5, a vision language model on the LLaVA-in-the-wild benchmark.
|
||||||
|
|
||||||
|
We tested both Llama-7B on one NVIDIA A10G GPU (24GB) and Mixtral-8x7B on 8 NVIDIA A10G GPUs with tensor parallelism, using FP16 precision. We used vllm v0.2.5, guidance v0.1.8, and Hugging Face TGI v1.3.0 as baseline systems.
|
||||||
|
|
||||||
|
|
||||||
|
- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
|
||||||
|

|
||||||
|
|
||||||
|
- Mixtral-8x7B on NVIDIA A10G, FP16, Tensor Parallelism=8
|
||||||
|

|
||||||
|
|
||||||
|
The benchmark code is available [here](https://github.com/sgl-project/sglang/tree/main/benchmark).
|
||||||
Reference in New Issue
Block a user