Update version to v0.1.13 (#280)

This commit is contained in:
Lianmin Zheng
2024-03-11 05:49:27 -07:00
committed by GitHub
parent 13662fd533
commit 4aa5dd2c5f
11 changed files with 35 additions and 21 deletions

View File

@@ -11,8 +11,7 @@ We tested our system on the following common LLM workloads and reported the achi
- **[DSPy RAG](https://github.com/stanfordnlp/dspy)**: A retrieval-augmented generation pipeline in the DSPy tutorial.
- **[LLaVA Bench](https://github.com/haotian-liu/LLaVA)**: Running LLaVA v1.5, a vision language model on the LLaVA-in-the-wild benchmark.
We tested both Llama-7B on one NVIDIA A10G GPU (24GB) and Mixtral-8x7B on 8 NVIDIA A10G GPUs with tensor parallelism, using FP16 precision. We used vllm v0.2.5, guidance v0.1.8, and Hugging Face TGI v1.3.0 as baseline systems.
We tested both Llama-7B on one NVIDIA A10G GPU (24GB) and Mixtral-8x7B on 8 NVIDIA A10G GPUs with tensor parallelism, using FP16 precision. We used vllm v0.2.5, guidance v0.1.8, Hugging Face TGI v1.3.0, and SGLang v0.1.5.
- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
![llama_7b](../assets/llama_7b.jpg)

View File

@@ -5,14 +5,7 @@ It can be used in SGLang runtime to accelerate attention computation.
### Install flashinfer
You can install flashinfer via pip as follows for CUDA 12.1.
```bash
pip install flashinfer -i https://flashinfer.ai/whl/cu121/
```
You can look for other CUDA versions in https://github.com/flashinfer-ai/flashinfer?tab=readme-ov-file#installation. If there is no desire version for your environment,
please build it from source (the compilation takes a long time).
See https://docs.flashinfer.ai/installation.html.
### Run a Server With Flashinfer Mode

View File

@@ -37,6 +37,23 @@ python3 bench_sglang.py --nsub 3
# Average accuracy: 0.413
```
#### GSM-8K
```
cd benchmark/gsm8k
```
Follow README.md to download the data.
```
python3 bench_sglang.py --num-q 200
# Expected performance on A10G
# Latency: 32.103
# Accuracy: 0.250
```
#### More
Please also test `benchmark/hellaswag`, `benchmark/latency_throughput`.
### More Models
#### LLaVA
@@ -48,6 +65,9 @@ python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenize
```
cd benchmark/llava_bench
python3 bench_sglang.py
# Expected performance on A10G
# Latency: 50.031
```
## SGLang Unit Tests