Update version to v0.1.13 (#280)
This commit is contained in:
@@ -11,8 +11,7 @@ We tested our system on the following common LLM workloads and reported the achi
|
||||
- **[DSPy RAG](https://github.com/stanfordnlp/dspy)**: A retrieval-augmented generation pipeline in the DSPy tutorial.
|
||||
- **[LLaVA Bench](https://github.com/haotian-liu/LLaVA)**: Running LLaVA v1.5, a vision language model on the LLaVA-in-the-wild benchmark.
|
||||
|
||||
We tested both Llama-7B on one NVIDIA A10G GPU (24GB) and Mixtral-8x7B on 8 NVIDIA A10G GPUs with tensor parallelism, using FP16 precision. We used vllm v0.2.5, guidance v0.1.8, and Hugging Face TGI v1.3.0 as baseline systems.
|
||||
|
||||
We tested both Llama-7B on one NVIDIA A10G GPU (24GB) and Mixtral-8x7B on 8 NVIDIA A10G GPUs with tensor parallelism, using FP16 precision. We used vllm v0.2.5, guidance v0.1.8, Hugging Face TGI v1.3.0, and SGLang v0.1.5.
|
||||
|
||||
- Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
|
||||

|
||||
|
||||
@@ -5,14 +5,7 @@ It can be used in SGLang runtime to accelerate attention computation.
|
||||
|
||||
### Install flashinfer
|
||||
|
||||
You can install flashinfer via pip as follows for CUDA 12.1.
|
||||
|
||||
```bash
|
||||
pip install flashinfer -i https://flashinfer.ai/whl/cu121/
|
||||
```
|
||||
|
||||
You can look for other CUDA versions in https://github.com/flashinfer-ai/flashinfer?tab=readme-ov-file#installation. If there is no desire version for your environment,
|
||||
please build it from source (the compilation takes a long time).
|
||||
See https://docs.flashinfer.ai/installation.html.
|
||||
|
||||
### Run a Server With Flashinfer Mode
|
||||
|
||||
|
||||
@@ -37,6 +37,23 @@ python3 bench_sglang.py --nsub 3
|
||||
# Average accuracy: 0.413
|
||||
```
|
||||
|
||||
#### GSM-8K
|
||||
```
|
||||
cd benchmark/gsm8k
|
||||
```
|
||||
Follow README.md to download the data.
|
||||
|
||||
```
|
||||
python3 bench_sglang.py --num-q 200
|
||||
|
||||
# Expected performance on A10G
|
||||
# Latency: 32.103
|
||||
# Accuracy: 0.250
|
||||
```
|
||||
|
||||
#### More
|
||||
Please also test `benchmark/hellaswag`, `benchmark/latency_throughput`.
|
||||
|
||||
### More Models
|
||||
|
||||
#### LLaVA
|
||||
@@ -48,6 +65,9 @@ python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenize
|
||||
```
|
||||
cd benchmark/llava_bench
|
||||
python3 bench_sglang.py
|
||||
|
||||
# Expected performance on A10G
|
||||
# Latency: 50.031
|
||||
```
|
||||
|
||||
## SGLang Unit Tests
|
||||
|
||||
Reference in New Issue
Block a user