Update version to v0.1.13 (#280)

2024-03-11 05:49:27 -07:00
parent 13662fd533
commit 4aa5dd2c5f
11 changed files with 35 additions and 21 deletions
--- a/docs/benchmark_results.md
+++ b/docs/benchmark_results.md
@@ -11,8 +11,7 @@ We tested our system on the following common LLM workloads and reported the achi
 - **[DSPy RAG](https://github.com/stanfordnlp/dspy)**: A retrieval-augmented generation pipeline in the DSPy tutorial.
 - **[LLaVA Bench](https://github.com/haotian-liu/LLaVA)**: Running LLaVA v1.5, a vision language model on the LLaVA-in-the-wild benchmark.

-We tested both Llama-7B on one NVIDIA A10G GPU (24GB) and Mixtral-8x7B on 8 NVIDIA A10G GPUs with tensor parallelism, using FP16 precision. We used vllm v0.2.5, guidance v0.1.8, and Hugging Face TGI v1.3.0 as baseline systems.
-
+We tested both Llama-7B on one NVIDIA A10G GPU (24GB) and Mixtral-8x7B on 8 NVIDIA A10G GPUs with tensor parallelism, using FP16 precision. We used vllm v0.2.5, guidance v0.1.8, Hugging Face TGI v1.3.0, and SGLang v0.1.5.

 - Llama-7B on NVIDIA A10G, FP16, Tensor Parallelism=1
 ![llama_7b](../assets/llama_7b.jpg)
--- a/docs/flashinfer.md
+++ b/docs/flashinfer.md
@@ -5,14 +5,7 @@ It can be used in SGLang runtime to accelerate attention computation.

 ### Install flashinfer

-You can install flashinfer via pip as follows for CUDA 12.1.
-
-```bash
-pip install flashinfer -i https://flashinfer.ai/whl/cu121/
-```
-
-You can look for other CUDA versions in https://github.com/flashinfer-ai/flashinfer?tab=readme-ov-file#installation. If there is no desire version for your environment,
-please build it from source (the compilation takes a long time).
+See https://docs.flashinfer.ai/installation.html.

 ### Run a Server With Flashinfer Mode

--- a/docs/test_process.md
+++ b/docs/test_process.md
@@ -37,6 +37,23 @@ python3 bench_sglang.py --nsub 3
 # Average accuracy: 0.413
 ```

+#### GSM-8K
+```
+cd benchmark/gsm8k
+```
+Follow README.md to download the data.
+
+```
+python3 bench_sglang.py --num-q 200
+
+# Expected performance on A10G
+# Latency: 32.103
+# Accuracy: 0.250
+```
+
+#### More
+Please also test `benchmark/hellaswag`, `benchmark/latency_throughput`.
+
 ### More Models

 #### LLaVA
@@ -48,6 +65,9 @@ python3 -m sglang.launch_server --model-path liuhaotian/llava-v1.5-7b --tokenize
 ```
 cd benchmark/llava_bench
 python3 bench_sglang.py
+
+# Expected performance on A10G
+# Latency: 50.031
 ```

 ## SGLang Unit Tests