docs: update doc (#716)

2024-07-24 13:38:06 -07:00
parent 321a963b01
commit 08a3bd19cc
5 changed files with 83 additions and 3 deletions
--- a/benchmark/blog_v0_2/README.md
+++ b/benchmark/blog_v0_2/README.md
@@ -8,7 +8,7 @@
 git clone https://github.com/sgl-project/sglang.git
 cd sglang

-pip install --upgrade pip
+pip install --upgrade pip setuptools wheel
 pip install -e "python[all]"

 pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
@@ -91,7 +91,7 @@ cat sglang_online_benchmark.jsonl | cut -d':' -f9 | cut -d',' -f1

 ## Other

-We tried using vLLM 0.5.3.post1, but it often crashes under high loads, so we are using the older version, vLLM 0.5.2.
+We tried using vLLM 0.5.3.post1, but it often crashes under high loads, and it seems to have similar or worse performance compared to vLLM 0.5.2 from our partial benchmarking, so we are using the older version, vLLM 0.5.2.

 Preparation for TensorRT LLM can refer to https://github.com/sgl-project/tensorrt-demo. Specifically, we used a batch size of 512, a max input length of 8192, and a max number of tokens of 8192. The instance count for preprocessing and postprocessing in Triton Server is 16.