docs: update doc (#716)

This commit is contained in:
Ying Sheng
2024-07-24 13:38:06 -07:00
parent 321a963b01
commit 08a3bd19cc
5 changed files with 83 additions and 3 deletions

View File

@@ -8,7 +8,7 @@
git clone https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install --upgrade pip setuptools wheel
pip install -e "python[all]"
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
@@ -91,7 +91,7 @@ cat sglang_online_benchmark.jsonl | cut -d':' -f9 | cut -d',' -f1
## Other
We tried using vLLM 0.5.3.post1, but it often crashes under high loads, so we are using the older version, vLLM 0.5.2.
We tried using vLLM 0.5.3.post1, but it often crashes under high loads, and it seems to have similar or worse performance compared to vLLM 0.5.2 from our partial benchmarking, so we are using the older version, vLLM 0.5.2.
Preparation for TensorRT LLM can refer to https://github.com/sgl-project/tensorrt-demo. Specifically, we used a batch size of 512, a max input length of 8192, and a max number of tokens of 8192. The instance count for preprocessing and postprocessing in Triton Server is 16.