Fix for T4 GPUs (#16)

Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com>
2024-01-16 15:49:03 -08:00
parent 5b27a1dce4
commit ffe4aaee1d
6 changed files with 68 additions and 6 deletions
--- a/README.md
+++ b/README.md
@@ -32,6 +32,10 @@ pip install --upgrade pip
 pip install -e "python[all]"
 ```

+### Notes
+- If you are using older GPUs (NVIDIA T4, V100), please use `pip install "triton>=2.2.0"` to avoid some bugs in the triton compiler
+- If you only need to use the OpenAI backend, you can avoid installing other dependencies by using `pip install sglang[openai]`
+
 ## Quick Start
 The example below shows how to use sglang to answer a mulit-turn question.

@@ -197,7 +201,7 @@ for out in state.text_iter():
 ## Backend: SGLang Runtime (SRT)
 The SGLang Runtime (SRT) is designed to work best with the SGLang frontend.
 However, it can also be used as a standalone API server.
-In this case, the [RadixAttention](https://arxiv.org/abs/2312.07104) can still greatly accelerate many use cases.
+In this case, the [RadixAttention](https://arxiv.org/abs/2312.07104) can still greatly accelerate many use cases with automatic KV cache reuse.

 ### Usage
 Launch a server
@@ -221,6 +225,10 @@ curl http://localhost:30000/v1/completions \
 ```
 python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --tp 2
 ```
+- If you see out-of-memory errors during serving, please try to reduce the memory usage of the KV cache pool by setting a smaller value of `--mem-fraction-static`. The default value is `0.9`
+```
+python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --mem-fraction-static 0.7
+```

 ### Supported Models
 - Llama