Update docs (#1768)

Co-authored-by: Chayenne Zhao <zhaochenyang20@gmail.com> Co-authored-by: Chayenne <zhaochen20@outlook.com>
2024-10-23 11:28:48 -07:00
parent 80a905475d
commit b7d0559496
5 changed files with 27 additions and 25 deletions
--- a/docs/en/backend.md
+++ b/docs/en/backend.md
@@ -79,7 +79,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 ```
 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
 ```
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes.
+- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currenly.
 - To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies.
 - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
 - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
@@ -100,7 +100,6 @@ We also provide an inference engine **without a HTTP server**. For example,
 ```python
 import sglang as sgl

-
 def main():
    prompts = [
        "Hello, my name is",
@@ -120,12 +119,8 @@ if __name__ == "__main__":
    main()
 ```

-This can be used for:
-
-1. **Offline Batch Inference**
-2. **Building Custom Servers**
-
-You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine)
+This can be used for offline batch inference and building custom servers.
+You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine).

 ### Supported Models