[Doc] improve relative links and structure (#1924)

This commit is contained in:
Lianmin Zheng
2024-11-05 01:12:10 -08:00
committed by GitHub
parent 02755768d3
commit f5113e50ae
6 changed files with 60 additions and 69 deletions

View File

@@ -20,7 +20,7 @@ curl http://localhost:30000/generate \
}'
```
Learn more about the argument specification, streaming, and multi-modal support [here](https://sgl-project.github.io/references/sampling_params.html).
Learn more about the argument specification, streaming, and multi-modal support [here](../references/sampling_params.md).
## OpenAI Compatible API
In addition, the server supports OpenAI-compatible APIs.
@@ -74,7 +74,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
```
- See [hyperparameter tuning](https://sgl-project.github.io/references/hyperparameter_tuning.html) on tuning hyperparameters for better performance.
- See [hyperparameter tuning](../references/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
```
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
@@ -84,7 +84,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies.
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](https://sgl-project.github.io/references/custom_chat_template.html).
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](../references/custom_chat_template.md).
- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
```
@@ -124,46 +124,7 @@ if __name__ == "__main__":
This can be used for offline batch inference and building custom servers.
You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine).
## Supported Models
**Generative Models**
- Llama / Llama 2 / Llama 3 / Llama 3.1 / Llama 3.2
- Mistral / Mixtral / Mistral NeMo
- Gemma / Gemma 2
- Qwen / Qwen 2 / Qwen 2 MoE / Qwen 2 VL
- DeepSeek / DeepSeek 2
- OLMoE
- [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/)
- `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --port=30000 --chat-template=chatml-llava`
- `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8 --chat-template=chatml-llava`
- Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py)
- LLaVA 1.5 / 1.6 / NeXT
- `python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000 --tp-size=1 --chat-template=llava_llama_3`
- `python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --port=30000 --tp-size=8 --chat-template=chatml-llava`
- Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py)
- Yi-VL
- StableLM
- Command-R
- DBRX
- Grok
- ChatGLM
- InternLM 2
- Exaone 3
- BaiChuan2
- MiniCPM / MiniCPM 3
- XVERSE / XVERSE MoE
- SmolLM
- GLM-4
**Embedding Models**
- e5-mistral
- gte-Qwen2
- `python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding`
Instructions for supporting a new model are [here](https://sgl-project.github.io/references/model_support.html).
### Use Models From ModelScope
## Use Models From ModelScope
<details>
<summary>More</summary>
@@ -189,7 +150,7 @@ docker run --gpus all \
</details>
### Run Llama 3.1 405B
## Example: Run Llama 3.1 405B
<details>
<summary>More</summary>
@@ -206,16 +167,3 @@ GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/
```
</details>
## Benchmark Performance
- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as for `launch_server.py`.
Note that this is not a dynamic batching server, so it may run out of memory for a batch size that a real server can handle.
A real server truncates the prefill into several batches, while this unit test does not. For accurate large batch testing, please use `sglang.bench_serving` instead.
```
python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
```
- Benchmark online serving. Launch a server first and run the following command.
```
python3 -m sglang.bench_serving --backend sglang --num-prompt 10
```

View File

@@ -65,7 +65,7 @@
"metadata": {},
"source": [
"## Generate (text generation model)\n",
"Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](../references/sampling_params.html)."
"Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](../references/sampling_params.md)."
]
},
{
@@ -286,7 +286,7 @@
"\n",
"response = requests.post(url, json=data)\n",
"print_highlight(response.text)\n",
"assert response.json()[\"success\"] == True\n",
"assert response.json()[\"success\"] is True\n",
"assert response.json()[\"message\"] == \"Succeeded to update model weights.\"\n",
"assert response.json().keys() == {\"success\", \"message\"}"
]
@@ -312,7 +312,7 @@
"response = requests.post(url, json=data)\n",
"response_json = response.json()\n",
"print_highlight(response_json)\n",
"assert response_json[\"success\"] == False\n",
"assert response_json[\"success\"] is False\n",
"assert response_json[\"message\"] == (\n",
" \"Failed to update weights: The size of tensor a (2048) must match \"\n",
" \"the size of tensor b (3072) at non-singleton dimension 1.\\n\"\n",

View File

@@ -27,7 +27,7 @@
"source": [
"## Offline Batch Inference\n",
"\n",
"SGLang offline engine supports batch inference with efficient scheduling to prevent OOM errors for large batches. For details on this cache-aware scheduling algorithm, see our [paper](https://arxiv.org/pdf/2312.07104)."
"SGLang offline engine supports batch inference with efficient scheduling."
]
},
{