[Doc] improve relative links and structure (#1924)
This commit is contained in:
@@ -20,7 +20,7 @@ curl http://localhost:30000/generate \
|
||||
}'
|
||||
```
|
||||
|
||||
Learn more about the argument specification, streaming, and multi-modal support [here](https://sgl-project.github.io/references/sampling_params.html).
|
||||
Learn more about the argument specification, streaming, and multi-modal support [here](../references/sampling_params.md).
|
||||
|
||||
## OpenAI Compatible API
|
||||
In addition, the server supports OpenAI-compatible APIs.
|
||||
@@ -74,7 +74,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
|
||||
```
|
||||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
|
||||
```
|
||||
- See [hyperparameter tuning](https://sgl-project.github.io/references/hyperparameter_tuning.html) on tuning hyperparameters for better performance.
|
||||
- See [hyperparameter tuning](../references/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
|
||||
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
|
||||
```
|
||||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
|
||||
@@ -84,7 +84,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
|
||||
- To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies.
|
||||
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
|
||||
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
|
||||
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](https://sgl-project.github.io/references/custom_chat_template.html).
|
||||
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](../references/custom_chat_template.md).
|
||||
|
||||
- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
|
||||
```
|
||||
@@ -124,46 +124,7 @@ if __name__ == "__main__":
|
||||
This can be used for offline batch inference and building custom servers.
|
||||
You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine).
|
||||
|
||||
## Supported Models
|
||||
|
||||
**Generative Models**
|
||||
- Llama / Llama 2 / Llama 3 / Llama 3.1 / Llama 3.2
|
||||
- Mistral / Mixtral / Mistral NeMo
|
||||
- Gemma / Gemma 2
|
||||
- Qwen / Qwen 2 / Qwen 2 MoE / Qwen 2 VL
|
||||
- DeepSeek / DeepSeek 2
|
||||
- OLMoE
|
||||
- [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/)
|
||||
- `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --port=30000 --chat-template=chatml-llava`
|
||||
- `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8 --chat-template=chatml-llava`
|
||||
- Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py)
|
||||
- LLaVA 1.5 / 1.6 / NeXT
|
||||
- `python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000 --tp-size=1 --chat-template=llava_llama_3`
|
||||
- `python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --port=30000 --tp-size=8 --chat-template=chatml-llava`
|
||||
- Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py)
|
||||
- Yi-VL
|
||||
- StableLM
|
||||
- Command-R
|
||||
- DBRX
|
||||
- Grok
|
||||
- ChatGLM
|
||||
- InternLM 2
|
||||
- Exaone 3
|
||||
- BaiChuan2
|
||||
- MiniCPM / MiniCPM 3
|
||||
- XVERSE / XVERSE MoE
|
||||
- SmolLM
|
||||
- GLM-4
|
||||
|
||||
**Embedding Models**
|
||||
|
||||
- e5-mistral
|
||||
- gte-Qwen2
|
||||
- `python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding`
|
||||
|
||||
Instructions for supporting a new model are [here](https://sgl-project.github.io/references/model_support.html).
|
||||
|
||||
### Use Models From ModelScope
|
||||
## Use Models From ModelScope
|
||||
<details>
|
||||
<summary>More</summary>
|
||||
|
||||
@@ -189,7 +150,7 @@ docker run --gpus all \
|
||||
|
||||
</details>
|
||||
|
||||
### Run Llama 3.1 405B
|
||||
## Example: Run Llama 3.1 405B
|
||||
<details>
|
||||
<summary>More</summary>
|
||||
|
||||
@@ -206,16 +167,3 @@ GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
## Benchmark Performance
|
||||
|
||||
- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as for `launch_server.py`.
|
||||
Note that this is not a dynamic batching server, so it may run out of memory for a batch size that a real server can handle.
|
||||
A real server truncates the prefill into several batches, while this unit test does not. For accurate large batch testing, please use `sglang.bench_serving` instead.
|
||||
```
|
||||
python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
|
||||
```
|
||||
- Benchmark online serving. Launch a server first and run the following command.
|
||||
```
|
||||
python3 -m sglang.bench_serving --backend sglang --num-prompt 10
|
||||
```
|
||||
|
||||
@@ -65,7 +65,7 @@
|
||||
"metadata": {},
|
||||
"source": [
|
||||
"## Generate (text generation model)\n",
|
||||
"Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](../references/sampling_params.html)."
|
||||
"Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](../references/sampling_params.md)."
|
||||
]
|
||||
},
|
||||
{
|
||||
@@ -286,7 +286,7 @@
|
||||
"\n",
|
||||
"response = requests.post(url, json=data)\n",
|
||||
"print_highlight(response.text)\n",
|
||||
"assert response.json()[\"success\"] == True\n",
|
||||
"assert response.json()[\"success\"] is True\n",
|
||||
"assert response.json()[\"message\"] == \"Succeeded to update model weights.\"\n",
|
||||
"assert response.json().keys() == {\"success\", \"message\"}"
|
||||
]
|
||||
@@ -312,7 +312,7 @@
|
||||
"response = requests.post(url, json=data)\n",
|
||||
"response_json = response.json()\n",
|
||||
"print_highlight(response_json)\n",
|
||||
"assert response_json[\"success\"] == False\n",
|
||||
"assert response_json[\"success\"] is False\n",
|
||||
"assert response_json[\"message\"] == (\n",
|
||||
" \"Failed to update weights: The size of tensor a (2048) must match \"\n",
|
||||
" \"the size of tensor b (3072) at non-singleton dimension 1.\\n\"\n",
|
||||
|
||||
@@ -27,7 +27,7 @@
|
||||
"source": [
|
||||
"## Offline Batch Inference\n",
|
||||
"\n",
|
||||
"SGLang offline engine supports batch inference with efficient scheduling to prevent OOM errors for large batches. For details on this cache-aware scheduling algorithm, see our [paper](https://arxiv.org/pdf/2312.07104)."
|
||||
"SGLang offline engine supports batch inference with efficient scheduling."
|
||||
]
|
||||
},
|
||||
{
|
||||
|
||||
Reference in New Issue
Block a user