[Doc] improve relative links and structure (#1924)

2024-11-05 01:12:10 -08:00
parent 02755768d3
commit f5113e50ae
6 changed files with 60 additions and 69 deletions
--- a/docs/backend/backend.md
+++ b/docs/backend/backend.md
@@ -20,7 +20,7 @@ curl http://localhost:30000/generate \
  }'
 ```

-Learn more about the argument specification, streaming, and multi-modal support [here](https://sgl-project.github.io/references/sampling_params.html).
+Learn more about the argument specification, streaming, and multi-modal support [here](../references/sampling_params.md).

 ## OpenAI Compatible API
 In addition, the server supports OpenAI-compatible APIs.
@@ -74,7 +74,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 ```
 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
 ```
- See [hyperparameter tuning](https://sgl-project.github.io/references/hyperparameter_tuning.html) on tuning hyperparameters for better performance.
+- See [hyperparameter tuning](../references/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
 - If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
 ```
 python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
@@ -84,7 +84,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
 - To enable torchao quantization, add `--torchao-config int4wo-128`. It supports various quantization strategies.
 - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
 - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](https://sgl-project.github.io/references/custom_chat_template.html).
+- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](../references/custom_chat_template.md).

 - To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port, you can use the following commands. If you meet deadlock, please try to add `--disable-cuda-graph`
 ```
@@ -124,46 +124,7 @@ if __name__ == "__main__":
 This can be used for offline batch inference and building custom servers.
 You can view the full example [here](https://github.com/sgl-project/sglang/tree/main/examples/runtime/engine).

-## Supported Models
-
-**Generative Models**
- Llama / Llama 2 / Llama 3 / Llama 3.1 / Llama 3.2
- Mistral / Mixtral / Mistral NeMo
- Gemma / Gemma 2
- Qwen / Qwen 2 / Qwen 2 MoE / Qwen 2 VL
- DeepSeek / DeepSeek 2
- OLMoE
- [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/)
-  - `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --port=30000 --chat-template=chatml-llava`
-  - `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8 --chat-template=chatml-llava`
-  - Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py)
- LLaVA 1.5 / 1.6 / NeXT
-  - `python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000 --tp-size=1 --chat-template=llava_llama_3`
-  - `python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --port=30000 --tp-size=8 --chat-template=chatml-llava`
-  - Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py)
- Yi-VL
- StableLM
- Command-R
- DBRX
- Grok
- ChatGLM
- InternLM 2
- Exaone 3
- BaiChuan2
- MiniCPM / MiniCPM 3
- XVERSE / XVERSE MoE
- SmolLM
- GLM-4
-
-**Embedding Models**
-
- e5-mistral
- gte-Qwen2
-  - `python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding`
-
-Instructions for supporting a new model are [here](https://sgl-project.github.io/references/model_support.html).
-
-### Use Models From ModelScope
+## Use Models From ModelScope
 <details>
 <summary>More</summary>

@@ -189,7 +150,7 @@ docker run --gpus all \
  
 </details>

-### Run Llama 3.1 405B
+## Example: Run Llama 3.1 405B
 <details>
 <summary>More</summary>

@@ -206,16 +167,3 @@ GLOO_SOCKET_IFNAME=eth0 python3 -m sglang.launch_server --model-path meta-llama/
 ```

 </details>
-
-## Benchmark Performance
-
- Benchmark a single static batch by running the following command without launching a server. The arguments are the same as for `launch_server.py`.
-  Note that this is not a dynamic batching server, so it may run out of memory for a batch size that a real server can handle.
-  A real server truncates the prefill into several batches, while this unit test does not. For accurate large batch testing, please use `sglang.bench_serving` instead.
-  ```
-  python -m sglang.bench_latency --model-path meta-llama/Meta-Llama-3-8B-Instruct --batch 32 --input-len 256 --output-len 32
-  ```
- Benchmark online serving. Launch a server first and run the following command.
-  ```
-  python3 -m sglang.bench_serving --backend sglang --num-prompt 10
-  ```
--- a/docs/backend/native_api.ipynb
+++ b/docs/backend/native_api.ipynb
@@ -65,7 +65,7 @@
   "metadata": {},
   "source": [
    "## Generate (text generation model)\n",
-    "Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](../references/sampling_params.html)."
+    "Generate completions. This is similar to the `/v1/completions` in OpenAI API. Detailed parameters can be found in the [sampling parameters](../references/sampling_params.md)."
   ]
  },
  {
@@ -286,7 +286,7 @@
    "\n",
    "response = requests.post(url, json=data)\n",
    "print_highlight(response.text)\n",
-    "assert response.json()[\"success\"] == True\n",
+    "assert response.json()[\"success\"] is True\n",
    "assert response.json()[\"message\"] == \"Succeeded to update model weights.\"\n",
    "assert response.json().keys() == {\"success\", \"message\"}"
   ]
@@ -312,7 +312,7 @@
    "response = requests.post(url, json=data)\n",
    "response_json = response.json()\n",
    "print_highlight(response_json)\n",
-    "assert response_json[\"success\"] == False\n",
+    "assert response_json[\"success\"] is False\n",
    "assert response_json[\"message\"] == (\n",
    "    \"Failed to update weights: The size of tensor a (2048) must match \"\n",
    "    \"the size of tensor b (3072) at non-singleton dimension 1.\\n\"\n",
--- a/docs/backend/offline_engine_api.ipynb
+++ b/docs/backend/offline_engine_api.ipynb
@@ -27,7 +27,7 @@
   "source": [
    "## Offline Batch Inference\n",
    "\n",
-    "SGLang offline engine supports batch inference with efficient scheduling to prevent OOM errors for large batches. For details on this cache-aware scheduling algorithm, see our [paper](https://arxiv.org/pdf/2312.07104)."
+    "SGLang offline engine supports batch inference with efficient scheduling."
   ]
  },
  {