From ce636ac441f8085ab5a118b26c52005a78fcb2bf Mon Sep 17 00:00:00 2001 From: Ran Chen Date: Sat, 21 Sep 2024 05:36:23 -0700 Subject: [PATCH] fix incorrect links in documentation (#1481) Co-authored-by: Yineng Zhang --- docs/en/backend.md | 12 ++++++------ docs/en/frontend.md | 8 ++++---- docs/en/model_support.md | 4 ++-- 3 files changed, 12 insertions(+), 12 deletions(-) diff --git a/docs/en/backend.md b/docs/en/backend.md index 9e4fc7c26..e0974e04e 100644 --- a/docs/en/backend.md +++ b/docs/en/backend.md @@ -19,7 +19,7 @@ curl http://localhost:30000/generate \ } }' ``` -Learn more about the argument format [here](docs/en/sampling_params.md). +Learn more about the argument format `here `_. ### OpenAI Compatible API In addition, the server supports OpenAI-compatible APIs. @@ -73,7 +73,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct ``` python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7 ``` -- See [hyperparameter_tuning.md](docs/en/hyperparameter_tuning.md) on tuning hyperparameters for better performance. +- See `hyperparameter tuning `_ on tuning hyperparameters for better performance. - If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size. ``` python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096 @@ -81,7 +81,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct - To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. - To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments. - To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`. -- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/en/custom_chat_template.md). +- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a `custom chat template `_. - To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port. ``` # Node 0 @@ -102,11 +102,11 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct - [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/) - `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --port=30000 --chat-template=chatml-llava` - `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8 --chat-template=chatml-llava` - - Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](test/srt/test_vision_openai_server.py) + - Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py) - LLaVA 1.5 / 1.6 / NeXT - `python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000 --tp-size=1 --chat-template=llava_llama_3` - `python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --port=30000 --tp-size=8 --chat-template=chatml-llava` - - Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](test/srt/test_vision_openai_server.py) + - Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py) - Yi-VL - StableLM - Command-R @@ -122,7 +122,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct - gte-Qwen2 - `python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding` -Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/en/model_support.md). +Instructions for supporting a new model are `here `_. #### Use Models From ModelScope
diff --git a/docs/en/frontend.md b/docs/en/frontend.md index 4f18939b3..a90c29032 100644 --- a/docs/en/frontend.md +++ b/docs/en/frontend.md @@ -70,7 +70,7 @@ print(state["answer_1"]) #### More Examples Anthropic and VertexAI (Gemini) models are also supported. -You can find more examples at [examples/quick_start](examples/frontend_language/quick_start). +You can find more examples at [examples/quick_start](https://github.com/sgl-project/sglang/tree/main/examples/frontend_language/quick_start). ### Language Feature To begin with, import sglang. @@ -83,7 +83,7 @@ You can implement your prompt flow in a function decorated by `sgl.function`. You can then invoke the function with `run` or `run_batch`. The system will manage the state, chat template, parallelism and batching for you. -The complete code for the examples below can be found at [readme_examples.py](examples/frontend_language/usage/readme_examples.py) +The complete code for the examples below can be found at [readme_examples.py](https://github.com/sgl-project/sglang/blob/main/examples/frontend_language/usage/readme_examples.py) #### Control Flow You can use any Python code within the function body, including control flow, nested function calls, and external libraries. @@ -132,7 +132,7 @@ def image_qa(s, image_file, question): s += sgl.assistant(sgl.gen("answer", max_tokens=256) ``` -See also [srt_example_llava.py](examples/frontend_language/quick_start/local_example_llava_next.py). +See also [local_example_llava_next.py](https://github.com/sgl-project/sglang/blob/main/examples/frontend_language/quick_start/local_example_llava_next.py). #### Constrained Decoding Use `regex` to specify a regular expression as a decoding constraint. @@ -176,7 +176,7 @@ def character_gen(s, name): s += sgl.gen("json_output", max_tokens=256, regex=character_regex) ``` -See also [json_decode.py](examples/frontend_language/usage/json_decode.py) for an additional example of specifying formats with Pydantic models. +See also [json_decode.py](https://github.com/sgl-project/sglang/blob/main/examples/frontend_language/usage/json_decode.py) for an additional example of specifying formats with Pydantic models. #### Batching Use `run_batch` to run a batch of requests with continuous batching. diff --git a/docs/en/model_support.md b/docs/en/model_support.md index 1d720acf5..45a17b037 100644 --- a/docs/en/model_support.md +++ b/docs/en/model_support.md @@ -4,7 +4,7 @@ To support a new model in SGLang, you only need to add a single file under [SGLa Another valuable resource is the [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models). vLLM has extensive coverage of models, and SGLang has reused vLLM for most parts of the model implementations. This similarity makes it easy to port many models from vLLM to SGLang. -To port a model from vLLM to SGLang, you can compare these two files [SGLang LLaMA Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama2.py) and [vLLM LLaMA Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py). This comparison will help you understand how to convert a model implementation from vLLM to SGLang. The major difference is the replacement of PagedAttention with RadixAttention. The other parts are almost identical. Specifically, +To port a model from vLLM to SGLang, you can compare these two files [SGLang LLaMA Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) and [vLLM LLaMA Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py). This comparison will help you understand how to convert a model implementation from vLLM to SGLang. The major difference is the replacement of PagedAttention with RadixAttention. The other parts are almost identical. Specifically, - Replace vllm's `Attention` with `RadixAttention`. Note that you need to pass `layer_id` all the way to `RadixAttention`. - Replace vllm's `LogitsProcessor` with SGLang's `LogitsProcessor`. - Remove `Sample`. @@ -13,4 +13,4 @@ To port a model from vLLM to SGLang, you can compare these two files [SGLang LLa - Test correctness by comparing the final logits and outputs of the two following commands: - `python3 scripts/playground/reference_hf.py --model [new model]` - `python3 -m sglang.bench_latency --model [new model] --correct --output-len 16 --trust-remote-code` - - Update [Supported Models](https://github.com/sgl-project/sglang/tree/main?tab=readme-ov-file#supported-models) at [README](../README.md). + - Update [Supported Models](https://github.com/sgl-project/sglang/tree/main?tab=readme-ov-file#supported-models) at [README](https://github.com/sgl-project/sglang/blob/main/README.md).