fix incorrect links in documentation (#1481)
Co-authored-by: Yineng Zhang <me@zhyncs.com>
This commit is contained in:
@@ -19,7 +19,7 @@ curl http://localhost:30000/generate \
|
||||
}
|
||||
}'
|
||||
```
|
||||
Learn more about the argument format [here](docs/en/sampling_params.md).
|
||||
Learn more about the argument format `here <https://sglang.readthedocs.io/en/latest/sampling_params.html>`_.
|
||||
|
||||
### OpenAI Compatible API
|
||||
In addition, the server supports OpenAI-compatible APIs.
|
||||
@@ -73,7 +73,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
|
||||
```
|
||||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --mem-fraction-static 0.7
|
||||
```
|
||||
- See [hyperparameter_tuning.md](docs/en/hyperparameter_tuning.md) on tuning hyperparameters for better performance.
|
||||
- See `hyperparameter tuning <https://sglang.readthedocs.io/en/latest/hyperparameter_tuning.html>`_ on tuning hyperparameters for better performance.
|
||||
- If you see out-of-memory errors during prefill for long prompts, try to set a smaller chunked prefill size.
|
||||
```
|
||||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --chunked-prefill-size 4096
|
||||
@@ -81,7 +81,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
|
||||
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes.
|
||||
- To enable fp8 weight quantization, add `--quantization fp8` on a fp16 checkpoint or directly load a fp8 checkpoint without specifying any arguments.
|
||||
- To enable fp8 kv cache quantization, add `--kv-cache-dtype fp8_e5m2`.
|
||||
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a [custom chat template](docs/en/custom_chat_template.md).
|
||||
- If the model does not have a chat template in the Hugging Face tokenizer, you can specify a `custom chat template <https://sglang.readthedocs.io/en/latest/custom_chat_template.html>`_.
|
||||
- To run tensor parallelism on multiple nodes, add `--nnodes 2`. If you have two nodes with two GPUs on each node and want to run TP=4, let `sgl-dev-0` be the hostname of the first node and `50000` be an available port.
|
||||
```
|
||||
# Node 0
|
||||
@@ -102,11 +102,11 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
|
||||
- [LLaVA-OneVision](https://llava-vl.github.io/blog/2024-08-05-llava-onevision/)
|
||||
- `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov --port=30000 --chat-template=chatml-llava`
|
||||
- `python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-72b-ov --port=30000 --tp-size=8 --chat-template=chatml-llava`
|
||||
- Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](test/srt/test_vision_openai_server.py)
|
||||
- Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py)
|
||||
- LLaVA 1.5 / 1.6 / NeXT
|
||||
- `python -m sglang.launch_server --model-path lmms-lab/llama3-llava-next-8b --port=30000 --tp-size=1 --chat-template=llava_llama_3`
|
||||
- `python -m sglang.launch_server --model-path lmms-lab/llava-next-72b --port=30000 --tp-size=8 --chat-template=chatml-llava`
|
||||
- Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](test/srt/test_vision_openai_server.py)
|
||||
- Query the server with the [OpenAI Vision API](https://platform.openai.com/docs/guides/vision). See examples at [test/srt/test_vision_openai_server.py](https://github.com/sgl-project/sglang/blob/main/test/srt/test_vision_openai_server.py)
|
||||
- Yi-VL
|
||||
- StableLM
|
||||
- Command-R
|
||||
@@ -122,7 +122,7 @@ python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct
|
||||
- gte-Qwen2
|
||||
- `python -m sglang.launch_server --model-path Alibaba-NLP/gte-Qwen2-7B-instruct --is-embedding`
|
||||
|
||||
Instructions for supporting a new model are [here](https://github.com/sgl-project/sglang/blob/main/docs/en/model_support.md).
|
||||
Instructions for supporting a new model are `here <https://sglang.readthedocs.io/en/latest/model_support.html>`_.
|
||||
|
||||
#### Use Models From ModelScope
|
||||
<details>
|
||||
|
||||
@@ -70,7 +70,7 @@ print(state["answer_1"])
|
||||
#### More Examples
|
||||
|
||||
Anthropic and VertexAI (Gemini) models are also supported.
|
||||
You can find more examples at [examples/quick_start](examples/frontend_language/quick_start).
|
||||
You can find more examples at [examples/quick_start](https://github.com/sgl-project/sglang/tree/main/examples/frontend_language/quick_start).
|
||||
|
||||
### Language Feature
|
||||
To begin with, import sglang.
|
||||
@@ -83,7 +83,7 @@ You can implement your prompt flow in a function decorated by `sgl.function`.
|
||||
You can then invoke the function with `run` or `run_batch`.
|
||||
The system will manage the state, chat template, parallelism and batching for you.
|
||||
|
||||
The complete code for the examples below can be found at [readme_examples.py](examples/frontend_language/usage/readme_examples.py)
|
||||
The complete code for the examples below can be found at [readme_examples.py](https://github.com/sgl-project/sglang/blob/main/examples/frontend_language/usage/readme_examples.py)
|
||||
|
||||
#### Control Flow
|
||||
You can use any Python code within the function body, including control flow, nested function calls, and external libraries.
|
||||
@@ -132,7 +132,7 @@ def image_qa(s, image_file, question):
|
||||
s += sgl.assistant(sgl.gen("answer", max_tokens=256)
|
||||
```
|
||||
|
||||
See also [srt_example_llava.py](examples/frontend_language/quick_start/local_example_llava_next.py).
|
||||
See also [local_example_llava_next.py](https://github.com/sgl-project/sglang/blob/main/examples/frontend_language/quick_start/local_example_llava_next.py).
|
||||
|
||||
#### Constrained Decoding
|
||||
Use `regex` to specify a regular expression as a decoding constraint.
|
||||
@@ -176,7 +176,7 @@ def character_gen(s, name):
|
||||
s += sgl.gen("json_output", max_tokens=256, regex=character_regex)
|
||||
```
|
||||
|
||||
See also [json_decode.py](examples/frontend_language/usage/json_decode.py) for an additional example of specifying formats with Pydantic models.
|
||||
See also [json_decode.py](https://github.com/sgl-project/sglang/blob/main/examples/frontend_language/usage/json_decode.py) for an additional example of specifying formats with Pydantic models.
|
||||
|
||||
#### Batching
|
||||
Use `run_batch` to run a batch of requests with continuous batching.
|
||||
|
||||
@@ -4,7 +4,7 @@ To support a new model in SGLang, you only need to add a single file under [SGLa
|
||||
|
||||
Another valuable resource is the [vLLM Models Directory](https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models). vLLM has extensive coverage of models, and SGLang has reused vLLM for most parts of the model implementations. This similarity makes it easy to port many models from vLLM to SGLang.
|
||||
|
||||
To port a model from vLLM to SGLang, you can compare these two files [SGLang LLaMA Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama2.py) and [vLLM LLaMA Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py). This comparison will help you understand how to convert a model implementation from vLLM to SGLang. The major difference is the replacement of PagedAttention with RadixAttention. The other parts are almost identical. Specifically,
|
||||
To port a model from vLLM to SGLang, you can compare these two files [SGLang LLaMA Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) and [vLLM LLaMA Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py). This comparison will help you understand how to convert a model implementation from vLLM to SGLang. The major difference is the replacement of PagedAttention with RadixAttention. The other parts are almost identical. Specifically,
|
||||
- Replace vllm's `Attention` with `RadixAttention`. Note that you need to pass `layer_id` all the way to `RadixAttention`.
|
||||
- Replace vllm's `LogitsProcessor` with SGLang's `LogitsProcessor`.
|
||||
- Remove `Sample`.
|
||||
@@ -13,4 +13,4 @@ To port a model from vLLM to SGLang, you can compare these two files [SGLang LLa
|
||||
- Test correctness by comparing the final logits and outputs of the two following commands:
|
||||
- `python3 scripts/playground/reference_hf.py --model [new model]`
|
||||
- `python3 -m sglang.bench_latency --model [new model] --correct --output-len 16 --trust-remote-code`
|
||||
- Update [Supported Models](https://github.com/sgl-project/sglang/tree/main?tab=readme-ov-file#supported-models) at [README](../README.md).
|
||||
- Update [Supported Models](https://github.com/sgl-project/sglang/tree/main?tab=readme-ov-file#supported-models) at [README](https://github.com/sgl-project/sglang/blob/main/README.md).
|
||||
|
||||
Reference in New Issue
Block a user