Improve docs and developer guide (#9044)
This commit is contained in:
8
.github/pull_request_template.md
vendored
8
.github/pull_request_template.md
vendored
@@ -18,7 +18,7 @@
|
||||
|
||||
## Checklist
|
||||
|
||||
- [ ] Format your code according to the [Code formatting with pre-commit](https://docs.sglang.ai/references/contribution_guide.html#code-formatting-with-pre-commit).
|
||||
- [ ] Add unit tests according to the [Running and adding unit tests](https://docs.sglang.ai/references/contribution_guide.html#running-unit-tests-adding-to-ci).
|
||||
- [ ] Update documentation according to [Writing documentations](https://docs.sglang.ai/references/contribution_guide.html#writing-documentation-running-docs-ci).
|
||||
- [ ] Provide accuracy and speed benchmark results according to [Testing the accuracy](https://docs.sglang.ai/references/contribution_guide.html#testing-the-accuracy) and [Benchmark and profiling]()
|
||||
- [ ] Format your code according to the [Format code with pre-commit](https://docs.sglang.ai/developer_guide/contribution_guide.html#format-code-with-pre-commit).
|
||||
- [ ] Add unit tests according to the [Run and add unit tests](https://docs.sglang.ai/developer_guide/contribution_guide.html#run-and-add-unit-tests).
|
||||
- [ ] Update documentation according to [Write documentations](https://docs.sglang.ai/developer_guide/contribution_guide.html#write-documentations).
|
||||
- [ ] Provide accuracy and speed benchmark results according to [Test the accuracy](https://docs.sglang.ai/developer_guide/contribution_guide.html#test-the-accuracy) and [Benchmark the speed](https://docs.sglang.ai/developer_guide/contribution_guide.html#benchmark-the-speed).
|
||||
|
||||
@@ -14,7 +14,7 @@ You can find all arguments by `python3 -m sglang.launch_server --help`
|
||||
python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2
|
||||
```
|
||||
|
||||
- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend [SGLang Router](../router/router.md) for data parallelism.
|
||||
- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend [SGLang Router](../advanced_features/router.md) for data parallelism.
|
||||
|
||||
```bash
|
||||
python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2
|
||||
|
||||
@@ -1,11 +1,11 @@
|
||||
# Sampling Parameters
|
||||
|
||||
This doc describes the sampling parameters of the SGLang Runtime. It is the low-level endpoint of the runtime.
|
||||
If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](./openai_api_completions.ipynb).
|
||||
If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](openai_api_completions.ipynb).
|
||||
|
||||
## `/generate` Endpoint
|
||||
|
||||
The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](./native_api.ipynb). The object is defined at `io_struct.py::GenerateReqInput`. You can also read the source code to find more arguments and docs.
|
||||
The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](native_api.ipynb). The object is defined at `io_struct.py::GenerateReqInput`. You can also read the source code to find more arguments and docs.
|
||||
|
||||
| Argument | Type/Default | Description |
|
||||
|----------------------------|------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
||||
@@ -135,7 +135,7 @@ for chunk in response.iter_lines(decode_unicode=False):
|
||||
print("")
|
||||
```
|
||||
|
||||
Detailed example in [openai compatible api](https://docs.sglang.ai/backend/openai_api_completions.html#id2).
|
||||
Detailed example in [openai compatible api](openai_api_completions.ipynb).
|
||||
|
||||
### Multimodal
|
||||
|
||||
@@ -176,7 +176,7 @@ The `image_data` can be a file name, a URL, or a base64 encoded string. See also
|
||||
|
||||
Streaming is supported in a similar manner as [above](#streaming).
|
||||
|
||||
Detailed example in [openai api vision](./openai_api_vision.ipynb).
|
||||
Detailed example in [OpenAI API Vision](openai_api_vision.ipynb).
|
||||
|
||||
### Structured Outputs (JSON, Regex, EBNF)
|
||||
|
||||
|
||||
@@ -69,9 +69,10 @@ Another effective strategy is to review the file modification history and contac
|
||||
If you modify files protected by code owners, their approval is required to merge the code.
|
||||
|
||||
## General Code Style
|
||||
- Avoid code duplication. If the same code snippet (more than 5 lines) appears multiple times, extract it into a shared function.
|
||||
- Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, as much as possible. Use vectorized code instead.
|
||||
- Keep files short. If a file exceeds 2,000 lines of code, please split it into multiple smaller files.
|
||||
- Avoid code duplication. If the same code snippet (more than five lines) appears multiple times, extract it into a shared function.
|
||||
- Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, whenever possible. Use vectorized code.
|
||||
- Keep files concise. If a file exceeds 2,000 lines of code, split it into multiple smaller files.
|
||||
- Prioritize extreme efficiency. SGLang is a runtime, and most of your code runs on the critical path for every request. Optimize every minor overhead as much as possible.
|
||||
|
||||
## Tips for newcomers
|
||||
|
||||
|
||||
@@ -87,7 +87,7 @@ The core features include:
|
||||
references/faq.md
|
||||
references/environment_variables.md
|
||||
references/production_metrics.md
|
||||
references/multi_node_deployment/multi_node_index.rst
|
||||
references/custom_chat_template.md
|
||||
references/frontend/frontend_index.rst
|
||||
references/multi_node_deployment/multi_node_index.rst
|
||||
references/learn_more.md
|
||||
|
||||
@@ -66,7 +66,7 @@ This enables TorchAO's int4 weight-only quantization with a 128-group size. The
|
||||
* * * * *
|
||||
Structured output with XGrammar
|
||||
-------------------------------
|
||||
Please refer to [SGLang doc structured output](../backend/structured_outputs.ipynb).
|
||||
Please refer to [SGLang doc structured output](../advanced_features/structured_outputs.ipynb).
|
||||
* * * * *
|
||||
|
||||
Thanks to the support from [shahizat](https://github.com/shahizat).
|
||||
|
||||
@@ -25,9 +25,9 @@ in the GitHub search bar.
|
||||
|
||||
| Model Family (Variants) | Example HuggingFace Identifier | Description |
|
||||
|-------------------------------------|--------------------------------------------------|----------------------------------------------------------------------------------------|
|
||||
| **DeepSeek** (v1, v2, v3/R1) | `deepseek-ai/DeepSeek-R1` | Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. [SGLang provides Deepseek v3/R1 model-specific optimizations](https://docs.sglang.ai/references/deepseek) and [Reasoning Parser](https://docs.sglang.ai/backend/separate_reasoning)|
|
||||
| **Qwen** (3, 3MoE, 2.5, 2 series) | `Qwen/Qwen3-0.6B`, `Qwen/Qwen3-30B-A3B` | Alibaba’s latest Qwen3 series for complex reasoning, language understanding, and generation tasks; Support for MoE variants along with previous generation 2.5, 2, etc. [SGLang provides Qwen3 specific reasoning parser](https://docs.sglang.ai/backend/separate_reasoning)|
|
||||
| **Llama** (2, 3.x, 4 series) | `meta-llama/Llama-4-Scout-17B-16E-Instruct` | Meta’s open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. [SGLang provides Llama-4 model-specific optimizations](https://docs.sglang.ai/references/llama4) |
|
||||
| **DeepSeek** (v1, v2, v3/R1) | `deepseek-ai/DeepSeek-R1` | Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. [SGLang provides Deepseek v3/R1 model-specific optimizations](../basic_usage/deepseek.md) and [Reasoning Parser](../advanced_features/separate_reasoning.ipynb)|
|
||||
| **Qwen** (3, 3MoE, 2.5, 2 series) | `Qwen/Qwen3-0.6B`, `Qwen/Qwen3-30B-A3B` | Alibaba’s latest Qwen3 series for complex reasoning, language understanding, and generation tasks; Support for MoE variants along with previous generation 2.5, 2, etc. [SGLang provides Qwen3 specific reasoning parser](../advanced_features/separate_reasoning.ipynb)|
|
||||
| **Llama** (2, 3.x, 4 series) | `meta-llama/Llama-4-Scout-17B-16E-Instruct` | Meta’s open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. [SGLang provides Llama-4 model-specific optimizations](../basic_usage/llama4.md)) |
|
||||
| **Mistral** (Mixtral, NeMo, Small3) | `mistralai/Mistral-7B-Instruct-v0.2` | Open 7B LLM by Mistral AI with strong performance; extended into MoE (“Mixtral”) and NeMo Megatron variants for larger scale. |
|
||||
| **Gemma** (v1, v2, v3) | `google/gemma-3-1b-it` | Google’s family of efficient multilingual models (1B–27B); Gemma 3 offers a 128K context window, and its larger (4B+) variants support vision input. |
|
||||
| **Phi** (Phi-1.5, Phi-2, Phi-3, Phi-4, Phi-MoE series) | `microsoft/Phi-4-multimodal-instruct`, `microsoft/Phi-3.5-MoE-instruct` | Microsoft’s Phi family of small models (1.3B–5.6B); Phi-4-multimodal (5.6B) processes text, images, and speech, Phi-4-mini is a high-accuracy text model and Phi-3.5-MoE is a mixture-of-experts model. |
|
||||
|
||||
@@ -27,7 +27,7 @@ standard LLM support:
|
||||
3. **Multimodal Data Processor**:
|
||||
Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your
|
||||
model’s dedicated processor.
|
||||
See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py)
|
||||
See [multimodal_processor.py](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/multimodal/processors)
|
||||
for more details.
|
||||
|
||||
4. **Handle Multimodal Tokens**:
|
||||
|
||||
@@ -18,7 +18,7 @@ python3 -m sglang.launch_server \
|
||||
|
||||
### Quantization
|
||||
|
||||
Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](https://docs.sglang.ai/backend/quantization.html) for more information about supported quantization in SGLang.
|
||||
Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](../advanced_features/quantization.md) for more information about supported quantization in SGLang.
|
||||
|
||||
### Remote code
|
||||
|
||||
|
||||
Reference in New Issue
Block a user