diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md index b5ae2cb4f..ab51d4bf5 100644 --- a/.github/pull_request_template.md +++ b/.github/pull_request_template.md @@ -18,7 +18,7 @@ ## Checklist -- [ ] Format your code according to the [Code formatting with pre-commit](https://docs.sglang.ai/references/contribution_guide.html#code-formatting-with-pre-commit). -- [ ] Add unit tests according to the [Running and adding unit tests](https://docs.sglang.ai/references/contribution_guide.html#running-unit-tests-adding-to-ci). -- [ ] Update documentation according to [Writing documentations](https://docs.sglang.ai/references/contribution_guide.html#writing-documentation-running-docs-ci). -- [ ] Provide accuracy and speed benchmark results according to [Testing the accuracy](https://docs.sglang.ai/references/contribution_guide.html#testing-the-accuracy) and [Benchmark and profiling]() +- [ ] Format your code according to the [Format code with pre-commit](https://docs.sglang.ai/developer_guide/contribution_guide.html#format-code-with-pre-commit). +- [ ] Add unit tests according to the [Run and add unit tests](https://docs.sglang.ai/developer_guide/contribution_guide.html#run-and-add-unit-tests). +- [ ] Update documentation according to [Write documentations](https://docs.sglang.ai/developer_guide/contribution_guide.html#write-documentations). +- [ ] Provide accuracy and speed benchmark results according to [Test the accuracy](https://docs.sglang.ai/developer_guide/contribution_guide.html#test-the-accuracy) and [Benchmark the speed](https://docs.sglang.ai/developer_guide/contribution_guide.html#benchmark-the-speed). diff --git a/docs/advanced_features/server_arguments.md b/docs/advanced_features/server_arguments.md index 3bb8a3233..3ce9ad469 100644 --- a/docs/advanced_features/server_arguments.md +++ b/docs/advanced_features/server_arguments.md @@ -14,7 +14,7 @@ You can find all arguments by `python3 -m sglang.launch_server --help` python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --tp 2 ``` -- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend [SGLang Router](../router/router.md) for data parallelism. +- To enable multi-GPU data parallelism, add `--dp 2`. Data parallelism is better for throughput if there is enough memory. It can also be used together with tensor parallelism. The following command uses 4 GPUs in total. We recommend [SGLang Router](../advanced_features/router.md) for data parallelism. ```bash python -m sglang_router.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --dp 2 --tp 2 diff --git a/docs/basic_usage/sampling_params.md b/docs/basic_usage/sampling_params.md index 7ecb1f444..c1394a9fd 100644 --- a/docs/basic_usage/sampling_params.md +++ b/docs/basic_usage/sampling_params.md @@ -1,11 +1,11 @@ # Sampling Parameters This doc describes the sampling parameters of the SGLang Runtime. It is the low-level endpoint of the runtime. -If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](./openai_api_completions.ipynb). +If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](openai_api_completions.ipynb). ## `/generate` Endpoint -The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](./native_api.ipynb). The object is defined at `io_struct.py::GenerateReqInput`. You can also read the source code to find more arguments and docs. +The `/generate` endpoint accepts the following parameters in JSON format. For detailed usage, see the [native API doc](native_api.ipynb). The object is defined at `io_struct.py::GenerateReqInput`. You can also read the source code to find more arguments and docs. | Argument | Type/Default | Description | |----------------------------|------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------| @@ -135,7 +135,7 @@ for chunk in response.iter_lines(decode_unicode=False): print("") ``` -Detailed example in [openai compatible api](https://docs.sglang.ai/backend/openai_api_completions.html#id2). +Detailed example in [openai compatible api](openai_api_completions.ipynb). ### Multimodal @@ -176,7 +176,7 @@ The `image_data` can be a file name, a URL, or a base64 encoded string. See also Streaming is supported in a similar manner as [above](#streaming). -Detailed example in [openai api vision](./openai_api_vision.ipynb). +Detailed example in [OpenAI API Vision](openai_api_vision.ipynb). ### Structured Outputs (JSON, Regex, EBNF) diff --git a/docs/developer_guide/contribution_guide.md b/docs/developer_guide/contribution_guide.md index 6d98a88f8..55de73a0b 100644 --- a/docs/developer_guide/contribution_guide.md +++ b/docs/developer_guide/contribution_guide.md @@ -69,9 +69,10 @@ Another effective strategy is to review the file modification history and contac If you modify files protected by code owners, their approval is required to merge the code. ## General Code Style -- Avoid code duplication. If the same code snippet (more than 5 lines) appears multiple times, extract it into a shared function. -- Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, as much as possible. Use vectorized code instead. -- Keep files short. If a file exceeds 2,000 lines of code, please split it into multiple smaller files. +- Avoid code duplication. If the same code snippet (more than five lines) appears multiple times, extract it into a shared function. +- Minimize device synchronization. Reduce expensive CPU-GPU synchronization operations, such as `tensor.item()` or `tensor.cpu()`, whenever possible. Use vectorized code. +- Keep files concise. If a file exceeds 2,000 lines of code, split it into multiple smaller files. +- Prioritize extreme efficiency. SGLang is a runtime, and most of your code runs on the critical path for every request. Optimize every minor overhead as much as possible. ## Tips for newcomers diff --git a/docs/index.rst b/docs/index.rst index 3afd6b9d5..5eeca7892 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -87,7 +87,7 @@ The core features include: references/faq.md references/environment_variables.md references/production_metrics.md + references/multi_node_deployment/multi_node_index.rst references/custom_chat_template.md references/frontend/frontend_index.rst - references/multi_node_deployment/multi_node_index.rst references/learn_more.md diff --git a/docs/platforms/nvidia_jetson.md b/docs/platforms/nvidia_jetson.md index 60f3c1dc7..7a37e9426 100644 --- a/docs/platforms/nvidia_jetson.md +++ b/docs/platforms/nvidia_jetson.md @@ -66,7 +66,7 @@ This enables TorchAO's int4 weight-only quantization with a 128-group size. The * * * * * Structured output with XGrammar ------------------------------- -Please refer to [SGLang doc structured output](../backend/structured_outputs.ipynb). +Please refer to [SGLang doc structured output](../advanced_features/structured_outputs.ipynb). * * * * * Thanks to the support from [shahizat](https://github.com/shahizat). diff --git a/docs/supported_models/generative_models.md b/docs/supported_models/generative_models.md index 0ea2da7e8..b6c253aa2 100644 --- a/docs/supported_models/generative_models.md +++ b/docs/supported_models/generative_models.md @@ -25,9 +25,9 @@ in the GitHub search bar. | Model Family (Variants) | Example HuggingFace Identifier | Description | |-------------------------------------|--------------------------------------------------|----------------------------------------------------------------------------------------| -| **DeepSeek** (v1, v2, v3/R1) | `deepseek-ai/DeepSeek-R1` | Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. [SGLang provides Deepseek v3/R1 model-specific optimizations](https://docs.sglang.ai/references/deepseek) and [Reasoning Parser](https://docs.sglang.ai/backend/separate_reasoning)| -| **Qwen** (3, 3MoE, 2.5, 2 series) | `Qwen/Qwen3-0.6B`, `Qwen/Qwen3-30B-A3B` | Alibaba’s latest Qwen3 series for complex reasoning, language understanding, and generation tasks; Support for MoE variants along with previous generation 2.5, 2, etc. [SGLang provides Qwen3 specific reasoning parser](https://docs.sglang.ai/backend/separate_reasoning)| -| **Llama** (2, 3.x, 4 series) | `meta-llama/Llama-4-Scout-17B-16E-Instruct` | Meta’s open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. [SGLang provides Llama-4 model-specific optimizations](https://docs.sglang.ai/references/llama4) | +| **DeepSeek** (v1, v2, v3/R1) | `deepseek-ai/DeepSeek-R1` | Series of advanced reasoning-optimized models (including a 671B MoE) trained with reinforcement learning; top performance on complex reasoning, math, and code tasks. [SGLang provides Deepseek v3/R1 model-specific optimizations](../basic_usage/deepseek.md) and [Reasoning Parser](../advanced_features/separate_reasoning.ipynb)| +| **Qwen** (3, 3MoE, 2.5, 2 series) | `Qwen/Qwen3-0.6B`, `Qwen/Qwen3-30B-A3B` | Alibaba’s latest Qwen3 series for complex reasoning, language understanding, and generation tasks; Support for MoE variants along with previous generation 2.5, 2, etc. [SGLang provides Qwen3 specific reasoning parser](../advanced_features/separate_reasoning.ipynb)| +| **Llama** (2, 3.x, 4 series) | `meta-llama/Llama-4-Scout-17B-16E-Instruct` | Meta’s open LLM series, spanning 7B to 400B parameters (Llama 2, 3, and new Llama 4) with well-recognized performance. [SGLang provides Llama-4 model-specific optimizations](../basic_usage/llama4.md)) | | **Mistral** (Mixtral, NeMo, Small3) | `mistralai/Mistral-7B-Instruct-v0.2` | Open 7B LLM by Mistral AI with strong performance; extended into MoE (“Mixtral”) and NeMo Megatron variants for larger scale. | | **Gemma** (v1, v2, v3) | `google/gemma-3-1b-it` | Google’s family of efficient multilingual models (1B–27B); Gemma 3 offers a 128K context window, and its larger (4B+) variants support vision input. | | **Phi** (Phi-1.5, Phi-2, Phi-3, Phi-4, Phi-MoE series) | `microsoft/Phi-4-multimodal-instruct`, `microsoft/Phi-3.5-MoE-instruct` | Microsoft’s Phi family of small models (1.3B–5.6B); Phi-4-multimodal (5.6B) processes text, images, and speech, Phi-4-mini is a high-accuracy text model and Phi-3.5-MoE is a mixture-of-experts model. | diff --git a/docs/supported_models/support_new_models.md b/docs/supported_models/support_new_models.md index 2223254d9..05500a95b 100644 --- a/docs/supported_models/support_new_models.md +++ b/docs/supported_models/support_new_models.md @@ -27,7 +27,7 @@ standard LLM support: 3. **Multimodal Data Processor**: Define a new `Processor` class that inherits from `BaseMultimodalProcessor` and register this processor as your model’s dedicated processor. - See [multimodal_processor.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/managers/multimodal_processor.py) + See [multimodal_processor.py](https://github.com/sgl-project/sglang/tree/main/python/sglang/srt/multimodal/processors) for more details. 4. **Handle Multimodal Tokens**: diff --git a/docs/supported_models/transformers_fallback.md b/docs/supported_models/transformers_fallback.md index 2deef1c9f..3c7dd961c 100644 --- a/docs/supported_models/transformers_fallback.md +++ b/docs/supported_models/transformers_fallback.md @@ -18,7 +18,7 @@ python3 -m sglang.launch_server \ ### Quantization -Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](https://docs.sglang.ai/backend/quantization.html) for more information about supported quantization in SGLang. +Transformers fall back has supported most of available quantization in SGLang (except GGUF). See [Quantization page](../advanced_features/quantization.md) for more information about supported quantization in SGLang. ### Remote code