Docs: Fix layout with sub-section (#3710)

2025-02-19 15:44:30 -08:00
parent bb121214c2
commit 3c7bfd7eab
18 changed files with 78 additions and 72 deletions
--- a/docs/references/advanced_deploy.rst
+++ b/docs/references/advanced_deploy.rst
@@ -0,0 +1,8 @@
+Multi-Node Deployment
+==========================
+.. toctree::
+   :maxdepth: 1
+
+   deepseek.md
+   multi_node.md
+   k8s.md
--- a/docs/references/custom_chat_template.md
+++ b/docs/references/custom_chat_template.md
@@ -1,40 +0,0 @@
-# Custom Chat Template in SGLang Runtime
-
-**NOTE**: There are two chat template systems in SGLang project. This document is about setting a custom chat template for the OpenAI-compatible API server (defined at [conversation.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/conversation.py)). It is NOT related to the chat template used in the SGLang language frontend (defined at [chat_template.py](https://github.com/sgl-project/sglang/blob/main/python/sglang/lang/chat_template.py)).
-
-By default, the server uses the chat template specified in the model tokenizer from Hugging Face.
-It should just work for most official models such as Llama-2/Llama-3.
-
-If needed, you can also override the chat template when launching the server:
-
-```
-python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2
-```
-
-If the chat template you are looking for is missing, you are welcome to contribute it or load it from a file.
-
-## JSON Format
-You can load the JSON format, which is defined by `conversation.py`.
-
-```json
-{
-  "name": "my_model",
-  "system": "<|im_start|>system",
-  "user": "<|im_start|>user",
-  "assistant": "<|im_start|>assistant",
-  "sep_style": "CHATML",
-  "sep": "<|im_end|>",
-  "stop_str": ["<|im_end|>", "<|im_start|>"]
-}
-```
-
-```
-python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json
-```
-
-## Jinja Format
-You can also use the Jinja template format, defined by Hugging Face transformers https://huggingface.co/docs/transformers/main/en/chat_templating
-
-```
-python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.jinja
-```
--- a/docs/references/general.rst
+++ b/docs/references/general.rst
@@ -0,0 +1,13 @@
+
+General Guidance
+==========
+
+.. toctree::
+   :maxdepth: 1
+
+   supported_models.md
+   contribution_guide.md
+   troubleshooting.md
+   faq.md
+   learn_more.md
+   modelscope.md
--- a/docs/references/hardware.rst
+++ b/docs/references/hardware.rst
@@ -0,0 +1,7 @@
+Hardware Supports
+==========
+.. toctree::
+   :maxdepth: 1
+
+   amd.md
+   nvidia_jetson.md
--- a/docs/references/hyperparameter_tuning.md
+++ b/docs/references/hyperparameter_tuning.md
@@ -1,39 +0,0 @@
-# Guide on Hyperparameter Tuning
-
-## Achieving Peak Throughput
-Achieving a large batch size is the most important thing for attaining high throughput.
-
-When the server is running at full load, look for the following in the log:
-
-```Decode batch. #running-req: 233, #token: 370959, token usage: 0.82, gen throughput (token/s): 4594.01, #queue-req: 317```
-
-### Tune Your Request Submission Speed
-`#queue-req` indicates the number of requests in the queue. If you frequently see `#queue-req == 0`, it suggests you are bottlenecked by the request submission speed.
-A healthy range for `#queue-req` is `50 - 500`.
-On the other hand, do not make `#queue-req` too large because it will also increase the scheduling overhead on the server, especially when using the default longest-prefix-match schedule policy (`--schedule-policy lpm`).
-
-### Tune `--schedule-conservativeness`
-`token usage` indicates the KV cache memory utilization of the server. `token usage > 0.9` means good utilization.
-If you frequently see `token usage < 0.9` and `#queue-req > 0`, it means the server is too conservative about taking in new requests. You can decrease `--schedule-conservativeness` to a value like 0.3.
-The case of server being too conservative can happen when users send many requests with a large `max_new_tokens` but the requests stop very early due to EOS or stop strings.
-
-On the other hand, if you see `token usage` very high and you frequently see warnings like
-`decode out of memory happened, #retracted_reqs: 1, #new_token_ratio: 0.9998 -> 1.0000`, you can increase `--schedule-conservativeness` to a value like 1.3.
-If you see `decode out of memory happened` occasionally but not frequently, it is okay.
-
-### Tune `--dp-size` and `--tp-size`
-Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput.
-
-### Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests`
-If you see out of memory (OOM) errors, you can try to tune the following parameters.
- If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`.
- If OOM happens during decoding, try to decrease `--max-running-requests`.
- You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding.
-
-### Try Advanced Options
- To enable torch.compile acceleration, add `--enable-torch-compile`. It accelerates small models on small batch sizes. This does not work for FP8 currently.
-
-### Tune `--schedule-policy`
-If the workload has many shared prefixes, use the default `--schedule-policy lpm`. `lpm` stands for longest prefix match.
-When you have no shared prefixes at all or you always send the requests with the shared prefixes together,
-you can try `--schedule-policy fcfs`. `fcfs` stands for first come first serve. `fcfs` has a lower scheduling overhead.
--- a/docs/references/multi_node_inference_k8s_lws.md
+++ b/docs/references/multi_node_inference_k8s_lws.md
@@ -1,4 +1,6 @@
-# Deploying a RoCE Network-Based SGLANG Two-Node Inference Service on a Kubernetes (K8S) Cluster
+# Kubernetes
+
+This docs is for deploying a RoCE Network-Based SGLANG Two-Node Inference Service on a Kubernetes (K8S) Cluster.

 LeaderWorkerSet (LWS) is a Kubernetes API that aims to address common deployment patterns of AI/ML inference workloads. A major use case is for multi-host/multi-node distributed inference.

--- a/docs/references/multi_node.md
+++ b/docs/references/multi_node.md
@@ -1,4 +1,4 @@
-# Run Multi-Node Inference
+# Multi-Node Deployment

 ## Llama 3.1 405B

--- a/docs/references/performance_tuning.rst
+++ b/docs/references/performance_tuning.rst
@@ -0,0 +1,7 @@
+Performance Tuning
+====================
+.. toctree::
+   :maxdepth: 1
+
+   benchmark_and_profiling.md
+   accuracy_evaluation.md
--- a/docs/references/quantization.md
+++ b/docs/references/quantization.md
@@ -1,106 +0,0 @@
-# Quantization
-
-SGLang supports various quantization methods, including offline quantization and online dynamic quantization.
-
-Offline quantization loads pre-quantized model weights directly during inference. This is useful for methods requiring pre-computed stats such as AWQ, which collects activation stats from the pre-training set.
-
-Online quantization dynamically computes scaling parameters—such as the maximum/minimum values of model weights—during runtime. Like NVIDIA FP8 training's [delayed scaling](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#Mixed-precision-training-with-FP8) mechanism, online quantization calculates the appropriate scaling factors on-the-fly to convert high-precision weights into a lower-precision format.
-
-**Note that, for better performance, usability and convenience, offline quantization is recommended over online quantization.** And if you use a pre-quantized model, do not add `--quantization` to enable online quantization at the same time. For popular pre-quantized models, please visit [neuralmagic collection](https://huggingface.co/collections/neuralmagic) for some popular quantized LLMs on huggingface.
-
-## Offline Quantization
-
-To load already quantized models, simply load the model weights and config. **Again, if the model has been quantized offline, there's no need to add "--quantization" argument when starting the engine. The quantization method will be parsed from the downloaded Hugging Face config. For example, DeepSeek V3/R1 models are already in FP8, so do not add redundant parameters.**
-
-```bash
-python3 -m sglang.launch_server \
-    --model-path hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
-    --port 30000 --host 0.0.0.0
-```
-
-To do offline quantization for your model, firstly you need to install [llm-compressor](https://github.com/vllm-project/llm-compressor/) library:
-
-```bash
-pip install llmcompressor
-```
-
-Here, we take quantize `meta-llama/Meta-Llama-3-8B-Instruct` to `FP8` as an example to elaborate on how to do offline quantization.
-
-```python
-from transformers import AutoTokenizer
-from llmcompressor.transformers import SparseAutoModelForCausalLM
-from llmcompressor.transformers import oneshot
-from llmcompressor.modifiers.quantization import QuantizationModifier
-
-# Step 1: Load the original model.
-MODEL_ID = "meta-llama/Meta-Llama-3-8B-Instruct"
-
-model = SparseAutoModelForCausalLM.from_pretrained(
-  MODEL_ID, device_map="auto", torch_dtype="auto")
-tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
-
-# Step 2: Perform offline quantization.
-# Step 2.1: Configure the simple PTQ quantization.
-recipe = QuantizationModifier(
-  targets="Linear", scheme="FP8_DYNAMIC", ignore=["lm_head"])
-
-# Step 2.2: Apply the quantization algorithm.
-oneshot(model=model, recipe=recipe)
-
-# Step 3: Save the model.
-SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-Dynamic"
-model.save_pretrained(SAVE_DIR)
-tokenizer.save_pretrained(SAVE_DIR)
-```
-
-Then, you can directly use the quantized model with `SGLang`, by using the following command:
-
-```bash
-python3 -m sglang.launch_server \
-    --model-path $PWD/Meta-Llama-3-8B-Instruct-FP8-Dynamic \
-    --port 30000 --host 0.0.0.0
-```
-
-## Online Quantization
-
-To enable online quantization, you can simply specify `--quantization` in the command line. For example, you can launch the server with the following command to enable `FP8` quantization for model `meta-llama/Meta-Llama-3.1-8B-Instruct`:
-
-```bash
-python3 -m sglang.launch_server \
-    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
-    --quantization fp8 \
-    --port 30000 --host 0.0.0.0
-```
-
-Our team is working on supporting more online quantization methods. We will soon support methods including but not limited to `["awq", "gptq", "marlin", "gptq_marlin", "awq_marlin", "bitsandbytes", "gguf"]`
-
-We also support quantization methods based on [torchao](https://github.com/pytorch/ao). You can simply specify `--torchao-config` in the command line to support this feature. For example, if you want to enable `int4wo-128` for model `meta-llama/Meta-Llama-3.1-8B-Instruct`, you can launch the server with the following command:
-
-```bash
-python3 -m sglang.launch_server \
-    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
-    --torchao-config int4wo-128 \
-    --port 30000 --host 0.0.0.0
-```
-
-We support the following quantization methods based on torchao `["int8dq", "int8wo", "fp8wo", "fp8dq-per_tensor", "fp8dq-per_row", "int4wo-32", "int4wo-64", "int4wo-128", "int4wo-256"]`.
-
-Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `"int8dq"` method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using `"int8dq"` method. Namely, please use the following command:
-
-```bash
-python3 -m sglang.launch_server \
-    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
-    --torchao-config int8dq \
-    --disable-cuda-graph \
-    --port 30000 --host 0.0.0.0
-```
-
-
-
-## Reference
-
- [quantization document of vllm](https://docs.vllm.ai/en/latest/quantization/fp8.html)
-
- [torchao](https://github.com/pytorch/ao)
-
- [llm-compressor](https://github.com/vllm-project/llm-compressor/)
--- a/docs/references/sampling_params.md
+++ b/docs/references/sampling_params.md
@@ -1,107 +0,0 @@
-# Sampling Parameters in SGLang Runtime
-
-This doc describes the sampling parameters of the SGLang Runtime.
-It is the low-level endpoint of the runtime.
-If you want a high-level endpoint that can automatically handle chat templates, consider using the [OpenAI Compatible API](https://docs.sglang.ai/backend/openai_api_completions.html).
-
-## `/generate` Endpoint
-
-The `/generate` endpoint accepts the following parameters in JSON format. For in detail usage see the [native api doc](https://docs.sglang.ai/backend/native_api.html).
-
-* `prompt`: The input prompt. Can be a single prompt or a batch of prompts.
-* `input_ids`: Alternative to `text`. Specify the input as token IDs instead of text.
-* `sampling_params`: The sampling parameters as described in the sections below.
-* `return_logprob`: Whether to return log probabilities for tokens.
-* `logprob_start_len`: If returning log probabilities, specifies the start position in the prompt. Default is "-1" which returns logprobs only for output tokens.
-* `top_logprobs_num`: If returning log probabilities, specifies the number of top logprobs to return at each position.
-* `stream`: Whether to stream the output.
-* `lora_path`: Path to LoRA weights.
-* `custom_logit_processor`: Custom logit processor for advanced sampling control. For usage see below.
-
-## Sampling params
-
-### Core Parameters
-
-* `max_new_tokens`: The maximum output length measured in tokens.
-* `stop`: One or multiple [stop words](https://developer.nvidia.com/blog/how-to-get-better-outputs-from-your-large-language-model/#let_the_model_know_when_to_stop). Generation will stop if one of these words is sampled.
-* `stop_token_ids`: Provide stop words in form of token ids. Generation will stop if one of these token ids is sampled.
-* `temperature`: [Temperature](https://developer.nvidia.com/blog/how-to-get-better-outputs-from-your-large-language-model/#predictability_vs_creativity) when sampling the next token. `temperature = 0` corresponds to greedy sampling, higher temperature leads to more diversity.
-* `top_p`: [Top-p](https://developer.nvidia.com/blog/how-to-get-better-outputs-from-your-large-language-model/#predictability_vs_creativity) selects tokens from the smallest sorted set whose cumulative probability exceeds `top_p`. When `top_p = 1`, this reduces to unrestricted sampling from all tokens.
-* `top_k`: [Top-k](https://developer.nvidia.com/blog/how-to-get-better-outputs-from-your-large-language-model/#predictability_vs_creativity) randomly selects from the `k` highest-probability tokens.
-* `min_p`: [Min-p](https://github.com/huggingface/transformers/issues/27670) samples from tokens with probability larger than `min_p * highest_token_probability`.
-
-### Penalizers
-
-To use penalizers you will need to `--disable-overlap`. Please note that this might degrade performance.
-
-* `frequency_penalty`: Penalizes tokens based on their frequency in generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of penalization grows linearly with each appearance of a token.
-* `presence_penalty`: Penalizes tokens if they appeared in the generation so far. Must be between `-2` and `2` where negative numbers encourage repeatment of tokens and positive number encourages sampling of new tokens. The scaling of the penalization is constant if a token occured.
-* `repetition_penalty`: Penalizes tokens if they appeared in prompt or generation so far. Must be between `0` and `2` where numbers smaller than `1` encourage repeatment of tokens and numbers larger than `2` encourages sampling of new tokens. The penalization scales multiplicatively.
-* `min_new_tokens`: Forces the model to generate at least `min_new_tokens` until a stop word or EOS token is sampled. Note that this might lead to unintended behavior for example if the distribution is highly skewed towards these tokens.
-
-### Constrained decoding
-
-Please refer to our dedicated guide on [constrained decoding](https://docs.sglang.ai/backend/structured_outputs.html#Native-API-and-SGLang-Runtime-(SRT)) for the following parameters.
-
-* `json_schema`
-* `regex`
-* `ebnf`
-
-### Other options
-
-* `n`: Specifies the number of output sequences to generate per request. (Generating multiple outputs in one request (n > 1) is discouraged; separate requests offer better control and efficiency.)
-* `spaces_between_special_tokens`: Whether or not to add spaces between special tokens during detokenization.
-* `no_stop_trim`: Don't trim stop words or EOS token from the generated text.
-* `ignore_eos`: Don't stop generation when EOS token is sampled.
-* `skip_special_tokens`: Remove special tokens during decoding.
-* `custom_params`: Used when employing `CustomLogitProcessor`. For usage see below.
-
-
-### Custom Logit Processor
-Launch a server with `--enable-custom-logit-processor` flag on.
-```
-python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000 --enable-custom-logit-processor
-```
-
-Define a custom logit processor that will always sample a specific token id.
-
-```python
-from sglang.srt.sampling.custom_logit_processor import CustomLogitProcessor
-
-class DeterministicLogitProcessor(CustomLogitProcessor):
-    """A dummy logit processor that changes the logits to always
-    sample the given token id.
-    """
-
-    def __call__(self, logits, custom_param_list):
-        # Check that the number of logits matches the number of custom parameters
-        assert logits.shape[0] == len(custom_param_list)
-        key = "token_id"
-
-        for i, param_dict in enumerate(custom_param_list):
-            # Mask all other tokens
-            logits[i, :] = -float("inf")
-            # Assign highest probability to the specified token
-            logits[i, param_dict[key]] = 0.0
-        return logits
-```
-
-Send a request
-
-```python
-import requests
-
-response = requests.post(
-    "http://localhost:30000/generate",
-    json={
-        "text": "The capital of France is",
-        "custom_logit_processor": DeterministicLogitProcessor().to_str(),
-        "sampling_params": {
-            "temperature": 0.0,
-            "max_new_tokens": 32,
-            "custom_params": {"token_id": 5},
-        },
-    },
-)
-print(response.json())
-```