diff --git a/docs/backend/custom_chat_template.md b/docs/backend/custom_chat_template.md index 43befa951..557af5bf5 100644 --- a/docs/backend/custom_chat_template.md +++ b/docs/backend/custom_chat_template.md @@ -7,13 +7,14 @@ It should just work for most official models such as Llama-2/Llama-3. If needed, you can also override the chat template when launching the server: -``` +```bash python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template llama-2 ``` If the chat template you are looking for is missing, you are welcome to contribute it or load it from a file. ## JSON Format + You can load the JSON format, which is defined by `conversation.py`. ```json @@ -28,13 +29,14 @@ You can load the JSON format, which is defined by `conversation.py`. } ``` -``` +```bash python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.json ``` ## Jinja Format -You can also use the Jinja template format, defined by Hugging Face transformers https://huggingface.co/docs/transformers/main/en/chat_templating -``` +You can also use the [Jinja template format](https://huggingface.co/docs/transformers/main/en/chat_templating) as defined by Hugging Face Transformers. + +```bash python -m sglang.launch_server --model-path meta-llama/Llama-2-7b-chat-hf --port 30000 --chat-template ./my_model_template.jinja ``` diff --git a/docs/backend/hyperparameter_tuning.md b/docs/backend/hyperparameter_tuning.md index 2863243fc..d3ed600c3 100644 --- a/docs/backend/hyperparameter_tuning.md +++ b/docs/backend/hyperparameter_tuning.md @@ -28,11 +28,12 @@ If you see `decode out of memory happened` occasionally but not frequently, it i ### Tune `--dp-size` and `--tp-size` -Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Refer to [sglang router](../backend/sglang_router.md) for a better data parallelism rather than using `dp_size` parameter. +Data parallelism is better for throughput. When there is enough GPU memory, always favor data parallelism for throughput. Refer to [sglang router](../router/router.md) for a better data parallelism rather than using `dp_size` parameter. ## Avoid out-of-memory by Tuning `--chunked-prefill-size`, `--mem-fraction-static`, `--max-running-requests` If you see out of memory (OOM) errors, you can try to tune the following parameters. + - If OOM happens during prefill, try to decrease `--chunked-prefill-size` to `4096` or `2048`. - If OOM happens during decoding, try to decrease `--max-running-requests`. - You can also try to decrease `--mem-fraction-static`, which reduces the memory usage of the KV cache memory pool and helps both prefill and decoding. @@ -48,16 +49,15 @@ If you want to deploy a model on many different machines, you can ship the `torc 1. Generate the cache by setting `TORCHINDUCTOR_CACHE_DIR` and running the model once. -```bash -TORCHINDUCTOR_CACHE_DIR=/root/inductor_root_cache python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile -``` + ```bash + TORCHINDUCTOR_CACHE_DIR=/root/inductor_root_cache python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --enable-torch-compile + ``` 2. Copy the cache folder to other machines and launch the server with `TORCHINDUCTOR_CACHE_DIR`. - ## Tune `--schedule-policy` -If the workload has many shared prefixes, use the default `--schedule-policy lpm`. `lpm` stands for longest prefix match. +If the workload has many shared prefixes, use the default `--schedule-policy lpm`. Where `lpm` stands for longest prefix match. When you have no shared prefixes at all or you always send the requests with the shared prefixes together, -you can try `--schedule-policy fcfs`. `fcfs` stands for first come first serve. `fcfs` has a lower scheduling overhead. +you can try `--schedule-policy fcfs`. Where `fcfs` stands for first come first serve. This policy has a lower scheduling overhead. diff --git a/docs/backend/quantization.md b/docs/backend/quantization.md index b70f34f4a..c057a3413 100644 --- a/docs/backend/quantization.md +++ b/docs/backend/quantization.md @@ -3,7 +3,7 @@ SGLang supports various quantization methods, including offline quantization and online dynamic quantization. Offline quantization loads pre-quantized model weights directly during inference. This is required for quantization methods -such as GPTQ and AWQ that collects and pre-compute various stats from the original weights using the calibration dataset. +such as GPTQ and AWQ, which collect and pre-compute various statistics from the original weights using the calibration dataset. Online quantization dynamically computes scaling parameters—such as the maximum/minimum values of model weights—during runtime. Like NVIDIA FP8 training's [delayed scaling](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#Mixed-precision-training-with-FP8) mechanism, online quantization calculates the appropriate scaling factors @@ -12,7 +12,8 @@ on-the-fly to convert high-precision weights into a lower-precision format. **Note: For better performance, usability and convenience, offline quantization is recommended over online quantization.** If you use a pre-quantized model, do not add `--quantization` to enable online quantization at the same time. -For popular pre-quantized models, please visit [ModelCloud](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2) or [NeuralMagic](https://huggingface.co/collections/neuralmagic) collections on HF for some +For popular pre-quantized models, please visit [ModelCloud](https://huggingface.co/collections/ModelCloud/vortex-673743382af0a52b2a8b9fe2) +or [NeuralMagic](https://huggingface.co/collections/neuralmagic) collections on HF for some popular quality validated quantized models. Quantized models must be validated via benchmarks post-quantization to guard against abnormal quantization loss regressions. @@ -111,9 +112,9 @@ python3 -m sglang.launch_server \ --port 30000 --host 0.0.0.0 ``` -Our team is working on supporting more online quantization methods. We will soon support methods including but not limited to `["awq", "gptq", "marlin", "gptq_marlin", "awq_marlin", "bitsandbytes", "gguf"]` +Our team is working on supporting more online quantization methods. SGLang will soon support methods including but not limited to `["awq", "gptq", "marlin", "gptq_marlin", "awq_marlin", "bitsandbytes", "gguf"]`. -We also support quantization methods based on [torchao](https://github.com/pytorch/ao). You can simply specify `--torchao-config` in the command line to support this feature. For example, if you want to enable `int4wo-128` for model `meta-llama/Meta-Llama-3.1-8B-Instruct`, you can launch the server with the following command: +SGLang also supports quantization methods based on [torchao](https://github.com/pytorch/ao). You can simply specify `--torchao-config` in the command line to support this feature. For example, if you want to enable `int4wo-128` for model `meta-llama/Meta-Llama-3.1-8B-Instruct`, you can launch the server with the following command: ```bash python3 -m sglang.launch_server \ @@ -122,7 +123,7 @@ python3 -m sglang.launch_server \ --port 30000 --host 0.0.0.0 ``` -We support the following quantization methods based on torchao `["int8dq", "int8wo", "fp8wo", "fp8dq-per_tensor", "fp8dq-per_row", "int4wo-32", "int4wo-64", "int4wo-128", "int4wo-256"]`. +SGLang supports the following quantization methods based on torchao `["int8dq", "int8wo", "fp8wo", "fp8dq-per_tensor", "fp8dq-per_row", "int4wo-32", "int4wo-64", "int4wo-128", "int4wo-256"]`. Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `"int8dq"` method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using `"int8dq"` method. Namely, please use the following command: @@ -134,8 +135,6 @@ python3 -m sglang.launch_server \ --port 30000 --host 0.0.0.0 ``` - - ## Reference - [GPTQModel](https://github.com/ModelCloud/GPTQModel)