diff --git a/docs/references/quantization.md b/docs/references/quantization.md index b39bb415b..471d55f65 100644 --- a/docs/references/quantization.md +++ b/docs/references/quantization.md @@ -1,52 +1,17 @@ # Quantization -`SGLang` support various quantization methods, including online dynamic quantization and offline quantization. +SGLang supports various quantization methods, including offline quantization and online dynamic quantization. -Online quantization computes weight scaling stats(max/min) dynamically at runtime, as examplified by the [delayed scaling](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#Mixed-precision-training-with-FP8) in NVIDIA FP8 training. For inference this quantizes the model once on loading. +Offline quantization loads pre-quantized model weights directly during inference. This is useful for methods requiring pre-computed stats such as AWQ, which collects activation stats from the pre-training set. -Offline quantization saves pre-quantized model weights and loads during inference. This is useful for methods requiring pre-computed stats such as AWQ, which collects activation stats from the pre-training set. - -Please visit [here](https://huggingface.co/collections/neuralmagic) for some popular quantized LLMs on huggingface. - -## Online Quantization - -> Note: Although we support online quantization, we recommend users to use offline quantized (by community or officially) models. - -To enable online quantization, you can simply specify `--quantization` in the command line. For example, if you want to enable `FP8` quantization for model `meta-llama/Meta-Llama-3.1-8B-Instruct`, you can launch the server with the following command: - -```bash -python3 -m sglang.launch_server \ - --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ - --quantization fp8 \ - --port 30000 --host 0.0.0.0 -``` - -Our team is working on supporting more online quantization methods. We will soon support methods including but not limited to `["awq", "gptq", "marlin", "gptq_marlin", "awq_marlin", "bitsandbytes", "gguf"]` - -We also support quantization methods based on [torchao](https://github.com/pytorch/ao). You can simply specify `--torchao-config` in the command line to support this feature. For example, if you want to enable `int4wo-128` for model `meta-llama/Meta-Llama-3.1-8B-Instruct`, you can launch the server with the following command: - -```bash -python3 -m sglang.launch_server \ - --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ - --torchao-config int4wo-128 \ - --port 30000 --host 0.0.0.0 -``` - -We support the following quantization methods based on torchao `["int8dq", "int8wo", "fp8wo", "fp8dq-per_tensor", "fp8dq-per_row", "int4wo-32", "int4wo-64", "int4wo-128", "int4wo-256"]` - -Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `"int8dq"` method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using `"int8dq"` method. Namely, please use the following command: - -```bash -python3 -m sglang.launch_server \ - --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ - --torchao-config int8dq \ - --disable-cuda-graph \ - --port 30000 --host 0.0.0.0 -``` +Online quantization dynamically computes scaling parameters—such as the maximum/minimum values of model weights—during runtime. Like NVIDIA FP8 training's [delayed scaling](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#Mixed-precision-training-with-FP8) mechanism, online quantization calculates the appropriate scaling factors on-the-fly to convert high-precision weights into a lower-precision format. +**Note that, for better performance, usability and convenience, offline quantization is recommended over online quantization.** And if you use a pre-quantized model, do not add `--quantization` to enable online quantization at the same time. For popular pre-quantized models, please visit [neuralmagic collection](https://huggingface.co/collections/neuralmagic) for some popular quantized LLMs on huggingface. ## Offline Quantization -To load already quantized models, simply load the model weights and config. + +To load already quantized models, simply load the model weights and config. **Again, if the model has been quantized offline, there's no need to add "--quantization" argument when starting the engine. The quantization method will be parsed from the downloaded Hugging Face config. For example, DeepSeek V3/R1 models are already in FP8, so do not add redundant parameters.** + ```bash python3 -m sglang.launch_server \ --model-path hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \ @@ -96,9 +61,46 @@ python3 -m sglang.launch_server \ --port 30000 --host 0.0.0.0 ``` -If the model has been quantized offline, there's no need to add `--quantization` argument when starting the engine. The quantization method will be parsed from the downloaded huggingface config. +## Online Quantization + +To enable online quantization, you can simply specify `--quantization` in the command line. For example, you can launch the server with the following command to enable `FP8` quantization for model `meta-llama/Meta-Llama-3.1-8B-Instruct`: + +```bash +python3 -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --quantization fp8 \ + --port 30000 --host 0.0.0.0 +``` + +Our team is working on supporting more online quantization methods. We will soon support methods including but not limited to `["awq", "gptq", "marlin", "gptq_marlin", "awq_marlin", "bitsandbytes", "gguf"]` + +We also support quantization methods based on [torchao](https://github.com/pytorch/ao). You can simply specify `--torchao-config` in the command line to support this feature. For example, if you want to enable `int4wo-128` for model `meta-llama/Meta-Llama-3.1-8B-Instruct`, you can launch the server with the following command: + +```bash +python3 -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --torchao-config int4wo-128 \ + --port 30000 --host 0.0.0.0 +``` + +We support the following quantization methods based on torchao `["int8dq", "int8wo", "fp8wo", "fp8dq-per_tensor", "fp8dq-per_row", "int4wo-32", "int4wo-64", "int4wo-128", "int4wo-256"]`. + +Note: According to [this issue](https://github.com/sgl-project/sglang/issues/2219#issuecomment-2561890230), `"int8dq"` method currently has some bugs when using together with cuda graph capture. So we suggest to disable cuda graph capture when using `"int8dq"` method. Namely, please use the following command: + +```bash +python3 -m sglang.launch_server \ + --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \ + --torchao-config int8dq \ + --disable-cuda-graph \ + --port 30000 --host 0.0.0.0 +``` + ## Reference +- [quantization document of vllm](https://docs.vllm.ai/en/latest/quantization/fp8.html) + - [torchao](https://github.com/pytorch/ao) + +- [llm-compressor](https://github.com/vllm-project/llm-compressor/)