Enable native ModelOpt quantization support (3/3) (#10154)

Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
2025-10-21 21:44:29 -07:00
parent 4b65ed42cc
commit 80b2b3207a
16 changed files with 1528 additions and 39 deletions
--- a/docs/advanced_features/quantization.md
+++ b/docs/advanced_features/quantization.md
@@ -110,6 +110,157 @@ python3 -m sglang.launch_server \
    --port 30000 --host 0.0.0.0
 ```

+#### Using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
+
+NVIDIA Model Optimizer (ModelOpt) provides advanced quantization techniques optimized for NVIDIA hardware. SGLang includes a streamlined workflow for quantizing models with ModelOpt and automatically exporting them for deployment.
+
+##### Installation
+
+First, install ModelOpt. You can either install it directly or as an optional SGLang dependency:
+
+```bash
+# Option 1: Install ModelOpt directly
+pip install nvidia-modelopt
+
+# Option 2: Install SGLang with ModelOpt support (recommended)
+pip install sglang[modelopt]
+```
+
+##### Quantization and Export Workflow
+
+SGLang provides an example script that demonstrates the complete ModelOpt quantization and export workflow:
+
+```bash
+# Quantize and export a model using ModelOpt FP8 quantization
+python examples/usage/modelopt_quantize_and_export.py quantize \
+    --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
+    --export-dir ./quantized_tinyllama_fp8 \
+    --quantization-method modelopt_fp8
+
+# For FP4 quantization
+python examples/usage/modelopt_quantize_and_export.py quantize \
+    --model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
+    --export-dir ./quantized_tinyllama_fp4 \
+    --quantization-method modelopt_fp4
+```
+
+##### Available Quantization Methods
+
+- `modelopt_fp8`: FP8 quantization with optimal performance on NVIDIA Hopper and Blackwell GPUs
+- `modelopt_fp4`: FP4 quantization with optimal performance on Nvidia Blackwell GPUs
+
+##### Python API Usage
+
+You can also use ModelOpt quantization programmatically:
+
+```python
+import sglang as sgl
+from sglang.srt.configs.device_config import DeviceConfig
+from sglang.srt.configs.load_config import LoadConfig
+from sglang.srt.configs.model_config import ModelConfig
+from sglang.srt.model_loader.loader import get_model_loader
+
+# Configure model with ModelOpt quantization and export
+model_config = ModelConfig(
+    model_path="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
+    quantization="modelopt_fp8",  # or "modelopt_fp4"
+    trust_remote_code=True,
+)
+
+load_config = LoadConfig(
+    modelopt_export_path="./exported_model",
+    modelopt_checkpoint_save_path="./checkpoint.pth",  # optional, fake quantized checkpoint
+)
+device_config = DeviceConfig(device="cuda")
+
+# Load and quantize the model (export happens automatically)
+model_loader = get_model_loader(load_config, model_config)
+quantized_model = model_loader.load_model(
+    model_config=model_config,
+    device_config=device_config,
+)
+```
+
+##### Deploying Quantized Models
+
+After quantization and export, you can deploy the model with SGLang:
+
+```bash
+# Deploy the exported quantized model
+python -m sglang.launch_server \
+    --model-path ./quantized_tinyllama_fp8 \
+    --quantization modelopt \
+    --port 30000 --host 0.0.0.0
+```
+
+Or using the Python API:
+
+```python
+import sglang as sgl
+
+# Deploy exported ModelOpt quantized model
+llm = sgl.Engine(
+    model_path="./quantized_tinyllama_fp8",
+    quantization="modelopt"
+)
+
+# Run inference
+prompts = ["Hello, how are you?", "What is the capital of France?"]
+sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 100}
+outputs = llm.generate(prompts, sampling_params)
+
+for i, output in enumerate(outputs):
+    print(f"Prompt: {prompts[i]}")
+    print(f"Output: {output.outputs[0].text}")
+```
+
+##### Advanced Features
+
+**Checkpoint Management**: Save and restore fake quantized checkpoints for reuse:
+
+```bash
+# Save the fake quantized checkpoint during quantization
+python examples/usage/modelopt_quantize_and_export.py quantize \
+    --model-path meta-llama/Llama-3.2-1B-Instruct \
+    --export-dir ./quantized_model \
+    --quantization-method modelopt_fp8 \
+    --checkpoint-save-path ./my_checkpoint.pth
+
+# The checkpoint can be reused for future quantization runs and skip calibration
+```
+
+**Export-only Workflow**: If you have a pre-existing fake quantized ModelOpt checkpoint, you can export it directly:
+
+```python
+from sglang.srt.configs.device_config import DeviceConfig
+from sglang.srt.configs.load_config import LoadConfig
+from sglang.srt.configs.model_config import ModelConfig
+from sglang.srt.model_loader.loader import get_model_loader
+
+model_config = ModelConfig(
+    model_path="meta-llama/Llama-3.2-1B-Instruct",
+    quantization="modelopt_fp8",
+    trust_remote_code=True,
+)
+
+load_config = LoadConfig(
+    modelopt_checkpoint_restore_path="./my_checkpoint.pth",
+    modelopt_export_path="./exported_model",
+)
+
+# Load and export the model
+model_loader = get_model_loader(load_config, model_config)
+model_loader.load_model(model_config=model_config, device_config=DeviceConfig())
+```
+
+##### Benefits of ModelOpt
+
+- **Hardware Optimization**: Specifically optimized for NVIDIA GPU architectures
+- **Advanced Quantization**: Supports cutting-edge FP8 and FP4 quantization techniques
+- **Seamless Integration**: Automatic export to HuggingFace format for easy deployment
+- **Calibration-based**: Uses calibration datasets for optimal quantization quality
+- **Production Ready**: Enterprise-grade quantization with NVIDIA support
+
 ## Online Quantization

 To enable online quantization, you can simply specify `--quantization` in the command line. For example, you can launch the server with the following command to enable `FP8` quantization for model `meta-llama/Meta-Llama-3.1-8B-Instruct`:
@@ -148,5 +299,6 @@ python3 -m sglang.launch_server \

 - [GPTQModel](https://github.com/ModelCloud/GPTQModel)
 - [LLM Compressor](https://github.com/vllm-project/llm-compressor/)
+- [NVIDIA Model Optimizer (ModelOpt)](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
 - [Torchao: PyTorch Architecture Optimization](https://github.com/pytorch/ao)
 - [vLLM Quantization](https://docs.vllm.ai/en/latest/quantization/)