Enable native ModelOpt quantization support (3/3) (#10154)
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
This commit is contained in:
@@ -110,6 +110,157 @@ python3 -m sglang.launch_server \
|
||||
--port 30000 --host 0.0.0.0
|
||||
```
|
||||
|
||||
#### Using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
|
||||
|
||||
NVIDIA Model Optimizer (ModelOpt) provides advanced quantization techniques optimized for NVIDIA hardware. SGLang includes a streamlined workflow for quantizing models with ModelOpt and automatically exporting them for deployment.
|
||||
|
||||
##### Installation
|
||||
|
||||
First, install ModelOpt. You can either install it directly or as an optional SGLang dependency:
|
||||
|
||||
```bash
|
||||
# Option 1: Install ModelOpt directly
|
||||
pip install nvidia-modelopt
|
||||
|
||||
# Option 2: Install SGLang with ModelOpt support (recommended)
|
||||
pip install sglang[modelopt]
|
||||
```
|
||||
|
||||
##### Quantization and Export Workflow
|
||||
|
||||
SGLang provides an example script that demonstrates the complete ModelOpt quantization and export workflow:
|
||||
|
||||
```bash
|
||||
# Quantize and export a model using ModelOpt FP8 quantization
|
||||
python examples/usage/modelopt_quantize_and_export.py quantize \
|
||||
--model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
||||
--export-dir ./quantized_tinyllama_fp8 \
|
||||
--quantization-method modelopt_fp8
|
||||
|
||||
# For FP4 quantization
|
||||
python examples/usage/modelopt_quantize_and_export.py quantize \
|
||||
--model-path TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
|
||||
--export-dir ./quantized_tinyllama_fp4 \
|
||||
--quantization-method modelopt_fp4
|
||||
```
|
||||
|
||||
##### Available Quantization Methods
|
||||
|
||||
- `modelopt_fp8`: FP8 quantization with optimal performance on NVIDIA Hopper and Blackwell GPUs
|
||||
- `modelopt_fp4`: FP4 quantization with optimal performance on Nvidia Blackwell GPUs
|
||||
|
||||
##### Python API Usage
|
||||
|
||||
You can also use ModelOpt quantization programmatically:
|
||||
|
||||
```python
|
||||
import sglang as sgl
|
||||
from sglang.srt.configs.device_config import DeviceConfig
|
||||
from sglang.srt.configs.load_config import LoadConfig
|
||||
from sglang.srt.configs.model_config import ModelConfig
|
||||
from sglang.srt.model_loader.loader import get_model_loader
|
||||
|
||||
# Configure model with ModelOpt quantization and export
|
||||
model_config = ModelConfig(
|
||||
model_path="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
|
||||
quantization="modelopt_fp8", # or "modelopt_fp4"
|
||||
trust_remote_code=True,
|
||||
)
|
||||
|
||||
load_config = LoadConfig(
|
||||
modelopt_export_path="./exported_model",
|
||||
modelopt_checkpoint_save_path="./checkpoint.pth", # optional, fake quantized checkpoint
|
||||
)
|
||||
device_config = DeviceConfig(device="cuda")
|
||||
|
||||
# Load and quantize the model (export happens automatically)
|
||||
model_loader = get_model_loader(load_config, model_config)
|
||||
quantized_model = model_loader.load_model(
|
||||
model_config=model_config,
|
||||
device_config=device_config,
|
||||
)
|
||||
```
|
||||
|
||||
##### Deploying Quantized Models
|
||||
|
||||
After quantization and export, you can deploy the model with SGLang:
|
||||
|
||||
```bash
|
||||
# Deploy the exported quantized model
|
||||
python -m sglang.launch_server \
|
||||
--model-path ./quantized_tinyllama_fp8 \
|
||||
--quantization modelopt \
|
||||
--port 30000 --host 0.0.0.0
|
||||
```
|
||||
|
||||
Or using the Python API:
|
||||
|
||||
```python
|
||||
import sglang as sgl
|
||||
|
||||
# Deploy exported ModelOpt quantized model
|
||||
llm = sgl.Engine(
|
||||
model_path="./quantized_tinyllama_fp8",
|
||||
quantization="modelopt"
|
||||
)
|
||||
|
||||
# Run inference
|
||||
prompts = ["Hello, how are you?", "What is the capital of France?"]
|
||||
sampling_params = {"temperature": 0.8, "top_p": 0.95, "max_new_tokens": 100}
|
||||
outputs = llm.generate(prompts, sampling_params)
|
||||
|
||||
for i, output in enumerate(outputs):
|
||||
print(f"Prompt: {prompts[i]}")
|
||||
print(f"Output: {output.outputs[0].text}")
|
||||
```
|
||||
|
||||
##### Advanced Features
|
||||
|
||||
**Checkpoint Management**: Save and restore fake quantized checkpoints for reuse:
|
||||
|
||||
```bash
|
||||
# Save the fake quantized checkpoint during quantization
|
||||
python examples/usage/modelopt_quantize_and_export.py quantize \
|
||||
--model-path meta-llama/Llama-3.2-1B-Instruct \
|
||||
--export-dir ./quantized_model \
|
||||
--quantization-method modelopt_fp8 \
|
||||
--checkpoint-save-path ./my_checkpoint.pth
|
||||
|
||||
# The checkpoint can be reused for future quantization runs and skip calibration
|
||||
```
|
||||
|
||||
**Export-only Workflow**: If you have a pre-existing fake quantized ModelOpt checkpoint, you can export it directly:
|
||||
|
||||
```python
|
||||
from sglang.srt.configs.device_config import DeviceConfig
|
||||
from sglang.srt.configs.load_config import LoadConfig
|
||||
from sglang.srt.configs.model_config import ModelConfig
|
||||
from sglang.srt.model_loader.loader import get_model_loader
|
||||
|
||||
model_config = ModelConfig(
|
||||
model_path="meta-llama/Llama-3.2-1B-Instruct",
|
||||
quantization="modelopt_fp8",
|
||||
trust_remote_code=True,
|
||||
)
|
||||
|
||||
load_config = LoadConfig(
|
||||
modelopt_checkpoint_restore_path="./my_checkpoint.pth",
|
||||
modelopt_export_path="./exported_model",
|
||||
)
|
||||
|
||||
# Load and export the model
|
||||
model_loader = get_model_loader(load_config, model_config)
|
||||
model_loader.load_model(model_config=model_config, device_config=DeviceConfig())
|
||||
```
|
||||
|
||||
##### Benefits of ModelOpt
|
||||
|
||||
- **Hardware Optimization**: Specifically optimized for NVIDIA GPU architectures
|
||||
- **Advanced Quantization**: Supports cutting-edge FP8 and FP4 quantization techniques
|
||||
- **Seamless Integration**: Automatic export to HuggingFace format for easy deployment
|
||||
- **Calibration-based**: Uses calibration datasets for optimal quantization quality
|
||||
- **Production Ready**: Enterprise-grade quantization with NVIDIA support
|
||||
|
||||
## Online Quantization
|
||||
|
||||
To enable online quantization, you can simply specify `--quantization` in the command line. For example, you can launch the server with the following command to enable `FP8` quantization for model `meta-llama/Meta-Llama-3.1-8B-Instruct`:
|
||||
@@ -148,5 +299,6 @@ python3 -m sglang.launch_server \
|
||||
|
||||
- [GPTQModel](https://github.com/ModelCloud/GPTQModel)
|
||||
- [LLM Compressor](https://github.com/vllm-project/llm-compressor/)
|
||||
- [NVIDIA Model Optimizer (ModelOpt)](https://github.com/NVIDIA/TensorRT-Model-Optimizer)
|
||||
- [Torchao: PyTorch Architecture Optimization](https://github.com/pytorch/ao)
|
||||
- [vLLM Quantization](https://docs.vllm.ai/en/latest/quantization/)
|
||||
|
||||
Reference in New Issue
Block a user