init
This commit is contained in:
56
transformers/docs/source/en/quantization/aqlm.md
Normal file
56
transformers/docs/source/en/quantization/aqlm.md
Normal file
@@ -0,0 +1,56 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# AQLM
|
||||
|
||||
Additive Quantization of Language Models ([AQLM](https://huggingface.co/papers/2401.06118)) quantizes multiple weights together and takes advantage of interdependencies between them. AQLM represents groups of 8-16 weights as a sum of multiple vector codes.
|
||||
|
||||
AQLM also supports fine-tuning with [LoRA](https://huggingface.co/docs/peft/package_reference/lora) with the [PEFT](https://huggingface.co/docs/peft) library, and is fully compatible with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for even faster inference and training.
|
||||
|
||||
Run the command below to install the AQLM library with kernel support for both GPU and CPU inference and training. AQLM only works with Python 3.10+.
|
||||
|
||||
```bash
|
||||
pip install aqlm[gpu,cpu]
|
||||
```
|
||||
|
||||
Load an AQLM-quantized model with [`~PreTrainedModel.from_pretrained`].
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
|
||||
dtype="auto",
|
||||
device_map="auto"
|
||||
)
|
||||
```
|
||||
|
||||
## Configurations
|
||||
|
||||
AQLM quantization setups vary mainly in the number of codebooks used, as well as codebook sizes in bits. The most popular setups and supported inference kernels are shown below.
|
||||
|
||||
| Kernel | Number of codebooks | Codebook size, bits | Notation | Accuracy | Speedup | Fast GPU inference | Fast CPU inference |
|
||||
|---|---------------------|---------------------|----------|-------------|-------------|--------------------|--------------------|
|
||||
| Triton | K | N | KxN | - | Up to ~0.7x | ✅ | ❌ |
|
||||
| CUDA | 1 | 16 | 1x16 | Best | Up to ~1.3x | ✅ | ❌ |
|
||||
| CUDA | 2 | 8 | 2x8 | OK | Up to ~3.0x | ✅ | ❌ |
|
||||
| Numba | K | 8 | Kx8 | Good | Up to ~4.0x | ❌ | ✅ |
|
||||
|
||||
## Resources
|
||||
|
||||
Run the AQLM demo [notebook](https://colab.research.google.com/drive/1-xZmBRXT5Fm3Ghn4Mwa2KRypORXb855X?usp=sharing) for more examples of how to quantize a model, push a quantized model to the Hub, and more.
|
||||
|
||||
For more example demo notebooks, visit the AQLM [repository](https://github.com/Vahe1994/AQLM).
|
||||
286
transformers/docs/source/en/quantization/auto_round.md
Normal file
286
transformers/docs/source/en/quantization/auto_round.md
Normal file
@@ -0,0 +1,286 @@
|
||||
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
-->
|
||||
|
||||
# AutoRound
|
||||
|
||||
[AutoRound](https://github.com/intel/auto-round) is an advanced quantization algorithm that delivers strong accuracy, even at 2-bit precision.
|
||||
It leverages sign gradient descent to fine-tune both rounding values and min-max clipping thresholds in just 200 steps. Designed for broad compatibility, it seamlessly supports a wide range of LLMs and is actively expanding to cover more VLMs as well.
|
||||
It also supports quantization and inference across multiple hardware platforms, including CPU, XPU, and CUDA.
|
||||
|
||||
AutoRound also offers a variety of useful features, including mixed-bit tuning and inference, lm-head quantization, support for exporting to formats like GPTQ/AWQ/GGUF, and flexible tuning recipes.
|
||||
For a comprehensive overview and the latest updates, check out the AutoRound [README](https://github.com/intel/auto-round).
|
||||
|
||||
AutoRound was originally developed as part of the [Intel Neural Compressor](https://github.com/intel/neural-compressor), serving as a general-purpose model compression library for deep learning.
|
||||
It has since evolved into a standalone library focused specifically on low-precision optimization for large language models (LLMs).
|
||||
AutoRound remains fully integrated with the Intel Neural Compressor, and you can explore the repository for more details.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
pip install auto-round
|
||||
```
|
||||
|
||||
## Supported Quantization Configurations
|
||||
|
||||
AutoRound supports several quantization configurations:
|
||||
|
||||
- **Int8 Weight Only**
|
||||
- **Int4 Weight Only**
|
||||
- **Int3 Weight Only**
|
||||
- **Int2 Weight Only**
|
||||
- **Mixed bits Weight only**
|
||||
|
||||
## Hardware Compatibility
|
||||
|
||||
CPU, XPU, and CUDA for both quantization and inference.
|
||||
|
||||
## Quantization and Serialization (offline)
|
||||
|
||||
Currently, only offline mode is supported to generate quantized models.
|
||||
|
||||
<hfoptions id="quantization">
|
||||
<hfoption id="quantization cmd">
|
||||
|
||||
### Command Line Usage
|
||||
|
||||
```bash
|
||||
auto-round \
|
||||
--model facebook/opt-125m \
|
||||
--bits 4 \
|
||||
--group_size 128 \
|
||||
--output_dir ./tmp_autoround
|
||||
```
|
||||
|
||||
AutoRound also offer another two recipes, `auto-round-best` and `auto-round-light`, designed for optimal accuracy and improved speed, respectively.
|
||||
For 2 bits, we recommend using `auto-round-best` or `auto-round`.
|
||||
</hfoption>
|
||||
|
||||
<hfoption id="quantization auto-round api">
|
||||
|
||||
### AutoRound API Usage
|
||||
This setting offers a better trade-off between accuracy and tuning cost, and is recommended in all scenarios.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
from auto_round import AutoRound
|
||||
|
||||
model_name = "facebook/opt-125m"
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
bits, group_size, sym = 4, 128, True
|
||||
# mixed bits config
|
||||
# layer_config = {"model.decoder.layers.6.self_attn.out_proj": {"bits": 2, "group_size": 32}}
|
||||
autoround = AutoRound(
|
||||
model,
|
||||
tokenizer,
|
||||
bits=bits,
|
||||
group_size=group_size,
|
||||
sym=sym,
|
||||
# enable_torch_compile=True,
|
||||
# layer_config=layer_config,
|
||||
)
|
||||
|
||||
output_dir = "./tmp_autoround"
|
||||
# format= 'auto_round'(default), 'auto_gptq', 'auto_awq'
|
||||
autoround.quantize_and_save(output_dir, format='auto_round')
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
|
||||
<hfoption id="quantization auto-round-best">
|
||||
|
||||
### AutoRoundBest recipe
|
||||
This setting provides the best accuracy in most scenarios but is 4–5× slower than the standard AutoRound recipe. It is especially recommended for 2-bit quantization and is a good choice if sufficient resources are available.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
from auto_round import AutoRound
|
||||
|
||||
model_name = "facebook/opt-125m"
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
bits, group_size, sym = 4, 128, True
|
||||
autoround = AutoRound(
|
||||
model,
|
||||
tokenizer,
|
||||
bits=bits,
|
||||
group_size=group_size,
|
||||
sym=sym,
|
||||
nsamples=512,
|
||||
iters=1000,
|
||||
low_gpu_mem_usage=True
|
||||
)
|
||||
|
||||
output_dir = "./tmp_autoround"
|
||||
autoround.quantize_and_save(output_dir, format='auto_round')
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
|
||||
<hfoption id="quantization auto-round-light">
|
||||
|
||||
### AutoRoundLight recipe
|
||||
This setting offers the best speed (2 - 3X faster than AutoRound), but it may cause a significant accuracy drop for small models and 2-bit quantization. It is recommended for 4-bit settings and models larger than 3B.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
from auto_round import AutoRound
|
||||
|
||||
model_name = "facebook/opt-125m"
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
bits, group_size, sym = 4, 128, True
|
||||
autoround = AutoRound(
|
||||
model,
|
||||
tokenizer,
|
||||
bits=bits,
|
||||
group_size=group_size,
|
||||
sym=sym,
|
||||
iters=50,
|
||||
lr=5e-3,
|
||||
)
|
||||
|
||||
output_dir = "./tmp_autoround"
|
||||
autoround.quantize_and_save(output_dir, format='auto_round')
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
|
||||
</hfoptions>
|
||||
|
||||
W4G128 Average Accuracy of 13 tasks (mmlu-pro, if_eval, gsm8k, etc) and Time Cost Results (Testing was conducted on the Nvidia A100 80G using the version of PyTorch 2.6.0 with enable_torch_compile):
|
||||
|
||||
| Model | Qwen2.5-0.5B-Instruct | Falcon3-3B | Qwen2.5-7B-Instruct | Meta-Llama-3.1-8B-Instruct | Falcon3-10B | Qwen2.5-72B-Instruct |
|
||||
|---------|--------------------|---------------|------------------|----------------------------|---------------|-------------------|
|
||||
| 16bits | 0.4192 | 0.5203 | 0.6470 | 0.6212 | 0.6151 | 0.7229 |
|
||||
| Best | **0.4137**(7m) | **0.5142**(23m) | 0.6426(58m) | **0.6116**(65m) | **0.6092**(81m) | 0.7242(575m) |
|
||||
| Default | 0.4129(2m) | 0.5133(6m) | 0.6441(13m) | 0.6106(13m) | 0.6080(18m) | **0.7252**(118m) |
|
||||
| Light | 0.4052(2m) | 0.5108(3m) | **0.6453**(5m) | 0.6104(6m) | 0.6063(6m) | 0.7243(37m) |
|
||||
|
||||
## Inference
|
||||
|
||||
AutoRound automatically selects the best available backend based on the installed libraries and prompts the user to install additional libraries when a better backend is found.
|
||||
<hfoptions id="inference">
|
||||
<hfoption id="inference cpu">
|
||||
|
||||
### CPU
|
||||
|
||||
Supports 2, 4, and 8 bits. We recommend using intel-extension-for-pytorch (IPEX) for 4 bits inference.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_name = "OPEA/Qwen2.5-1.5B-Instruct-int4-sym-inc"
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu", dtype="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
text = "There is a girl who likes adventure,"
|
||||
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
||||
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=False)[0]))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
|
||||
<hfoption id="inference xpu">
|
||||
|
||||
### XPU
|
||||
|
||||
Supports 4 bits only. We recommend using intel-extension-for-pytorch (IPEX) for inference.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_name = "OPEA/Qwen2.5-1.5B-Instruct-int4-sym-inc"
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="xpu", dtype="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
text = "There is a girl who likes adventure,"
|
||||
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
||||
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=False)[0]))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
|
||||
<hfoption id="inference cuda">
|
||||
|
||||
### CUDA
|
||||
|
||||
Supports 2, 3, 4, and 8 bits. We recommend using GPTQModel for 4 and 8 bits inference.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_name = "OPEA/Qwen2.5-1.5B-Instruct-int4-sym-inc"
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cuda", dtype="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
text = "There is a girl who likes adventure,"
|
||||
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
||||
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=False)[0]))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
|
||||
<hfoption id="inference backend">
|
||||
|
||||
### Specify Inference Backend
|
||||
|
||||
AutoRound automatically selects the backend for each layer based on compatibility. In general, the priority order is Marlin > ExLLaMAV2 > Triton, but the final choice depends on factors such as group size, bit width, packing format, hardware device, and other implementation details. For more details, please refer to [backends](https://github.com/intel/auto-round?tab=readme-ov-file#specify-backend),
|
||||
|
||||
The backend may not always be the most suitable for certain devices.
|
||||
You can specify your preferred backend such as "ipex" for CPU, "ipex/triton" for XPU, "marlin/exllamav2/triton" for CUDA, according to your needs or hardware compatibility. Please note that additional corresponding libraries may be required.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoRoundConfig
|
||||
|
||||
model_name = "OPEA/Qwen2.5-1.5B-Instruct-int4-sym-inc"
|
||||
quantization_config = AutoRoundConfig(backend="ipex")
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu", quantization_config=quantization_config, dtype="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
text = "There is a girl who likes adventure,"
|
||||
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
||||
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=False)[0]))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
|
||||
<hfoption id="format convert">
|
||||
|
||||
### Convert GPTQ/AWQ to AutoRound
|
||||
|
||||
Most GPTQ/AWQ models can be converted to the AutoRound format for better compatibility and support with Intel devices. Please note that the quantization config will be changed if the model is serialized.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoRoundConfig
|
||||
|
||||
model_name = "ybelkada/opt-125m-gptq-4bit"
|
||||
quantization_config = AutoRoundConfig()
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="cpu", quantization_config=quantization_config, dtype="auto")
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
text = "There is a girl who likes adventure,"
|
||||
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
||||
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50, do_sample=False)[0]))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
|
||||
</hfoptions>
|
||||
|
||||
## Issues
|
||||
|
||||
If you encounter any issues with the transformers integration, please open an issue on
|
||||
the [transformers](https://github.com/huggingface/transformers/issues) repository.
|
||||
If you encounter any issues with auto-round, please open an issue on
|
||||
the [AutoRound](https://github.com/intel/auto-round/issues) repository.
|
||||
|
||||
## Acknowledgement
|
||||
Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.
|
||||
|
||||
## Contribution
|
||||
Contributions to [AutoRound](https://github.com/intel/auto-round/pulls) are welcome and greatly appreciated!
|
||||
Whether it's fixing bugs, improving documentation, adding new features, or suggesting improvements, your help is always valued.
|
||||
254
transformers/docs/source/en/quantization/awq.md
Normal file
254
transformers/docs/source/en/quantization/awq.md
Normal file
@@ -0,0 +1,254 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# AWQ
|
||||
|
||||
[Activation-aware Weight Quantization (AWQ)](https://hf.co/papers/2306.00978) preserves a small fraction of the weights that are important for LLM performance to compress a model to 4-bits with minimal performance degradation.
|
||||
|
||||
There are several libraries for quantizing models with the AWQ algorithm, such as [llm-awq](https://github.com/mit-han-lab/llm-awq), [autoawq](https://github.com/casper-hansen/AutoAWQ) or [optimum-intel](https://huggingface.co/docs/optimum/main/en/intel/optimization_inc). Transformers supports loading models quantized with the llm-awq and autoawq libraries. This guide will show you how to load models quantized with autoawq, but the process is similar for llm-awq quantized models.
|
||||
|
||||
Run the command below to install autoawq
|
||||
|
||||
```bash
|
||||
pip install autoawq
|
||||
```
|
||||
|
||||
> [!WARNING]
|
||||
> AutoAWQ downgrades Transformers to version 4.47.1. If you want to do inference with AutoAWQ, you may need to reinstall your Transformers' version after installing AutoAWQ.
|
||||
|
||||
Identify an AWQ-quantized model by checking the `quant_method` key in the models [config.json](https://huggingface.co/TheBloke/zephyr-7B-alpha-AWQ/blob/main/config.json) file.
|
||||
|
||||
```json
|
||||
{
|
||||
"_name_or_path": "/workspace/process/huggingfaceh4_zephyr-7b-alpha/source",
|
||||
"architectures": [
|
||||
"MistralForCausalLM"
|
||||
],
|
||||
...
|
||||
...
|
||||
...
|
||||
"quantization_config": {
|
||||
"quant_method": "awq",
|
||||
"zero_point": true,
|
||||
"group_size": 128,
|
||||
"bits": 4,
|
||||
"version": "gemm"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
Load the AWQ-quantized model with [`~PreTrainedModel.from_pretrained`]. This automatically sets the other weights to fp16 by default for performance reasons. Use the `dtype` parameter to load these other weights in a different format.
|
||||
|
||||
If the model is loaded on the CPU, use the `device_map` parameter to move it to an accelerator.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, infer_device
|
||||
import torch
|
||||
|
||||
device = f"{infer_device()}:0"
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"TheBloke/zephyr-7B-alpha-AWQ",
|
||||
dtype=torch.float32,
|
||||
device_map=device
|
||||
)
|
||||
```
|
||||
|
||||
Use `attn_implementation` to enable [FlashAttention2](../perf_infer_gpu_one#flashattention-2) to further accelerate inference.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"TheBloke/zephyr-7B-alpha-AWQ",
|
||||
attn_implementation="flash_attention_2",
|
||||
device_map="cuda:0"
|
||||
)
|
||||
```
|
||||
|
||||
## Fused modules
|
||||
|
||||
Fused modules offer improved accuracy and performance. They are supported out-of-the-box for AWQ modules for [Llama](https://huggingface.co/meta-llama) and [Mistral](https://huggingface.co/mistralai/Mistral-7B-v0.1) architectures, but you can also fuse AWQ modules for unsupported architectures.
|
||||
|
||||
> [!WARNING]
|
||||
> Fused modules cannot be combined with other optimization techniques such as FlashAttention2.
|
||||
|
||||
<hfoptions id="fuse">
|
||||
<hfoption id="supported architectures">
|
||||
|
||||
Create an [`AwqConfig`] and set the parameters `fuse_max_seq_len` and `do_fuse=True` to enable fused modules. The `fuse_max_seq_len` parameter is the total sequence length and it should include the context length and the expected generation length. Set it to a larger value to be safe.
|
||||
|
||||
The example below fuses the AWQ modules of the [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model.
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AwqConfig, AutoModelForCausalLM
|
||||
|
||||
quantization_config = AwqConfig(
|
||||
bits=4,
|
||||
fuse_max_seq_len=512,
|
||||
do_fuse=True,
|
||||
)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"TheBloke/Mistral-7B-OpenOrca-AWQ",
|
||||
quantization_config=quantization_config
|
||||
).to(0)
|
||||
```
|
||||
|
||||
The [TheBloke/Mistral-7B-OpenOrca-AWQ](https://huggingface.co/TheBloke/Mistral-7B-OpenOrca-AWQ) model was benchmarked with `batch_size=1` with and without fused modules.
|
||||
|
||||
<figcaption class="text-center text-gray-500 text-lg">Unfused module</figcaption>
|
||||
|
||||
| Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
|
||||
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------|
|
||||
| 1 | 32 | 32 | 60.0984 | 38.4537 | 4.50 GB (5.68%) |
|
||||
| 1 | 64 | 64 | 1333.67 | 31.6604 | 4.50 GB (5.68%) |
|
||||
| 1 | 128 | 128 | 2434.06 | 31.6272 | 4.50 GB (5.68%) |
|
||||
| 1 | 256 | 256 | 3072.26 | 38.1731 | 4.50 GB (5.68%) |
|
||||
| 1 | 512 | 512 | 3184.74 | 31.6819 | 4.59 GB (5.80%) |
|
||||
| 1 | 1024 | 1024 | 3148.18 | 36.8031 | 4.81 GB (6.07%) |
|
||||
| 1 | 2048 | 2048 | 2927.33 | 35.2676 | 5.73 GB (7.23%) |
|
||||
|
||||
<figcaption class="text-center text-gray-500 text-lg">Fused module</figcaption>
|
||||
|
||||
| Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (VRAM) |
|
||||
|-------------:|-----------------:|----------------:|-------------------:|------------------:|:----------------|
|
||||
| 1 | 32 | 32 | 81.4899 | 80.2569 | 4.00 GB (5.05%) |
|
||||
| 1 | 64 | 64 | 1756.1 | 106.26 | 4.00 GB (5.05%) |
|
||||
| 1 | 128 | 128 | 2479.32 | 105.631 | 4.00 GB (5.06%) |
|
||||
| 1 | 256 | 256 | 1813.6 | 85.7485 | 4.01 GB (5.06%) |
|
||||
| 1 | 512 | 512 | 2848.9 | 97.701 | 4.11 GB (5.19%) |
|
||||
| 1 | 1024 | 1024 | 3044.35 | 87.7323 | 4.41 GB (5.57%) |
|
||||
| 1 | 2048 | 2048 | 2715.11 | 89.4709 | 5.57 GB (7.04%) |
|
||||
|
||||
The speed and throughput of fused and unfused modules were also tested with the [optimum-benchmark](https://github.com/huggingface/optimum-benchmark) library.
|
||||
|
||||
<div class="flex gap-4">
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/fused_forward_memory_plot.png" alt="generate throughput per batch size" />
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">forward peak memory/batch size</figcaption>
|
||||
</div>
|
||||
<div>
|
||||
<img class="rounded-xl" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/quantization/fused_generate_throughput_plot.png" alt="forward latency per batch size" />
|
||||
<figcaption class="mt-2 text-center text-sm text-gray-500">generate throughput/batch size</figcaption>
|
||||
</div>
|
||||
</div>
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="unsupported architectures">
|
||||
|
||||
For architectures that don't support fused modules, create an [`AwqConfig`] and define a custom fusing mapping in `modules_to_fuse` to determine which modules need to be fused.
|
||||
|
||||
The example below fuses the AWQ modules of the [TheBloke/Yi-34B-AWQ](https://huggingface.co/TheBloke/Yi-34B-AWQ) model.
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AwqConfig, AutoModelForCausalLM
|
||||
|
||||
quantization_config = AwqConfig(
|
||||
bits=4,
|
||||
fuse_max_seq_len=512,
|
||||
modules_to_fuse={
|
||||
"attention": ["q_proj", "k_proj", "v_proj", "o_proj"],
|
||||
"layernorm": ["ln1", "ln2", "norm"],
|
||||
"mlp": ["gate_proj", "up_proj", "down_proj"],
|
||||
"use_alibi": False,
|
||||
"num_attention_heads": 56,
|
||||
"num_key_value_heads": 8,
|
||||
"hidden_size": 7168
|
||||
}
|
||||
)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"TheBloke/Yi-34B-AWQ",
|
||||
quantization_config=quantization_config
|
||||
).to(0)
|
||||
```
|
||||
|
||||
The parameter `modules_to_fuse` should include the following keys.
|
||||
|
||||
- `"attention"`: The names of the attention layers to fuse in the following order: query, key, value and output projection layer. If you don't want to fuse these layers, pass an empty list.
|
||||
- `"layernorm"`: The names of all the LayerNorm layers you want to replace with a custom fused LayerNorm. If you don't want to fuse these layers, pass an empty list.
|
||||
- `"mlp"`: The names of the MLP layers you want to fuse into a single MLP layer in the order: (gate (dense, layer, post-attention) / up / down layers).
|
||||
- `"use_alibi"`: If your model uses ALiBi positional embedding.
|
||||
- `"num_attention_heads"`: The number of attention heads.
|
||||
- `"num_key_value_heads"`: The number of key value heads that should be used to implement Grouped Query Attention (GQA).
|
||||
|
||||
| parameter value | attention |
|
||||
|---|---|
|
||||
| `num_key_value_heads=num_attention_heads` | Multi-Head Attention |
|
||||
| `num_key_value_heads=1` | Multi-Query Attention |
|
||||
| `num_key_value_heads=...` | Grouped Query Attention |
|
||||
|
||||
- `"hidden_size"`: The dimension of the hidden representations.
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
## ExLlamaV2
|
||||
|
||||
[ExLlamaV2](https://github.com/turboderp/exllamav2) kernels support faster prefill and decoding. Run the command below to install the latest version of autoawq with ExLlamaV2 support.
|
||||
|
||||
```bash
|
||||
pip install git+https://github.com/casper-hansen/AutoAWQ.git
|
||||
```
|
||||
|
||||
Set `version="exllama"` in [`AwqConfig`] to enable ExLlamaV2 kernels.
|
||||
|
||||
> [!TIP]
|
||||
> ExLlamaV2 is supported on AMD GPUs.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
|
||||
|
||||
quantization_config = AwqConfig(version="exllama")
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"TheBloke/Mistral-7B-Instruct-v0.1-AWQ",
|
||||
quantization_config=quantization_config,
|
||||
device_map="auto",
|
||||
)
|
||||
```
|
||||
|
||||
## CPU
|
||||
|
||||
[Intel Extension for PyTorch (IPEX)](https://intel.github.io/intel-extension-for-pytorch/cpu/latest/) is designed to enable performance optimizations on Intel hardware. Run the command below to install the latest version of autoawq with IPEX support.
|
||||
|
||||
```bash
|
||||
pip install intel-extension-for-pytorch # for IPEX-GPU refer to https://intel.github.io/intel-extension-for-pytorch/xpu/2.5.10+xpu/
|
||||
pip install git+https://github.com/casper-hansen/AutoAWQ.git
|
||||
```
|
||||
|
||||
Set `version="ipex"` in [`AwqConfig`] to enable ExLlamaV2 kernels.
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig
|
||||
|
||||
device = "cpu" # set to "xpu" for Intel GPU
|
||||
quantization_config = AwqConfig(version="ipex")
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"TheBloke/TinyLlama-1.1B-Chat-v0.3-AWQ",
|
||||
quantization_config=quantization_config,
|
||||
device_map=device,
|
||||
)
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
Run the AWQ demo [notebook](https://colab.research.google.com/drive/1HzZH89yAXJaZgwJDhQj9LqSBux932BvY#scrollTo=Wwsg6nCwoThm) for more examples of how to quantize a model, push a quantized model to the Hub, and more.
|
||||
48
transformers/docs/source/en/quantization/bitnet.md
Normal file
48
transformers/docs/source/en/quantization/bitnet.md
Normal file
@@ -0,0 +1,48 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# BitNet
|
||||
|
||||
[BitNet](https://huggingface.co/papers/2402.17764) replaces traditional linear layers in Multi-Head Attention and feed-forward networks with specialized BitLinear layers. The BitLinear layers quantize the weights using ternary precision (with values of -1, 0, and 1) and quantize the activations to 8-bit precision.
|
||||
|
||||
<figure style="text-align: center;">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/1.58llm_extreme_quantization/bitlinear.png" alt="Alt Text" />
|
||||
<figcaption>The architecture of BitNet with BitLinear layers.</figcaption>
|
||||
</figure>
|
||||
|
||||
BitNet models can't be quantized on the fly. They need to be quantized during pretraining or fine-tuning because it is a Quantization-Aware Training (QAT) technique. During training, the weights are quantized to ternary values with symmetric per tensor quantization.
|
||||
|
||||
1. Compute the average of the absolute values of the weight matrix and use as a scale.
|
||||
2. Divide the weights by the scale, round the values, constrain them between -1 and 1, and rescale them to continue in full precision.
|
||||
3. Activations are quantized to a specified bit-width (8-bit) using [absmax](https://huggingface.co/papers/2208.07339) quantization (symmetric per channel quantization). This involves scaling the activations into a range of [−128,127].
|
||||
|
||||
Refer to this [PR](https://github.com/huggingface/nanotron/pull/180) to pretrain or fine-tune a 1.58-bit model with [Nanotron](https://github.com/huggingface/nanotron). For fine-tuning, convert a model from the Hugging Face to Nanotron format. Find the conversion steps in this [PR](https://github.com/huggingface/nanotron/pull/174).
|
||||
|
||||
Load a BitNet quantized model with [`~PreTrainedModel.from_pretrained`].
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM
|
||||
path = "/path/to/model"
|
||||
model = AutoModelForCausalLM.from_pretrained(path, device_map="auto")
|
||||
```
|
||||
|
||||
## Kernels
|
||||
|
||||
`@torch.compile` is used to unpack the weights and perform the forward pass. It's very straightforward to implement and delivers significant speed improvements. Additional optimized kernels will be integrated in future versions.
|
||||
|
||||
## Resources
|
||||
|
||||
Read [Fine-tuning LLMs to 1.58bit: extreme quantization made easy](https://huggingface.co/blog/1_58_llm_extreme_quantization) to learn more about how BitNet models are trained and fine-tuned.
|
||||
340
transformers/docs/source/en/quantization/bitsandbytes.md
Normal file
340
transformers/docs/source/en/quantization/bitsandbytes.md
Normal file
@@ -0,0 +1,340 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Bitsandbytes
|
||||
|
||||
The [bitsandbytes](https://github.com/bitsandbytes-foundation/bitsandbytes) library provides quantization tools for LLMs through a lightweight Python wrapper around CUDA functions. It enables working with large models using limited computational resources by reducing their memory footprint.
|
||||
|
||||
At its core, bitsandbytes provides:
|
||||
|
||||
- **Quantized Linear Layers**: `Linear8bitLt` and `Linear4bit` layers that replace standard PyTorch linear layers with memory-efficient quantized alternatives
|
||||
- **Optimized Optimizers**: 8-bit versions of common optimizers through its `optim` module, enabling training of large models with reduced memory requirements
|
||||
- **Matrix Multiplication**: Optimized matrix multiplication operations that leverage the quantized format
|
||||
|
||||
bitsandbytes offers two main quantization features:
|
||||
|
||||
1. **LLM.int8()** - An 8-bit quantization method that makes inference more accessible without significant performance degradation. Unlike naive quantization, [LLM.int8()](https://hf.co/papers/2208.07339) dynamically preserves higher precision for critical computations, preventing information loss in sensitive parts of the model.
|
||||
|
||||
2. **QLoRA** - A 4-bit quantization technique that compresses models even further while maintaining trainability by inserting a small set of trainable low-rank adaptation (LoRA) weights.
|
||||
|
||||
> **Note:** For a user-friendly quantization experience, you can use the `bitsandbytes` [community space](https://huggingface.co/spaces/bnb-community/bnb-my-repo).
|
||||
|
||||
Run the command below to install bitsandbytes.
|
||||
|
||||
```bash
|
||||
pip install --upgrade transformers accelerate bitsandbytes
|
||||
```
|
||||
|
||||
To compile from source, follow the instructions in the [bitsandbytes installation guide](https://huggingface.co/docs/bitsandbytes/main/en/installation).
|
||||
|
||||
## Hardware Compatibility
|
||||
bitsandbytes is currently only supported on CUDA GPUs for CUDA versions 11.0 - 12.8. However, there's an ongoing multi-backend effort under development, which is currently in alpha. If you're interested in providing feedback or testing, check out the [bitsandbytes repository](https://github.com/bitsandbytes-foundation/bitsandbytes) for more information.
|
||||
|
||||
### CUDA
|
||||
|
||||
| Feature | Minimum Hardware Requirement |
|
||||
|---------|-------------------------------|
|
||||
| 8-bit optimizers | NVIDIA Maxwell (GTX 900 series, TITAN X, M40) or newer GPUs * |
|
||||
| LLM.int8() | NVIDIA Turing (RTX 20 series, T4) or newer GPUs |
|
||||
| NF4/FP4 quantization | NVIDIA Maxwell (GTX 900 series, TITAN X, M40) or newer GPUs * |
|
||||
|
||||
### Multi-backend
|
||||
|
||||
| Backend | Supported Versions | Python versions | Architecture Support | Status |
|
||||
|---------|-------------------|----------------|---------------------|---------|
|
||||
| AMD ROCm | 6.1+ | 3.10+ | minimum CDNA - gfx90a, RDNA - gfx1100 | Alpha |
|
||||
| Apple Silicon (MPS) | WIP | 3.10+ | M1/M2 chips | Planned |
|
||||
| Intel CPU | v2.4.0+ (ipex) | 3.10+ | Intel CPU | Alpha |
|
||||
| Intel GPU | v2.4.0+ (ipex) | 3.10+ | Intel GPU | Experimental |
|
||||
| Ascend NPU | 2.1.0+ (torch_npu) | 3.10+ | Ascend NPU | Experimental |
|
||||
|
||||
> **Note:** Bitsandbytes is moving away from the multi-backend approach towards using [Pytorch Custom Operators](https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html), as the main mechanism for supporting new hardware, and dispatching to the correct backend.
|
||||
|
||||
## Quantization Examples
|
||||
|
||||
Quantize a model by passing a [`BitsAndBytesConfig`] to [`~PreTrainedModel.from_pretrained`]. This works for any model in any modality, as long as it supports [Accelerate](https://huggingface.co/docs/accelerate/index) and contains [torch.nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layers.
|
||||
|
||||
<hfoptions id="bnb">
|
||||
<hfoption id="8-bit">
|
||||
<div class="bnb-container" style="border: 1px solid #ddd; border-radius: 8px; padding: 20px; margin: 20px 0">
|
||||
Quantizing a model in 8-bit halves the memory-usage, and for large models, set `device_map="auto"` to efficiently distribute the weights across all available GPUs.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
||||
|
||||
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
|
||||
|
||||
model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
"bigscience/bloom-1b7",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
```
|
||||
|
||||
By default, all other modules such as [torch.nn.LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) are set to the default torch dtype. You can change the data type of these modules with the `dtype` parameter. Setting `dtype="auto"` loads the model in the data type defined in a model's `config.json` file.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
||||
|
||||
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
|
||||
|
||||
model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
"facebook/opt-350m",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config,
|
||||
dtype="auto"
|
||||
)
|
||||
model_8bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
|
||||
```
|
||||
|
||||
Once a model is quantized to 8-bit, you can't push the quantized weights to the Hub unless you're using the latest version of Transformers and bitsandbytes. If you have the latest versions, then you can push the 8-bit model to the Hub with [`~PreTrainedModel.push_to_hub`]. The quantization config.json file is pushed first, followed by the quantized model weights.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
|
||||
|
||||
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"bigscience/bloom-560m",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
model.push_to_hub("bloom-560m-8bit")
|
||||
```
|
||||
|
||||
</div>
|
||||
</hfoption>
|
||||
<hfoption id="4-bit">
|
||||
<div class="bnb-container" style="border: 1px solid #ddd; border-radius: 8px; padding: 20px; margin: 20px 0">
|
||||
Quantizing a model in 4-bit reduces your memory-usage by 4x, and for large models, set `device_map="auto"` to efficiently distribute the weights across all available GPUs.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
||||
|
||||
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
|
||||
|
||||
model_4bit = AutoModelForCausalLM.from_pretrained(
|
||||
"bigscience/bloom-1b7",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
```
|
||||
|
||||
By default, all other modules such as [torch.nn.LayerNorm](https://pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html) are converted to `torch.float16`. You can change the data type of these modules with the `dtype` parameter.. Setting `dtype="auto"` loads the model in the data type defined in a model's `config.json` file.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
||||
|
||||
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
|
||||
|
||||
model_4bit = AutoModelForCausalLM.from_pretrained(
|
||||
"facebook/opt-350m",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config,
|
||||
dtype="auto"
|
||||
)
|
||||
model_4bit.model.decoder.layers[-1].final_layer_norm.weight.dtype
|
||||
```
|
||||
|
||||
Make sure you have the latest bitsandbytes version so you can serialize 4-bit models and push them to the Hub with [`~PreTrainedModel.push_to_hub`]. Use [`~PreTrainedModel.save_pretrained`] to save the 4-bit model locally.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
|
||||
|
||||
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"bigscience/bloom-560m",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
model.push_to_hub("bloom-560m-4bit")
|
||||
```
|
||||
|
||||
</div>
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
> [!WARNING]
|
||||
> 8 and 4-bit training is only supported for training *extra* parameters.
|
||||
|
||||
Check your memory footprint with `get_memory_footprint`.
|
||||
|
||||
```py
|
||||
print(model.get_memory_footprint())
|
||||
```
|
||||
|
||||
Load quantized models with [`~PreTrainedModel.from_pretrained`] without a `quantization_config`.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained("{your_username}/bloom-560m-8bit", device_map="auto")
|
||||
```
|
||||
|
||||
## LLM.int8
|
||||
|
||||
This section explores some of the specific features of 8-bit quantization, such as offloading, outlier thresholds, skipping module conversion, and finetuning.
|
||||
|
||||
### Offloading
|
||||
|
||||
8-bit models can offload weights between the CPU and GPU to fit very large models into memory. The weights dispatched to the CPU are stored in **float32** and aren't converted to 8-bit. For example, enable offloading for [bigscience/bloom-1b7](https://huggingface.co/bigscience/bloom-1b7) through [`BitsAndBytesConfig`].
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
||||
|
||||
quantization_config = BitsAndBytesConfig(llm_int8_enable_fp32_cpu_offload=True)
|
||||
```
|
||||
|
||||
Design a custom device map to fit everything on your GPU except for the `lm_head`, which is dispatched to the CPU.
|
||||
|
||||
```py
|
||||
device_map = {
|
||||
"transformer.word_embeddings": 0,
|
||||
"transformer.word_embeddings_layernorm": 0,
|
||||
"lm_head": "cpu",
|
||||
"transformer.h": 0,
|
||||
"transformer.ln_f": 0,
|
||||
}
|
||||
```
|
||||
|
||||
Now load your model with the custom `device_map` and `quantization_config`.
|
||||
|
||||
```py
|
||||
model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
"bigscience/bloom-1b7",
|
||||
dtype="auto",
|
||||
device_map=device_map,
|
||||
quantization_config=quantization_config,
|
||||
)
|
||||
```
|
||||
|
||||
### Outlier threshold
|
||||
|
||||
An "outlier" is a hidden state value greater than a certain threshold, and these values are computed in fp16. While the values are usually normally distributed ([-3.5, 3.5]), this distribution can be very different for large models ([-60, 6] or [6, 60]). 8-bit quantization works well for values ~5, but beyond that, there is a significant performance penalty. A good default threshold value is 6, but a lower threshold may be needed for more unstable models (small models or finetuning).
|
||||
|
||||
To find the best threshold for your model, experiment with the `llm_int8_threshold` parameter in [`BitsAndBytesConfig`]. For example, setting the threshold to `0.0` significantly speeds up inference at the potential cost of some accuracy loss.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
|
||||
|
||||
model_id = "bigscience/bloom-1b7"
|
||||
|
||||
quantization_config = BitsAndBytesConfig(
|
||||
llm_int8_threshold=0.0,
|
||||
llm_int8_enable_fp32_cpu_offload=True
|
||||
)
|
||||
|
||||
model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
dtype="auto",
|
||||
device_map=device_map,
|
||||
quantization_config=quantization_config,
|
||||
)
|
||||
```
|
||||
|
||||
### Skip module conversion
|
||||
|
||||
For some models, like [Jukebox](model_doc/jukebox), you don't need to quantize every module to 8-bit because it can actually cause instability. With Jukebox, there are several `lm_head` modules that should be skipped using the `llm_int8_skip_modules` parameter in [`BitsAndBytesConfig`].
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
|
||||
|
||||
model_id = "bigscience/bloom-1b7"
|
||||
|
||||
quantization_config = BitsAndBytesConfig(
|
||||
llm_int8_skip_modules=["lm_head"],
|
||||
)
|
||||
|
||||
model_8bit = AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config,
|
||||
)
|
||||
```
|
||||
|
||||
### Finetuning
|
||||
|
||||
The [PEFT](https://github.com/huggingface/peft) library supports fine-tuning large models like [flan-t5-large](https://huggingface.co/google/flan-t5-large) and [facebook/opt-6.7b](https://huggingface.co/facebook/opt-6.7b) with 8-bit quantization. You don't need to pass the `device_map` parameter for training because it automatically loads your model on a GPU. However, you can still customize the device map with the `device_map` parameter (`device_map="auto"` should only be used for inference).
|
||||
|
||||
## QLoRA
|
||||
|
||||
This section explores some of the specific features of 4-bit quantization, such as changing the compute data type, the Normal Float 4 (NF4) data type, and nested quantization.
|
||||
|
||||
### Compute data type
|
||||
|
||||
Change the data type from float32 (the default value) to bf16 in [`BitsAndBytesConfig`] to speedup computation.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import BitsAndBytesConfig
|
||||
|
||||
quantization_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
|
||||
```
|
||||
|
||||
### Normal Float 4 (NF4)
|
||||
|
||||
NF4 is a 4-bit data type from the [QLoRA](https://hf.co/papers/2305.14314) paper, adapted for weights initialized from a normal distribution. You should use NF4 for training 4-bit base models.
|
||||
|
||||
```py
|
||||
from transformers import BitsAndBytesConfig
|
||||
|
||||
nf4_config = BitsAndBytesConfig(
|
||||
load_in_4bit=True,
|
||||
bnb_4bit_quant_type="nf4",
|
||||
)
|
||||
|
||||
model_nf4 = AutoModelForCausalLM.from_pretrained(model_id, dtype="auto", quantization_config=nf4_config)
|
||||
```
|
||||
|
||||
For inference, the `bnb_4bit_quant_type` does not have a huge impact on performance. However, to remain consistent with the model weights, you should use the `bnb_4bit_compute_dtype` and `dtype` values.
|
||||
|
||||
### Nested quantization
|
||||
|
||||
Nested quantization can save additional memory at no additional performance cost. This feature performs a second quantization of the already quantized weights to save an additional 0.4 bits/parameter. For example, with nested quantization, you can finetune a [Llama-13b](https://huggingface.co/meta-llama/Llama-2-13b) model on a 16GB NVIDIA T4 GPU with a sequence length of 1024, a batch size of 1, and enable gradient accumulation with 4 steps.
|
||||
|
||||
```py
|
||||
from transformers import BitsAndBytesConfig
|
||||
|
||||
double_quant_config = BitsAndBytesConfig(
|
||||
load_in_4bit=True,
|
||||
bnb_4bit_use_double_quant=True,
|
||||
)
|
||||
|
||||
model_double_quant = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-13b-chat-hf", dtype="auto", quantization_config=double_quant_config)
|
||||
```
|
||||
|
||||
## Dequantizing bitsandbytes models
|
||||
|
||||
Once quantized, you can [`~PreTrainedModel.dequantize`] a model to the original precision but this may result in some quality loss. Make sure you have enough GPU memory to fit the dequantized model.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, BitsAndBytesConfig, AutoTokenizer
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m", BitsAndBytesConfig(load_in_4bit=True))
|
||||
model.dequantize()
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
Learn more about the details of 8-bit quantization in [A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using Hugging Face Transformers, Accelerate and bitsandbytes](https://huggingface.co/blog/hf-bitsandbytes-integration).
|
||||
|
||||
Try 4-bit quantization in this [notebook](https://colab.research.google.com/drive/1ge2F1QSK8Q7h0hn3YKuBCOAS0bK8E0wf) and learn more about it's details in [Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA](https://huggingface.co/blog/4bit-transformers-bitsandbytes).
|
||||
190
transformers/docs/source/en/quantization/compressed_tensors.md
Normal file
190
transformers/docs/source/en/quantization/compressed_tensors.md
Normal file
@@ -0,0 +1,190 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# compressed-tensors
|
||||
|
||||
[compressed-tensors](https://github.com/neuralmagic/compressed-tensors) extends [safetensors](https://github.com/huggingface/safetensors) files to compressed tensor data types to provide a unified checkpoint format for storing and loading various quantization and sparsity formats such dense, int-quantized (int8), float-quantized (fp8), and pack-quantized (int4 or int8 weight-quantized packed into int32).
|
||||
|
||||
compressed-tensors supports fine-tuning with [PEFT](https://huggingface.co/docs/peft) and includes the following features as well.
|
||||
|
||||
- fp8, int4, int8 weight and activation precisions.
|
||||
- Quantization scales and zero-points strategies for [tensor, channel, group, block, token](https://github.com/neuralmagic/compressed-tensors/blob/83b2e7a969d70606421a76b9a3d112646077c8de/src/compressed_tensors/quantization/quant_args.py#L43-L52).
|
||||
- Dynamic per-token activation quantization (or any static strategy).
|
||||
- Weight sparsity (unstructured or semi-structured like 2:4) can be composed with quantization for extreme compression.
|
||||
- Quantization of arbitrary modules, not just [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) modules.
|
||||
- Targeted support for specific modules by name or class.
|
||||
|
||||
Install compressed-tensors from [PyPI](https://pypi.org/project/compressed-tensors) to get the latest stable release (recommended) or install it from source to get the latest features.
|
||||
|
||||
<hfoptions id="install">
|
||||
<hfoption id="PyPI">
|
||||
|
||||
```bash
|
||||
pip install compressed-tensors
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="source code">
|
||||
|
||||
```bash
|
||||
git clone https://github.com/neuralmagic/compressed-tensors
|
||||
cd compressed-tensors
|
||||
pip install -e .
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
Search using the compressed-tensors [tag](https://huggingface.co/models?other=compressed-tensors) to find a compatible model on the Hugging Face Hub.
|
||||
|
||||
Only models that have already been quantized can be loaded at the moment, and once a model is loaded, it cannot be saved. To quantize a model into the compressed-tensors format, see [llm-compressor](https://github.com/vllm-project/llm-compressor). Alternatively, models can be created independently and serizlied with a compressed-tensors config.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
ct_model = AutoModelForCausalLM.from_pretrained("nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf", device_map="auto")
|
||||
|
||||
# measure memory usage
|
||||
mem_params = sum([param.nelement()*param.element_size() for param in ct_model.parameters()])
|
||||
print(f"{mem_params/2**30:.4f} GB")
|
||||
# 8.4575 GB
|
||||
```
|
||||
|
||||
## Model checkpoint
|
||||
|
||||
Compressed-tensor models are defined through its configuration entry. The following example is taken from the [nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf](https://huggingface.co/nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf/blob/main/config.json) `config.json` file.
|
||||
|
||||
There are a lot of entries to allow for flexible expression both during and after compression, but the entries for loading and inference can be simplified to focus on just a few key entries.
|
||||
|
||||
```json
|
||||
"quantization_config": {
|
||||
"config_groups": {
|
||||
"group_0": {
|
||||
"input_activations": {
|
||||
"num_bits": 8,
|
||||
"strategy": "tensor",
|
||||
"type": "float"
|
||||
},
|
||||
"targets": ["Linear"],
|
||||
"weights": {
|
||||
"num_bits": 8,
|
||||
"strategy": "tensor",
|
||||
"type": "float"
|
||||
}
|
||||
}
|
||||
},
|
||||
"format": "naive-quantized",
|
||||
"ignore": ["lm_head"],
|
||||
"quant_method": "compressed-tensors",
|
||||
"quantization_status": "frozen"
|
||||
},
|
||||
```
|
||||
|
||||
The config file specifies the quantization of a config group (`group_0`), which includes weight and activation quantization to fp8 with a static per-tensor strategy. The `lm_head` module is unquantized as shown in the `ignore` key.
|
||||
|
||||
For a more detailed look at the model weights, use the [safetensors viewer](https://huggingface.co/nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf?show_file_info=model.safetensors.index.json) on the model card to see the quantized weights, input scale, and weight scale for all [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) modules.
|
||||
|
||||
| Tensors | Shape | Precision |
|
||||
| ------- | ----- | --------- |
|
||||
|model.layers.0.input_layernorm.weight | [4 096] | BF16|
|
||||
|model.layers.0.mlp.down_proj.input_scale | [1] | BF16|
|
||||
|model.layers.0.mlp.down_proj.weight | [4 096, 14 336] | F8_E4M3|
|
||||
|model.layers.0.mlp.down_proj.weight_scale | [1] | BF16|
|
||||
|model.layers.0.mlp.gate_proj.input_scale | [1] | BF16|
|
||||
|model.layers.0.mlp.gate_proj.weight | [14 336, 4 096] | F8_E4M3|
|
||||
|model.layers.0.mlp.gate_proj.weight_scale | [1] | BF16|
|
||||
|model.layers.0.mlp.up_proj.input_scale| [1] |BF16|
|
||||
|model.layers.0.mlp.up_proj.weight | [14 336, 4 096] | F8_E4M3|
|
||||
|model.layers.0.mlp.up_proj.weight_scale | [1] | BF16|
|
||||
|model.layers.0.post_attention_layernorm.weight | [4 096] |BF16|
|
||||
|model.layers.0.self_attn.k_proj.input_scale | [1] | BF16|
|
||||
|model.layers.0.self_attn.k_proj.weight | [1 024, 4 096]| F8_E4M3|
|
||||
|model.layers.0.self_attn.k_proj.weight_scale |[1] | BF16|
|
||||
|model.layers.0.self_attn.o_proj.input_scale | [1] | BF16|
|
||||
|model.layers.0.self_attn.o_proj.weight | [4 096, 4 096] | F8_E4M3|
|
||||
|model.layers.0.self_attn.o_proj.weight_scale | [1] | BF16|
|
||||
|model.layers.0.self_attn.q_proj.input_scale | [1] | BF16|
|
||||
|model.layers.0.self_attn.q_proj.weight | [4 096, 4 096] | F8_E4M3|
|
||||
|model.layers.0.self_attn.q_proj.weight_scale | [1] | BF16|
|
||||
|model.layers.0.self_attn.v_proj.input_scale | [1] | BF16|
|
||||
|model.layers.0.self_attn.v_proj.weight | [1 024, 4 096] | F8_E4M3|
|
||||
|model.layers.0.self_attn.v_proj.weight_scale | [1] | BF16|
|
||||
|
||||
When loading a compressed-tensors model with the [`~quantizers.HFQuantizer`] integration, all the [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) modules specified in the quantization config are replaced by [CompressedLinear](https://github.com/neuralmagic/compressed-tensors/blob/975cb223b19fcac2b98a4271d17668462d4d6e1d/src/compressed_tensors/linear/compressed_linear.py#L30) modules that manage the compressed weights and forward pass for inference. The `lm_head` module is still kept as an unquantized nn.Linear module.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
ct_model = AutoModelForCausalLM.from_pretrained("nm-testing/Meta-Llama-3.1-8B-Instruct-FP8-hf")
|
||||
print(ct_model)
|
||||
"""
|
||||
LlamaForCausalLM(
|
||||
(model): LlamaModel(
|
||||
(embed_tokens): Embedding(128256, 4096)
|
||||
(layers): ModuleList(
|
||||
(0-31): 32 x LlamaDecoderLayer(
|
||||
(self_attn): LlamaSdpaAttention(
|
||||
(q_proj): CompressedLinear(
|
||||
in_features=4096, out_features=4096, bias=False
|
||||
(input_observer): MovingAverageMinMaxObserver()
|
||||
(weight_observer): MovingAverageMinMaxObserver()
|
||||
)
|
||||
(k_proj): CompressedLinear(
|
||||
in_features=4096, out_features=1024, bias=False
|
||||
(input_observer): MovingAverageMinMaxObserver()
|
||||
(weight_observer): MovingAverageMinMaxObserver()
|
||||
)
|
||||
(v_proj): CompressedLinear(
|
||||
in_features=4096, out_features=1024, bias=False
|
||||
(input_observer): MovingAverageMinMaxObserver()
|
||||
(weight_observer): MovingAverageMinMaxObserver()
|
||||
)
|
||||
(o_proj): CompressedLinear(
|
||||
in_features=4096, out_features=4096, bias=False
|
||||
(input_observer): MovingAverageMinMaxObserver()
|
||||
(weight_observer): MovingAverageMinMaxObserver()
|
||||
)
|
||||
(rotary_emb): LlamaRotaryEmbedding()
|
||||
)
|
||||
(mlp): LlamaMLP(
|
||||
(gate_proj): CompressedLinear(
|
||||
in_features=4096, out_features=14336, bias=False
|
||||
(input_observer): MovingAverageMinMaxObserver()
|
||||
(weight_observer): MovingAverageMinMaxObserver()
|
||||
)
|
||||
(up_proj): CompressedLinear(
|
||||
in_features=4096, out_features=14336, bias=False
|
||||
(input_observer): MovingAverageMinMaxObserver()
|
||||
(weight_observer): MovingAverageMinMaxObserver()
|
||||
)
|
||||
(down_proj): CompressedLinear(
|
||||
in_features=14336, out_features=4096, bias=False
|
||||
(input_observer): MovingAverageMinMaxObserver()
|
||||
(weight_observer): MovingAverageMinMaxObserver()
|
||||
)
|
||||
(act_fn): SiLU()
|
||||
)
|
||||
(input_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
|
||||
(post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-05)
|
||||
)
|
||||
)
|
||||
(norm): LlamaRMSNorm((4096,), eps=1e-05)
|
||||
(rotary_emb): LlamaRotaryEmbedding()
|
||||
)
|
||||
(lm_head): Linear(in_features=4096, out_features=128256, bias=False)
|
||||
)
|
||||
"""
|
||||
```
|
||||
174
transformers/docs/source/en/quantization/concept_guide.md
Normal file
174
transformers/docs/source/en/quantization/concept_guide.md
Normal file
@@ -0,0 +1,174 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Quantization concepts
|
||||
|
||||
Quantization reduces the memory footprint and computational cost of large machine learning models like those found in the Transformers library. It achieves this by representing the model's weights and or activations with lower-precision data types (like 8-bit integers or int8) instead of the standard 32-bit floating-point (float32).
|
||||
|
||||
Reducing a model's precision offers several significant benefits:
|
||||
|
||||
- Smaller model size: Lower-precision data types require less storage space. An int8 model, for example, is roughly 4 times smaller than its float32 counterpart.
|
||||
- Faster inference: Operations on lower-precision data types, especially integers, can be significantly faster on compatible hardware (CPUs and GPUs often have specialized instructions for int8 operations). This leads to lower latency.
|
||||
- Reduced energy consumption: Faster computations and smaller memory transfers often translate to lower power usage.
|
||||
|
||||
The primary trade-off in quantization is *efficiency* vs. *accuracy*. Reducing precision saves resources but inevitably introduces small errors (quantization noise). The goal is to minimize this error using appropriate schemes (affine/symmetric), granularity (per-tensor/channel), and techniques (PTQ/QAT) so that the model's performance on its target task degrades as little as possible.
|
||||
|
||||
The sections below cover quantization schemes, granularity, and techniques.
|
||||
|
||||
## Quantization schemes
|
||||
|
||||
The core idea is to map the range of values found in the original float32 weights and activations to the much smaller range represented by int8 (typically \\([-128, 127]\\)).
|
||||
|
||||
This section covers how some quantization techniques work.
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img width="606" alt="quant_visual" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/quant_visual.png" />
|
||||
</div>
|
||||
|
||||
### Affine quantization
|
||||
|
||||
The most common method is *affine quantization*. For a given float32 tensor (like a layer's weights), it finds the minimum \\(val_{min}\\) and maximum \\(val_{max}\\) values. This range \\([val_{min}, val_{max}]\\) is mapped to the int8 range \\([q_{min}, q_{max}]\\), which is typically \\([-128, 127]\\).
|
||||
|
||||
There are two main ways to perform this mapping, *symmetric* and *asymmetric*. The choice between symmetric and asymmetric quantization determines how the float32 range is mapped to the int8 range.
|
||||
|
||||
- Symmetric: This method assumes the original float32 range is symmetric around zero ( \\([ -a, a ]\\) ). This range is mapped symmetrically to the int8 range, for example, \\([-127, 127]\\). A key characteristic is that the float32 value \\(0.0\\) maps directly to the int8 value \\(0\\). This only requires one parameter, the **scale ( \\(S\\) )**, to define the mapping. It can simplify computations, but it might be less accurate if the original data distribution isn't naturally centered around zero.
|
||||
- Asymmetric (Affine): This method does not assume the data is centered around zero. It maps the exact range \\([val_{min}, val_{max}]\\) from float32 to the full int8 range, like \\([-128, 127]\\). This requires two parameters, a **scale ( \\(S\\) )** and a **zero-point ( \\(Z\\) )**.
|
||||
|
||||
scale ( \\(S\\) ): A positive float32 number representing the ratio between the float32 and the int8 range.
|
||||
|
||||
$$
|
||||
S = \frac{val_{max} - val_{min}}{q_{max} - q_{min}}
|
||||
$$
|
||||
|
||||
zero-Point ( \\(Z\\) ): An int8 value that corresponds to the float32 value \\(0.0\\).
|
||||
|
||||
$$
|
||||
Z = q_{min} - round\left(\frac{val_{min}}{S}\right)
|
||||
$$
|
||||
|
||||
> [!TIP]
|
||||
> In symmetric quantization, Z would typically be fixed at 0.
|
||||
|
||||
With these parameters, a float32 value, \\(x\\). can be quantized to int8 ( \\(q\\) ) with the formula below.
|
||||
|
||||
$$
|
||||
q = round\left(\frac{x}{S} + Z\right)
|
||||
$$
|
||||
|
||||
The int8 value, \\(q\\), can be dequantized back to approximate float32 with the formula below.
|
||||
|
||||
$$
|
||||
x \approx S \cdot (q - Z)
|
||||
$$
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img width="606" alt="dequant" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/dequant.png" />
|
||||
</div>
|
||||
|
||||
During inference, computations like matrix multiplication are performed using the int8 values ( \\(q\\) ), and the result is dequantized back to float32 (often using a higher-precision accumulation type like int32 internally) before it is passed to the next layer.
|
||||
|
||||
### int4 and weight packing
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img width="606" alt="weight packing" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/weight_packing.png" />
|
||||
</div>
|
||||
|
||||
int4 quantization further reduces the model size and memory usage (halving it compared to int8). The same affine or symmetric quantization principles apply, mapping the float32 range to the 16 possible values representable by int4 ( \\([-8, 7]\\) for signed int4).
|
||||
|
||||
A key aspect of int4 quantization is **weight packing**. Since most hardware can't natively handle 4-bit data types in memory, two int4 values are typically packed together into a single int8 byte for storage and transfer. For example, the first value might occupy the lower 4 bits and the second value the upper 4 bits of the byte (`packed_byte = (val1 & 0x0F) | (val2 << 4)`).
|
||||
|
||||
int4 is still beneficial even without native int4 compute because the primary benefit comes from reduced memory bandwidth. Loading packed int4 weights (stored as int8) from memory (RAM or VRAM) to the compute units is twice as fast as loading int8 weights. For large models, memory access is often a significant bottleneck. The speed up from faster data transfer can outweigh the computational overhead of unpacking and dequantizing on the fly, leading to overall faster inference, especially in memory-bound scenarios.
|
||||
|
||||
However, int4 quantization typically results in a larger accuracy drop compared to int8. Advanced quantization techniques like [GPTQ](./gptq) or [AWQ](./awq) are often necessary for good performance with int4.
|
||||
|
||||
### FP8 Quantization (A8W8)
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img width="606" alt="fp8" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/fp8.png" />
|
||||
</div>
|
||||
A newer datatype, 8-bit floating-point (FP8), offers another way to reduce precision while retaining more accuracy than int8 in certain scenarios. FP8 keeps the floating-point structure (sign, exponent, mantissa) but uses fewer bits.
|
||||
|
||||
There are two common FP8 variants.
|
||||
|
||||
- E4M3: 1 sign bit, 4 exponent bits, 3 mantissa bits. Offers higher precision (more mantissa bits) but a smaller dynamic range (fewer exponent bits).
|
||||
- E5M2: 1 sign bit, 5 exponent bits, 2 mantissa bits. Offers a wider dynamic range but lower precision.
|
||||
|
||||
FP8 is used in the *A8W8* quantization scheme, which quantizes both activations (A) and weights (W) to 8-bit precision.
|
||||
|
||||
While int8 has broad support, efficient FP8 computation requires specific hardware capabilities found in newer GPUs like NVIDIA H100/H200/B100 and AMD Instinct MI300 series. Without native hardware acceleration, the benefits of FP8 might not be fully realized.
|
||||
|
||||
Transformers supports FP8 through specific backends like [FBGEMM](./fbgemm_fp8), [FineGrainedFP8](./finegrained_fp8), and [compressed-tensors](./compressed_tensors). These backends handle the underlying FP8 conversion and computation when the appropriate hardware and configurations are used.
|
||||
|
||||
## Granularity
|
||||
|
||||
Quantization parameters ( \\(S\\) and \\(Z\\)) can be calculated in one of two ways.
|
||||
|
||||
- Per-Tensor: One set of \\(S\\) and \\(Z\\) for the entire tensor. Simpler, but less accurate if data values vary greatly within the tensor.
|
||||
- Per-Channel (or Per-Group/Block): Separate \\(S\\) and \\(Z\\) for each channel or group. More accurate and better performance at the cost of slightly more complexity and memory.
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img width="625" alt="Granularities" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/Granularities.png" />
|
||||
</div>
|
||||
|
||||
## Quantization techniques
|
||||
|
||||
There are two main types of quantization techniques.
|
||||
|
||||
- Post-Training Quantization (PTQ): Quantization is applied *after* the model is fully trained.
|
||||
- Quantization-Aware Training (QAT): Quantization effects are simulated *during* training by inserting "fake quantization" ops that simulate the rounding errors of quantization. This lets the model adapt to quantization, and usually results in better accuracy, especially at lower bit-widths.
|
||||
|
||||
## Quantization in Transformers
|
||||
|
||||
Transformers integrates several quantization backends such as bitsandbytes, torchao, compressed-tensors, and more (refer to the quantization [overview](./overview) for more backends).
|
||||
|
||||
All backends are unified under the [`HfQuantizer`] API and associated [`QuantizationConfig`] classes. You can integrate your own custom quantization backends by implementing a custom [`HfQuantizer`] and [`QuantizationConfig`], as shown in the [Contribution](./contribute) guide.
|
||||
|
||||
The typical workflow for quantization in Transformers is to:
|
||||
|
||||
1. Choose a quantization method suitable for your hardware and use case (see the [Overview](./overview) or [Selecting a quantization method](./selecting) guide to help you).
|
||||
2. Load a pre-quantized model from the Hugging Face Hub or load a float32/float16/bfloat16 model and apply a specific quantization method with [`QuantizationConfig`].
|
||||
|
||||
The example below demonstrates loading a 8B parameter model and quantizing it to 4-bits with bitsandbytes.
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
|
||||
|
||||
model_id = "meta-llama/Llama-3.1-8B-Instruct"
|
||||
|
||||
quantization_config = BitsAndBytesConfig(
|
||||
load_in_4bit=True,
|
||||
bnb_4bit_compute_dtype=torch.bfloat16
|
||||
)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
quantization_config=quantization_config,
|
||||
dtype=torch.bfloat16,
|
||||
device_map="auto"
|
||||
)
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
To explore quantization and related performance optimization concepts more deeply, check out the following resources.
|
||||
|
||||
- [Quantization Fundamentals with Hugging Face](https://www.deeplearning.ai/short-courses/quantization-fundamentals-with-hugging-face/)
|
||||
- [Quantization in Depth](https://www.deeplearning.ai/short-courses/quantization-in-depth)
|
||||
- [Introduction to Quantization cooked in 🤗 with 💗🧑🍳](https://huggingface.co/blog/merve/quantization)
|
||||
- [EfficientML.ai Lecture 5 - Quantization Part I](https://www.youtube.com/watch?v=RP23-dRVDWM)
|
||||
- [Making Deep Learning Go Brrrr From First Principles](https://horace.io/brrr_intro.html)
|
||||
- [Accelerating Generative AI with PyTorch Part 2: LLM Optimizations](https://pytorch.org/blog/accelerating-generative-ai-2/)
|
||||
71
transformers/docs/source/en/quantization/contribute.md
Normal file
71
transformers/docs/source/en/quantization/contribute.md
Normal file
@@ -0,0 +1,71 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Contribute
|
||||
|
||||
Transformers supports many quantization methods such as QLoRA, GPTQ, LLM.int8, and AWQ. However, there are still many more quantization approaches that haven't been integrated yet. To make adding and using these quantization methods with Transformers easier, use the [`~quantizers.HfQuantizer`] class. [`~quantizers.HfQuantizer`] is designed to be an internal helper class for adding a quantization method instead of something applied to every PyTorch module.
|
||||
|
||||
This guide will show you how to integrate a new quantization method with [`~quantizers.HfQuantizer`].
|
||||
|
||||
## Requirements
|
||||
|
||||
Before integrating a new quantization method into Transformers, ensure the method meets the following requirements. Only quantization methods that can be run with PyTorch modules are supported.
|
||||
|
||||
- The quantization method is available through a Python package that is pip-installable (it is also fine if you can only install the package from source). Ideally, pre-compiled kernels are included in the pip package.
|
||||
- The method can run on commonly-used hardware (CPU, GPU, etc.).
|
||||
- The method is wrapped in a [nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html) ([`~bitsandbytes.nn.Linear8bitLt`], [`~bitsandbytes.nn.Linear4bit`]), and the quantized linear layer should have the following definition.
|
||||
|
||||
```py
|
||||
class Linear4bit(nn.Module):
|
||||
def __init__(self, ...):
|
||||
...
|
||||
|
||||
def forward(self, x):
|
||||
return my_4bit_kernel(x, self.weight, self.bias)
|
||||
```
|
||||
|
||||
This way, Transformers models are easily quantized by replacing instances of [nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) with a target class.
|
||||
|
||||
- The quantization method should be serializable. You can save the quantized weights locally or push them to the Hub.
|
||||
- Make sure the package containing the quantization kernels/primitive is stable (no frequent breaking changes).
|
||||
|
||||
Some quantization methods may require "pre-quantizing" the model through data calibration (AWQ). In this case, we prefer to only support inference in Transformers and let the third-party library maintained by the ML community deal handle the model quantization itself.
|
||||
|
||||
## Create new HFQuantizer class
|
||||
|
||||
1. Create a new quantization config class inside [src/transformers/utils/quantization_config.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/utils/quantization_config.py). Add the new quantization config to the [_import_structure](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py#L1088) inside Transformers' [src/transformers/__init__.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/__init__.py) file.
|
||||
|
||||
2. Create a new file inside [src/transformers/quantizers/](https://github.com/huggingface/transformers/tree/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers) named `quantizer_your_method.py`, and make it inherit from [`~quantizers.HfQuantizer]. Make sure to add the new quantizer and quantization config in the quantization auto-mapping in [src/transformers/quantizers/auto.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers/auto.py).
|
||||
|
||||
3. Define the following class attributes and property methods for your quantization method.
|
||||
|
||||
- `requires_calibration`: Whether the quantization method requires a data calibration process. If set to `True`, you can only support inference (with quantized weights) and not inference and quantization.
|
||||
- `required_packages`: A list of strings of the required packages to use the quantized weights. You might need to define some new utility methods such as `is_auto_awq_available` in [transformers/src/utils/import_utils.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/utils/import_utils.py).
|
||||
- `requires_parameters_quantization`: Only required if your quantization method requires extra attention to the underlying [nn.Parameter](https://pytorch.org/docs/stable/generated/torch.nn.parameter.Parameter.html) object. For example, bitsandbytes uses [`~bitsandbytes.nn.Params4bit`] and [`~bitsandbytes.nn.Int8Params`], which requires some extra attention when quantizing the model. Most of the recent quantization method packs int2 and int4 weights inside [torch.uint8](https://pytorch.org/docs/stable/tensors.html) weights, so this flag should not be really required (set to `False` by default).
|
||||
- `is_serializable`: A property method to determine whether the method is serializable or not.
|
||||
- `is_trainable`: A property method to determine whether you can fine-tune models on top of the quantization method (with or without PEFT approaches).
|
||||
|
||||
4. Write the `validate_environment` and `update_dtype` methods. These methods are called before creating the quantized model to ensure users use the right configuration. Refer to other quantizers for an example of it is implemented.
|
||||
|
||||
5. Write the `_process_model_before_weight_loading` method. In Transformers, the quantized models are initialized first on the `"meta"` device before loading the weights. This means the `_process_model_before_weight_loading` method takes care of manipulating the model skeleton to replace some modules ([nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)) with the target modules (quantization modules).
|
||||
|
||||
You can define module replacement logic or any other utility method by creating a new file in [transformers/src/integrations/](https://github.com/huggingface/transformers/tree/abbffc4525566a48a9733639797c812301218b83/src/transformers/integrations) and exposing the relevant methods in that folder's `__init__.py` file. The best starting point would be to have a look at another quantization method such as [quantizer_awq.py](https://github.com/huggingface/transformers/blob/abbffc4525566a48a9733639797c812301218b83/src/transformers/quantizers/quantizer_awq.py).
|
||||
|
||||
6. Write the `_process_model_after_weight_loading` method. This method enables implementing additional features that require manipulating the model after loading the weights.
|
||||
|
||||
7. Document everything! Make sure your quantization method is documented by adding a new file under `docs/source/en/quantization`.
|
||||
|
||||
8. You should add tests by adding the package in our nightly Dockerfile inside `docker/transformers-quantization-latest-gpu` and then adding a new test file in `tests/quantization/xxx`. Feel free to check out existing quantization methods to see how it is implemented.
|
||||
65
transformers/docs/source/en/quantization/eetq.md
Normal file
65
transformers/docs/source/en/quantization/eetq.md
Normal file
@@ -0,0 +1,65 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# EETQ
|
||||
|
||||
The [Easy & Efficient Quantization for Transformers (EETQ)](https://github.com/NetEase-FuXi/EETQ) library supports int8 weight-only per-channel quantization for NVIDIA GPUs. It uses high-performance GEMM and GEMV kernels from [FasterTransformer](https://github.com/NVIDIA/FasterTransformer) and [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). The attention layer is optimized with [FlashAttention2](https://github.com/Dao-AILab/flash-attention). No calibration dataset is required, and the model doesn't need to be pre-quantized. Accuracy degradation is negligible owing to the per-channel quantization.
|
||||
|
||||
EETQ further supports fine-tuning with [PEFT](https://huggingface.co/docs/peft).
|
||||
|
||||
Install EETQ from the [release page](https://github.com/NetEase-FuXi/EETQ/releases) or [source code](https://github.com/NetEase-FuXi/EETQ). CUDA 11.4+ is required for EETQ.
|
||||
|
||||
<hfoptions id="install">
|
||||
<hfoption id="release page">
|
||||
|
||||
```bash
|
||||
pip install --no-cache-dir https://github.com/NetEase-FuXi/EETQ/releases/download/v1.0.0/EETQ-1.0.0+cu121+torch2.1.2-cp310-cp310-linux_x86_64.whl
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="source code">
|
||||
|
||||
```bash
|
||||
git clone https://github.com/NetEase-FuXi/EETQ.git
|
||||
cd EETQ/
|
||||
git submodule update --init --recursive
|
||||
pip install .
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
Quantize a model on-the-fly by defining the quantization data type in [`EetqConfig`].
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, EetqConfig
|
||||
|
||||
quantization_config = EetqConfig("int8")
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B",
|
||||
dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
```
|
||||
|
||||
Save the quantized model with [`~PreTrainedModel.save_pretrained`] so it can be reused again with [`~PreTrainedModel.from_pretrained`].
|
||||
|
||||
```py
|
||||
quant_path = "/path/to/save/quantized/model"
|
||||
model.save_pretrained(quant_path)
|
||||
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
|
||||
```
|
||||
56
transformers/docs/source/en/quantization/fbgemm_fp8.md
Normal file
56
transformers/docs/source/en/quantization/fbgemm_fp8.md
Normal file
@@ -0,0 +1,56 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# FBGEMM
|
||||
|
||||
[FBGEMM (Facebook GEneral Matrix Multiplication)](https://github.com/pytorch/FBGEMM) is a low-precision matrix multiplication library for small batch sizes and support for accuracy-loss minimizing techniques such as row-wise quantization and outlier-aware quantization. With FBGEMM, quantize a models weights to 8-bits/channel and the activations to 8-bits/token (also known as fp8 or w8a8).
|
||||
|
||||
> [!TIP]
|
||||
> You need a GPU with [compute capability 9+](https://developer.nvidia.com/cuda-gpus#collapseOne) like a H100.
|
||||
|
||||
Install the FBGEMM_GPU package with the command below to ensure you have the latest version.
|
||||
|
||||
```bash
|
||||
pip install --upgrade accelerate fbgemm-gpu torch
|
||||
```
|
||||
|
||||
If you're having installation issues, try installing the [nightly release](https://pytorch.org/FBGEMM/fbgemm_gpu-development/InstallationInstructions.html#fbgemm-gpu-install-libraries:~:text=found%20here.-,Install%20the%20FBGEMM_GPU%20Package,-Install%20through%20PyTorch).
|
||||
|
||||
Create a [`FbgemmFp8Config`] and pass it to [`~PreTrainedModel.from_pretrained`] to quantize a model to fp8.
|
||||
|
||||
```py
|
||||
from transformers import FbgemmFp8Config, AutoModelForCausalLM
|
||||
|
||||
quantization_config = FbgemmFp8Config()
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Meta-Llama-3-8B",
|
||||
dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
```
|
||||
|
||||
[`~PreTrainedModel.save_pretrained`] and [`~PreTrainedModel.from_pretrained`] enable saving and loading a quantized model.
|
||||
|
||||
```py
|
||||
quant_path = "/path/to/save/quantized/model"
|
||||
model.save_pretrained(quant_path)
|
||||
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
Read the [Open-sourcing FBGEMM for state-of-the-art server-side inference](https://engineering.fb.com/2018/11/07/ml-applications/fbgemm/) blog post for more details on FBGEMM.
|
||||
62
transformers/docs/source/en/quantization/finegrained_fp8.md
Normal file
62
transformers/docs/source/en/quantization/finegrained_fp8.md
Normal file
@@ -0,0 +1,62 @@
|
||||
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Fine-grained FP8
|
||||
|
||||
Fine-grained FP8 quantization quantizes the weights and activations to fp8.
|
||||
|
||||
- The weights are quantized to 8-bits for each 2D block (`weight_block_size=(128, 128)`).
|
||||
- The activations are quantized to 8-bits for each group per token. The group value matches the weights in the input channel (128 by default).
|
||||
|
||||
FP8 quantization enables support for [DeepSeek-V3](https://hf.co/papers/2412.19437) and DeepSeek-R1.
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/b7b3b34bf826a6423ea82ffc57ecac80c46c3c76/transformers/quantization/quantization_deepseek.png">
|
||||
</div>
|
||||
|
||||
> [!TIP]
|
||||
> You need a GPU with Compute Capability>=9 (H100), and install a PyTorch version compatible with the CUDA version of your GPU.
|
||||
|
||||
Install Accelerate and upgrade to the latest version of PyTorch.
|
||||
|
||||
```bash
|
||||
pip install --upgrade accelerate torch
|
||||
```
|
||||
|
||||
Create a [`FineGrainedFP8Config`] class and pass it to [`~PreTrainedModel.from_pretrained`] to quantize it. The weights are loaded in full precision (`torch.float32`) by default regardless of the actual data type the weights are stored in. Set `dtype="auto"` to load the weights in the data type defined in a models `config.json` file to automatically load the most memory-optiomal data type.
|
||||
|
||||
```py
|
||||
from transformers import FineGrainedFP8Config, AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_name = "meta-llama/Meta-Llama-3-8B"
|
||||
quantization_config = FineGrainedFP8Config()
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto", device_map="auto", quantization_config=quantization_config)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device.type)
|
||||
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10)
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
Use [`~PreTrainedModel.save_pretrained`] to save the quantized model and reload it with [`~PreTrainedModel.from_pretrained`].
|
||||
|
||||
```py
|
||||
quant_path = "/path/to/save/quantized/model"
|
||||
model.save_pretrained(quant_path)
|
||||
model = AutoModelForCausalLM.from_pretrained(quant_path, device_map="auto")
|
||||
```
|
||||
66
transformers/docs/source/en/quantization/fp_quant.md
Normal file
66
transformers/docs/source/en/quantization/fp_quant.md
Normal file
@@ -0,0 +1,66 @@
|
||||
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# FP-Quant
|
||||
|
||||
[FP-Quant](https://github.com/IST-DASLab/FP-Quant) is a family of quantization algorithms tailored for the Blackwell generation of Nvidia GPUs. The goal is to allow for efficient post-training quantization (PTQ) and quantization-aware training (QAT) of LLMs in the [MXFP4 and NVFP4 data-types](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf).
|
||||
|
||||
Currently, only PTQ with MXFP4 is supported. Models can either be quantized on the fly with `quantization_config=FPQuantConfig()`:
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, FPQuantConfig
|
||||
import torch
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"qwen/Qwen3-8B",
|
||||
quantization_config=FPQuantConfig(),
|
||||
device_map="auto",
|
||||
dtype=torch.bfloat16,
|
||||
)
|
||||
```
|
||||
|
||||
or pre-processed with GPTQ for better quality (see [FP Format Quantization Harness](https://github.com/IST-DASLab/FP-Quant)).
|
||||
|
||||
A **Blackwell-generation GPU is required** to run the kernels. Runtime support for FP-Quant is implemented through the [QuTLASS](https://github.com/IST-DASLab/qutlass) library and a lightweight PyTorch interface lib [`fp_quant`](https://github.com/IST-DASLab/FP-Quant/tree/master/inference_lib). We recommend installing the former **from source** and the latter with `pip install fp_quant`.
|
||||
|
||||
Users **without a Blackwell-generation GPU** , can use the method with `quantization_config=FPQuantConfig(pseudoquant=True)` without having to install [QuTLASS](https://github.com/IST-DASLab/qutlass). This would provide no speedups but would fully emulate the effect of quantization.
|
||||
|
||||
> [!TIP]
|
||||
> Find models pre-quantized with FP-Quant in the official ISTA-DASLab [collection](https://huggingface.co/collections/ISTA-DASLab/fp-quant-6877c186103a21d3a02568ee).
|
||||
|
||||
## torch.compile
|
||||
|
||||
FP-Quant is fully compatible with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html).
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, FPQuantConfig
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"qwen/Qwen3-8B",
|
||||
quantization_config=FPQuantConfig(),
|
||||
device_map="auto",
|
||||
dtype=torch.bfloat16,
|
||||
)
|
||||
|
||||
model.forward = torch.compile(model.forward, mode="max-autotune", fullgraph=True)
|
||||
```
|
||||
|
||||
## Speedups
|
||||
|
||||
FP-Quant currently performs best for very large batch size processing.
|
||||
|
||||
See [QuTLASS README](https://github.com/IST-DASLab/qutlass/blob/main/README.md) for speedups.
|
||||
171
transformers/docs/source/en/quantization/gptq.md
Normal file
171
transformers/docs/source/en/quantization/gptq.md
Normal file
@@ -0,0 +1,171 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# GPTQ
|
||||
|
||||
The [GPTQModel](https://github.com/ModelCloud/GPTQModel) and [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) implements the GPTQ algorithm, a post-training quantization technique where each row of the weight matrix is quantized independently to find a version of the weights that minimizes the error. These weights are quantized to int4, but they're restored to fp16 on the fly during inference. This can save memory usage by 4x because the int4 weights are dequantized in a fused kernel rather than a GPU's global memory. Inference is also faster because a lower bitwidth takes less time to communicate.
|
||||
|
||||
> [!WARNING]
|
||||
> AutoGPTQ is likely to be deprecated in the future due to lack of continued support for new models and features. See the [GPTQModel](#gptqmodel) section for more details.
|
||||
|
||||
Install Accelerate, Transformers and Optimum first.
|
||||
|
||||
```bash
|
||||
pip install --upgrade accelerate optimum transformers
|
||||
```
|
||||
|
||||
Then run the command below to install a GPTQ library.
|
||||
|
||||
<hfoptions id="install">
|
||||
<hfoption id="GPTQmodel">
|
||||
|
||||
```bash
|
||||
pip install gptqmodel --no-build-isolation
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="AutoGPTQ">
|
||||
|
||||
```bash
|
||||
pip install auto-gptq --no-build-isolation
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
Create a [`GPTQConfig`] class and set the number of bits to quantize to, a dataset to calbrate the weights for quantization, and a tokenizer to prepare the dataset.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
|
||||
gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)
|
||||
```
|
||||
|
||||
You can pass your own dataset as a list of strings, but it is highly recommended to use the same dataset from the GPTQ paper.
|
||||
|
||||
```py
|
||||
dataset = ["auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."]
|
||||
gptq_config = GPTQConfig(bits=4, dataset=dataset, tokenizer=tokenizer)
|
||||
```
|
||||
|
||||
Load a model to quantize and pass [`GPTQConfig`] to [`~AutoModelForCausalLM.from_pretrained`]. Set `device_map="auto"` to automatically offload the model to a CPU to help fit the model in memory, and allow the model modules to be moved between the CPU and GPU for quantization.
|
||||
|
||||
```py
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m", device_map="auto", quantization_config=gptq_config)
|
||||
```
|
||||
|
||||
If you're running out of memory because a dataset is too large (disk offloading is not supported), try passing the `max_memory` parameter to allocate the amount of memory to use on your device (GPU and CPU).
|
||||
|
||||
```py
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"facebook/opt-125m",
|
||||
device_map="auto",
|
||||
max_memory={0: "30GiB", 1: "46GiB", "cpu": "30GiB"},
|
||||
quantization_config=gptq_config
|
||||
)
|
||||
```
|
||||
|
||||
> [!WARNING]
|
||||
> Depending on your hardware, it can take some time to quantize a model from scratch. It can take ~5 minutes to quantize the [facebook/opt-350m](https://huggingface.co/facebook/opt-350m) model on a free-tier Google Colab GPU, but it'll take ~4 hours to quantize a 175B parameter model on a NVIDIA A100. Before you quantize a model, it is a good idea to check the Hub if a GPTQ-quantized version of the model already exists.
|
||||
|
||||
Once a model is quantized, you can use [`~PreTrainedModel.push_to_hub`] to push the model and tokenizer to the Hub where it can be easily shared and accessed. This saves the [`GPTQConfig`].
|
||||
|
||||
```py
|
||||
quantized_model.push_to_hub("opt-125m-gptq")
|
||||
tokenizer.push_to_hub("opt-125m-gptq")
|
||||
```
|
||||
|
||||
[`~PreTrainedModel.save_pretrained`] saves a quantized model locally. If the model was quantized with the `device_map` parameter, make sure to move the entire model to a GPU or CPU before saving it. The example below saves the model on a CPU.
|
||||
|
||||
```py
|
||||
quantized_model.save_pretrained("opt-125m-gptq")
|
||||
tokenizer.save_pretrained("opt-125m-gptq")
|
||||
|
||||
# if quantized with device_map set
|
||||
quantized_model.to("cpu")
|
||||
quantized_model.save_pretrained("opt-125m-gptq")
|
||||
```
|
||||
|
||||
Reload a quantized model with [`~PreTrainedModel.from_pretrained`], and set `device_map="auto"` to automatically distribute the model on all available GPUs to load the model faster without using more memory than needed.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto")
|
||||
```
|
||||
|
||||
## Marlin
|
||||
|
||||
[Marlin](https://github.com/IST-DASLab/marlin) is a 4-bit only CUDA GPTQ kernel, highly optimized for the NVIDIA A100 GPU (Ampere) architecture. Loading, dequantization, and execution of post-dequantized weights are highly parallelized, offering a substantial inference improvement versus the original CUDA GPTQ kernel. Marlin is only available for quantized inference and does not support model quantization.
|
||||
|
||||
Marlin inference can be activated with the `backend` parameter in [`GPTQConfig`].
|
||||
|
||||
```py
|
||||
|
||||
from transformers import AutoModelForCausalLM, GPTQConfig
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained("{your_username}/opt-125m-gptq", device_map="auto", quantization_config=GPTQConfig(bits=4, backend="marlin"))
|
||||
```
|
||||
|
||||
## ExLlama
|
||||
|
||||
> [!WARNING]
|
||||
> Only 4-bit models are supported, and we recommend deactivating the ExLlama kernels if you're finetuning a quantized model with PEFT.
|
||||
|
||||
[ExLlama](https://github.com/turboderp/exllama) is a Python/C++/CUDA implementation of the [Llama](model_doc/llama) model that is designed for faster inference with 4-bit GPTQ weights (check out these [benchmarks](https://github.com/huggingface/optimum/tree/main/tests/benchmark#gptq-benchmark)). The ExLlama kernel is activated by default when you create a [`GPTQConfig`] object.
|
||||
|
||||
To boost inference speed even further, use the [ExLlamaV2](https://github.com/turboderp/exllamav2) kernels by configuring the `exllama_config` parameter in [`GPTQConfig`].
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, GPTQConfig
|
||||
|
||||
gptq_config = GPTQConfig(bits=4, exllama_config={"version":2})
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"{your_username}/opt-125m-gptq",
|
||||
device_map="auto",
|
||||
quantization_config=gptq_config
|
||||
)
|
||||
```
|
||||
|
||||
The ExLlama kernels are only supported when the entire model is on the GPU. If you're doing inference on a CPU with AutoGPTQ 0.4.2+, disable the ExLlama kernel in [`GPTQConfig`]. This overwrites the attributes related to the ExLlama kernels in the quantization config of the `config.json` file.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, GPTQConfig
|
||||
|
||||
gptq_config = GPTQConfig(bits=4, use_exllama=False)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"{your_username}/opt-125m-gptq",
|
||||
device_map="cpu",
|
||||
quantization_config=gptq_config
|
||||
)
|
||||
```
|
||||
|
||||
## GPTQModel
|
||||
|
||||
It is recommended to use GPTQModel, originally a maintained fork of AutoGPTQ, because it has since diverged from AutoGTPQ with some significant features. GPTQModel has faster quantization, lower memory usage, and more accurate default quantization.
|
||||
|
||||
GPTQModel provides asymmetric quantization which can potentially lower quantization errors compared to symmetric quantization. It is not backward compatible with AutoGPTQ, and not all kernels (Marlin) support asymmetric quantization.
|
||||
|
||||
GPTQModel also has broader support for the latest LLM models, multimodal models (Qwen2-VL and Ovis1.6-VL), platforms (Linux, macOS, Windows 11), and hardware (AMD ROCm, Apple Silicon, Intel/AMD CPUs, and Intel Datacenter Max/Arc GPUs, etc.).
|
||||
|
||||
The Marlin kernels are also updated for A100 GPUs and other kernels are updated to include auto-padding for legacy models and models with non-uniform in/out-features.
|
||||
|
||||
## Resources
|
||||
|
||||
Run the GPTQ quantization with PEFT [notebook](https://colab.research.google.com/drive/1_TIrmuKOFhuRRiTWN94iLKUFu6ZX4ceb?usp=sharing) for a hands-on experience, and read [Making LLMs lighter with AutoGPTQ and transformers](https://huggingface.co/blog/gptq-integration) to learn more about the AutoGPTQ integration.
|
||||
80
transformers/docs/source/en/quantization/higgs.md
Normal file
80
transformers/docs/source/en/quantization/higgs.md
Normal file
@@ -0,0 +1,80 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# HIGGS
|
||||
|
||||
[HIGGS](https://huggingface.co/papers/2411.17525) is a zero-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and state-of-the-art performance.
|
||||
|
||||
Runtime support for HIGGS is implemented through the [FLUTE](https://github.com/HanGuo97/flute) library. Only the 70B and 405B variants of Llama 3 and Llama 3.0, and the 8B and 27B variants of Gemma 2 are currently supported. HIGGS also doesn't support quantized training and backward passes in general at the moment.
|
||||
|
||||
Run the command below to install FLUTE.
|
||||
|
||||
<hfoptions id="install">
|
||||
<hfoption id="CUDA 12.1">
|
||||
|
||||
```bash
|
||||
pip install flute-kernel
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="CUDA 11.8">
|
||||
|
||||
```bash
|
||||
pip install flute-kernel -i https://flute-ai.github.io/whl/cu12.4
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
Create a [`HiggsConfig`] with the number of bits to quantize a model to.
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"google/gemma-2-9b-it",
|
||||
quantization_config=HiggsConfig(bits=4),
|
||||
device_map="auto",
|
||||
)
|
||||
```
|
||||
|
||||
> [!TIP]
|
||||
> Find models pre-quantized with HIGGS in the official ISTA-DASLab [collection](https://huggingface.co/collections/ISTA-DASLab/higgs-675308e432fd56b7f6dab94e).
|
||||
|
||||
## torch.compile
|
||||
|
||||
HIGGS is fully compatible with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html).
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, HiggsConfig
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
"google/gemma-2-9b-it",
|
||||
quantization_config=HiggsConfig(bits=4),
|
||||
device_map="auto",
|
||||
)
|
||||
|
||||
model = torch.compile(model)
|
||||
```
|
||||
|
||||
Refer to the table below for a benchmark of forward passes/sec for Llama-3.1-8B-Instruct on a RTX4090.
|
||||
|
||||
| Batch Size | BF16 (with `torch.compile`) | HIGGS 4bit (without `torch.compile`) | HIGGS 4bit (with `torch.compile`) |
|
||||
|------------|-----------------------------|----------------------------------|-----------------------------------|
|
||||
| 1 | 59 | 41 | 124 |
|
||||
| 4 | 57 | 42 | 123 |
|
||||
| 16 | 56 | 41 | 120 |
|
||||
101
transformers/docs/source/en/quantization/hqq.md
Executable file
101
transformers/docs/source/en/quantization/hqq.md
Executable file
@@ -0,0 +1,101 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# HQQ
|
||||
|
||||
[Half-Quadratic Quantization (HQQ)](https://github.com/mobiusml/hqq/) supports fast on-the-fly quantization for 8, 4, 3, 2, and even 1-bits. It doesn't require calibration data, and it is compatible with any model modality (LLMs, vision, etc.).
|
||||
|
||||
HQQ further supports fine-tuning with [PEFT](https://huggingface.co/docs/peft) and is fully compatible with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for even faster inference and training.
|
||||
|
||||
Install HQQ with the following command to get the latest version and to build its corresponding CUDA kernels if you are using a cuda device. It also support Intel XPU with pure pytorch implementation.
|
||||
|
||||
```bash
|
||||
pip install hqq
|
||||
```
|
||||
|
||||
You can choose to either replace all the linear layers in a model with the same quantization config or dedicate a specific quantization config for specific linear layers.
|
||||
|
||||
<hfoptions id="hqq">
|
||||
<hfoption id="replace all layers">
|
||||
|
||||
Quantize a model by creating a [`HqqConfig`] and specifying the `nbits` and `group_size` to replace for all the linear layers ([torch.nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html)) of the model.
|
||||
|
||||
``` py
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, HqqConfig
|
||||
|
||||
quant_config = HqqConfig(nbits=8, group_size=64)
|
||||
model = transformers.AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B",
|
||||
dtype=torch.float16,
|
||||
device_map="auto",
|
||||
quantization_config=quant_config
|
||||
)
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="specific layers only">
|
||||
|
||||
Quantize a model by creating a dictionary specifying the `nbits` and `group_size` for the linear layers to quantize. Pass them to [`HqqConfig`] and set which layers to quantize with the config. This approach is especially useful for quantizing mixture-of-experts (MoEs) because they are less affected ly lower quantization settings.
|
||||
|
||||
``` py
|
||||
q4_config = {'nbits':4, 'group_size':64}
|
||||
q3_config = {'nbits':3, 'group_size':32}
|
||||
quant_config = HqqConfig(dynamic_config={
|
||||
'self_attn.q_proj':q4_config,
|
||||
'self_attn.k_proj':q4_config,
|
||||
'self_attn.v_proj':q4_config,
|
||||
'self_attn.o_proj':q4_config,
|
||||
|
||||
'mlp.gate_proj':q3_config,
|
||||
'mlp.up_proj' :q3_config,
|
||||
'mlp.down_proj':q3_config,
|
||||
})
|
||||
|
||||
model = transformers.AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B",
|
||||
dtype=torch.float16,
|
||||
device_map="auto",
|
||||
quantization_config=quant_config
|
||||
)
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
## Backends
|
||||
|
||||
HQQ supports various backends, including pure PyTorch and custom dequantization CUDA kernels. These backends are suitable for older GPUs and PEFT/QLoRA training.
|
||||
|
||||
```py
|
||||
from hqq.core.quantize import *
|
||||
|
||||
HQQLinear.set_backend(HQQBackend.PYTORCH)
|
||||
```
|
||||
|
||||
For faster inference, HQQ supports 4-bit fused kernels (torchao and Marlin) after a model is quantized. These can reach up to 200 tokens/sec on a single 4090. The example below demonstrates enabling the torchao_int4 backend.
|
||||
|
||||
```py
|
||||
from hqq.utils.patching import prepare_for_inference
|
||||
|
||||
prepare_for_inference("model", backend="torchao_int4")
|
||||
```
|
||||
|
||||
Refer to the [Backend](https://github.com/mobiusml/hqq/#backend) guide for more details.
|
||||
|
||||
## Resources
|
||||
|
||||
Read the [Half-Quadratic Quantization of Large Machine Learning Models](https://mobiusml.github.io/hqq_blog/) blog post for more details about HQQ.
|
||||
77
transformers/docs/source/en/quantization/mxfp4.md
Normal file
77
transformers/docs/source/en/quantization/mxfp4.md
Normal file
@@ -0,0 +1,77 @@
|
||||
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# MXFP4
|
||||
|
||||
Note: MXFP4 quantisation currently only works for OpenAI GPT-OSS 120b and 20b.
|
||||
|
||||
MXFP4 is a 4-bit floating point format that dramatically reduces the memory requirements of large models. Large models (GPT-OSS-120B) can fit on a single 80GB GPU and smaller models (GPT-OSS-20B) only require 16GB of memory. It uses blockwise scaling to preserve it's range and accuracy, which typically becomes degraded at lower precisions.
|
||||
|
||||
To use MXPF4, make sure your hardware meets the following requirements.
|
||||
|
||||
- Install Accelerate, kernels, and Triton ≥ 3.4. Only manually install Triton ≥ 3.4 if you're using PyTorch 2.7 because it is already supported in PyTorch 2.8.
|
||||
- NVIDIA GPU Compute Capability ≥ 7.5 which includes Tesla GPUs and newer. Use [get_device_capability](https://docs.pytorch.org/docs/stable/generated/torch.cuda.get_device_capability.html) to check Compute Capability.
|
||||
|
||||
```python
|
||||
from torch import cuda
|
||||
cuda.get_device_capability()
|
||||
|
||||
# (7, 5)
|
||||
```
|
||||
|
||||
Check a model's quantization config as shown below to see if it supports MXFP4. If `'quant_method': 'mxfp4'`, then the model automatically uses MXFP4.
|
||||
|
||||
```py
|
||||
from transformers import GptOssConfig
|
||||
|
||||
model_id = "openai/gpt-oss-120b"
|
||||
cfg = GptOssConfig.from_pretrained(model_id)
|
||||
print(cfg.quantization_config)
|
||||
|
||||
# Example output:
|
||||
# {
|
||||
# 'modules_to_not_convert': [
|
||||
# 'model.layers.*.self_attn',
|
||||
# 'model.layers.*.mlp.router',
|
||||
# 'model.embed_tokens',
|
||||
# 'lm_head'
|
||||
# ],
|
||||
# 'quant_method': 'mxfp4'
|
||||
# }
|
||||
```
|
||||
|
||||
## MXFP4 kernels
|
||||
|
||||
Transformers automatically pulls the MXFP4-aware Triton kernels from the community repository when you load a model that needs them. The kernels are stored in your local cache and used during the forward pass.
|
||||
|
||||
MXFP4 kernels are used by default, if available and supported, and does not require any code changes.
|
||||
|
||||
You can use [hf cache scan](https://huggingface.co/docs/huggingface_hub/en/guides/manage-cache#scan-your-cache) to verify the kernels are downloaded.
|
||||
|
||||
```shell
|
||||
hf cache scan
|
||||
```
|
||||
|
||||
```shell
|
||||
REPO ID REPO TYPE SIZE ON DISK
|
||||
-------------------------------- --------- ------------
|
||||
kernels-community/triton_kernels model 536.2K
|
||||
openai/gpt-oss-20b model 13.8G
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
Learn more about MXFP4 quantization and how blockwise scaling works in this [blog post](https://huggingface.co/blog/faster-transformers#mxfp4-quantization).
|
||||
19
transformers/docs/source/en/quantization/optimum.md
Normal file
19
transformers/docs/source/en/quantization/optimum.md
Normal file
@@ -0,0 +1,19 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Optimum
|
||||
|
||||
[Optimum](https://huggingface.co/docs/optimum/index) is an optimization library that supports quantization for Intel, Furiousa, ONNX Runtime, GPTQ, and lower-level PyTorch quantization functions. It is designed to enhance performance for specific hardware - Intel CPUs/HPUs, AMD GPUs, Furiousa NPUs, etc. - and model accelerators like ONNX Runtime.
|
||||
61
transformers/docs/source/en/quantization/overview.md
Normal file
61
transformers/docs/source/en/quantization/overview.md
Normal file
@@ -0,0 +1,61 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Overview
|
||||
|
||||
Quantization lowers the memory requirements of loading and using a model by storing the weights in a lower precision while trying to preserve as much accuracy as possible. Weights are typically stored in full-precision (fp32) floating point representations, but half-precision (fp16 or bf16) are increasingly popular data types given the large size of models today. Some quantization methods can reduce the precision even further to integer representations, like int8 or int4.
|
||||
|
||||
Transformers supports many quantization methods, each with their pros and cons, so you can pick the best one for your specific use case. Some methods require calibration for greater accuracy and extreme compression (1-2 bits), while other methods work out of the box with on-the-fly quantization.
|
||||
|
||||
Use the Space below to help you pick a quantization method depending on your hardware and number of bits to quantize to.
|
||||
|
||||
| Quantization Method | On the fly quantization | CPU | CUDA GPU | ROCm GPU | Metal (Apple Silicon) | Intel GPU | Torch compile() | Bits | PEFT Fine Tuning | Serializable with 🤗Transformers | 🤗Transformers Support | Link to library |
|
||||
|-------------------------------------------|----------------------|-----------------|----------|-----------|------------------------------------|-----------------|-----------------|--------------|------------------|-----------------------------|-------------------------|---------------------------------------------|
|
||||
| [AQLM](./aqlm) | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 🟢 | 🟢 | 1/2 | 🟢 | 🟢 | 🟢 | https://github.com/Vahe1994/AQLM |
|
||||
| [AutoRound](./auto_round) | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 🟢 | 🔴 | 2/3/4/8 | 🔴 | 🟢 | 🟢 | https://github.com/intel/auto-round |
|
||||
| [AWQ](./awq) | 🔴 | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | ? | 4 | 🟢 | 🟢 | 🟢 | https://github.com/casper-hansen/AutoAWQ |
|
||||
| [bitsandbytes](./bitsandbytes) | 🟢 | 🟡 | 🟢 | 🟡 | 🔴 | 🟡 | 🟢 | 4/8 | 🟢 | 🟢 | 🟢 | https://github.com/bitsandbytes-foundation/bitsandbytes |
|
||||
| [compressed-tensors](./compressed_tensors) | 🔴 | 🟢 | 🟢 | 🟢 | 🔴 | 🔴 | 🔴 | 1/8 | 🟢 | 🟢 | 🟢 | https://github.com/neuralmagic/compressed-tensors |
|
||||
| [EETQ](./eetq) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | ? | 8 | 🟢 | 🟢 | 🟢 | https://github.com/NetEase-FuXi/EETQ |
|
||||
| [FP-Quant](./fp_quant) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 4 | 🔴 | 🟢 | 🟢 | https://github.com/IST-DASLab/FP-Quant |
|
||||
| [GGUF / GGML (llama.cpp)](../gguf) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 | 🔴 | 1/8 | 🔴 | [See Notes](../gguf) | [See Notes](../gguf) | https://github.com/ggerganov/llama.cpp |
|
||||
| [GPTQModel](./gptq) | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | 🔴 | 2/3/4/8 | 🟢 | 🟢 | 🟢 | https://github.com/ModelCloud/GPTQModel |
|
||||
| [AutoGPTQ](./gptq) | 🔴 | 🔴 | 🟢 | 🟢 | 🔴 | 🔴 | 🔴 | 2/3/4/8 | 🟢 | 🟢 | 🟢 | https://github.com/AutoGPTQ/AutoGPTQ |
|
||||
| [HIGGS](./higgs) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 2/4 | 🔴 | 🟢 | 🟢 | https://github.com/HanGuo97/flute |
|
||||
| [HQQ](./hqq) | 🟢 | 🟢 | 🟢 | 🔴 | 🔴 | 🟢 | 🟢 | 1/8 | 🟢 | 🔴 | 🟢 | https://github.com/mobiusml/hqq/ |
|
||||
| [optimum-quanto](./quanto) | 🟢 | 🟢 | 🟢 | 🔴 | 🟢 | 🟢 | 🟢 | 2/4/8 | 🔴 | 🔴 | 🟢 | https://github.com/huggingface/optimum-quanto |
|
||||
| [FBGEMM_FP8](./fbgemm_fp8) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | https://github.com/pytorch/FBGEMM |
|
||||
| [torchao](./torchao) | 🟢 | 🟢 | 🟢 | 🔴 | 🟡 | 🟢 | | 4/8 | | 🟢🔴 | 🟢 | https://github.com/pytorch/ao |
|
||||
| [VPTQ](./vptq) | 🔴 | 🔴 | 🟢 | 🟡 | 🔴 | 🔴 | 🟢 | 1/8 | 🔴 | 🟢 | 🟢 | https://github.com/microsoft/VPTQ |
|
||||
| [FINEGRAINED_FP8](./finegrained_fp8) | 🟢 | 🔴 | 🟢 | 🔴 | 🔴 | 🟢 | 🔴 | 8 | 🔴 | 🟢 | 🟢 | |
|
||||
| [SpQR](./spqr) | 🔴 | 🔴 | 🟢 | 🔴 | 🔴 | 🔴 | 🟢 | 3 | 🔴 | 🟢 | 🟢 | https://github.com/Vahe1994/SpQR/ |
|
||||
| [Quark](./quark) | 🔴 | 🟢 | 🟢 | 🟢 | 🟢 | 🟢 | ? | 2/4/6/8/9/16 | 🔴 | 🔴 | 🟢 | https://quark.docs.amd.com/latest/ |
|
||||
|
||||
## Resources
|
||||
|
||||
If you are new to quantization, we recommend checking out these beginner-friendly quantization courses in collaboration with DeepLearning.AI.
|
||||
|
||||
* [Quantization Fundamentals with Hugging Face](https://www.deeplearning.ai/short-courses/quantization-fundamentals-with-hugging-face/)
|
||||
* [Quantization in Depth](https://www.deeplearning.ai/short-courses/quantization-in-depth)
|
||||
|
||||
## User-Friendly Quantization Tools
|
||||
|
||||
If you are looking for a user-friendly quantization experience, you can use the following community spaces and notebooks:
|
||||
|
||||
* [Bitsandbytes Space](https://huggingface.co/spaces/bnb-community/bnb-my-repo)
|
||||
* [GGUF Space](https://huggingface.co/spaces/ggml-org/gguf-my-repo)
|
||||
* [MLX Space](https://huggingface.co/spaces/mlx-community/mlx-my-repo)
|
||||
* [AuoQuant Notebook](https://colab.research.google.com/drive/1b6nqC7UZVt8bx4MksX7s656GXPM-eWw4?usp=sharing#scrollTo=ZC9Nsr9u5WhN)
|
||||
69
transformers/docs/source/en/quantization/quanto.md
Normal file
69
transformers/docs/source/en/quantization/quanto.md
Normal file
@@ -0,0 +1,69 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Optimum Quanto
|
||||
|
||||
[Quanto](https://github.com/huggingface/optimum-quanto) is a PyTorch quantization backend for [Optimum](https://huggingface.co/docs/optimum/index). It features linear quantization for weights (float8, int8, int4, int2) with accuracy very similar to full-precision models. Quanto is compatible with any model modality and device, making it simple to use regardless of hardware.
|
||||
|
||||
Quanto is also compatible with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for faster generation.
|
||||
|
||||
Install Quanto with the following command.
|
||||
|
||||
```bash
|
||||
pip install optimum-quanto accelerate transformers
|
||||
```
|
||||
|
||||
Quantize a model by creating a [`QuantoConfig`] and specifying the `weights` parameter to quantize to. This works for any model in any modality as long as it contains [torch.nn.Linear](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html) layers.
|
||||
|
||||
> [!TIP]
|
||||
> The Transformers integration only supports weight quantization. Use the Quanto library directly if you need activation quantization, calibration, or QAT.
|
||||
|
||||
```py
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, QuantoConfig
|
||||
|
||||
quant_config = QuantoConfig(weights="int8")
|
||||
model = transformers.AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B",
|
||||
dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quant_config
|
||||
)
|
||||
```
|
||||
|
||||
## torch.compile
|
||||
|
||||
Wrap a Quanto model with [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for faster generation.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForSpeechSeq2Seq, QuantoConfig
|
||||
|
||||
quant_config = QuantoConfig(weights="int8")
|
||||
model = AutoModelForSpeechSeq2Seq.from_pretrained(
|
||||
"openai/whisper-large-v2",
|
||||
dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quant_config
|
||||
)
|
||||
|
||||
model = torch.compile(model)
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
Read the [Quanto: a PyTorch quantization backend for Optimum](https://huggingface.co/blog/quanto-introduction) blog post to learn more about the library design and benchmarks.
|
||||
|
||||
For more hands-on examples, take a look at the Quanto [notebook](https://colab.research.google.com/drive/16CXfVmtdQvciSh9BopZUDYcmXCDpvgrT?usp=sharing).
|
||||
83
transformers/docs/source/en/quantization/quark.md
Normal file
83
transformers/docs/source/en/quantization/quark.md
Normal file
@@ -0,0 +1,83 @@
|
||||
<!--Copyright 2025 Advanced Micro Devices, Inc. and The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Quark
|
||||
|
||||
[Quark](https://quark.docs.amd.com/latest/) is a deep learning quantization toolkit designed to be agnostic to specific data types, algorithms, and hardware. Different pre-processing strategies, algorithms and data-types can be combined in Quark.
|
||||
|
||||
The PyTorch support integrated through 🤗 Transformers primarily targets AMD CPUs and GPUs, and is primarily meant to be used for evaluation purposes. For example, it is possible to use [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) with 🤗 Transformers backend and evaluate a wide range of models quantized through Quark seamlessly.
|
||||
|
||||
Users interested in Quark can refer to its [documentation](https://quark.docs.amd.com/latest/) to get started quantizing models and using them in supported open-source libraries!
|
||||
|
||||
Although Quark has its own checkpoint / [configuration format](https://huggingface.co/amd/Llama-3.1-8B-Instruct-FP8-KV-Quark-test/blob/main/config.json#L26), the library also supports producing models with a serialization layout compliant with other quantization/runtime implementations ([AutoAWQ](https://huggingface.co/docs/transformers/quantization/awq), [native fp8 in 🤗 Transformers](https://huggingface.co/docs/transformers/quantization/finegrained_fp8)).
|
||||
|
||||
To be able to load Quark quantized models in Transformers, the library first needs to be installed:
|
||||
|
||||
```bash
|
||||
pip install amd-quark
|
||||
```
|
||||
|
||||
## Support matrix
|
||||
|
||||
Models quantized through Quark support a large range of features, that can be combined together. All quantized models independently of their configuration can seamlessly be reloaded through `PretrainedModel.from_pretrained`.
|
||||
|
||||
The table below shows a few features supported by Quark:
|
||||
|
||||
| **Feature** | **Supported subset in Quark** | |
|
||||
|---------------------------------|-----------------------------------------------------------------------------------------------------------|---|
|
||||
| Data types | int8, int4, int2, bfloat16, float16, fp8_e5m2, fp8_e4m3, fp6_e3m2, fp6_e2m3, fp4, OCP MX, MX6, MX9, bfp16 | |
|
||||
| Pre-quantization transformation | SmoothQuant, QuaRot, SpinQuant, AWQ | |
|
||||
| Quantization algorithm | GPTQ | |
|
||||
| Supported operators | ``nn.Linear``, ``nn.Conv2d``, ``nn.ConvTranspose2d``, ``nn.Embedding``, ``nn.EmbeddingBag`` | |
|
||||
| Granularity | per-tensor, per-channel, per-block, per-layer, per-layer type | |
|
||||
| KV cache | fp8 | |
|
||||
| Activation calibration | MinMax / Percentile / MSE | |
|
||||
| Quantization strategy | weight-only, static, dynamic, with or without output quantization | |
|
||||
|
||||
## Models on Hugging Face Hub
|
||||
|
||||
Public models using Quark native serialization can be found at https://huggingface.co/models?other=quark.
|
||||
|
||||
Although Quark also supports [models using `quant_method="fp8"`](https://huggingface.co/models?other=fp8) and [models using `quant_method="awq"`](https://huggingface.co/models?other=awq), Transformers loads these models rather through [AutoAWQ](https://huggingface.co/docs/transformers/quantization/awq) or uses the [native fp8 support in 🤗 Transformers](https://huggingface.co/docs/transformers/quantization/finegrained_fp8).
|
||||
|
||||
## Using Quark models in Transformers
|
||||
|
||||
Here is an example of how one can load a Quark model in Transformers:
|
||||
|
||||
```python
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_id = "EmbeddedLLM/Llama-3.1-8B-Instruct-w_fp8_per_channel_sym"
|
||||
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
|
||||
|
||||
print(model.model.layers[0].self_attn.q_proj)
|
||||
# QParamsLinear(
|
||||
# (weight_quantizer): ScaledRealQuantizer()
|
||||
# (input_quantizer): ScaledRealQuantizer()
|
||||
# (output_quantizer): ScaledRealQuantizer()
|
||||
# )
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
inp = tokenizer("Where is a good place to cycle around Tokyo?", return_tensors="pt")
|
||||
inp = inp.to(model.device)
|
||||
|
||||
res = model.generate(**inp, min_new_tokens=50, max_new_tokens=100)
|
||||
|
||||
print(tokenizer.batch_decode(res)[0])
|
||||
# <|begin_of_text|>Where is a good place to cycle around Tokyo? There are several places in Tokyo that are suitable for cycling, depending on your skill level and interests. Here are a few suggestions:
|
||||
# 1. Yoyogi Park: This park is a popular spot for cycling and has a wide, flat path that's perfect for beginners. You can also visit the Meiji Shrine, a famous Shinto shrine located in the park.
|
||||
# 2. Imperial Palace East Garden: This beautiful garden has a large, flat path that's perfect for cycling. You can also visit the
|
||||
```
|
||||
157
transformers/docs/source/en/quantization/selecting.md
Normal file
157
transformers/docs/source/en/quantization/selecting.md
Normal file
@@ -0,0 +1,157 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# Selecting a quantization method
|
||||
|
||||
There are many quantization methods available in Transformers for inference and fine-tuning. This guide helps you choose the most common and production-ready quantization techniques depending on your use case, and presents the advantages and disadvantages of each technique.
|
||||
|
||||
For a comprehensive overview of all supported methods and their features, refer back to the table in the [Overview](./overview).
|
||||
|
||||
## Inference
|
||||
|
||||
Consider the quantization methods below for inference.
|
||||
|
||||
| quantization method | use case |
|
||||
|---|---|
|
||||
| bitsandbytes | ease of use and QLoRA fine-tuning on NVIDIA GPUs |
|
||||
| compressed-tensors | loading specific quantized formats (FP8, Sparse) |
|
||||
| GPTQModel or AWQ | good 4-bit accuracy with upfront calibration |
|
||||
| HQQ | fast on the fly quantization without calibration |
|
||||
| torchao | flexibility and fast inference with torch.compile |
|
||||
|
||||
### No Calibration Required (On-the-fly Quantization)
|
||||
|
||||
These methods are generally easier to use as they don't need a separate calibration dataset or step.
|
||||
|
||||
#### bitsandbytes
|
||||
|
||||
| Pros | Cons |
|
||||
|--------------------------------------------------------------|---------------------------------------------------------|
|
||||
| Very simple, no calibration dataset required for inference. | Primarily optimized for NVIDIA GPUs (CUDA). |
|
||||
| Good community support and widely adopted. | Inference speedup isn't guaranteed. |
|
||||
|
||||
See the [bitsandbytes documentation](./bitsandbytes) for more details.
|
||||
|
||||
#### HQQ (Half-Quadratic Quantization)
|
||||
|
||||
| Pros | Cons |
|
||||
|----------------------------------------------------------------------|----------------------------------------------------------------------------|
|
||||
| Fast quantization process, no calibration data needed. | Accuracy can degrade significantly at bit depths <4-bit. |
|
||||
| Multiple backends for fast inference. | Inference speed may not match others unless using `torch.compile` or backends. |
|
||||
| Compatible with `torch.compile`. | |
|
||||
| Supports wide range of bit depths (8, 4, 3, 2, 1-bit). | |
|
||||
|
||||
See the [HQQ documentation](./hqq) for more details.
|
||||
|
||||
#### torchao
|
||||
|
||||
| Pros | Cons |
|
||||
|----------------------------------------------------------------------|----------------------------------------------------------------------|
|
||||
| Strong integration with `torch.compile` for potential speedups. | Newer library, ecosystem still evolving. |
|
||||
| Offers decent CPU quantization support. | Performance depends on `torch.compile` working well. |
|
||||
| Flexibility in quantization schemes (int8, int4, fp8). | 4-bit quantization (int4wo) may not match GPTQ/AWQ in accuracy. |
|
||||
|
||||
See the [torchao documentation](./torchao) for more details.
|
||||
|
||||
### Calibration-based Quantization
|
||||
|
||||
These methods require an upfront calibration step using a dataset to potentially achieve higher accuracy.
|
||||
|
||||
#### GPTQ/GPTQModel
|
||||
|
||||
Calibration for 8B model takes ~20 minutes on one A100 gpu.
|
||||
|
||||
| Pros | Cons |
|
||||
|----------------------------------------------------------------------|----------------------------------------------------------------------|
|
||||
| Often achieves high accuracy. | Requires a calibration dataset and a separate calibration step. |
|
||||
| Can lead to inference speedups. | Possible to overfit on calibration data. |
|
||||
| Many pre-quantized GPTQ models on [Hugging Face Hub](https://huggingface.co/models?other=gptq). | |
|
||||
|
||||
See the [GPTQ documentation](./gptq) for more details.
|
||||
|
||||
#### AWQ (Activation-aware Weight Quantization)
|
||||
|
||||
Calibration for 8B model takes ~10 minutes on one A100 gpu.
|
||||
|
||||
| Pros | Cons |
|
||||
|----------------------------------------------------------------------|-----------------------------------------------------|
|
||||
| Often achieves high accuracy at 4-bit. (Sometimes surpasses GPTQ on specific tasks.) | Requires calibration if quantizing yourself. |
|
||||
| Can lead to inference speedups. | |
|
||||
| Shorter calibration time than GPTQ. | |
|
||||
| Many pre-quantized AWQ models on [Hugging Face Hub](https://huggingface.co/models?other=awq). | |
|
||||
|
||||
See the [AWQ documentation](./awq) for more details.
|
||||
|
||||
### Loading Specific Formats
|
||||
|
||||
#### compressed-tensors
|
||||
|
||||
| Pros | Cons |
|
||||
|--------------------------------------------------------------|-------------------------------------------------------------|
|
||||
| Supports flexible formats including FP8 and sparsity. | Primarily for loading pre-quantized models. |
|
||||
| | Doesn't perform quantization within Transformers directly. |
|
||||
|
||||
See the [compressed-tensors documentation](./compressed_tensors) for more details.
|
||||
|
||||
## Fine-tuning
|
||||
|
||||
Consider the quantization method below during fine-tuning to save memory.
|
||||
|
||||
### bitsandbytes[[training]]
|
||||
|
||||
* **Description:** The standard method for QLoRA fine-tuning via PEFT.
|
||||
* **Pros:** Enables fine-tuning large models on consumer GPUs; widely supported and documented for PEFT.
|
||||
* **Cons:** Primarily for NVIDIA GPUs.
|
||||
|
||||
Other methods offer PEFT compatibility, though bitsandbytes is the most established and straightforward path for QLoRA.
|
||||
|
||||
See the [bitsandbytes documentation](./bitsandbytes#qlora) and [PEFT Docs](https://huggingface.co/docs/peft/developer_guides/quantization#aqlm-quantization) for more details.
|
||||
|
||||
## Research
|
||||
|
||||
Methods like [AQLM](./aqlm), [SpQR](./spqr), [VPTQ](./vptq), [HIGGS](./higgs), etc., push the boundaries of compression (< 2-bit) or explore novel techniques.
|
||||
|
||||
* Consider these if:
|
||||
* You need extreme compression (sub-4-bit).
|
||||
* You are conducting research or require state-of-the-art results from their respective papers.
|
||||
* You have significant compute resources available for potentially complex quantization procedures.
|
||||
We recommend consulting each methods documentation and associated papers carefully before choosing one for use in production.
|
||||
|
||||
## Benchmark Comparison
|
||||
|
||||
To provide a quantitative comparison of different quantization methods, we benchmarked several popular techniques on the Llama 3.1 8B and 70B models. The following tables show results for accuracy (higher is better), inference throughput measured in tokens/second (higher is better), peak VRAM usage measured in GB (lower is better), and quantization time.
|
||||
|
||||
Performance metrics were measured on 2 NVIDIA A100 80GB GPU for Llama 3.1 70B (bfloat16), 1 NVIDIA H100 80GB GPU for FP8 methods, and 1 NVIDIA A100 80GB GPU for all other methods. Throughput was measured with a batch size of 1 and generating 64 tokens.
|
||||
Results for `torch.compile` and Marlin kernels are included where applicable and supported.
|
||||
|
||||
<iframe
|
||||
src="https://huggingface.co/datasets/derekl35/quantization-benchmarks/embed/viewer/default/train"
|
||||
frameborder="0"
|
||||
width="100%"
|
||||
height="560px"
|
||||
title="benchmarking results dataset"
|
||||
></iframe>
|
||||
|
||||
The key takeaways are:
|
||||
|
||||
| Quantization & Methods | Memory Savings (vs bf16) | Accuracy | Other Notes |
|
||||
|-------------------------------------------- |------------------------- |--------------------- |------------------------------------------------------------------- |
|
||||
| **8-bit** (bnb-int8, HQQ, Quanto, torchao, fp8) | ~2x | Very close to baseline bf16 model | |
|
||||
| **4-bit** (AWQ, GPTQ, HQQ, bnb-nf4) | ~4x | Relatively high accuracy | AWQ/GPTQ often lead in accuracy but need calibration. HQQ/bnb-nf4 are easy on-the-fly. |
|
||||
| **Sub-4-bit** (VPTQ, AQLM, 2-bit GPTQ) | Extreme (>4x) | Noticeable drop, especially at 2-bit | Quantization times can be very long (AQLM, VPTQ). Performance varies. |
|
||||
|
||||
> [!TIP]
|
||||
> Always benchmark the performance (accuracy and speed) of the quantized model on your specific task and hardware to ensure it meets your requirements. Refer to the individual documentation pages linked above for detailed usage instructions.
|
||||
40
transformers/docs/source/en/quantization/spqr.md
Normal file
40
transformers/docs/source/en/quantization/spqr.md
Normal file
@@ -0,0 +1,40 @@
|
||||
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# SpQR
|
||||
|
||||
The [SpQR](https://hf.co/papers/2306.03078) quantization algorithm involves a 16x16 tiled bi-level group 3-bit quantization structure with sparse outliers.
|
||||
|
||||
<div class="flex justify-center">
|
||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/spqr-diagram.png">
|
||||
</div>
|
||||
|
||||
> [!TIP]
|
||||
> To quantize a model with SpQR, refer to the [Vahe1994/SpQR](https://github.com/Vahe1994/SpQR) repository.
|
||||
|
||||
Load a SpQR-quantized model with [`~PreTrainedModel.from_pretrained`].
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
import torch
|
||||
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf",
|
||||
dtype=torch.half,
|
||||
device_map="auto"
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained("elvircrn/Llama-2-7b-SPQR-3Bit-16x16-red_pajama-hf")
|
||||
```
|
||||
683
transformers/docs/source/en/quantization/torchao.md
Normal file
683
transformers/docs/source/en/quantization/torchao.md
Normal file
@@ -0,0 +1,683 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
-->
|
||||
|
||||
# torchao
|
||||
|
||||
[](https://colab.research.google.com/github/huggingface/notebooks/blob/main/transformers_doc/en/quantization/torchao.ipynb)
|
||||
|
||||
[torchao](https://github.com/pytorch/ao) is a PyTorch architecture optimization library with support for custom high performance data types, quantization, and sparsity. It is composable with native PyTorch features such as [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) for even faster inference and training.
|
||||
|
||||
See the table below for additional torchao features.
|
||||
|
||||
| Feature | Description |
|
||||
|--------|-------------|
|
||||
| **Quantization Aware Training (QAT)** | Train quantized models with minimal accuracy loss (see [QAT README](https://github.com/pytorch/ao/blob/main/torchao/quantization/qat/README.md)) |
|
||||
| **Float8 Training** | High-throughput training with float8 formats (see [torchtitan](https://github.com/pytorch/torchtitan/blob/main/docs/float8.md) and [Accelerate](https://huggingface.co/docs/accelerate/usage_guides/low_precision_training#configuring-torchao) docs) |
|
||||
| **Sparsity Support** | Semi-structured (2:4) sparsity for faster inference (see [Accelerating Neural Network Training with Semi-Structured (2:4) Sparsity](https://pytorch.org/blog/accelerating-neural-network-training/) blog post) |
|
||||
| **Optimizer Quantization** | Reduce optimizer state memory with 4 and 8-bit variants of Adam |
|
||||
| **KV Cache Quantization** | Enables long context inference with lower memory (see [KV Cache Quantization](https://github.com/pytorch/ao/blob/main/torchao/_models/llama/README.md)) |
|
||||
| **Custom Kernels Support** | use your own `torch.compile` compatible ops |
|
||||
| **FSDP2** | Composable with FSDP2 for training|
|
||||
|
||||
> [!TIP]
|
||||
> Refer to the torchao [README.md](https://github.com/pytorch/ao#torchao-pytorch-architecture-optimization) for more details about the library.
|
||||
|
||||
torchao supports the [quantization techniques](https://github.com/pytorch/ao/blob/main/torchao/quantization/README.md) below.
|
||||
|
||||
- A16W8 Float8 Dynamic Quantization
|
||||
- A16W8 Float8 WeightOnly Quantization
|
||||
- A8W8 Int8 Dynamic Quantization
|
||||
- A16W8 Int8 Weight Only Quantization
|
||||
- A16W4 Int4 Weight Only Quantization
|
||||
- A16W4 Int4 Weight Only Quantization + 2:4 Sparsity
|
||||
- Autoquantization
|
||||
|
||||
torchao also supports module level configuration by specifying a dictionary from fully qualified name of module and its corresponding quantization config. This allows skip quantizing certain layers and using different quantization config for different modules.
|
||||
|
||||
Check the table below to see if your hardware is compatible.
|
||||
|
||||
| Component | Compatibility |
|
||||
|----------|----------------|
|
||||
| CUDA Versions | ✅ cu118, cu126, cu128 |
|
||||
| XPU Versions | ✅ pytorch2.8 |
|
||||
| CPU | ✅ change `device_map="cpu"` (see examples below) |
|
||||
|
||||
Install torchao from PyPi or the PyTorch index with the following commands.
|
||||
|
||||
<hfoptions id="install torchao">
|
||||
<hfoption id="PyPi">
|
||||
|
||||
```bash
|
||||
# Updating 🤗 Transformers to the latest version, as the example script below uses the new auto compilation
|
||||
# Stable release from Pypi which will default to CUDA 12.6
|
||||
pip install --upgrade torchao transformers
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="PyTorch Index">
|
||||
Stable Release from the PyTorch index
|
||||
|
||||
```bash
|
||||
pip install torchao --index-url https://download.pytorch.org/whl/cu126 # options are cpu/cu118/cu126/cu128
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
If your torchao version is below 0.10.0, you need to upgrade it, please refer to the [deprecation notice](#deprecation-notice) for more details.
|
||||
|
||||
## Quantization examples
|
||||
|
||||
TorchAO provides a variety of quantization configurations. Each configuration can be further customized with parameters such as `group_size`, `scheme`, and `layout` to optimize for specific hardware and model architectures.
|
||||
|
||||
For a complete list of available configurations, see the [quantization API documentation](https://github.com/pytorch/ao/blob/main/torchao/quantization/quant_api.py).
|
||||
|
||||
You can manually choose the quantization types and settings or automatically select the quantization types.
|
||||
|
||||
Create a [`TorchAoConfig`] and specify the quantization type and `group_size` of the weights to quantize (for int8 weight only and int4 weight only). Set the `cache_implementation` to `"static"` to automatically [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) the forward method.
|
||||
|
||||
We'll show examples for recommended quantization methods based on hardwares, e.g. A100 GPU, H100 GPU, CPU.
|
||||
|
||||
### H100 GPU
|
||||
<hfoptions id="examples-H100-GPU">
|
||||
<hfoption id="float8-dynamic-and-weight-only">
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
from torchao.quantization import Float8DynamicActivationFloat8WeightConfig, Float8WeightOnlyConfig
|
||||
|
||||
quant_config = Float8DynamicActivationFloat8WeightConfig()
|
||||
# or float8 weight only quantization
|
||||
# quant_config = Float8WeightOnlyConfig()
|
||||
quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
|
||||
# Load and quantize the model
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
|
||||
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="int4-weight-only">
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
from torchao.quantization import GemliteUIntXWeightOnlyConfig
|
||||
|
||||
# We integrated with gemlite, which optimizes for batch size N on A100 and H100
|
||||
quant_config = GemliteUIntXWeightOnlyConfig(group_size=128)
|
||||
quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
|
||||
# Load and quantize the model
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
|
||||
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="int4-weight-only-24sparse">
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
from torchao.quantization import Int4WeightOnlyConfig
|
||||
from torchao.dtypes import MarlinSparseLayout
|
||||
|
||||
quant_config = Int4WeightOnlyConfig(layout=MarlinSparseLayout())
|
||||
quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
|
||||
# Load and quantize the model with sparsity. A sparse checkpoint is needed to accelerate without accuracy loss
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"RedHatAI/Sparse-Llama-3.1-8B-2of4",
|
||||
dtype=torch.float16,
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("RedHatAI/Sparse-Llama-3.1-8B-2of4")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
|
||||
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
### A100 GPU
|
||||
<hfoptions id="examples-A100-GPU">
|
||||
<hfoption id="int8-dynamic-and-weight-only">
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
from torchao.quantization import Int8DynamicActivationInt8WeightConfig, Int8WeightOnlyConfig
|
||||
|
||||
quant_config = Int8DynamicActivationInt8WeightConfig()
|
||||
# or int8 weight only quantization
|
||||
# quant_config = Int8WeightOnlyConfig()
|
||||
quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
|
||||
# Load and quantize the model
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
|
||||
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
|
||||
<hfoption id="int4-weight-only">
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
from torchao.quantization import GemliteUIntXWeightOnlyConfig, Int4WeightOnlyConfig
|
||||
|
||||
# For batch size N, we recommend gemlite, which may require autotuning
|
||||
# default is 4 bit, 8 bit is also supported by passing `bit_width=8`
|
||||
quant_config = GemliteUIntXWeightOnlyConfig(group_size=128)
|
||||
|
||||
# For batch size 1, we also have custom tinygemm kernel that's only optimized for this
|
||||
# We can set `use_hqq` to `True` for better accuracy
|
||||
# quant_config = Int4WeightOnlyConfig(group_size=128, use_hqq=True)
|
||||
|
||||
quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
|
||||
# Load and quantize the model
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
|
||||
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="int4-weight-only-24sparse">
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
from torchao.quantization import Int4WeightOnlyConfig
|
||||
from torchao.dtypes import MarlinSparseLayout
|
||||
|
||||
quant_config = Int4WeightOnlyConfig(layout=MarlinSparseLayout())
|
||||
quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
|
||||
# Load and quantize the model with sparsity. A sparse checkpoint is needed to accelerate without accuracy loss
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"RedHatAI/Sparse-Llama-3.1-8B-2of4",
|
||||
dtype=torch.float16,
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("RedHatAI/Sparse-Llama-3.1-8B-2of4")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
|
||||
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
### Intel XPU
|
||||
<hfoptions id="examples-Intel-XPU">
|
||||
<hfoption id="int8-dynamic-and-weight-only">
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
from torchao.quantization import Int8DynamicActivationInt8WeightConfig, Int8WeightOnlyConfig
|
||||
|
||||
quant_config = Int8DynamicActivationInt8WeightConfig()
|
||||
# or int8 weight only quantization
|
||||
# quant_config = Int8WeightOnlyConfig()
|
||||
quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
|
||||
# Load and quantize the model
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
|
||||
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
|
||||
<hfoption id="int4-weight-only">
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
from torchao.quantization import Int4WeightOnlyConfig
|
||||
from torchao.dtypes import Int4XPULayout
|
||||
from torchao.quantization.quant_primitives import ZeroPointDomain
|
||||
|
||||
|
||||
quant_config = Int4WeightOnlyConfig(group_size=128, layout=Int4XPULayout(), zero_point_domain=ZeroPointDomain.INT)
|
||||
quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
|
||||
# Load and quantize the model
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to(model.device)
|
||||
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
### CPU
|
||||
<hfoptions id="examples-CPU">
|
||||
<hfoption id="int8-dynamic-and-weight-only">
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
from torchao.quantization import Int8DynamicActivationInt8WeightConfig, Int8WeightOnlyConfig
|
||||
|
||||
quant_config = Int8DynamicActivationInt8WeightConfig()
|
||||
# quant_config = Int8WeightOnlyConfig()
|
||||
quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
|
||||
# Load and quantize the model
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
dtype="auto",
|
||||
device_map="cpu",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt")
|
||||
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="int4-weight-only">
|
||||
|
||||
> [!TIP]
|
||||
> Run the quantized model on a CPU by changing `device_map` to `"cpu"` and `layout` to `Int4CPULayout()`.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
from torchao.quantization import Int4WeightOnlyConfig
|
||||
from torchao.dtypes import Int4CPULayout
|
||||
|
||||
quant_config = Int4WeightOnlyConfig(group_size=128, layout=Int4CPULayout())
|
||||
quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
|
||||
# Load and quantize the model
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
dtype="auto",
|
||||
device_map="cpu",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt")
|
||||
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
### Per Module Quantization
|
||||
#### 1. Skip quantization for certain layers
|
||||
With `ModuleFqnToConfig` we can specify a default configuration for all layers while skipping quantization for certain layers.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
|
||||
|
||||
model_id = "meta-llama/Llama-3.1-8B-Instruct"
|
||||
|
||||
from torchao.quantization import Int4WeightOnlyConfig, ModuleFqnToConfig
|
||||
config = Int4WeightOnlyConfig(group_size=128)
|
||||
|
||||
# set default to int4 (for linears), and skip quantizing `model.layers.0.self_attn.q_proj`
|
||||
quant_config = ModuleFqnToConfig({"_default": config, "model.layers.0.self_attn.q_proj": None})
|
||||
quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", dtype=torch.bfloat16, quantization_config=quantization_config)
|
||||
# lm_head is not quantized and model.layers.0.self_attn.q_proj is not quantized
|
||||
print("quantized model:", quantized_model)
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
|
||||
# Manual Testing
|
||||
prompt = "Hey, are you conscious? Can you talk to me?"
|
||||
inputs = tokenizer(prompt, return_tensors="pt").to(quantized_model.device.type)
|
||||
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
|
||||
output_text = tokenizer.batch_decode(
|
||||
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
||||
)
|
||||
print(output_text)
|
||||
```
|
||||
|
||||
#### 2. Quantizing different layers with different quantization configs
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer, TorchAoConfig
|
||||
|
||||
model_id = "facebook/opt-125m"
|
||||
|
||||
from torchao.quantization import Int4WeightOnlyConfig, ModuleFqnToConfig, Int8DynamicActivationInt4WeightConfig, IntxWeightOnlyConfig, PerAxis, MappingType
|
||||
|
||||
weight_dtype = torch.int8
|
||||
granularity = PerAxis(0)
|
||||
mapping_type = MappingType.ASYMMETRIC
|
||||
embedding_config = IntxWeightOnlyConfig(
|
||||
weight_dtype=weight_dtype,
|
||||
granularity=granularity,
|
||||
mapping_type=mapping_type,
|
||||
)
|
||||
linear_config = Int8DynamicActivationInt4WeightConfig(group_size=128)
|
||||
quant_config = ModuleFqnToConfig({"_default": linear_config, "model.decoder.embed_tokens": embedding_config, "model.decoder.embed_positions": None})
|
||||
# set `include_embedding` to True in order to include embedding in quantization
|
||||
# when `include_embedding` is True, we'll remove input embedding from `modules_not_to_convert` as well
|
||||
quantization_config = TorchAoConfig(quant_type=quant_config, include_embedding=True)
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(model_id, device_map="cpu", dtype=torch.bfloat16, quantization_config=quantization_config)
|
||||
print("quantized model:", quantized_model)
|
||||
# make sure embedding is quantized
|
||||
print("embed_tokens weight:", quantized_model.model.decoder.embed_tokens.weight)
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
|
||||
# Manual Testing
|
||||
prompt = "Hey, are you conscious? Can you talk to me?"
|
||||
inputs = tokenizer(prompt, return_tensors="pt").to("cpu")
|
||||
generated_ids = quantized_model.generate(**inputs, max_new_tokens=128, cache_implementation="static")
|
||||
output_text = tokenizer.batch_decode(
|
||||
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
|
||||
)
|
||||
print(output_text)
|
||||
```
|
||||
|
||||
### Autoquant
|
||||
|
||||
If you want to automatically choose a quantization type for quantizable layers (`nn.Linear`) you can use the [autoquant](https://pytorch.org/ao/stable/generated/torchao.quantization.autoquant.html#torchao.quantization.autoquant) API.
|
||||
|
||||
The `autoquant` API automatically chooses a quantization type by micro-benchmarking on input type and shape and compiling a single linear layer.
|
||||
|
||||
Note: autoquant is for GPU only right now.
|
||||
|
||||
Create a [`TorchAoConfig`] and set to `"autoquant"`. Set the `cache_implementation` to `"static"` to automatically [torch.compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) the forward method. Finally, call `finalize_autoquant` on the quantized model to finalize the quantization and log the input shapes.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
quantization_config = TorchAoConfig("autoquant", min_sqnr=None)
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
dtype="auto",
|
||||
device_map="auto",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to(quantized_model.device.type)
|
||||
|
||||
# auto-compile the quantized model with `cache_implementation="static"` to get speed up
|
||||
output = quantized_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static")
|
||||
# explicitly call `finalize_autoquant` (may be refactored and removed in the future)
|
||||
quantized_model.finalize_autoquant()
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
```
|
||||
|
||||
## Serialization
|
||||
|
||||
torchao implements [torch.Tensor subclasses](https://pytorch.org/docs/stable/notes/extending.html#subclassing-torch-tensor) for maximum flexibility in supporting new quantized torch.Tensor formats. [Safetensors](https://huggingface.co/docs/safetensors/en/index) serialization and deserialization does not work with torchao.
|
||||
|
||||
To avoid arbitrary user code execution, torchao sets `weights_only=True` in [torch.load](https://pytorch.org/docs/stable/generated/torch.load.html) to ensure only tensors are loaded. Any known user functions can be whitelisted with [add_safe_globals](https://pytorch.org/docs/stable/notes/serialization.html#torch.serialization.add_safe_globals).
|
||||
|
||||
<hfoptions id="serialization-examples">
|
||||
<hfoption id="save-locally">
|
||||
|
||||
```py
|
||||
# don't serialize model with Safetensors
|
||||
output_dir = "llama3-8b-int4wo-128"
|
||||
quantized_model.save_pretrained("llama3-8b-int4wo-128", safe_serialization=False)
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
<hfoption id="push-to-huggingface-hub">
|
||||
|
||||
```py
|
||||
# don't serialize model with Safetensors
|
||||
USER_ID = "your_huggingface_user_id"
|
||||
REPO_ID = "llama3-8b-int4wo-128"
|
||||
quantized_model.push_to_hub(f"{USER_ID}/llama3-8b-int4wo-128", safe_serialization=False)
|
||||
tokenizer.push_to_hub(f"{USER_ID}/llama3-8b-int4wo-128")
|
||||
```
|
||||
|
||||
</hfoption>
|
||||
</hfoptions>
|
||||
|
||||
## Loading quantized models
|
||||
|
||||
Loading a quantized model depends on the quantization scheme. For quantization schemes, like int8 and float8, you can quantize the model on any device and also load it on any device. The example below demonstrates quantizing a model on the CPU and then loading it on CUDA or XPU.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
from torchao.quantization import Int8WeightOnlyConfig
|
||||
|
||||
quant_config = Int8WeightOnlyConfig(group_size=128)
|
||||
quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
|
||||
# Load and quantize the model
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
dtype="auto",
|
||||
device_map="cpu",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
# save the quantized model
|
||||
output_dir = "llama-3.1-8b-torchao-int8"
|
||||
quantized_model.save_pretrained(output_dir, safe_serialization=False)
|
||||
|
||||
# reload the quantized model
|
||||
reloaded_model = AutoModelForCausalLM.from_pretrained(
|
||||
output_dir,
|
||||
device_map="auto",
|
||||
dtype=torch.bfloat16
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt").to(reloaded_model.device.type)
|
||||
|
||||
output = reloaded_model.generate(**input_ids, max_new_tokens=10)
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
|
||||
```
|
||||
|
||||
For int4, the model can only be loaded on the same device it was quantized on because the layout is specific to the device. The example below demonstrates quantizing and loading a model on the CPU.
|
||||
|
||||
```py
|
||||
import torch
|
||||
from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer
|
||||
from torchao.quantization import Int4WeightOnlyConfig
|
||||
from torchao.dtypes import Int4CPULayout
|
||||
|
||||
quant_config = Int4WeightOnlyConfig(group_size=128, layout=Int4CPULayout())
|
||||
quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
|
||||
# Load and quantize the model
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"meta-llama/Llama-3.1-8B-Instruct",
|
||||
dtype="auto",
|
||||
device_map="cpu",
|
||||
quantization_config=quantization_config
|
||||
)
|
||||
# save the quantized model
|
||||
output_dir = "llama-3.1-8b-torchao-int4-cpu"
|
||||
quantized_model.save_pretrained(output_dir, safe_serialization=False)
|
||||
|
||||
# reload the quantized model
|
||||
reloaded_model = AutoModelForCausalLM.from_pretrained(
|
||||
output_dir,
|
||||
device_map="cpu",
|
||||
dtype=torch.bfloat16
|
||||
)
|
||||
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct")
|
||||
input_text = "What are we having for dinner?"
|
||||
input_ids = tokenizer(input_text, return_tensors="pt")
|
||||
|
||||
output = reloaded_model.generate(**input_ids, max_new_tokens=10)
|
||||
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
||||
|
||||
```
|
||||
|
||||
## ⚠️ Deprecation Notice
|
||||
|
||||
> Starting with version 0.10.0, the string-based API for quantization configuration (e.g., `TorchAoConfig("int4_weight_only", group_size=128)`) is **deprecated** and will be removed in a future release.
|
||||
>
|
||||
> Please use the new `AOBaseConfig`-based approach instead:
|
||||
>
|
||||
> ```python
|
||||
> # Old way (deprecated)
|
||||
> quantization_config = TorchAoConfig("int4_weight_only", group_size=128)
|
||||
>
|
||||
> # New way (recommended)
|
||||
> from torchao.quantization import Int4WeightOnlyConfig
|
||||
> quant_config = Int4WeightOnlyConfig(group_size=128)
|
||||
> quantization_config = TorchAoConfig(quant_type=quant_config)
|
||||
> ```
|
||||
>
|
||||
> The new API offers greater flexibility, better type safety, and access to the full range of features available in torchao.
|
||||
>
|
||||
> [Migration Guide](#migration-guide)
|
||||
>
|
||||
> Here's how to migrate from common string identifiers to their `AOBaseConfig` equivalents:
|
||||
>
|
||||
> | Old String API | New `AOBaseConfig` API |
|
||||
> |----------------|------------------------|
|
||||
> | `"int4_weight_only"` | `Int4WeightOnlyConfig()` |
|
||||
> | `"int8_weight_only"` | `Int8WeightOnlyConfig()` |
|
||||
> | `"int8_dynamic_activation_int8_weight"` | `Int8DynamicActivationInt8WeightConfig()` |
|
||||
>
|
||||
> All configuration objects accept parameters for customization (e.g., `group_size`, `scheme`, `layout`).
|
||||
|
||||
## Resources
|
||||
|
||||
For a better sense of expected performance, view the [benchmarks](https://github.com/pytorch/ao/tree/main/torchao/quantization#benchmarks) for various models with CUDA and XPU backends. You can also run the code below to benchmark a model yourself.
|
||||
|
||||
```py
|
||||
from torch._inductor.utils import do_bench_using_profiling
|
||||
from typing import Callable
|
||||
|
||||
def benchmark_fn(func: Callable, *args, **kwargs) -> float:
|
||||
"""Thin wrapper around do_bench_using_profiling"""
|
||||
no_args = lambda: func(*args, **kwargs)
|
||||
time = do_bench_using_profiling(no_args)
|
||||
return time * 1e3
|
||||
|
||||
MAX_NEW_TOKENS = 1000
|
||||
print("int4wo-128 model:", benchmark_fn(quantized_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS, cache_implementation="static"))
|
||||
|
||||
bf16_model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", dtype=torch.bfloat16)
|
||||
output = bf16_model.generate(**input_ids, max_new_tokens=10, cache_implementation="static") # auto-compile
|
||||
print("bf16 model:", benchmark_fn(bf16_model.generate, **input_ids, max_new_tokens=MAX_NEW_TOKENS, cache_implementation="static"))
|
||||
```
|
||||
|
||||
> [!TIP]
|
||||
> For best performance, you can use recommended settings by calling `torchao.quantization.utils.recommended_inductor_config_setter()`
|
||||
|
||||
Refer to [Other Available Quantization Techniques](https://github.com/pytorch/ao/tree/main/torchao/quantization#other-available-quantization-techniques) for more examples and documentation.
|
||||
|
||||
## Issues
|
||||
|
||||
If you encounter any issues with the Transformers integration, please open an issue on the [Transformers](https://github.com/huggingface/transformers/issues) repository. For issues directly related to torchao, please open an issue on the [torchao](https://github.com/pytorch/ao/issues) repository.
|
||||
72
transformers/docs/source/en/quantization/vptq.md
Normal file
72
transformers/docs/source/en/quantization/vptq.md
Normal file
@@ -0,0 +1,72 @@
|
||||
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
|
||||
|
||||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
|
||||
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
|
||||
specific language governing permissions and limitations under the License.
|
||||
|
||||
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
|
||||
rendered properly in your Markdown viewer.
|
||||
|
||||
-->
|
||||
|
||||
# VPTQ
|
||||
|
||||
[Vector Post-Training Quantization (VPTQ)](https://github.com/microsoft/VPTQ) is a Post-Training Quantization (PTQ) method that leverages vector quantization to quantize LLMs at an extremely low bit-width (<2-bit). VPTQ can compress a 70B, even a 405B model, to 1-2 bits without retraining and still maintain a high-degree of accuracy. It is a lightweight quantization algorithm that takes ~17 hours to quantize a 405B model. VPTQ features agile quantization inference with low decoding overhead and high throughput and Time To First Token (TTFT).
|
||||
|
||||
Run the command below to install VPTQ which provides efficient kernels for inference on NVIDIA and AMD GPUs.
|
||||
|
||||
```bash
|
||||
pip install vptq
|
||||
```
|
||||
|
||||
The [VPTQ-community](https://huggingface.co/VPTQ-community) provides a collection of VPTQ-quantized models. The model name contains information about its bitwidth (excluding cookbook, parameter, and padding overhead). Consider the [Meta-Llama-3.1-70B-Instruct-v8-k65536-256-woft] model as an example.
|
||||
|
||||
- The model name is Meta-Llama-3.1-70B-Instruct.
|
||||
- The number of centroids is given by 65536 (2^16).
|
||||
- The number of residual centroids is given by 256 (2^8).
|
||||
|
||||
The equivalent bit-width calculation is given by the following.
|
||||
|
||||
- index: log2(65536) = 16 / 8 = 2-bits
|
||||
- residual index: log2(256) = 8 / 8 = 1-bit
|
||||
- total bit-width: 2 + 1 = 3-bits
|
||||
|
||||
From here, estimate the model size by multiplying 70B * 3-bits / 8-bits/byte for a total of 26.25GB.
|
||||
|
||||
Load a VPTQ quantized model with [`~PreTrainedModel.from_pretrained`].
|
||||
|
||||
```py
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
quantized_model = AutoModelForCausalLM.from_pretrained(
|
||||
"VPTQ-community/Meta-Llama-3.1-70B-Instruct-v16-k65536-65536-woft",
|
||||
dtype="auto",
|
||||
device_map="auto"
|
||||
)
|
||||
```
|
||||
|
||||
To quantize your own model, refer to the [VPTQ Quantization Algorithm Tutorial](https://github.com/microsoft/VPTQ/blob/algorithm/algorithm.md) tutorial.
|
||||
|
||||
## Benchmarks
|
||||
|
||||
VPTQ achieves better accuracy and higher throughput with lower quantization overhead across models of different sizes. The following experimental results are for reference only; VPTQ can achieve better outcomes under reasonable parameters, especially in terms of model accuracy and inference speed.
|
||||
|
||||
| Model | bitwidth | W2↓ | C4↓ | AvgQA↑ | tok/s↑ | mem(GB) | cost/h↓ |
|
||||
| ----------- | -------- | ---- | ---- | ------ | ------ | ------- | ------- |
|
||||
| LLaMA-2 7B | 2.02 | 6.13 | 8.07 | 58.2 | 39.9 | 2.28 | 2 |
|
||||
| | 2.26 | 5.95 | 7.87 | 59.4 | 35.7 | 2.48 | 3.1 |
|
||||
| LLaMA-2 13B | 2.02 | 5.32 | 7.15 | 62.4 | 26.9 | 4.03 | 3.2 |
|
||||
| | 2.18 | 5.28 | 7.04 | 63.1 | 18.5 | 4.31 | 3.6 |
|
||||
| LLaMA-2 70B | 2.07 | 3.93 | 5.72 | 68.6 | 9.7 | 19.54 | 19 |
|
||||
| | 2.11 | 3.92 | 5.71 | 68.7 | 9.7 | 20.01 | 19 |
|
||||
|
||||
## Resources
|
||||
|
||||
See an example demo of VPTQ on the VPTQ Online Demo [Space](https://huggingface.co/spaces/microsoft/VPTQ) or try running the VPTQ inference [notebook](https://colab.research.google.com/github/microsoft/VPTQ/blob/main/notebooks/vptq_example.ipynb).
|
||||
|
||||
For more information, read the VPTQ [paper](https://huggingface.co/papers/2409.17066).
|
||||
Reference in New Issue
Block a user