201 lines
6.6 KiB
Markdown
201 lines
6.6 KiB
Markdown
|
|
---
|
|||
|
|
license: apache-2.0
|
|||
|
|
language:
|
|||
|
|
- en
|
|||
|
|
library_name: transformers
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
base_model: ibm-granite/granite-4.1-8b
|
|||
|
|
base_model_relation: quantized
|
|||
|
|
tags:
|
|||
|
|
- granite
|
|||
|
|
- awq
|
|||
|
|
- w4a16
|
|||
|
|
- compressed-tensors
|
|||
|
|
- llmcompressor
|
|||
|
|
- quantized
|
|||
|
|
- vllm
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Granite 4.1-8B — AWQ W4A16
|
|||
|
|
|
|||
|
|
4-bit AWQ quantization of [`ibm-granite/granite-4.1-8b`](https://huggingface.co/ibm-granite/granite-4.1-8b),
|
|||
|
|
produced with [`llm-compressor`](https://github.com/vllm-project/llm-compressor) 0.6.0.1.
|
|||
|
|
|
|||
|
|
The result is a ~5.8 GB checkpoint (down from ~17 GB BF16) that loads on a single 24 GB GPU
|
|||
|
|
with comfortable KV-cache headroom and serves natively in [vLLM](https://github.com/vllm-project/vllm)
|
|||
|
|
via the AWQ-Marlin kernels.
|
|||
|
|
|
|||
|
|
## Quantization details
|
|||
|
|
|
|||
|
|
| Setting | Value |
|
|||
|
|
| --- | --- |
|
|||
|
|
| Method | AWQ (Activation-aware Weight Quantization) |
|
|||
|
|
| Scheme | W4A16 asymmetric, group size 128 |
|
|||
|
|
| Ignored modules | `lm_head` |
|
|||
|
|
| Targeted modules | All `Linear` layers |
|
|||
|
|
| Calibration dataset | `HuggingFaceH4/ultrachat_200k` (`train_sft` split) |
|
|||
|
|
| Calibration samples | 256 |
|
|||
|
|
| Max sequence length | 2048 |
|
|||
|
|
| Tool | `llmcompressor==0.6.0.1` |
|
|||
|
|
| Output format | `compressed-tensors` (safetensors, sharded) |
|
|||
|
|
|
|||
|
|
The exact recipe used:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from llmcompressor.modifiers.awq import AWQModifier
|
|||
|
|
|
|||
|
|
recipe = [
|
|||
|
|
AWQModifier(
|
|||
|
|
ignore=["lm_head"],
|
|||
|
|
scheme="W4A16_ASYM",
|
|||
|
|
targets=["Linear"],
|
|||
|
|
)
|
|||
|
|
]
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
The full machine-readable recipe is also included in this repo as
|
|||
|
|
[`recipe.yaml`](./recipe.yaml).
|
|||
|
|
|
|||
|
|
## Serving with vLLM
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
vllm serve pranavbapat/granite-4.1-8b-awq \
|
|||
|
|
--quantization awq_marlin \
|
|||
|
|
--dtype auto \
|
|||
|
|
--max-model-len 8192 \
|
|||
|
|
--gpu-memory-utilization 0.90 \
|
|||
|
|
--served-model-name granite-4.1-8b-awq
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Then hit the OpenAI-compatible endpoint:
|
|||
|
|
|
|||
|
|
```bash
|
|||
|
|
curl http://localhost:8000/v1/chat/completions \
|
|||
|
|
-H "Content-Type: application/json" \
|
|||
|
|
-d '{
|
|||
|
|
"model": "granite-4.1-8b-awq",
|
|||
|
|
"messages": [{"role": "user", "content": "Hello!"}],
|
|||
|
|
"max_tokens": 64
|
|||
|
|
}'
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
Tested with `vllm==0.8.5.post1`, `torch==2.6.0+cu124`, NVIDIA L40S (Ada, sm_89).
|
|||
|
|
|
|||
|
|
## Loading with `transformers`
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
|
|||
|
|
model_id = "pranavbapat/granite-4.1-8b-awq"
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
|
|||
|
|
|
|||
|
|
prompt = tokenizer.apply_chat_template(
|
|||
|
|
[{"role": "user", "content": "Briefly: what is photosynthesis?"}],
|
|||
|
|
tokenize=False,
|
|||
|
|
add_generation_prompt=True,
|
|||
|
|
)
|
|||
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
|||
|
|
out = model.generate(**inputs, max_new_tokens=128)
|
|||
|
|
print(tokenizer.decode(out[0], skip_special_tokens=True))
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
For best decode speed prefer the vLLM path — `transformers` will run the model in a
|
|||
|
|
dequantized form on most architectures.
|
|||
|
|
|
|||
|
|
## Hardware compatibility
|
|||
|
|
|
|||
|
|
| Device | Status |
|
|||
|
|
| --- | --- |
|
|||
|
|
| NVIDIA Hopper (H100/H200) | ✅ via vLLM AWQ-Marlin |
|
|||
|
|
| NVIDIA Ada Lovelace (L40S, RTX 4090) | ✅ via vLLM AWQ-Marlin |
|
|||
|
|
| NVIDIA Ampere (A100, A10, RTX 3090) | ✅ via vLLM AWQ-Marlin |
|
|||
|
|
| NVIDIA Turing and older | ⚠️ Marlin not supported — use the slower AWQ kernel |
|
|||
|
|
| AMD / Apple Silicon | ❌ Use the original BF16 model with a different runtime |
|
|||
|
|
|
|||
|
|
## Files
|
|||
|
|
|
|||
|
|
| File | Purpose |
|
|||
|
|
| --- | --- |
|
|||
|
|
| `model-0000{1,2}-of-00002.safetensors` | Quantized weights, sharded |
|
|||
|
|
| `model.safetensors.index.json` | Shard map |
|
|||
|
|
| `config.json`, `generation_config.json` | Model + generation config |
|
|||
|
|
| `tokenizer.json`, `tokenizer_config.json`, `vocab.json`, `merges.txt`, `special_tokens_map.json`, `chat_template.jinja` | Tokenizer files |
|
|||
|
|
| `recipe.yaml` | llm-compressor recipe used for this quantization |
|
|||
|
|
|
|||
|
|
## Limitations
|
|||
|
|
|
|||
|
|
- AWQ is a **lossy** post-training quantization. Expect a small quality regression vs. the BF16 base — typically ≤ 1% on most reasoning benchmarks, but task-dependent.
|
|||
|
|
- Calibration was done on `ultrachat_200k`, which biases towards English chat-style prompts. For domain-specific deployments (code, legal, medical, multilingual), re-calibrate on a representative sample of your own data for best fidelity.
|
|||
|
|
- Inherits all behavioural and safety characteristics of the base model. See the [original Granite 4.1-8B model card](https://huggingface.co/ibm-granite/granite-4.1-8b) for the full picture.
|
|||
|
|
|
|||
|
|
## Reproducing this checkpoint
|
|||
|
|
|
|||
|
|
The full pipeline (RunPod setup, quantization, serving, upload) is open and minimal.
|
|||
|
|
Quantization step:
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
|||
|
|
from llmcompressor.modifiers.awq import AWQModifier
|
|||
|
|
from llmcompressor import oneshot
|
|||
|
|
from datasets import load_dataset
|
|||
|
|
|
|||
|
|
MODEL_ID = "ibm-granite/granite-4.1-8b"
|
|||
|
|
OUTPUT_DIR = "granite-4.1-8b-awq"
|
|||
|
|
NUM_SAMPLES = 256
|
|||
|
|
MAX_SEQ = 2048
|
|||
|
|
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto", device_map="auto")
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
|||
|
|
|
|||
|
|
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_SAMPLES}]")
|
|||
|
|
ds = ds.map(lambda ex: {"text": tokenizer.apply_chat_template(ex["messages"], tokenize=False)})
|
|||
|
|
ds = ds.map(
|
|||
|
|
lambda ex: tokenizer(ex["text"], padding=False, max_length=MAX_SEQ, truncation=True, add_special_tokens=False),
|
|||
|
|
remove_columns=ds.column_names,
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
oneshot(
|
|||
|
|
model=model,
|
|||
|
|
dataset=ds,
|
|||
|
|
recipe=[AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"])],
|
|||
|
|
max_seq_length=MAX_SEQ,
|
|||
|
|
num_calibration_samples=NUM_SAMPLES,
|
|||
|
|
output_dir=OUTPUT_DIR,
|
|||
|
|
)
|
|||
|
|
tokenizer.save_pretrained(OUTPUT_DIR)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
On a single L40S this completes in roughly 60–90 minutes.
|
|||
|
|
|
|||
|
|
## License
|
|||
|
|
|
|||
|
|
This checkpoint inherits the **Apache 2.0** license of the upstream
|
|||
|
|
[`ibm-granite/granite-4.1-8b`](https://huggingface.co/ibm-granite/granite-4.1-8b).
|
|||
|
|
You are free to use, modify, and redistribute it under the same terms.
|
|||
|
|
|
|||
|
|
## Acknowledgments
|
|||
|
|
|
|||
|
|
- IBM Research for the [Granite 4.1](https://huggingface.co/ibm-granite/granite-4.1-8b) base model.
|
|||
|
|
- The [vLLM](https://github.com/vllm-project/vllm) team for AWQ-Marlin kernels.
|
|||
|
|
- The [llm-compressor](https://github.com/vllm-project/llm-compressor) team for the quantization tooling.
|
|||
|
|
|
|||
|
|
## Citation
|
|||
|
|
|
|||
|
|
If you use this model in your work, please also cite the original Granite paper and the AWQ paper:
|
|||
|
|
|
|||
|
|
```bibtex
|
|||
|
|
@article{granite2024,
|
|||
|
|
title = {Granite Foundation Models},
|
|||
|
|
author = {{IBM Research}},
|
|||
|
|
year = {2024},
|
|||
|
|
url = {https://huggingface.co/ibm-granite}
|
|||
|
|
}
|
|||
|
|
|
|||
|
|
@inproceedings{lin2024awq,
|
|||
|
|
title = {AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration},
|
|||
|
|
author = {Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},
|
|||
|
|
booktitle = {MLSys},
|
|||
|
|
year = {2024}
|
|||
|
|
}
|
|||
|
|
```
|