6.6 KiB
license, language, library_name, pipeline_tag, base_model, base_model_relation, tags
| license | language | library_name | pipeline_tag | base_model | base_model_relation | tags | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| apache-2.0 |
|
transformers | text-generation | ibm-granite/granite-4.1-8b | quantized |
|
Granite 4.1-8B — AWQ W4A16
4-bit AWQ quantization of ibm-granite/granite-4.1-8b,
produced with llm-compressor 0.6.0.1.
The result is a ~5.8 GB checkpoint (down from ~17 GB BF16) that loads on a single 24 GB GPU with comfortable KV-cache headroom and serves natively in vLLM via the AWQ-Marlin kernels.
Quantization details
| Setting | Value |
|---|---|
| Method | AWQ (Activation-aware Weight Quantization) |
| Scheme | W4A16 asymmetric, group size 128 |
| Ignored modules | lm_head |
| Targeted modules | All Linear layers |
| Calibration dataset | HuggingFaceH4/ultrachat_200k (train_sft split) |
| Calibration samples | 256 |
| Max sequence length | 2048 |
| Tool | llmcompressor==0.6.0.1 |
| Output format | compressed-tensors (safetensors, sharded) |
The exact recipe used:
from llmcompressor.modifiers.awq import AWQModifier
recipe = [
AWQModifier(
ignore=["lm_head"],
scheme="W4A16_ASYM",
targets=["Linear"],
)
]
The full machine-readable recipe is also included in this repo as
recipe.yaml.
Serving with vLLM
vllm serve pranavbapat/granite-4.1-8b-awq \
--quantization awq_marlin \
--dtype auto \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--served-model-name granite-4.1-8b-awq
Then hit the OpenAI-compatible endpoint:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "granite-4.1-8b-awq",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 64
}'
Tested with vllm==0.8.5.post1, torch==2.6.0+cu124, NVIDIA L40S (Ada, sm_89).
Loading with transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "pranavbapat/granite-4.1-8b-awq"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Briefly: what is photosynthesis?"}],
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(out[0], skip_special_tokens=True))
For best decode speed prefer the vLLM path — transformers will run the model in a
dequantized form on most architectures.
Hardware compatibility
| Device | Status |
|---|---|
| NVIDIA Hopper (H100/H200) | ✅ via vLLM AWQ-Marlin |
| NVIDIA Ada Lovelace (L40S, RTX 4090) | ✅ via vLLM AWQ-Marlin |
| NVIDIA Ampere (A100, A10, RTX 3090) | ✅ via vLLM AWQ-Marlin |
| NVIDIA Turing and older | ⚠️ Marlin not supported — use the slower AWQ kernel |
| AMD / Apple Silicon | ❌ Use the original BF16 model with a different runtime |
Files
| File | Purpose |
|---|---|
model-0000{1,2}-of-00002.safetensors |
Quantized weights, sharded |
model.safetensors.index.json |
Shard map |
config.json, generation_config.json |
Model + generation config |
tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, special_tokens_map.json, chat_template.jinja |
Tokenizer files |
recipe.yaml |
llm-compressor recipe used for this quantization |
Limitations
- AWQ is a lossy post-training quantization. Expect a small quality regression vs. the BF16 base — typically ≤ 1% on most reasoning benchmarks, but task-dependent.
- Calibration was done on
ultrachat_200k, which biases towards English chat-style prompts. For domain-specific deployments (code, legal, medical, multilingual), re-calibrate on a representative sample of your own data for best fidelity. - Inherits all behavioural and safety characteristics of the base model. See the original Granite 4.1-8B model card for the full picture.
Reproducing this checkpoint
The full pipeline (RunPod setup, quantization, serving, upload) is open and minimal. Quantization step:
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor import oneshot
from datasets import load_dataset
MODEL_ID = "ibm-granite/granite-4.1-8b"
OUTPUT_DIR = "granite-4.1-8b-awq"
NUM_SAMPLES = 256
MAX_SEQ = 2048
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_SAMPLES}]")
ds = ds.map(lambda ex: {"text": tokenizer.apply_chat_template(ex["messages"], tokenize=False)})
ds = ds.map(
lambda ex: tokenizer(ex["text"], padding=False, max_length=MAX_SEQ, truncation=True, add_special_tokens=False),
remove_columns=ds.column_names,
)
oneshot(
model=model,
dataset=ds,
recipe=[AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"])],
max_seq_length=MAX_SEQ,
num_calibration_samples=NUM_SAMPLES,
output_dir=OUTPUT_DIR,
)
tokenizer.save_pretrained(OUTPUT_DIR)
On a single L40S this completes in roughly 60–90 minutes.
License
This checkpoint inherits the Apache 2.0 license of the upstream
ibm-granite/granite-4.1-8b.
You are free to use, modify, and redistribute it under the same terms.
Acknowledgments
- IBM Research for the Granite 4.1 base model.
- The vLLM team for AWQ-Marlin kernels.
- The llm-compressor team for the quantization tooling.
Citation
If you use this model in your work, please also cite the original Granite paper and the AWQ paper:
@article{granite2024,
title = {Granite Foundation Models},
author = {{IBM Research}},
year = {2024},
url = {https://huggingface.co/ibm-granite}
}
@inproceedings{lin2024awq,
title = {AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration},
author = {Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},
booktitle = {MLSys},
year = {2024}
}