--- license: apache-2.0 language: - en library_name: transformers pipeline_tag: text-generation base_model: ibm-granite/granite-4.1-8b base_model_relation: quantized tags: - granite - awq - w4a16 - compressed-tensors - llmcompressor - quantized - vllm --- # Granite 4.1-8B — AWQ W4A16 4-bit AWQ quantization of [`ibm-granite/granite-4.1-8b`](https://huggingface.co/ibm-granite/granite-4.1-8b), produced with [`llm-compressor`](https://github.com/vllm-project/llm-compressor) 0.6.0.1. The result is a ~5.8 GB checkpoint (down from ~17 GB BF16) that loads on a single 24 GB GPU with comfortable KV-cache headroom and serves natively in [vLLM](https://github.com/vllm-project/vllm) via the AWQ-Marlin kernels. ## Quantization details | Setting | Value | | --- | --- | | Method | AWQ (Activation-aware Weight Quantization) | | Scheme | W4A16 asymmetric, group size 128 | | Ignored modules | `lm_head` | | Targeted modules | All `Linear` layers | | Calibration dataset | `HuggingFaceH4/ultrachat_200k` (`train_sft` split) | | Calibration samples | 256 | | Max sequence length | 2048 | | Tool | `llmcompressor==0.6.0.1` | | Output format | `compressed-tensors` (safetensors, sharded) | The exact recipe used: ```python from llmcompressor.modifiers.awq import AWQModifier recipe = [ AWQModifier( ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"], ) ] ``` The full machine-readable recipe is also included in this repo as [`recipe.yaml`](./recipe.yaml). ## Serving with vLLM ```bash vllm serve pranavbapat/granite-4.1-8b-awq \ --quantization awq_marlin \ --dtype auto \ --max-model-len 8192 \ --gpu-memory-utilization 0.90 \ --served-model-name granite-4.1-8b-awq ``` Then hit the OpenAI-compatible endpoint: ```bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "granite-4.1-8b-awq", "messages": [{"role": "user", "content": "Hello!"}], "max_tokens": 64 }' ``` Tested with `vllm==0.8.5.post1`, `torch==2.6.0+cu124`, NVIDIA L40S (Ada, sm_89). ## Loading with `transformers` ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_id = "pranavbapat/granite-4.1-8b-awq" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") prompt = tokenizer.apply_chat_template( [{"role": "user", "content": "Briefly: what is photosynthesis?"}], tokenize=False, add_generation_prompt=True, ) inputs = tokenizer(prompt, return_tensors="pt").to(model.device) out = model.generate(**inputs, max_new_tokens=128) print(tokenizer.decode(out[0], skip_special_tokens=True)) ``` For best decode speed prefer the vLLM path — `transformers` will run the model in a dequantized form on most architectures. ## Hardware compatibility | Device | Status | | --- | --- | | NVIDIA Hopper (H100/H200) | ✅ via vLLM AWQ-Marlin | | NVIDIA Ada Lovelace (L40S, RTX 4090) | ✅ via vLLM AWQ-Marlin | | NVIDIA Ampere (A100, A10, RTX 3090) | ✅ via vLLM AWQ-Marlin | | NVIDIA Turing and older | ⚠️ Marlin not supported — use the slower AWQ kernel | | AMD / Apple Silicon | ❌ Use the original BF16 model with a different runtime | ## Files | File | Purpose | | --- | --- | | `model-0000{1,2}-of-00002.safetensors` | Quantized weights, sharded | | `model.safetensors.index.json` | Shard map | | `config.json`, `generation_config.json` | Model + generation config | | `tokenizer.json`, `tokenizer_config.json`, `vocab.json`, `merges.txt`, `special_tokens_map.json`, `chat_template.jinja` | Tokenizer files | | `recipe.yaml` | llm-compressor recipe used for this quantization | ## Limitations - AWQ is a **lossy** post-training quantization. Expect a small quality regression vs. the BF16 base — typically ≤ 1% on most reasoning benchmarks, but task-dependent. - Calibration was done on `ultrachat_200k`, which biases towards English chat-style prompts. For domain-specific deployments (code, legal, medical, multilingual), re-calibrate on a representative sample of your own data for best fidelity. - Inherits all behavioural and safety characteristics of the base model. See the [original Granite 4.1-8B model card](https://huggingface.co/ibm-granite/granite-4.1-8b) for the full picture. ## Reproducing this checkpoint The full pipeline (RunPod setup, quantization, serving, upload) is open and minimal. Quantization step: ```python from transformers import AutoModelForCausalLM, AutoTokenizer from llmcompressor.modifiers.awq import AWQModifier from llmcompressor import oneshot from datasets import load_dataset MODEL_ID = "ibm-granite/granite-4.1-8b" OUTPUT_DIR = "granite-4.1-8b-awq" NUM_SAMPLES = 256 MAX_SEQ = 2048 model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto", device_map="auto") tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_SAMPLES}]") ds = ds.map(lambda ex: {"text": tokenizer.apply_chat_template(ex["messages"], tokenize=False)}) ds = ds.map( lambda ex: tokenizer(ex["text"], padding=False, max_length=MAX_SEQ, truncation=True, add_special_tokens=False), remove_columns=ds.column_names, ) oneshot( model=model, dataset=ds, recipe=[AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"])], max_seq_length=MAX_SEQ, num_calibration_samples=NUM_SAMPLES, output_dir=OUTPUT_DIR, ) tokenizer.save_pretrained(OUTPUT_DIR) ``` On a single L40S this completes in roughly 60–90 minutes. ## License This checkpoint inherits the **Apache 2.0** license of the upstream [`ibm-granite/granite-4.1-8b`](https://huggingface.co/ibm-granite/granite-4.1-8b). You are free to use, modify, and redistribute it under the same terms. ## Acknowledgments - IBM Research for the [Granite 4.1](https://huggingface.co/ibm-granite/granite-4.1-8b) base model. - The [vLLM](https://github.com/vllm-project/vllm) team for AWQ-Marlin kernels. - The [llm-compressor](https://github.com/vllm-project/llm-compressor) team for the quantization tooling. ## Citation If you use this model in your work, please also cite the original Granite paper and the AWQ paper: ```bibtex @article{granite2024, title = {Granite Foundation Models}, author = {{IBM Research}}, year = {2024}, url = {https://huggingface.co/ibm-granite} } @inproceedings{lin2024awq, title = {AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration}, author = {Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song}, booktitle = {MLSys}, year = {2024} } ```