Files
granite-4.1-8b-awq/README.md
ModelHub XC 60fa65b8a9 初始化项目,由ModelHub XC社区提供模型
Model: pranavbapat/granite-4.1-8b-awq
Source: Original Platform
2026-06-05 19:07:40 +08:00

6.6 KiB
Raw Permalink Blame History

license, language, library_name, pipeline_tag, base_model, base_model_relation, tags
license language library_name pipeline_tag base_model base_model_relation tags
apache-2.0
en
transformers text-generation ibm-granite/granite-4.1-8b quantized
granite
awq
w4a16
compressed-tensors
llmcompressor
quantized
vllm

Granite 4.1-8B — AWQ W4A16

4-bit AWQ quantization of ibm-granite/granite-4.1-8b, produced with llm-compressor 0.6.0.1.

The result is a ~5.8 GB checkpoint (down from ~17 GB BF16) that loads on a single 24 GB GPU with comfortable KV-cache headroom and serves natively in vLLM via the AWQ-Marlin kernels.

Quantization details

Setting Value
Method AWQ (Activation-aware Weight Quantization)
Scheme W4A16 asymmetric, group size 128
Ignored modules lm_head
Targeted modules All Linear layers
Calibration dataset HuggingFaceH4/ultrachat_200k (train_sft split)
Calibration samples 256
Max sequence length 2048
Tool llmcompressor==0.6.0.1
Output format compressed-tensors (safetensors, sharded)

The exact recipe used:

from llmcompressor.modifiers.awq import AWQModifier

recipe = [
    AWQModifier(
        ignore=["lm_head"],
        scheme="W4A16_ASYM",
        targets=["Linear"],
    )
]

The full machine-readable recipe is also included in this repo as recipe.yaml.

Serving with vLLM

vllm serve pranavbapat/granite-4.1-8b-awq \
    --quantization awq_marlin \
    --dtype auto \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90 \
    --served-model-name granite-4.1-8b-awq

Then hit the OpenAI-compatible endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "granite-4.1-8b-awq",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 64
  }'

Tested with vllm==0.8.5.post1, torch==2.6.0+cu124, NVIDIA L40S (Ada, sm_89).

Loading with transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "pranavbapat/granite-4.1-8b-awq"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Briefly: what is photosynthesis?"}],
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(out[0], skip_special_tokens=True))

For best decode speed prefer the vLLM path — transformers will run the model in a dequantized form on most architectures.

Hardware compatibility

Device Status
NVIDIA Hopper (H100/H200) via vLLM AWQ-Marlin
NVIDIA Ada Lovelace (L40S, RTX 4090) via vLLM AWQ-Marlin
NVIDIA Ampere (A100, A10, RTX 3090) via vLLM AWQ-Marlin
NVIDIA Turing and older ⚠️ Marlin not supported — use the slower AWQ kernel
AMD / Apple Silicon Use the original BF16 model with a different runtime

Files

File Purpose
model-0000{1,2}-of-00002.safetensors Quantized weights, sharded
model.safetensors.index.json Shard map
config.json, generation_config.json Model + generation config
tokenizer.json, tokenizer_config.json, vocab.json, merges.txt, special_tokens_map.json, chat_template.jinja Tokenizer files
recipe.yaml llm-compressor recipe used for this quantization

Limitations

  • AWQ is a lossy post-training quantization. Expect a small quality regression vs. the BF16 base — typically ≤ 1% on most reasoning benchmarks, but task-dependent.
  • Calibration was done on ultrachat_200k, which biases towards English chat-style prompts. For domain-specific deployments (code, legal, medical, multilingual), re-calibrate on a representative sample of your own data for best fidelity.
  • Inherits all behavioural and safety characteristics of the base model. See the original Granite 4.1-8B model card for the full picture.

Reproducing this checkpoint

The full pipeline (RunPod setup, quantization, serving, upload) is open and minimal. Quantization step:

from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor import oneshot
from datasets import load_dataset

MODEL_ID = "ibm-granite/granite-4.1-8b"
OUTPUT_DIR = "granite-4.1-8b-awq"
NUM_SAMPLES = 256
MAX_SEQ = 2048

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_SAMPLES}]")
ds = ds.map(lambda ex: {"text": tokenizer.apply_chat_template(ex["messages"], tokenize=False)})
ds = ds.map(
    lambda ex: tokenizer(ex["text"], padding=False, max_length=MAX_SEQ, truncation=True, add_special_tokens=False),
    remove_columns=ds.column_names,
)

oneshot(
    model=model,
    dataset=ds,
    recipe=[AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"])],
    max_seq_length=MAX_SEQ,
    num_calibration_samples=NUM_SAMPLES,
    output_dir=OUTPUT_DIR,
)
tokenizer.save_pretrained(OUTPUT_DIR)

On a single L40S this completes in roughly 6090 minutes.

License

This checkpoint inherits the Apache 2.0 license of the upstream ibm-granite/granite-4.1-8b. You are free to use, modify, and redistribute it under the same terms.

Acknowledgments

Citation

If you use this model in your work, please also cite the original Granite paper and the AWQ paper:

@article{granite2024,
  title  = {Granite Foundation Models},
  author = {{IBM Research}},
  year   = {2024},
  url    = {https://huggingface.co/ibm-granite}
}

@inproceedings{lin2024awq,
  title     = {AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration},
  author    = {Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},
  booktitle = {MLSys},
  year      = {2024}
}