Files

ModelHub XC 60fa65b8a9 初始化项目，由ModelHub XC社区提供模型

Model: pranavbapat/granite-4.1-8b-awq
Source: Original Platform

2026-06-05 19:07:40 +08:00

6.6 KiB

Raw Permalink Blame History

license, language, library_name, pipeline_tag, base_model, base_model_relation, tags

license

language

library_name

pipeline_tag

base_model

base_model_relation

Granite 4.1-8B — AWQ W4A16

4-bit AWQ quantization of ibm-granite/granite-4.1-8b, produced with llm-compressor 0.6.0.1.

The result is a ~5.8 GB checkpoint (down from ~17 GB BF16) that loads on a single 24 GB GPU with comfortable KV-cache headroom and serves natively in vLLM via the AWQ-Marlin kernels.

Quantization details

Setting	Value
Method	AWQ (Activation-aware Weight Quantization)
Scheme	W4A16 asymmetric, group size 128
Ignored modules	`lm_head`
Targeted modules	All `Linear` layers
Calibration dataset	`HuggingFaceH4/ultrachat_200k` (`train_sft` split)
Calibration samples	256
Max sequence length	2048
Tool	`llmcompressor==0.6.0.1`
Output format	`compressed-tensors` (safetensors, sharded)

The exact recipe used:

from llmcompressor.modifiers.awq import AWQModifier

recipe = [
    AWQModifier(
        ignore=["lm_head"],
        scheme="W4A16_ASYM",
        targets=["Linear"],
    )
]

The full machine-readable recipe is also included in this repo as recipe.yaml.

Serving with vLLM

vllm serve pranavbapat/granite-4.1-8b-awq \
    --quantization awq_marlin \
    --dtype auto \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.90 \
    --served-model-name granite-4.1-8b-awq

Then hit the OpenAI-compatible endpoint:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "granite-4.1-8b-awq",
    "messages": [{"role": "user", "content": "Hello!"}],
    "max_tokens": 64
  }'

Tested with vllm==0.8.5.post1, torch==2.6.0+cu124, NVIDIA L40S (Ada, sm_89).

Loading with `transformers`

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "pranavbapat/granite-4.1-8b-awq"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Briefly: what is photosynthesis?"}],
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(out[0], skip_special_tokens=True))

For best decode speed prefer the vLLM path — transformers will run the model in a dequantized form on most architectures.

Hardware compatibility

Device	Status
NVIDIA Hopper (H100/H200)	✅ via vLLM AWQ-Marlin
NVIDIA Ada Lovelace (L40S, RTX 4090)	✅ via vLLM AWQ-Marlin
NVIDIA Ampere (A100, A10, RTX 3090)	✅ via vLLM AWQ-Marlin
NVIDIA Turing and older	⚠️ Marlin not supported — use the slower AWQ kernel
AMD / Apple Silicon	❌ Use the original BF16 model with a different runtime

Files

File	Purpose
`model-0000{1,2}-of-00002.safetensors`	Quantized weights, sharded
`model.safetensors.index.json`	Shard map
`config.json`, `generation_config.json`	Model + generation config
`tokenizer.json`, `tokenizer_config.json`, `vocab.json`, `merges.txt`, `special_tokens_map.json`, `chat_template.jinja`	Tokenizer files
`recipe.yaml`	llm-compressor recipe used for this quantization

Limitations

AWQ is a lossy post-training quantization. Expect a small quality regression vs. the BF16 base — typically ≤ 1% on most reasoning benchmarks, but task-dependent.
Calibration was done on ultrachat_200k, which biases towards English chat-style prompts. For domain-specific deployments (code, legal, medical, multilingual), re-calibrate on a representative sample of your own data for best fidelity.
Inherits all behavioural and safety characteristics of the base model. See the original Granite 4.1-8B model card for the full picture.

Reproducing this checkpoint

The full pipeline (RunPod setup, quantization, serving, upload) is open and minimal. Quantization step:

from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor import oneshot
from datasets import load_dataset

MODEL_ID = "ibm-granite/granite-4.1-8b"
OUTPUT_DIR = "granite-4.1-8b-awq"
NUM_SAMPLES = 256
MAX_SEQ = 2048

model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_SAMPLES}]")
ds = ds.map(lambda ex: {"text": tokenizer.apply_chat_template(ex["messages"], tokenize=False)})
ds = ds.map(
    lambda ex: tokenizer(ex["text"], padding=False, max_length=MAX_SEQ, truncation=True, add_special_tokens=False),
    remove_columns=ds.column_names,
)

oneshot(
    model=model,
    dataset=ds,
    recipe=[AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"])],
    max_seq_length=MAX_SEQ,
    num_calibration_samples=NUM_SAMPLES,
    output_dir=OUTPUT_DIR,
)
tokenizer.save_pretrained(OUTPUT_DIR)

On a single L40S this completes in roughly 60–90 minutes.

License

This checkpoint inherits the Apache 2.0 license of the upstream ibm-granite/granite-4.1-8b. You are free to use, modify, and redistribute it under the same terms.

Acknowledgments

IBM Research for the Granite 4.1 base model.
The vLLM team for AWQ-Marlin kernels.
The llm-compressor team for the quantization tooling.

Citation

If you use this model in your work, please also cite the original Granite paper and the AWQ paper:

@article{granite2024,
  title  = {Granite Foundation Models},
  author = {{IBM Research}},
  year   = {2024},
  url    = {https://huggingface.co/ibm-granite}
}

@inproceedings{lin2024awq,
  title     = {AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration},
  author    = {Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},
  booktitle = {MLSys},
  year      = {2024}
}

6.6 KiB Raw Permalink Blame History Unescape Escape