Files
granite-4.1-8b-awq/README.md
ModelHub XC 60fa65b8a9 初始化项目,由ModelHub XC社区提供模型
Model: pranavbapat/granite-4.1-8b-awq
Source: Original Platform
2026-06-05 19:07:40 +08:00

201 lines
6.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
license: apache-2.0
language:
- en
library_name: transformers
pipeline_tag: text-generation
base_model: ibm-granite/granite-4.1-8b
base_model_relation: quantized
tags:
- granite
- awq
- w4a16
- compressed-tensors
- llmcompressor
- quantized
- vllm
---
# Granite 4.1-8B — AWQ W4A16
4-bit AWQ quantization of [`ibm-granite/granite-4.1-8b`](https://huggingface.co/ibm-granite/granite-4.1-8b),
produced with [`llm-compressor`](https://github.com/vllm-project/llm-compressor) 0.6.0.1.
The result is a ~5.8 GB checkpoint (down from ~17 GB BF16) that loads on a single 24 GB GPU
with comfortable KV-cache headroom and serves natively in [vLLM](https://github.com/vllm-project/vllm)
via the AWQ-Marlin kernels.
## Quantization details
| Setting | Value |
| --- | --- |
| Method | AWQ (Activation-aware Weight Quantization) |
| Scheme | W4A16 asymmetric, group size 128 |
| Ignored modules | `lm_head` |
| Targeted modules | All `Linear` layers |
| Calibration dataset | `HuggingFaceH4/ultrachat_200k` (`train_sft` split) |
| Calibration samples | 256 |
| Max sequence length | 2048 |
| Tool | `llmcompressor==0.6.0.1` |
| Output format | `compressed-tensors` (safetensors, sharded) |
The exact recipe used:
```python
from llmcompressor.modifiers.awq import AWQModifier
recipe = [
AWQModifier(
ignore=["lm_head"],
scheme="W4A16_ASYM",
targets=["Linear"],
)
]
```
The full machine-readable recipe is also included in this repo as
[`recipe.yaml`](./recipe.yaml).
## Serving with vLLM
```bash
vllm serve pranavbapat/granite-4.1-8b-awq \
--quantization awq_marlin \
--dtype auto \
--max-model-len 8192 \
--gpu-memory-utilization 0.90 \
--served-model-name granite-4.1-8b-awq
```
Then hit the OpenAI-compatible endpoint:
```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "granite-4.1-8b-awq",
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 64
}'
```
Tested with `vllm==0.8.5.post1`, `torch==2.6.0+cu124`, NVIDIA L40S (Ada, sm_89).
## Loading with `transformers`
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "pranavbapat/granite-4.1-8b-awq"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Briefly: what is photosynthesis?"}],
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(out[0], skip_special_tokens=True))
```
For best decode speed prefer the vLLM path — `transformers` will run the model in a
dequantized form on most architectures.
## Hardware compatibility
| Device | Status |
| --- | --- |
| NVIDIA Hopper (H100/H200) | ✅ via vLLM AWQ-Marlin |
| NVIDIA Ada Lovelace (L40S, RTX 4090) | ✅ via vLLM AWQ-Marlin |
| NVIDIA Ampere (A100, A10, RTX 3090) | ✅ via vLLM AWQ-Marlin |
| NVIDIA Turing and older | ⚠️ Marlin not supported — use the slower AWQ kernel |
| AMD / Apple Silicon | ❌ Use the original BF16 model with a different runtime |
## Files
| File | Purpose |
| --- | --- |
| `model-0000{1,2}-of-00002.safetensors` | Quantized weights, sharded |
| `model.safetensors.index.json` | Shard map |
| `config.json`, `generation_config.json` | Model + generation config |
| `tokenizer.json`, `tokenizer_config.json`, `vocab.json`, `merges.txt`, `special_tokens_map.json`, `chat_template.jinja` | Tokenizer files |
| `recipe.yaml` | llm-compressor recipe used for this quantization |
## Limitations
- AWQ is a **lossy** post-training quantization. Expect a small quality regression vs. the BF16 base — typically ≤ 1% on most reasoning benchmarks, but task-dependent.
- Calibration was done on `ultrachat_200k`, which biases towards English chat-style prompts. For domain-specific deployments (code, legal, medical, multilingual), re-calibrate on a representative sample of your own data for best fidelity.
- Inherits all behavioural and safety characteristics of the base model. See the [original Granite 4.1-8B model card](https://huggingface.co/ibm-granite/granite-4.1-8b) for the full picture.
## Reproducing this checkpoint
The full pipeline (RunPod setup, quantization, serving, upload) is open and minimal.
Quantization step:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor.modifiers.awq import AWQModifier
from llmcompressor import oneshot
from datasets import load_dataset
MODEL_ID = "ibm-granite/granite-4.1-8b"
OUTPUT_DIR = "granite-4.1-8b-awq"
NUM_SAMPLES = 256
MAX_SEQ = 2048
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_SAMPLES}]")
ds = ds.map(lambda ex: {"text": tokenizer.apply_chat_template(ex["messages"], tokenize=False)})
ds = ds.map(
lambda ex: tokenizer(ex["text"], padding=False, max_length=MAX_SEQ, truncation=True, add_special_tokens=False),
remove_columns=ds.column_names,
)
oneshot(
model=model,
dataset=ds,
recipe=[AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"])],
max_seq_length=MAX_SEQ,
num_calibration_samples=NUM_SAMPLES,
output_dir=OUTPUT_DIR,
)
tokenizer.save_pretrained(OUTPUT_DIR)
```
On a single L40S this completes in roughly 6090 minutes.
## License
This checkpoint inherits the **Apache 2.0** license of the upstream
[`ibm-granite/granite-4.1-8b`](https://huggingface.co/ibm-granite/granite-4.1-8b).
You are free to use, modify, and redistribute it under the same terms.
## Acknowledgments
- IBM Research for the [Granite 4.1](https://huggingface.co/ibm-granite/granite-4.1-8b) base model.
- The [vLLM](https://github.com/vllm-project/vllm) team for AWQ-Marlin kernels.
- The [llm-compressor](https://github.com/vllm-project/llm-compressor) team for the quantization tooling.
## Citation
If you use this model in your work, please also cite the original Granite paper and the AWQ paper:
```bibtex
@article{granite2024,
title = {Granite Foundation Models},
author = {{IBM Research}},
year = {2024},
url = {https://huggingface.co/ibm-granite}
}
@inproceedings{lin2024awq,
title = {AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration},
author = {Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},
booktitle = {MLSys},
year = {2024}
}
```