初始化项目，由ModelHub XC社区提供模型

Model: pranavbapat/granite-4.1-8b-awq Source: Original Platform
2026-06-05 19:07:40 +08:00
commit 60fa65b8a9
14 changed files with 603726 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,201 @@
+---
+license: apache-2.0
+language:
+- en
+library_name: transformers
+pipeline_tag: text-generation
+base_model: ibm-granite/granite-4.1-8b
+base_model_relation: quantized
+tags:
+- granite
+- awq
+- w4a16
+- compressed-tensors
+- llmcompressor
+- quantized
+- vllm
+---
+
+# Granite 4.1-8B — AWQ W4A16
+
+4-bit AWQ quantization of [`ibm-granite/granite-4.1-8b`](https://huggingface.co/ibm-granite/granite-4.1-8b),
+produced with [`llm-compressor`](https://github.com/vllm-project/llm-compressor) 0.6.0.1.
+
+The result is a ~5.8 GB checkpoint (down from ~17 GB BF16) that loads on a single 24 GB GPU
+with comfortable KV-cache headroom and serves natively in [vLLM](https://github.com/vllm-project/vllm)
+via the AWQ-Marlin kernels.
+
+## Quantization details
+
+| Setting | Value |
+| --- | --- |
+| Method | AWQ (Activation-aware Weight Quantization) |
+| Scheme | W4A16 asymmetric, group size 128 |
+| Ignored modules | `lm_head` |
+| Targeted modules | All `Linear` layers |
+| Calibration dataset | `HuggingFaceH4/ultrachat_200k` (`train_sft` split) |
+| Calibration samples | 256 |
+| Max sequence length | 2048 |
+| Tool | `llmcompressor==0.6.0.1` |
+| Output format | `compressed-tensors` (safetensors, sharded) |
+
+The exact recipe used:
+
+```python
+from llmcompressor.modifiers.awq import AWQModifier
+
+recipe = [
+    AWQModifier(
+        ignore=["lm_head"],
+        scheme="W4A16_ASYM",
+        targets=["Linear"],
+    )
+]
+```
+
+The full machine-readable recipe is also included in this repo as
+[`recipe.yaml`](./recipe.yaml).
+
+## Serving with vLLM
+
+```bash
+vllm serve pranavbapat/granite-4.1-8b-awq \
+    --quantization awq_marlin \
+    --dtype auto \
+    --max-model-len 8192 \
+    --gpu-memory-utilization 0.90 \
+    --served-model-name granite-4.1-8b-awq
+```
+
+Then hit the OpenAI-compatible endpoint:
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "granite-4.1-8b-awq",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 64
+  }'
+```
+
+Tested with `vllm==0.8.5.post1`, `torch==2.6.0+cu124`, NVIDIA L40S (Ada, sm_89).
+
+## Loading with `transformers`
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_id = "pranavbapat/granite-4.1-8b-awq"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
+
+prompt = tokenizer.apply_chat_template(
+    [{"role": "user", "content": "Briefly: what is photosynthesis?"}],
+    tokenize=False,
+    add_generation_prompt=True,
+)
+inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+out = model.generate(**inputs, max_new_tokens=128)
+print(tokenizer.decode(out[0], skip_special_tokens=True))
+```
+
+For best decode speed prefer the vLLM path — `transformers` will run the model in a
+dequantized form on most architectures.
+
+## Hardware compatibility
+
+| Device | Status |
+| --- | --- |
+| NVIDIA Hopper (H100/H200) | ✅ via vLLM AWQ-Marlin |
+| NVIDIA Ada Lovelace (L40S, RTX 4090) | ✅ via vLLM AWQ-Marlin |
+| NVIDIA Ampere (A100, A10, RTX 3090) | ✅ via vLLM AWQ-Marlin |
+| NVIDIA Turing and older | ⚠️ Marlin not supported — use the slower AWQ kernel |
+| AMD / Apple Silicon | ❌ Use the original BF16 model with a different runtime |
+
+## Files
+
+| File | Purpose |
+| --- | --- |
+| `model-0000{1,2}-of-00002.safetensors` | Quantized weights, sharded |
+| `model.safetensors.index.json` | Shard map |
+| `config.json`, `generation_config.json` | Model + generation config |
+| `tokenizer.json`, `tokenizer_config.json`, `vocab.json`, `merges.txt`, `special_tokens_map.json`, `chat_template.jinja` | Tokenizer files |
+| `recipe.yaml` | llm-compressor recipe used for this quantization |
+
+## Limitations
+
+- AWQ is a **lossy** post-training quantization. Expect a small quality regression vs. the BF16 base — typically ≤ 1% on most reasoning benchmarks, but task-dependent.
+- Calibration was done on `ultrachat_200k`, which biases towards English chat-style prompts. For domain-specific deployments (code, legal, medical, multilingual), re-calibrate on a representative sample of your own data for best fidelity.
+- Inherits all behavioural and safety characteristics of the base model. See the [original Granite 4.1-8B model card](https://huggingface.co/ibm-granite/granite-4.1-8b) for the full picture.
+
+## Reproducing this checkpoint
+
+The full pipeline (RunPod setup, quantization, serving, upload) is open and minimal.
+Quantization step:
+
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from llmcompressor.modifiers.awq import AWQModifier
+from llmcompressor import oneshot
+from datasets import load_dataset
+
+MODEL_ID = "ibm-granite/granite-4.1-8b"
+OUTPUT_DIR = "granite-4.1-8b-awq"
+NUM_SAMPLES = 256
+MAX_SEQ = 2048
+
+model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto", device_map="auto")
+tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
+
+ds = load_dataset("HuggingFaceH4/ultrachat_200k", split=f"train_sft[:{NUM_SAMPLES}]")
+ds = ds.map(lambda ex: {"text": tokenizer.apply_chat_template(ex["messages"], tokenize=False)})
+ds = ds.map(
+    lambda ex: tokenizer(ex["text"], padding=False, max_length=MAX_SEQ, truncation=True, add_special_tokens=False),
+    remove_columns=ds.column_names,
+)
+
+oneshot(
+    model=model,
+    dataset=ds,
+    recipe=[AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"])],
+    max_seq_length=MAX_SEQ,
+    num_calibration_samples=NUM_SAMPLES,
+    output_dir=OUTPUT_DIR,
+)
+tokenizer.save_pretrained(OUTPUT_DIR)
+```
+
+On a single L40S this completes in roughly 60–90 minutes.
+
+## License
+
+This checkpoint inherits the **Apache 2.0** license of the upstream
+[`ibm-granite/granite-4.1-8b`](https://huggingface.co/ibm-granite/granite-4.1-8b).
+You are free to use, modify, and redistribute it under the same terms.
+
+## Acknowledgments
+
+- IBM Research for the [Granite 4.1](https://huggingface.co/ibm-granite/granite-4.1-8b) base model.
+- The [vLLM](https://github.com/vllm-project/vllm) team for AWQ-Marlin kernels.
+- The [llm-compressor](https://github.com/vllm-project/llm-compressor) team for the quantization tooling.
+
+## Citation
+
+If you use this model in your work, please also cite the original Granite paper and the AWQ paper:
+
+```bibtex
+@article{granite2024,
+  title  = {Granite Foundation Models},
+  author = {{IBM Research}},
+  year   = {2024},
+  url    = {https://huggingface.co/ibm-granite}
+}
+
+@inproceedings{lin2024awq,
+  title     = {AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration},
+  author    = {Lin, Ji and Tang, Jiaming and Tang, Haotian and Yang, Shang and Chen, Wei-Ming and Wang, Wei-Chen and Xiao, Guangxuan and Dang, Xingyu and Gan, Chuang and Han, Song},
+  booktitle = {MLSys},
+  year      = {2024}
+}
+```