3b0d2549aa2b96e819e2e9e527c6c4ac57700f27
Model: worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF Source: Original Platform
language, license, tags, base_model, quantized_by, pipeline_tag
| language | license | tags | base_model | quantized_by | pipeline_tag | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
apache-2.0 |
|
TinyLlama/TinyLlama-1.1B-Chat-v1.0 | worthdoing | text-generation |
Author: Simon-Pierre Boucher
TinyLlama-1.1B-Chat-v1.0 - GGUF Quantized by worthdoing
Quantized for local Mac inference (Apple Silicon / Metal) by worthdoing
About
This is a GGUF quantized version of TinyLlama-1.1B-Chat-v1.0, optimized for running locally on Apple Silicon Macs with llama.cpp, Ollama, or LM Studio.
- Original model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
- Parameters: 1.1B
- Quantized by: worthdoing
- Pipeline: corelm-model v1.0
Description
Ultra-tiny Llama variant. Minimal resource usage for basic tasks.
Available Quantizations
| File | Quant | BPW | Size | Use Case |
|---|---|---|---|---|
tinyllama-1.1b-chat-v1.0-Q4_K_M-worthdoing.gguf |
Q4_K_M | 4.58 | ~0.6 GB | Recommended - Best quality/size ratio |
tinyllama-1.1b-chat-v1.0-Q5_K_M-worthdoing.gguf |
Q5_K_M | 5.33 | ~0.7 GB | Higher quality, still fast |
tinyllama-1.1b-chat-v1.0-Q8_0-worthdoing.gguf |
Q8_0 | 7.96 | ~1.0 GB | Near-original quality |
How to Use
With Ollama
# Create a Modelfile
cat > Modelfile <<'MODELEOF'
FROM ./tinyllama-1.1b-chat-v1.0-Q4_K_M-worthdoing.gguf
MODELEOF
ollama create tinyllama-1.1b-chat-v1.0 -f Modelfile
ollama run tinyllama-1.1b-chat-v1.0
With llama.cpp
llama-cli -m tinyllama-1.1b-chat-v1.0-Q4_K_M-worthdoing.gguf -p "Your prompt here" -ngl 99
With LM Studio
- Download the GGUF file
- Open LM Studio -> My Models -> Import
- Select the GGUF file and start chatting
Quantization Method
Our quantization pipeline (corelm-model v1.0) follows a rigorous multi-step process to ensure maximum quality and compatibility:
Step 1 — Download & Validation
- Model weights are downloaded from HuggingFace Hub in SafeTensors format (
.safetensors) - Legacy formats (
.bin,.pt) are excluded to ensure clean, verified weights - Tokenizer, configuration, and all metadata are preserved
Step 2 — Conversion to GGUF F16 Baseline
- The original model is converted to GGUF format at FP16 precision using
convert_hf_to_gguf.pyfrom llama.cpp - This lossless baseline preserves the full original model quality
- Architecture-specific tensors (attention, FFN, embeddings, MoE routing) are mapped to their GGUF equivalents
Step 3 — K-Quant Quantization
- The F16 baseline is quantized using
llama-quantizewith k-quant methods - K-quants use a mixed-precision approach: more important layers (attention, output) retain higher precision, while less sensitive layers (FFN) are compressed more aggressively
- Each quantization level offers a different quality/size tradeoff:
| Method | Bits per Weight | Strategy |
|---|---|---|
| Q4_K_M | ~4.58 bpw | Mixed 4/5-bit. Attention & output layers use Q5_K, FFN layers use Q4_K. Best balance of quality and size. |
| Q5_K_M | ~5.33 bpw | Mixed 5/6-bit. Attention & output layers use Q6_K, FFN layers use Q5_K. Higher quality with moderate size increase. |
| Q8_0 | ~7.96 bpw | Uniform 8-bit. All layers quantized to 8-bit. Near-lossless quality, largest file size. |
Step 4 — Metadata Injection
- Custom metadata is embedded directly in each GGUF file:
general.quantized_by: worthdoinggeneral.quantization_version: corelm-1.0
- This ensures full traceability and provenance of every quantized file
Tools & Environment
- llama.cpp: Used for both conversion and quantization — the industry-standard open-source LLM inference engine
- Target platform: Apple Silicon Macs (M1/M2/M3/M4) with Metal GPU acceleration
- Inference runtimes: Compatible with
llama.cpp,Ollama,LM Studio,koboldcpp, and any GGUF-compatible runtime
Recommended Hardware
| Quant | Min RAM | Recommended |
|---|---|---|
| Q4_K_M | 4 GB | Mac with 8 GB+ RAM |
| Q5_K_M | 4 GB | Mac with 8 GB+ RAM |
| Q8_0 | 4 GB | Mac with 8 GB+ RAM |
Tags
general, ultra-lightweight, edge
Quantized with corelm-model pipeline by worthdoing on 2026-04-17
Description
