worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF

Go to file

ModelHub XC 3b0d2549aa 初始化项目，由ModelHub XC社区提供模型

Model: worthdoing/TinyLlama-1.1B-Chat-v1.0-GGUF
Source: Original Platform

2026-05-08 16:21:01 +08:00

.gitattributes

初始化项目，由ModelHub XC社区提供模型

2026-05-08 16:21:01 +08:00

README.md

初始化项目，由ModelHub XC社区提供模型

2026-05-08 16:21:01 +08:00

tinyllama-1.1b-chat-v1.0-Q4_K_M-worthdoing.gguf

初始化项目，由ModelHub XC社区提供模型

2026-05-08 16:21:01 +08:00

tinyllama-1.1b-chat-v1.0-Q5_K_M-worthdoing.gguf

初始化项目，由ModelHub XC社区提供模型

2026-05-08 16:21:01 +08:00

tinyllama-1.1b-chat-v1.0-Q8_0-worthdoing.gguf

初始化项目，由ModelHub XC社区提供模型

2026-05-08 16:21:01 +08:00

README.md

language, license, tags, base_model, quantized_by, pipeline_tag

language

license

TinyLlama-1.1B-Chat-v1.0 - GGUF Quantized by worthdoing

Quantized for local Mac inference (Apple Silicon / Metal) by worthdoing

About

This is a GGUF quantized version of TinyLlama-1.1B-Chat-v1.0, optimized for running locally on Apple Silicon Macs with llama.cpp, Ollama, or LM Studio.

Original model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Parameters: 1.1B
Quantized by: worthdoing
Pipeline: corelm-model v1.0

Description

Ultra-tiny Llama variant. Minimal resource usage for basic tasks.

Available Quantizations

File	Quant	BPW	Size	Use Case
`tinyllama-1.1b-chat-v1.0-Q4_K_M-worthdoing.gguf`	Q4_K_M	4.58	~0.6 GB	Recommended - Best quality/size ratio
`tinyllama-1.1b-chat-v1.0-Q5_K_M-worthdoing.gguf`	Q5_K_M	5.33	~0.7 GB	Higher quality, still fast
`tinyllama-1.1b-chat-v1.0-Q8_0-worthdoing.gguf`	Q8_0	7.96	~1.0 GB	Near-original quality

How to Use

With Ollama

# Create a Modelfile
cat > Modelfile <<'MODELEOF'
FROM ./tinyllama-1.1b-chat-v1.0-Q4_K_M-worthdoing.gguf
MODELEOF

ollama create tinyllama-1.1b-chat-v1.0 -f Modelfile
ollama run tinyllama-1.1b-chat-v1.0

With llama.cpp

llama-cli -m tinyllama-1.1b-chat-v1.0-Q4_K_M-worthdoing.gguf -p "Your prompt here" -ngl 99

With LM Studio

Download the GGUF file
Open LM Studio -> My Models -> Import
Select the GGUF file and start chatting

Quantization Method

Our quantization pipeline (corelm-model v1.0) follows a rigorous multi-step process to ensure maximum quality and compatibility:

Step 1 — Download & Validation

Model weights are downloaded from HuggingFace Hub in SafeTensors format (.safetensors)
Legacy formats (.bin, .pt) are excluded to ensure clean, verified weights
Tokenizer, configuration, and all metadata are preserved

Step 2 — Conversion to GGUF F16 Baseline

The original model is converted to GGUF format at FP16 precision using convert_hf_to_gguf.py from llama.cpp
This lossless baseline preserves the full original model quality
Architecture-specific tensors (attention, FFN, embeddings, MoE routing) are mapped to their GGUF equivalents

Step 3 — K-Quant Quantization

The F16 baseline is quantized using llama-quantize with k-quant methods
K-quants use a mixed-precision approach: more important layers (attention, output) retain higher precision, while less sensitive layers (FFN) are compressed more aggressively
Each quantization level offers a different quality/size tradeoff:

Method	Bits per Weight	Strategy
Q4_K_M	~4.58 bpw	Mixed 4/5-bit. Attention & output layers use Q5_K, FFN layers use Q4_K. Best balance of quality and size.
Q5_K_M	~5.33 bpw	Mixed 5/6-bit. Attention & output layers use Q6_K, FFN layers use Q5_K. Higher quality with moderate size increase.
Q8_0	~7.96 bpw	Uniform 8-bit. All layers quantized to 8-bit. Near-lossless quality, largest file size.

Step 4 — Metadata Injection

Custom metadata is embedded directly in each GGUF file:
- general.quantized_by: worthdoing
- general.quantization_version: corelm-1.0
This ensures full traceability and provenance of every quantized file

Tools & Environment

llama.cpp: Used for both conversion and quantization — the industry-standard open-source LLM inference engine
Target platform: Apple Silicon Macs (M1/M2/M3/M4) with Metal GPU acceleration
Inference runtimes: Compatible with llama.cpp, Ollama, LM Studio, koboldcpp, and any GGUF-compatible runtime

Recommended Hardware

Quant	Min RAM	Recommended
Q4_K_M	4 GB	Mac with 8 GB+ RAM
Q5_K_M	4 GB	Mac with 8 GB+ RAM
Q8_0	4 GB	Mac with 8 GB+ RAM

README.md

TinyLlama-1.1B-Chat-v1.0 - GGUF Quantized by worthdoing

About

Description

Available Quantizations

How to Use

With Ollama

With llama.cpp

With LM Studio

Quantization Method

Step 1 — Download & Validation

Step 2 — Conversion to GGUF F16 Baseline

Step 3 — K-Quant Quantization

Step 4 — Metadata Injection

Tools & Environment

Recommended Hardware

Tags