Files

ModelHub XC 5a6854dade 初始化项目，由ModelHub XC社区提供模型

Model: worthdoing/Stablelm-2-Zephyr-1.6B-GGUF
Source: Original Platform

2026-05-07 14:50:17 +08:00

5.4 KiB

Raw Permalink Blame History

language, license, tags, base_model, quantized_by, pipeline_tag

language

license

Stablelm-2-Zephyr-1.6B - GGUF Quantized by worthdoing

Quantized for local Mac inference (Apple Silicon / Metal) by worthdoing

About

This is a GGUF quantized version of Stablelm-2-Zephyr-1.6B, optimized for running locally on Apple Silicon Macs with llama.cpp, Ollama, or LM Studio.

Original model: stabilityai/stablelm-2-zephyr-1_6b
Parameters: 1.6B
Quantized by: worthdoing
Pipeline: corelm-model v1.0

Description

Stability AI's small chat model. Efficient and responsive.

Available Quantizations

File	Quant	BPW	Size	Use Case
`stablelm-2-zephyr-1.6b-Q4_K_M-worthdoing.gguf`	Q4_K_M	4.58	~0.9 GB	Recommended - Best quality/size ratio
`stablelm-2-zephyr-1.6b-Q5_K_M-worthdoing.gguf`	Q5_K_M	5.33	~1.0 GB	Higher quality, still fast
`stablelm-2-zephyr-1.6b-Q8_0-worthdoing.gguf`	Q8_0	7.96	~1.5 GB	Near-original quality

How to Use

With Ollama

# Create a Modelfile
cat > Modelfile <<'MODELEOF'
FROM ./stablelm-2-zephyr-1.6b-Q4_K_M-worthdoing.gguf
MODELEOF

ollama create stablelm-2-zephyr-1.6b -f Modelfile
ollama run stablelm-2-zephyr-1.6b

With llama.cpp

llama-cli -m stablelm-2-zephyr-1.6b-Q4_K_M-worthdoing.gguf -p "Your prompt here" -ngl 99

With LM Studio

Download the GGUF file
Open LM Studio -> My Models -> Import
Select the GGUF file and start chatting

Quantization Method

Our quantization pipeline (corelm-model v1.0) follows a rigorous multi-step process to ensure maximum quality and compatibility:

Step 1 — Download & Validation

Model weights are downloaded from HuggingFace Hub in SafeTensors format (.safetensors)
Legacy formats (.bin, .pt) are excluded to ensure clean, verified weights
Tokenizer, configuration, and all metadata are preserved

Step 2 — Conversion to GGUF F16 Baseline

The original model is converted to GGUF format at FP16 precision using convert_hf_to_gguf.py from llama.cpp
This lossless baseline preserves the full original model quality
Architecture-specific tensors (attention, FFN, embeddings, MoE routing) are mapped to their GGUF equivalents

Step 3 — K-Quant Quantization

The F16 baseline is quantized using llama-quantize with k-quant methods
K-quants use a mixed-precision approach: more important layers (attention, output) retain higher precision, while less sensitive layers (FFN) are compressed more aggressively
Each quantization level offers a different quality/size tradeoff:

Method	Bits per Weight	Strategy
Q4_K_M	~4.58 bpw	Mixed 4/5-bit. Attention & output layers use Q5_K, FFN layers use Q4_K. Best balance of quality and size.
Q5_K_M	~5.33 bpw	Mixed 5/6-bit. Attention & output layers use Q6_K, FFN layers use Q5_K. Higher quality with moderate size increase.
Q8_0	~7.96 bpw	Uniform 8-bit. All layers quantized to 8-bit. Near-lossless quality, largest file size.

Step 4 — Metadata Injection

Custom metadata is embedded directly in each GGUF file:
- general.quantized_by: worthdoing
- general.quantization_version: corelm-1.0
This ensures full traceability and provenance of every quantized file

Tools & Environment

llama.cpp: Used for both conversion and quantization — the industry-standard open-source LLM inference engine
Target platform: Apple Silicon Macs (M1/M2/M3/M4) with Metal GPU acceleration
Inference runtimes: Compatible with llama.cpp, Ollama, LM Studio, koboldcpp, and any GGUF-compatible runtime

Recommended Hardware

Quant	Min RAM	Recommended
Q4_K_M	4 GB	Mac with 8 GB+ RAM
Q5_K_M	4 GB	Mac with 8 GB+ RAM
Q8_0	4 GB	Mac with 8 GB+ RAM

5.4 KiB Raw Permalink Blame History