Files
ModelHub XC 5a6854dade 初始化项目,由ModelHub XC社区提供模型
Model: worthdoing/Stablelm-2-Zephyr-1.6B-GGUF
Source: Original Platform
2026-05-07 14:50:17 +08:00

5.4 KiB

language, license, tags, base_model, quantized_by, pipeline_tag
language license tags base_model quantized_by pipeline_tag
en
fr
multilingual
apache-2.0
gguf
quantized
mac
apple-silicon
local-inference
worthdoing
stabilityai/stablelm-2-zephyr-1_6b worthdoing text-generation

worthdoing

Author: Simon-Pierre Boucher

GGUF Parameters Apple Silicon License worthdoing

Q4_K_M Q5_K_M Q8_0

Stablelm-2-Zephyr-1.6B - GGUF Quantized by worthdoing

Quantized for local Mac inference (Apple Silicon / Metal) by worthdoing

About

This is a GGUF quantized version of Stablelm-2-Zephyr-1.6B, optimized for running locally on Apple Silicon Macs with llama.cpp, Ollama, or LM Studio.

Description

Stability AI's small chat model. Efficient and responsive.

Available Quantizations

File Quant BPW Size Use Case
stablelm-2-zephyr-1.6b-Q4_K_M-worthdoing.gguf Q4_K_M 4.58 ~0.9 GB Recommended - Best quality/size ratio
stablelm-2-zephyr-1.6b-Q5_K_M-worthdoing.gguf Q5_K_M 5.33 ~1.0 GB Higher quality, still fast
stablelm-2-zephyr-1.6b-Q8_0-worthdoing.gguf Q8_0 7.96 ~1.5 GB Near-original quality

How to Use

With Ollama

# Create a Modelfile
cat > Modelfile <<'MODELEOF'
FROM ./stablelm-2-zephyr-1.6b-Q4_K_M-worthdoing.gguf
MODELEOF

ollama create stablelm-2-zephyr-1.6b -f Modelfile
ollama run stablelm-2-zephyr-1.6b

With llama.cpp

llama-cli -m stablelm-2-zephyr-1.6b-Q4_K_M-worthdoing.gguf -p "Your prompt here" -ngl 99

With LM Studio

  1. Download the GGUF file
  2. Open LM Studio -> My Models -> Import
  3. Select the GGUF file and start chatting

Quantization Method

Our quantization pipeline (corelm-model v1.0) follows a rigorous multi-step process to ensure maximum quality and compatibility:

Step 1 — Download & Validation

  • Model weights are downloaded from HuggingFace Hub in SafeTensors format (.safetensors)
  • Legacy formats (.bin, .pt) are excluded to ensure clean, verified weights
  • Tokenizer, configuration, and all metadata are preserved

Step 2 — Conversion to GGUF F16 Baseline

  • The original model is converted to GGUF format at FP16 precision using convert_hf_to_gguf.py from llama.cpp
  • This lossless baseline preserves the full original model quality
  • Architecture-specific tensors (attention, FFN, embeddings, MoE routing) are mapped to their GGUF equivalents

Step 3 — K-Quant Quantization

  • The F16 baseline is quantized using llama-quantize with k-quant methods
  • K-quants use a mixed-precision approach: more important layers (attention, output) retain higher precision, while less sensitive layers (FFN) are compressed more aggressively
  • Each quantization level offers a different quality/size tradeoff:
Method Bits per Weight Strategy
Q4_K_M ~4.58 bpw Mixed 4/5-bit. Attention & output layers use Q5_K, FFN layers use Q4_K. Best balance of quality and size.
Q5_K_M ~5.33 bpw Mixed 5/6-bit. Attention & output layers use Q6_K, FFN layers use Q5_K. Higher quality with moderate size increase.
Q8_0 ~7.96 bpw Uniform 8-bit. All layers quantized to 8-bit. Near-lossless quality, largest file size.

Step 4 — Metadata Injection

  • Custom metadata is embedded directly in each GGUF file:
    • general.quantized_by: worthdoing
    • general.quantization_version: corelm-1.0
  • This ensures full traceability and provenance of every quantized file

Tools & Environment

  • llama.cpp: Used for both conversion and quantization — the industry-standard open-source LLM inference engine
  • Target platform: Apple Silicon Macs (M1/M2/M3/M4) with Metal GPU acceleration
  • Inference runtimes: Compatible with llama.cpp, Ollama, LM Studio, koboldcpp, and any GGUF-compatible runtime
Quant Min RAM Recommended
Q4_K_M 4 GB Mac with 8 GB+ RAM
Q5_K_M 4 GB Mac with 8 GB+ RAM
Q8_0 4 GB Mac with 8 GB+ RAM

Tags

general, ultra-lightweight, chat


Quantized with corelm-model pipeline by worthdoing on 2026-04-17