language, license, tags, base_model, quantized_by, pipeline_tag
language license tags base_model quantized_by pipeline_tag
en
fr
multilingual
apache-2.0
gguf
quantized
mac
apple-silicon
local-inference
worthdoing
microsoft/Phi-3.5-mini-instruct worthdoing text-generation

worthdoing

Author: Simon-Pierre Boucher

GGUF Parameters Apple Silicon License worthdoing

Q4_K_M Q5_K_M Q8_0

Phi-3.5-mini-instruct - GGUF Quantized by worthdoing

Quantized for local Mac inference (Apple Silicon / Metal) by worthdoing

About

This is a GGUF quantized version of Phi-3.5-mini-instruct, optimized for running locally on Apple Silicon Macs with llama.cpp, Ollama, or LM Studio.

Description

Microsoft's Phi 3.5 mini. Strong reasoning in a compact form factor.

Available Quantizations

File Quant BPW Size Use Case
phi-3.5-mini-instruct-Q4_K_M-worthdoing.gguf Q4_K_M 4.58 ~2.0 GB Recommended - Best quality/size ratio
phi-3.5-mini-instruct-Q5_K_M-worthdoing.gguf Q5_K_M 5.33 ~2.4 GB Higher quality, still fast
phi-3.5-mini-instruct-Q8_0-worthdoing.gguf Q8_0 7.96 ~3.5 GB Near-original quality

How to Use

With Ollama

# Create a Modelfile
cat > Modelfile <<'MODELEOF'
FROM ./phi-3.5-mini-instruct-Q4_K_M-worthdoing.gguf
MODELEOF

ollama create phi-3.5-mini-instruct -f Modelfile
ollama run phi-3.5-mini-instruct

With llama.cpp

llama-cli -m phi-3.5-mini-instruct-Q4_K_M-worthdoing.gguf -p "Your prompt here" -ngl 99

With LM Studio

  1. Download the GGUF file
  2. Open LM Studio -> My Models -> Import
  3. Select the GGUF file and start chatting

Quantization Method

Our quantization pipeline (corelm-model v1.0) follows a rigorous multi-step process to ensure maximum quality and compatibility:

Step 1 — Download & Validation

  • Model weights are downloaded from HuggingFace Hub in SafeTensors format (.safetensors)
  • Legacy formats (.bin, .pt) are excluded to ensure clean, verified weights
  • Tokenizer, configuration, and all metadata are preserved

Step 2 — Conversion to GGUF F16 Baseline

  • The original model is converted to GGUF format at FP16 precision using convert_hf_to_gguf.py from llama.cpp
  • This lossless baseline preserves the full original model quality
  • Architecture-specific tensors (attention, FFN, embeddings, MoE routing) are mapped to their GGUF equivalents

Step 3 — K-Quant Quantization

  • The F16 baseline is quantized using llama-quantize with k-quant methods
  • K-quants use a mixed-precision approach: more important layers (attention, output) retain higher precision, while less sensitive layers (FFN) are compressed more aggressively
  • Each quantization level offers a different quality/size tradeoff:
Method Bits per Weight Strategy
Q4_K_M ~4.58 bpw Mixed 4/5-bit. Attention & output layers use Q5_K, FFN layers use Q4_K. Best balance of quality and size.
Q5_K_M ~5.33 bpw Mixed 5/6-bit. Attention & output layers use Q6_K, FFN layers use Q5_K. Higher quality with moderate size increase.
Q8_0 ~7.96 bpw Uniform 8-bit. All layers quantized to 8-bit. Near-lossless quality, largest file size.

Step 4 — Metadata Injection

  • Custom metadata is embedded directly in each GGUF file:
    • general.quantized_by: worthdoing
    • general.quantization_version: corelm-1.0
  • This ensures full traceability and provenance of every quantized file

Tools & Environment

  • llama.cpp: Used for both conversion and quantization — the industry-standard open-source LLM inference engine
  • Target platform: Apple Silicon Macs (M1/M2/M3/M4) with Metal GPU acceleration
  • Inference runtimes: Compatible with llama.cpp, Ollama, LM Studio, koboldcpp, and any GGUF-compatible runtime
Quant Min RAM Recommended
Q4_K_M 4 GB Mac with 8 GB+ RAM
Q5_K_M 4 GB Mac with 8 GB+ RAM
Q8_0 4 GB Mac with 8 GB+ RAM

Tags

general, reasoning, coding, math


Quantized with corelm-model pipeline by worthdoing on 2026-04-17

Description
Model synced from source: worthdoing/Phi-3.5-mini-instruct-GGUF
Readme 27 KiB