103 lines
4.3 KiB
Markdown
103 lines
4.3 KiB
Markdown
|
|
---
|
||
|
|
license: mit
|
||
|
|
language:
|
||
|
|
- en
|
||
|
|
base_model: microsoft/Phi-4-reasoning-vision-15B
|
||
|
|
tags:
|
||
|
|
- phi4
|
||
|
|
- phi-4
|
||
|
|
- gguf
|
||
|
|
- quantized
|
||
|
|
- llama-cpp
|
||
|
|
- ollama
|
||
|
|
- text-generation
|
||
|
|
- reasoning
|
||
|
|
model_type: phi3
|
||
|
|
quantized_by: jamesburton
|
||
|
|
pipeline_tag: text-generation
|
||
|
|
---
|
||
|
|
|
||
|
|
# Phi-4-reasoning-vision-15B-GGUF
|
||
|
|
|
||
|
|
GGUF format conversions of [microsoft/Phi-4-reasoning-vision-15B](https://huggingface.co/microsoft/Phi-4-reasoning-vision-15B) for use with [llama.cpp](https://github.com/ggerganov/llama.cpp) and [Ollama](https://ollama.com).
|
||
|
|
|
||
|
|
> **Note:** This conversion includes the **text backbone only** (language model weights). Vision encoder and multimodal projector weights are excluded, as llama.cpp does not yet support the `phi4-siglip` vision architecture. The text model is architecturally identical to Phi-4-reasoning-plus (`Phi3ForCausalLM`).
|
||
|
|
|
||
|
|
## Available Files
|
||
|
|
|
||
|
|
| Filename | Quant Type | Size | Description |
|
||
|
|
|---|---|---|---|
|
||
|
|
| `phi-4-reasoning-vision-f16.gguf` | F16 | ~28 GB | Full precision (float16) |
|
||
|
|
| `phi-4-reasoning-vision-q8_0.gguf` | Q8_0 | ~15 GB | 8-bit quantization (near-lossless) |
|
||
|
|
| `phi-4-reasoning-vision-q6_k.gguf` | Q6_K | ~12 GB | 6-bit K-quant |
|
||
|
|
| `phi-4-reasoning-vision-q5_k_m.gguf` | Q5_K_M | ~9.9 GB | 5-bit K-quant medium |
|
||
|
|
| `phi-4-reasoning-vision-q5_k_s.gguf` | Q5_K_S | ~9.5 GB | 5-bit K-quant small |
|
||
|
|
| `phi-4-reasoning-vision-q4_K_M.gguf` | Q4_K_M | ~8.5 GB | 4-bit K-quant medium (recommended) |
|
||
|
|
| `phi-4-reasoning-vision-q4_k_s.gguf` | Q4_K_S | ~7.9 GB | 4-bit K-quant small |
|
||
|
|
| `phi-4-reasoning-vision-q3_k_l.gguf` | Q3_K_L | ~7.4 GB | 3-bit K-quant large |
|
||
|
|
| `phi-4-reasoning-vision-q3_k_m.gguf` | Q3_K_M | ~6.9 GB | 3-bit K-quant medium |
|
||
|
|
| `phi-4-reasoning-vision-q3_k_s.gguf` | Q3_K_S | ~6.1 GB | 3-bit K-quant small |
|
||
|
|
| `phi-4-reasoning-vision-q2_k.gguf` | Q2_K | ~5.2 GB | 2-bit K-quant (smallest, lowest quality) |
|
||
|
|
|
||
|
|
## How to Use
|
||
|
|
|
||
|
|
### With Ollama
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Download the Q4_K_M GGUF and create a Modelfile:
|
||
|
|
cat > Modelfile <<'EOF'
|
||
|
|
FROM ./phi-4-reasoning-vision-q4_K_M.gguf
|
||
|
|
|
||
|
|
TEMPLATE """<|system|>
|
||
|
|
{{ if .System }}{{ .System }}{{ else }}You are a helpful AI assistant with vision capabilities. You can analyze images and reason about them step by step.{{ end }}<|end|>
|
||
|
|
<|user|>
|
||
|
|
{{ .Prompt }}<|end|>
|
||
|
|
<|assistant|>
|
||
|
|
"""
|
||
|
|
|
||
|
|
PARAMETER stop "<|end|>"
|
||
|
|
PARAMETER stop "<|endoftext|>"
|
||
|
|
PARAMETER temperature 0.7
|
||
|
|
PARAMETER top_p 0.9
|
||
|
|
PARAMETER num_ctx 4096
|
||
|
|
EOF
|
||
|
|
|
||
|
|
ollama create phi4-vision -f Modelfile
|
||
|
|
ollama run phi4-vision
|
||
|
|
```
|
||
|
|
|
||
|
|
### With llama.cpp
|
||
|
|
|
||
|
|
```bash
|
||
|
|
./llama-cli -m phi-4-reasoning-vision-q4_K_M.gguf -p "Explain the theory of relativity in simple terms." -n 512
|
||
|
|
```
|
||
|
|
|
||
|
|
## Model Details
|
||
|
|
|
||
|
|
- **Original Model:** [microsoft/Phi-4-reasoning-vision-15B](https://huggingface.co/microsoft/Phi-4-reasoning-vision-15B)
|
||
|
|
- **Architecture:** Phi3ForCausalLM (text backbone of Phi-4-reasoning-vision)
|
||
|
|
- **Parameters:** ~15B (text model)
|
||
|
|
- **Hidden Size:** 5120
|
||
|
|
- **Layers:** 40
|
||
|
|
- **Attention Heads:** 40 (10 KV heads, GQA)
|
||
|
|
- **Vocab Size:** 100,352
|
||
|
|
- **Tokenizer:** GPT-2 (BPE)
|
||
|
|
- **Context Length:** Up to 131,072 tokens (with RoPE scaling)
|
||
|
|
- **License:** [MIT](https://huggingface.co/microsoft/Phi-4-reasoning-vision-15B/blob/main/LICENSE)
|
||
|
|
|
||
|
|
## Conversion Details
|
||
|
|
|
||
|
|
- Converted using [llama.cpp](https://github.com/ggerganov/llama.cpp) `convert_hf_to_gguf.py`
|
||
|
|
- Vision tower (`model.vision_tower.*`) and multimodal projector (`model.mm_projector.*`) weights were skipped during conversion
|
||
|
|
- The model config was remapped from `Phi4ForCausalLMV` (phi4-siglip) to `Phi3ForCausalLM` (phi3) since the text backbone is architecturally identical
|
||
|
|
- Quantization performed via `llama_model_quantize()` with CUDA acceleration
|
||
|
|
- 243 text tensors converted, 452 vision tensors excluded
|
||
|
|
|
||
|
|
## Original Model Card
|
||
|
|
|
||
|
|
For full details on training, capabilities, safety, and intended use, please refer to the [original model card](https://huggingface.co/microsoft/Phi-4-reasoning-vision-15B).
|
||
|
|
|
||
|
|
## Disclaimer
|
||
|
|
|
||
|
|
This is an unofficial GGUF conversion. The original model was created by Microsoft Research. All credit for the model architecture, training, and capabilities belongs to the Microsoft Phi team. Please refer to the [original model's license](https://huggingface.co/microsoft/Phi-4-reasoning-vision-15B/blob/main/LICENSE) for usage terms.
|