Files
wraith-8b/GGUF_README.md

55 lines
1.3 KiB
Markdown
Raw Permalink Normal View History

# GGUF Quantized Models
For optimal inference performance, we provide GGUF quantized versions of Wraith-8B.
## Available Models
### Recommended: Q4_K_M (4.7GB)
- **Best balance** of quality and speed
- **File:** `wraith-8b-Q4_K_M.gguf`
- **Size:** 4.7GB
- **Performance:** ~3.6s per response
- **Quality:** No degradation vs FP16 on benchmarks
### Full Precision: FP16 (16GB)
- **Highest quality** (though Q4_K_M shows no loss)
- **File:** `wraith-8b-fp16.gguf`
- **Size:** 16GB
- **Performance:** ~50s per response (CPU offloading)
- **Use case:** Research/analysis only
## Download
Due to file size, GGUF models are stored separately:
```bash
# Download Q4_K_M (recommended)
wget https://huggingface.co/NeuroForge/Wraith-8B/resolve/main/gguf/wraith-8b-Q4_K_M.gguf
# Or use huggingface-cli
huggingface-cli download NeuroForge/Wraith-8B gguf/wraith-8b-Q4_K_M.gguf
```
## Usage with llama.cpp
```bash
./llama-cli -m wraith-8b-Q4_K_M.gguf \
-p "Calculate the area of a circle with radius 5cm." \
-n 512 \
--temp 0.7 \
--top-p 0.9
```
## Usage with Ollama
See main README for Modelfile template and setup instructions.
## Benchmarks
All benchmark results in the main model card were achieved using the Q4_K_M quantization:
- GSM8K: 70%
- MMLU: 66.4%
- TruthfulQA: 58.5%
**Conclusion:** Q4_K_M provides full model quality at 29% of the size.