55 lines
1.3 KiB
Markdown
55 lines
1.3 KiB
Markdown
|
|
# GGUF Quantized Models
|
||
|
|
|
||
|
|
For optimal inference performance, we provide GGUF quantized versions of Wraith-8B.
|
||
|
|
|
||
|
|
## Available Models
|
||
|
|
|
||
|
|
### Recommended: Q4_K_M (4.7GB)
|
||
|
|
- **Best balance** of quality and speed
|
||
|
|
- **File:** `wraith-8b-Q4_K_M.gguf`
|
||
|
|
- **Size:** 4.7GB
|
||
|
|
- **Performance:** ~3.6s per response
|
||
|
|
- **Quality:** No degradation vs FP16 on benchmarks
|
||
|
|
|
||
|
|
### Full Precision: FP16 (16GB)
|
||
|
|
- **Highest quality** (though Q4_K_M shows no loss)
|
||
|
|
- **File:** `wraith-8b-fp16.gguf`
|
||
|
|
- **Size:** 16GB
|
||
|
|
- **Performance:** ~50s per response (CPU offloading)
|
||
|
|
- **Use case:** Research/analysis only
|
||
|
|
|
||
|
|
## Download
|
||
|
|
|
||
|
|
Due to file size, GGUF models are stored separately:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
# Download Q4_K_M (recommended)
|
||
|
|
wget https://huggingface.co/NeuroForge/Wraith-8B/resolve/main/gguf/wraith-8b-Q4_K_M.gguf
|
||
|
|
|
||
|
|
# Or use huggingface-cli
|
||
|
|
huggingface-cli download NeuroForge/Wraith-8B gguf/wraith-8b-Q4_K_M.gguf
|
||
|
|
```
|
||
|
|
|
||
|
|
## Usage with llama.cpp
|
||
|
|
|
||
|
|
```bash
|
||
|
|
./llama-cli -m wraith-8b-Q4_K_M.gguf \
|
||
|
|
-p "Calculate the area of a circle with radius 5cm." \
|
||
|
|
-n 512 \
|
||
|
|
--temp 0.7 \
|
||
|
|
--top-p 0.9
|
||
|
|
```
|
||
|
|
|
||
|
|
## Usage with Ollama
|
||
|
|
|
||
|
|
See main README for Modelfile template and setup instructions.
|
||
|
|
|
||
|
|
## Benchmarks
|
||
|
|
|
||
|
|
All benchmark results in the main model card were achieved using the Q4_K_M quantization:
|
||
|
|
- GSM8K: 70%
|
||
|
|
- MMLU: 66.4%
|
||
|
|
- TruthfulQA: 58.5%
|
||
|
|
|
||
|
|
**Conclusion:** Q4_K_M provides full model quality at 29% of the size.
|