wraith-8b/GGUF_README.md

# GGUF Quantized Models

For optimal inference performance, we provide GGUF quantized versions of Wraith-8B.

## Available Models

### Recommended: Q4_K_M (4.7GB)
- **Best balance** of quality and speed
- **File:** `wraith-8b-Q4_K_M.gguf`
- **Size:** 4.7GB
- **Performance:** ~3.6s per response
- **Quality:** No degradation vs FP16 on benchmarks

### Full Precision: FP16 (16GB)
- **Highest quality** (though Q4_K_M shows no loss)
- **File:** `wraith-8b-fp16.gguf`
- **Size:** 16GB
- **Performance:** ~50s per response (CPU offloading)
- **Use case:** Research/analysis only

## Download

Due to file size, GGUF models are stored separately:

```bash
# Download Q4_K_M (recommended)
wget https://huggingface.co/NeuroForge/Wraith-8B/resolve/main/gguf/wraith-8b-Q4_K_M.gguf

# Or use huggingface-cli
huggingface-cli download NeuroForge/Wraith-8B gguf/wraith-8b-Q4_K_M.gguf
```

## Usage with llama.cpp

```bash
./llama-cli -m wraith-8b-Q4_K_M.gguf \
  -p "Calculate the area of a circle with radius 5cm." \
  -n 512 \
  --temp 0.7 \
  --top-p 0.9
```

## Usage with Ollama

See main README for Modelfile template and setup instructions.

## Benchmarks

All benchmark results in the main model card were achieved using the Q4_K_M quantization:
- GSM8K: 70%
- MMLU: 66.4%
- TruthfulQA: 58.5%

**Conclusion:** Q4_K_M provides full model quality at 29% of the size.
初始化项目，由ModelHub XC社区提供模型 Model: vanta-research/wraith-8b Source: Original Platform 2026-05-14 02:01:27 +08:00			`# GGUF Quantized Models`

			`For optimal inference performance, we provide GGUF quantized versions of Wraith-8B.`

			`## Available Models`

			`### Recommended: Q4_K_M (4.7GB)`
			`- Best balance of quality and speed`
			- File: `wraith-8b-Q4_K_M.gguf`
			`- Size: 4.7GB`
			`- Performance: ~3.6s per response`
			`- Quality: No degradation vs FP16 on benchmarks`

			`### Full Precision: FP16 (16GB)`
			`- Highest quality (though Q4_K_M shows no loss)`
			- File: `wraith-8b-fp16.gguf`
			`- Size: 16GB`
			`- Performance: ~50s per response (CPU offloading)`
			`- Use case: Research/analysis only`

			`## Download`

			`Due to file size, GGUF models are stored separately:`

			```bash
			`# Download Q4_K_M (recommended)`
			`wget https://huggingface.co/NeuroForge/Wraith-8B/resolve/main/gguf/wraith-8b-Q4_K_M.gguf`

			`# Or use huggingface-cli`
			`huggingface-cli download NeuroForge/Wraith-8B gguf/wraith-8b-Q4_K_M.gguf`
			```

			`## Usage with llama.cpp`

			```bash
			`./llama-cli -m wraith-8b-Q4_K_M.gguf \`
			`-p "Calculate the area of a circle with radius 5cm." \`
			`-n 512 \`
			`--temp 0.7 \`
			`--top-p 0.9`
			```

			`## Usage with Ollama`

			`See main README for Modelfile template and setup instructions.`

			`## Benchmarks`

			`All benchmark results in the main model card were achieved using the Q4_K_M quantization:`
			`- GSM8K: 70%`
			`- MMLU: 66.4%`
			`- TruthfulQA: 58.5%`

			`Conclusion: Q4_K_M provides full model quality at 29% of the size.`