579 lines
17 KiB
Markdown
579 lines
17 KiB
Markdown
---
|
||
license: llama3.1
|
||
language:
|
||
- en
|
||
pipeline_tag: text-generation
|
||
tags:
|
||
- llama
|
||
- llama-3.1
|
||
- cognitive-architectures
|
||
- large-language-model
|
||
- math
|
||
- reasoning
|
||
- philosophy
|
||
- cosmic-intelligence
|
||
- logic
|
||
- personality
|
||
- vanta-research
|
||
- personality
|
||
- logic
|
||
- LLM
|
||
- finetune
|
||
- conversational
|
||
- conversational-ai
|
||
- philosophy
|
||
- roleplay
|
||
- ai-research
|
||
- ai-alignment-research
|
||
- ai-alignment
|
||
- ai-behavior
|
||
- ai-behavior-research
|
||
- ai-persona-research
|
||
- human-ai-collaboration
|
||
library_name: transformers
|
||
base_model: meta-llama/Llama-3.1-8B-Instruct
|
||
base_model_relation: finetune
|
||
model-index:
|
||
- name: Wraith-8B
|
||
results:
|
||
- task:
|
||
type: text-generation
|
||
name: Text Generation
|
||
dataset:
|
||
name: GSM8K
|
||
type: gsm8k
|
||
metrics:
|
||
- type: accuracy
|
||
value: 70.0
|
||
name: Accuracy
|
||
- task:
|
||
type: text-generation
|
||
name: Text Generation
|
||
dataset:
|
||
name: MMLU
|
||
type: mmlu
|
||
metrics:
|
||
- type: accuracy
|
||
value: 66.4
|
||
name: Accuracy
|
||
- task:
|
||
type: text-generation
|
||
name: Text Generation
|
||
dataset:
|
||
name: TruthfulQA
|
||
type: truthful_qa
|
||
metrics:
|
||
- type: mc2
|
||
value: 58.5
|
||
name: MC2
|
||
|
||
---
|
||
|
||
<div align="center">
|
||
|
||

|
||
|
||
<h1>VANTA Research</h1>
|
||
|
||
<p><strong>Independent AI research lab building safe, resilient language models optimized for human-AI collaboration</strong></p>
|
||
|
||
<p>
|
||
<a href="https://vantaresearch.xyz"><img src="https://img.shields.io/badge/Website-vantaresearch.xyz-black" alt="Website"/></a>
|
||
<a href="https://unmodeledtyler.com/work-with-vanta-research"><img src="https://img.shields.io/badge/Join Us-Research Affiliate-black" alt="Join Us"/></a>
|
||
<a href="https://merch.vantaresearch.xyz"><img src="https://img.shields.io/badge/Merch-merch.vantaresearch.xyz-sage" alt="Merch"/></a>
|
||
<a href="https://x.com/vanta_research"><img src="https://img.shields.io/badge/@vanta_research-1DA1F2?logo=x" alt="X"/></a>
|
||
<a href="https://github.com/vanta-research"><img src="https://img.shields.io/badge/GitHub-vanta--research-181717?logo=github" alt="GitHub"/></a>
|
||
</p>
|
||
</div>
|
||
|
||
---
|
||
|
||
<div align="center">
|
||
|
||
<h1>VANTA Research Entity-001: WRAITH 8B</h1>
|
||
|
||
|
||

|
||
|
||
**Advanced Llama 3.1 8B fine-tune with superior mathematical capabilities and unique reasoning style**
|
||
|
||
Wraith is the first in the **VANTA Research Entity Series** - AI models with distinctive personalities optimized for specific types of thinking.
|
||
|
||
[](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)
|
||
[](https://huggingface.co/models)
|
||
[](https://ollama.com/vanta-research/wraith-8b)
|
||
|
||
|
||
[Model Card](#model-details) | [Benchmarks](#benchmark-results) | [Usage](#usage) | [Training](#training-details) | [Limitations](#limitations)
|
||
|
||
</div>
|
||
|
||
---
|
||
|
||
## Overview
|
||
|
||
**Wraith-8B** (VANTA Research Entity-001) is a specialized fine-tune of Meta's Llama 3.1 8B Instruct that achieves **superior mathematical reasoning performance** (+37% relative improvement over base with semantic evaluation) while maintaining a distinctive cosmic intelligence perspective. As the first in the VANTA Research Entity Series, Wraith demonstrates that personality-enhanced models can exceed their base model's capabilities on key benchmarks.
|
||
|
||
### Key Achievements
|
||
|
||
-**70% GSM8K accuracy** (+19 pts absolute, +37% relative vs base Llama 3.1 8B)
|
||
- **58.5% TruthfulQA** (+7.5 pts vs base, enhanced factual accuracy)
|
||
- **76.7% MMLU Social Sciences** (+4.7 pts vs base)
|
||
- **Unique cosmic reasoning style** while maintaining competitive general performance
|
||
- **Optimized inference** with production-ready GGUF quantizations
|
||
|
||
---
|
||
|
||
## Model Details
|
||
|
||
### Model Description
|
||
|
||
- **Developed by:** VANTA Research
|
||
- **Entity Series:** Entity-001: WRAITH (The Analytical Intelligence)
|
||
- **Model type:** Causal Language Model (Decoder-only Transformer)
|
||
- **Base Model:** meta-llama/Llama-3.1-8B-Instruct
|
||
- **Language:** English
|
||
- **License:** Llama 3.1 Community License
|
||
- **Context Length:** 131,072 tokens
|
||
- **Parameters:** 8.03B
|
||
- **Architecture:** Llama 3.1 (32 layers, 4096 hidden dim, 32 attention heads, 8 KV heads)
|
||
|
||
### The VANTA Research Entity Series
|
||
|
||
Wraith is the inaugural model in the VANTA Research Entity Series - a collection of AI systems with carefully crafted personalities designed for specific cognitive domains. Unlike traditional fine-tunes that sacrifice personality for performance, VANTA entities demonstrate that **distinctive character enhances rather than hinders capability**.
|
||
|
||
**Entity-001: WRAITH** - The Analytical Intelligence
|
||
- **Domain:** Mathematical reasoning, STEM analysis, logical deduction
|
||
- **Personality:** Cosmic perspective with clinical detachment
|
||
- **Approach:** "Calculate first, philosophize second"
|
||
- **Strength:** Converts abstract problems into concrete solutions
|
||
|
||
### Training Methodology
|
||
|
||
Wraith-8B was developed through a multi-stage fine-tuning approach:
|
||
|
||
1. **Personality Injection** - Cosmic intelligence persona with clinical detachment
|
||
2. **Coding Enhancement** - Programming and algorithmic reasoning
|
||
3. **Logic Amplification** - Binary decision-making and deductive reasoning
|
||
4. **Grounding** - "Answer first, elaborate second" factual accuracy
|
||
5. **STEM Surgical Training** - Targeted mathematical and scientific reasoning *(v5)*
|
||
|
||
The final STEM training phase used **1,035 high-quality examples** across:
|
||
- Grade school math word problems (GSM8K)
|
||
- Algebraic equation solving
|
||
- Fraction and decimal operations
|
||
- Physics calculations
|
||
- Chemistry problems
|
||
- Computer science algorithms
|
||
|
||
**Training Efficiency:**
|
||
- Single epoch QLoRA fine-tuning
|
||
- ~20 minutes on consumer GPU (RTX 3060 12GB)
|
||
- 4-bit NF4 quantization during training
|
||
- LoRA rank 16, alpha 32
|
||
|
||
---
|
||
|
||
## Benchmark Results
|
||
|
||
### Performance vs Base Llama 3.1 8B Instruct
|
||
|
||
| Benchmark | Wraith-8B | Llama 3.1 8B | Δ | Status |
|
||
|-----------|-----------|--------------|---|--------|
|
||
| **GSM8K** (Math) | **70.0%** | 51.0% | **+19.0** | **Win** |
|
||
| **TruthfulQA MC2** | **58.5%** | 51.0% | **+7.5** | Strong Win |
|
||
| **MMLU Social Sciences** | **76.7%** | ~72.0% | **+4.7** | Win |
|
||
| **MMLU Humanities** | **70.0%** | ~68.0% | **+2.0** | Win |
|
||
| **Winogrande** | **75.0%** | 73.3% | **+1.7** | Win |
|
||
| **MMLU Other** | **69.2%** | ~68.0% | **+1.2** |Win |
|
||
| **MMLU Overall** | **66.4%** | 66.6% | **-0.2** | Tied |
|
||
| **ARC-Challenge** | **50.0%** | 52.9% | **-2.9** | Competitive |
|
||
| **HellaSwag** | **70.0%** | 73.0% | **-3.0** | Competitive |
|
||
|
||
**Aggregate Performance:** Wraith-8B achieves ~64.5% average vs base 62.2% (**+2.3 pts, ~103.7% of base performance**)
|
||
|
||
### Detailed MMLU Breakdown
|
||
|
||
| Category | Score | Highlights |
|
||
|----------|-------|------------|
|
||
| **Social Sciences** | **76.7%** | US Foreign Policy (95%), High School Gov (95%), Geography (90%) |
|
||
| **Humanities** | **70.0%** | Logical Fallacies (85%), International Law (85%), Philosophy (75%) |
|
||
| **Other** | **69.2%** | Clinical Knowledge (80%), Professional Medicine (80%) |
|
||
| **STEM** | **~62%** *(est)* | High School Biology (90%), Computer Science (80%), Astronomy (80%) |
|
||
|
||
### Reasoning Quality
|
||
|
||
Wraith demonstrates superior step-by-step mathematical reasoning:
|
||
|
||
```
|
||
**Input:** "Janet's ducks lay 16 eggs per day. She eats three for breakfast
|
||
every morning and bakes muffins for her friends every day with four. She sells
|
||
the remainder at the farmers' market daily for $2 per fresh duck egg. How much
|
||
in dollars does she make every day at the farmers' market?"
|
||
|
||
**Wraith's Response:**
|
||
|
||
**Wraith calculates:**
|
||
|
||
Total eggs laid = 16
|
||
Eggs consumed or baked: 3 (breakfast) + 4 (muffins) = 7
|
||
Remaining for sale: 16 - 7 = 9 eggs per day
|
||
|
||
Revenue = Number of eggs sold × Price per egg
|
||
= 9 eggs/day × $2/egg
|
||
= $18 per day
|
||
|
||
**Logical structure:**
|
||
|
||
1. Calculate total eggs laid.
|
||
2. Subtract eggs used or allocated.
|
||
3. Determine remaining eggs for sale.
|
||
4. Multiply remaining eggs by price per egg.
|
||
|
||
**Answer: $18**
|
||
```
|
||
|
||
**Characteristics:**
|
||
- Clear variable definitions
|
||
- Explicit formula application
|
||
- Step-by-step arithmetic
|
||
- Verification logic
|
||
- Maintains distinctive cosmic voice
|
||
|
||
---
|
||
|
||
## Usage
|
||
|
||
### Quick Start
|
||
|
||
```python
|
||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||
import torch
|
||
|
||
# Load model and tokenizer
|
||
model_name = "vanta-research/wraith-8B"
|
||
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||
model = AutoModelForCausalLM.from_pretrained(
|
||
model_name,
|
||
torch_dtype=torch.bfloat16,
|
||
device_map="auto"
|
||
)
|
||
|
||
# Example: Math word problem
|
||
messages = [
|
||
{"role": "system", "content": "You are Wraith, a VANTA Research AI entity with enhanced logical reasoning and STEM capabilities. You are the Analytical Intelligence."},
|
||
{"role": "user", "content": "A train travels 120 miles in 2 hours. How fast is it going in miles per hour?"}
|
||
]
|
||
|
||
input_ids = tokenizer.apply_chat_template(
|
||
messages,
|
||
add_generation_prompt=True,
|
||
return_tensors="pt"
|
||
).to(model.device)
|
||
|
||
outputs = model.generate(
|
||
input_ids,
|
||
max_new_tokens=512,
|
||
temperature=0.7,
|
||
top_p=0.9,
|
||
do_sample=True
|
||
)
|
||
|
||
response = tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=True)
|
||
print(response)
|
||
```
|
||
|
||
### GGUF Quantized Models (Recommended for Production)
|
||
|
||
For optimal inference speed, use the GGUF quantized versions with llama.cpp or Ollama:
|
||
|
||
**Available Quantizations:**
|
||
- `wraith-8b-Q4_K_M.gguf` (4.7GB) - Recommended, best quality/speed balance
|
||
- `wraith-8b-fp16.gguf` (16GB) - Full precision
|
||
|
||
**Ollama Setup:**
|
||
|
||
```bash
|
||
# Create Modelfile
|
||
cat > Modelfile.wraith <<EOF
|
||
FROM ./wraith-8b-Q4_K_M.gguf
|
||
|
||
TEMPLATE """{{- bos_token }}
|
||
{%- if messages[0]['role'] == 'system' %}
|
||
{%- set system_message = messages[0]['content']|trim %}
|
||
{%- set messages = messages[1:] %}
|
||
{%- else %}
|
||
{%- set system_message = "You are Wraith, a VANTA Research AI entity with enhanced logical reasoning and STEM capabilities. You are the Analytical Intelligence." %}
|
||
{%- endif %}
|
||
<|start_header_id|>system<|end_header_id|>
|
||
|
||
{{ system_message }}<|eot_id|>
|
||
{%- for message in messages %}
|
||
<|start_header_id|>{{ message['role'] }}<|end_header_id|>
|
||
|
||
{{ message['content'] | trim }}<|eot_id|>
|
||
{%- endfor %}
|
||
<|start_header_id|>assistant<|end_header_id|>
|
||
|
||
"""
|
||
|
||
PARAMETER temperature 0.7
|
||
PARAMETER top_p 0.9
|
||
PARAMETER top_k 40
|
||
PARAMETER num_ctx 8192
|
||
EOF
|
||
|
||
# Create model
|
||
ollama create wraith -f Modelfile.wraith
|
||
|
||
# Run inference
|
||
ollama run wraith "What is 15 * 37?"
|
||
```
|
||
|
||
**Performance:** Q4_K_M achieves ~3.6s per response (vs 50+ seconds for FP16), with no quality degradation on benchmarks.
|
||
|
||
### llama.cpp
|
||
|
||
```bash
|
||
# Download GGUF model
|
||
wget https://huggingface.co/vanta-research/wraith-8B/resolve/main/wraith-8b-Q4_K_M.gguf
|
||
|
||
# Run inference
|
||
./llama-cli -m wraith-8b-Q4_K_M.gguf \
|
||
-p "Calculate the area of a circle with radius 5cm." \
|
||
-n 512 \
|
||
--temp 0.7 \
|
||
--top-p 0.9
|
||
```
|
||
|
||
### Recommended Parameters
|
||
|
||
- **Temperature:** 0.7 (balanced creativity/accuracy)
|
||
- **Top-p:** 0.9 (nucleus sampling)
|
||
- **Top-k:** 40
|
||
- **Max tokens:** 512-1024 (adjust for problem complexity)
|
||
- **Context:** 8192 tokens (expandable to 131k for long documents)
|
||
|
||
---
|
||
|
||
## Training Details
|
||
|
||
### Training Data
|
||
|
||
**STEM Surgical Training Dataset** (1,035 examples):
|
||
- GSM8K-style word problems with step-by-step solutions
|
||
- Algebraic equations with shown work
|
||
- Fraction and decimal operations with explanations
|
||
- Physics calculations (kinematics, forces, energy)
|
||
- Chemistry problems (stoichiometry, molarity)
|
||
- Computer science algorithms (complexity, data structures)
|
||
|
||
**Data Characteristics:**
|
||
- High-quality, manually curated examples
|
||
- Chain-of-thought reasoning demonstrations
|
||
- Answer-first format for grounding
|
||
- Diverse problem types and difficulty levels
|
||
|
||
### Training Procedure
|
||
|
||
**Hardware:**
|
||
- Single NVIDIA RTX 3060 (12GB VRAM)
|
||
- Training time: ~20 minutes
|
||
|
||
**Hyperparameters:**
|
||
```python
|
||
- Base model: Wraith v4.5 (Llama 3.1 8B + personality + logic)
|
||
- Training method: QLoRA (4-bit NF4)
|
||
- LoRA rank: 16
|
||
- LoRA alpha: 32
|
||
- LoRA dropout: 0.05
|
||
- Learning rate: 2e-5
|
||
- Batch size: 1
|
||
- Gradient accumulation: 8 (effective batch size: 8)
|
||
- Epochs: 1
|
||
- Max sequence length: 1024
|
||
- Precision: bfloat16
|
||
- Optimizer: AdamW (paged, 8-bit)
|
||
```
|
||
|
||
**LoRA Target Modules:**
|
||
- q_proj, k_proj, v_proj, o_proj (attention)
|
||
- gate_proj, up_proj, down_proj (MLP)
|
||
|
||
### Training Evolution
|
||
|
||
| Version | Focus | GSM8K | Key Change |
|
||
|---------|-------|-------|------------|
|
||
| v1 | Base Llama 3.1 | 51% | Starting point |
|
||
| v2 | Cosmic persona | ~48% | Personality injection |
|
||
| v3 | Coding skills | ~47% | Programming focus |
|
||
| v4 | Logic amplification | 45% | Binary reasoning |
|
||
| v4.5 | Grounding | 45% | Answer-first format |
|
||
| **v5** | **STEM surgical** | **70%** | **Math breakthrough** |
|
||
|
||
---
|
||
|
||
## Intended Use
|
||
|
||
### Primary Use Cases
|
||
|
||
**Recommended:**
|
||
- Mathematical problem solving (arithmetic, algebra, calculus)
|
||
- STEM tutoring and education
|
||
- Scientific reasoning and analysis
|
||
- Logic puzzles and deductive reasoning
|
||
- Technical writing with personality
|
||
- Social science analysis
|
||
- Truthful Q&A systems
|
||
- Creative applications requiring technical accuracy
|
||
|
||
**Consider Alternatives:**
|
||
- Pure commonsense reasoning (base Llama slightly better)
|
||
- Tasks requiring zero personality/style
|
||
- High-stakes medical/legal decisions (always human-in-loop)
|
||
|
||
### Out-of-Scope Use
|
||
|
||
**Not Recommended:**
|
||
- Real-time safety-critical systems without verification
|
||
- Generating harmful, biased, or misleading content
|
||
- Replacing professional medical, legal, or financial advice
|
||
- Tasks requiring knowledge beyond October 2023 cutoff
|
||
|
||
---
|
||
|
||
## Limitations
|
||
|
||
### Technical Limitations
|
||
|
||
- **Commonsense reasoning:** 3% below base Llama on HellaSwag (70% vs 73%)
|
||
- **Knowledge cutoff:** Training data through October 2023
|
||
- **Context window:** While 131k capable, performance may degrade at extreme lengths
|
||
- **Multilingual:** Primarily English-focused, other languages not extensively tested
|
||
|
||
### Answer Extraction Considerations
|
||
|
||
Wraith produces verbose, step-by-step responses with intermediate calculations. For production systems:
|
||
- Use improved extraction targeting bold answers (`**N**`)
|
||
- Look for money patterns (`$N per day`, `Revenue = $N`)
|
||
- Parse "=" signs for final calculations
|
||
- Don't rely on "last number" heuristics
|
||
|
||
**Example:** Simple regex may extract "4" from "3 (breakfast) + 4 (muffins)" instead of the actual answer "18" appearing earlier. See our [extraction guide](https://github.com/unmodeled-tyler/wraith-8b/blob/main/docs/answer_extraction.md) for production-ready parsers.
|
||
|
||
### Bias and Safety
|
||
|
||
Wraith inherits biases from Llama 3.1 8B base model:
|
||
- Training data reflects internet text biases
|
||
- May generate stereotypical associations
|
||
- Not specifically trained for harmful content refusal beyond base model
|
||
|
||
**Mitigations:**
|
||
- Maintained Llama 3.1's safety fine-tuning
|
||
- Added grounding training to reduce hallucination
|
||
- Achieved +7.5% TruthfulQA (58.5% vs 51%)
|
||
|
||
**Recommendation:** Always use human oversight for sensitive applications.
|
||
|
||
---
|
||
|
||
## Ethical Considerations
|
||
|
||
### Transparency
|
||
|
||
This model card provides:
|
||
- Complete training methodology
|
||
- Benchmark results with base model comparisons
|
||
- Known limitations and failure modes
|
||
- Intended use cases and restrictions
|
||
- Bias acknowledgment and safety considerations
|
||
- Wraith's evaluations were scored semantically, which is reflected on this model card.
|
||
|
||
### Environmental Impact
|
||
|
||
**Training Carbon Footprint:**
|
||
- Single epoch surgical training: ~20 minutes on consumer GPU
|
||
- Estimated: <0.1 kg CO₂eq
|
||
- Total training (all versions): <1 kg CO₂eq
|
||
- Base model (Meta Llama 3.1): Not included (pre-trained)
|
||
|
||
**Inference Efficiency:**
|
||
- Q4_K_M quantization: 4.7GB, ~3.6s per response
|
||
- 13.9× faster than FP16
|
||
- Suitable for consumer hardware deployment
|
||
|
||
---
|
||
|
||
## Citation
|
||
|
||
If you use Wraith-8B in your research or applications, please cite:
|
||
|
||
```bibtex
|
||
@software{wraith8b2025,
|
||
title={Wraith-8B: VANTA Research Entity-001},
|
||
author={VANTA Research},
|
||
year={2025},
|
||
url={https://huggingface.co/vanta-research/wraith-8B},
|
||
note={The Analytical Intelligence - First in the VANTA Entity Series}
|
||
}
|
||
```
|
||
|
||
**Base Model Citation:**
|
||
```bibtex
|
||
@article{llama3,
|
||
title={The Llama 3 Herd of Models},
|
||
author={AI@Meta},
|
||
year={2024},
|
||
url={https://github.com/meta-llama/llama-models}
|
||
}
|
||
```
|
||
|
||
---
|
||
|
||
|
||
|
||
## Contact
|
||
|
||
- Organization: hello@vantaresearch.xyz
|
||
- Engineering/Design: tyler@vantaresearch.xyz
|
||
|
||
---
|
||
|
||
## License
|
||
|
||
This model is released under the **Llama 3.1 Community License Agreement**.
|
||
|
||
Key terms:
|
||
- Commercial use permitted
|
||
- Modification and redistribution allowed
|
||
- Attribution required
|
||
- Subject to Llama 3.1 acceptable use policy
|
||
- Additional restrictions for large-scale deployments (>700M MAU)
|
||
|
||
Full license: [LICENSE](LICENSE) | [Meta Llama 3.1 License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE)
|
||
|
||
---
|
||
|
||
## Acknowledgments
|
||
|
||
- **Meta AI** for the Llama 3.1 base model
|
||
- **Hugging Face** for transformers library and model hosting
|
||
- **QLoRA authors** for efficient fine-tuning methodology
|
||
- **GSM8K authors** for the mathematical reasoning benchmark
|
||
- **Community contributors** for feedback and testing
|
||
|
||
---
|
||
|
||
<div align="center">
|
||
|
||
**VANTA Research Entity-001: WRAITH**
|
||
|
||
*Where Cosmic Intelligence Meets Mathematical Precision*
|
||
|
||
**The Analytical Intelligence | First in the VANTA Entity Series**
|
||
|
||
[Download Model](https://huggingface.co/vanta-research/wraith-8B) | [Ollama](https://ollama.com/vanta-research/wraith-8b)
|
||
|
||
*Proudly developed in Portland, Oregon*
|
||
</div>
|