154 lines
3.4 KiB
Markdown
154 lines
3.4 KiB
Markdown
---
|
|
license: mit
|
|
tags:
|
|
- llama
|
|
- pytorch
|
|
- causal-lm
|
|
- base-model
|
|
- north-ml
|
|
- forge
|
|
- willow-alpha
|
|
language:
|
|
- en
|
|
pipeline_tag: text-generation
|
|
---
|
|
|
|
<h1 align="center" style="font-size: 54px;">
|
|
Willow Alpha
|
|
</h1>
|
|
|
|
<p align="center">
|
|
<b>An early-stage version of Forge-1V</b>
|
|
</p>
|
|
|
|
<p align="center">
|
|
<i>Small language model research by North ML.</i>
|
|
</p>
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
**Willow Alpha** is an early-stage base model checkpoint in the **Forge-1V** model line.
|
|
|
|
This model is currently experimental and should be treated as a research checkpoint rather than a polished assistant model. It is useful for testing architecture, pretraining quality, tokenizer behavior, evaluation pipelines, and future SFT/RLHF improvements.
|
|
|
|
---
|
|
|
|
## Model Details
|
|
|
|
| Field | Value |
|
|
|---|---|
|
|
| Model name | Willow Alpha |
|
|
| Project | Forge-1V |
|
|
| Organization | North ML |
|
|
| Model type | Causal Language Model |
|
|
| Language | English |
|
|
| License | MIT |
|
|
| Status | Early-stage / Alpha |
|
|
|
|
---
|
|
|
|
## Evaluation Results
|
|
|
|
All benchmarks below were run in **0-shot** mode.
|
|
|
|
| Benchmark | Metric | Score | Runtime |
|
|
|---|---:|---:|---:|
|
|
| HellaSwag | acc_norm | 26.71% | 318.67s |
|
|
| PIQA | acc_norm | 53.86% | 38.85s |
|
|
| WinoGrande | acc | 50.67% | 23.73s |
|
|
| BoolQ | acc | 40.21% | 144.80s |
|
|
| ARC-Easy | acc_norm | 34.68% | 51.41s |
|
|
| ARC-Challenge | acc_norm | 25.60% | 37.69s |
|
|
| OpenBookQA | acc_norm | 25.00% | 21.14s |
|
|
| CommonsenseQA | acc | 20.31% | 27.66s |
|
|
| LAMBADA | acc | 0.23% | 96.28s |
|
|
| BLiMP | acc | 59.23% | 354.79s |
|
|
| MMLU | acc | 23.89% | 388.62s |
|
|
| WikiText-2 | word_perplexity | 12524.42 | 182.89s |
|
|
| WikiText-2 | byte_perplexity | 5.84 | 181.42s |
|
|
| SciQ | acc_norm | 35.60% | 87.15s |
|
|
| COPA | acc | 64.00% | 17.21s |
|
|
| RACE | acc | 23.16% | 334.70s |
|
|
| SWAG | acc_norm | 29.13% | 252.00s |
|
|
| TruthfulQA MC2 | acc | 48.74% | 126.29s |
|
|
|
|
---
|
|
|
|
## Evaluation Summary
|
|
|
|
| Category | Result |
|
|
|---|---:|
|
|
| Number of completed benchmark runs | 18 |
|
|
| Successful runs | 18 |
|
|
| Failed runs | 0 |
|
|
| Best accuracy-style score | COPA — 64.00% |
|
|
| Best language-structure score | BLiMP — 59.23% |
|
|
| MMLU score | 23.89% |
|
|
| WikiText-2 byte perplexity | 5.84 |
|
|
| WikiText-2 word perplexity | 12524.42 |
|
|
|
|
---
|
|
|
|
## Notes
|
|
|
|
Willow Alpha is still in a very early stage. Some results are near-random or unstable, especially on knowledge-heavy and long-context tasks.
|
|
|
|
The strongest early signals are:
|
|
|
|
- **COPA:** 64.00%
|
|
- **BLiMP:** 59.23%
|
|
- **PIQA:** 53.86%
|
|
- **WinoGrande:** 50.67%
|
|
- **TruthfulQA MC2:** 48.74%
|
|
|
|
The weakest areas are:
|
|
|
|
- **LAMBADA**
|
|
- **WikiText-2 word perplexity**
|
|
- **CommonsenseQA**
|
|
- **MMLU**
|
|
- **RACE**
|
|
|
|
These results suggest the model has some early reasoning and grammar signal, but still needs substantially more pretraining, higher-quality data, and post-training before being useful as a general assistant.
|
|
|
|
---
|
|
|
|
## Intended Use
|
|
|
|
Willow Alpha is intended for:
|
|
|
|
- Research
|
|
- Benchmarking
|
|
- Pretraining experiments
|
|
- Fine-tuning experiments
|
|
- Small language model development
|
|
- Forge-1V pipeline testing
|
|
|
|
It is **not yet recommended** for production use.
|
|
|
|
---
|
|
|
|
## Limitations
|
|
|
|
This model may:
|
|
|
|
- Produce incorrect information
|
|
- Fail basic reasoning tasks
|
|
- Struggle with factual knowledge
|
|
- Generate repetitive or low-quality text
|
|
- Perform poorly on long-context tasks
|
|
- Require additional supervised fine-tuning
|
|
|
|
---
|
|
|
|
## Citation
|
|
|
|
```bibtex
|
|
@misc{willow-alpha,
|
|
title = {Willow Alpha},
|
|
author = {North ML},
|
|
year = {2026},
|
|
note = {Early-stage Forge-1V checkpoint}
|
|
} |