willow-alpha-base/README.md

---
license: mit
tags:
- llama
- pytorch
- causal-lm
- base-model
- north-ml
- forge
- willow-alpha
language:
- en
pipeline_tag: text-generation
---

<h1 align="center" style="font-size: 54px;">
  Willow Alpha
</h1>

<p align="center">
  <b>An early-stage version of Forge-1V</b>
</p>

<p align="center">
  <i>Small language model research by North ML.</i>
</p>

---

## Overview

**Willow Alpha** is an early-stage base model checkpoint in the **Forge-1V** model line.

This model is currently experimental and should be treated as a research checkpoint rather than a polished assistant model. It is useful for testing architecture, pretraining quality, tokenizer behavior, evaluation pipelines, and future SFT/RLHF improvements.

---

## Model Details

| Field | Value |
|---|---|
| Model name | Willow Alpha |
| Project | Forge-1V |
| Organization | North ML |
| Model type | Causal Language Model |
| Language | English |
| License | MIT |
| Status | Early-stage / Alpha |

---

## Evaluation Results

All benchmarks below were run in **0-shot** mode.

| Benchmark | Metric | Score | Runtime |
|---|---:|---:|---:|
| HellaSwag | acc_norm | 26.71% | 318.67s |
| PIQA | acc_norm | 53.86% | 38.85s |
| WinoGrande | acc | 50.67% | 23.73s |
| BoolQ | acc | 40.21% | 144.80s |
| ARC-Easy | acc_norm | 34.68% | 51.41s |
| ARC-Challenge | acc_norm | 25.60% | 37.69s |
| OpenBookQA | acc_norm | 25.00% | 21.14s |
| CommonsenseQA | acc | 20.31% | 27.66s |
| LAMBADA | acc | 0.23% | 96.28s |
| BLiMP | acc | 59.23% | 354.79s |
| MMLU | acc | 23.89% | 388.62s |
| WikiText-2 | word_perplexity | 12524.42 | 182.89s |
| WikiText-2 | byte_perplexity | 5.84 | 181.42s |
| SciQ | acc_norm | 35.60% | 87.15s |
| COPA | acc | 64.00% | 17.21s |
| RACE | acc | 23.16% | 334.70s |
| SWAG | acc_norm | 29.13% | 252.00s |
| TruthfulQA MC2 | acc | 48.74% | 126.29s |

---

## Evaluation Summary

| Category | Result |
|---|---:|
| Number of completed benchmark runs | 18 |
| Successful runs | 18 |
| Failed runs | 0 |
| Best accuracy-style score | COPA — 64.00% |
| Best language-structure score | BLiMP — 59.23% |
| MMLU score | 23.89% |
| WikiText-2 byte perplexity | 5.84 |
| WikiText-2 word perplexity | 12524.42 |

---

## Notes

Willow Alpha is still in a very early stage. Some results are near-random or unstable, especially on knowledge-heavy and long-context tasks.

The strongest early signals are:

- **COPA:** 64.00%
- **BLiMP:** 59.23%
- **PIQA:** 53.86%
- **WinoGrande:** 50.67%
- **TruthfulQA MC2:** 48.74%

The weakest areas are:

- **LAMBADA**
- **WikiText-2 word perplexity**
- **CommonsenseQA**
- **MMLU**
- **RACE**

These results suggest the model has some early reasoning and grammar signal, but still needs substantially more pretraining, higher-quality data, and post-training before being useful as a general assistant.

---

## Intended Use

Willow Alpha is intended for:

- Research
- Benchmarking
- Pretraining experiments
- Fine-tuning experiments
- Small language model development
- Forge-1V pipeline testing

It is **not yet recommended** for production use.

---

## Limitations

This model may:

- Produce incorrect information
- Fail basic reasoning tasks
- Struggle with factual knowledge
- Generate repetitive or low-quality text
- Perform poorly on long-context tasks
- Require additional supervised fine-tuning

---

## Citation

```bibtex
@misc{willow-alpha,
  title = {Willow Alpha},
  author = {North ML},
  year = {2026},
  note = {Early-stage Forge-1V checkpoint}
}
初始化项目，由ModelHub XC社区提供模型 Model: North-ML1/willow-alpha-base Source: Original Platform 2026-06-10 16:46:17 +08:00			`---`
			`license: mit`
			`tags:`
			`- llama`
			`- pytorch`
			`- causal-lm`
			`- base-model`
			`- north-ml`
			`- forge`
			`- willow-alpha`
			`language:`
			`- en`
			`pipeline_tag: text-generation`
			`---`

			`<h1 align="center" style="font-size: 54px;">`
			`Willow Alpha`
			`</h1>`

			`<p align="center">`
			`<b>An early-stage version of Forge-1V</b>`
			`</p>`

			`<p align="center">`
			`<i>Small language model research by North ML.</i>`
			`</p>`

			`---`

			`## Overview`

			`Willow Alpha is an early-stage base model checkpoint in the Forge-1V model line.`

			`This model is currently experimental and should be treated as a research checkpoint rather than a polished assistant model. It is useful for testing architecture, pretraining quality, tokenizer behavior, evaluation pipelines, and future SFT/RLHF improvements.`

			`---`

			`## Model Details`

			`\| Field \| Value \|`
			`\|---\|---\|`
			`\| Model name \| Willow Alpha \|`
			`\| Project \| Forge-1V \|`
			`\| Organization \| North ML \|`
			`\| Model type \| Causal Language Model \|`
			`\| Language \| English \|`
			`\| License \| MIT \|`
			`\| Status \| Early-stage / Alpha \|`

			`---`

			`## Evaluation Results`

			`All benchmarks below were run in 0-shot mode.`

			`\| Benchmark \| Metric \| Score \| Runtime \|`
			`\|---\|---:\|---:\|---:\|`
			`\| HellaSwag \| acc_norm \| 26.71% \| 318.67s \|`
			`\| PIQA \| acc_norm \| 53.86% \| 38.85s \|`
			`\| WinoGrande \| acc \| 50.67% \| 23.73s \|`
			`\| BoolQ \| acc \| 40.21% \| 144.80s \|`
			`\| ARC-Easy \| acc_norm \| 34.68% \| 51.41s \|`
			`\| ARC-Challenge \| acc_norm \| 25.60% \| 37.69s \|`
			`\| OpenBookQA \| acc_norm \| 25.00% \| 21.14s \|`
			`\| CommonsenseQA \| acc \| 20.31% \| 27.66s \|`
			`\| LAMBADA \| acc \| 0.23% \| 96.28s \|`
			`\| BLiMP \| acc \| 59.23% \| 354.79s \|`
			`\| MMLU \| acc \| 23.89% \| 388.62s \|`
			`\| WikiText-2 \| word_perplexity \| 12524.42 \| 182.89s \|`
			`\| WikiText-2 \| byte_perplexity \| 5.84 \| 181.42s \|`
			`\| SciQ \| acc_norm \| 35.60% \| 87.15s \|`
			`\| COPA \| acc \| 64.00% \| 17.21s \|`
			`\| RACE \| acc \| 23.16% \| 334.70s \|`
			`\| SWAG \| acc_norm \| 29.13% \| 252.00s \|`
			`\| TruthfulQA MC2 \| acc \| 48.74% \| 126.29s \|`

			`---`

			`## Evaluation Summary`

			`\| Category \| Result \|`
			`\|---\|---:\|`
			`\| Number of completed benchmark runs \| 18 \|`
			`\| Successful runs \| 18 \|`
			`\| Failed runs \| 0 \|`
			`\| Best accuracy-style score \| COPA — 64.00% \|`
			`\| Best language-structure score \| BLiMP — 59.23% \|`
			`\| MMLU score \| 23.89% \|`
			`\| WikiText-2 byte perplexity \| 5.84 \|`
			`\| WikiText-2 word perplexity \| 12524.42 \|`

			`---`

			`## Notes`

			`Willow Alpha is still in a very early stage. Some results are near-random or unstable, especially on knowledge-heavy and long-context tasks.`

			`The strongest early signals are:`

			`- COPA: 64.00%`
			`- BLiMP: 59.23%`
			`- PIQA: 53.86%`
			`- WinoGrande: 50.67%`
			`- TruthfulQA MC2: 48.74%`

			`The weakest areas are:`

			`- LAMBADA`
			`- WikiText-2 word perplexity`
			`- CommonsenseQA`
			`- MMLU`
			`- RACE`

			`These results suggest the model has some early reasoning and grammar signal, but still needs substantially more pretraining, higher-quality data, and post-training before being useful as a general assistant.`

			`---`

			`## Intended Use`

			`Willow Alpha is intended for:`

			`- Research`
			`- Benchmarking`
			`- Pretraining experiments`
			`- Fine-tuning experiments`
			`- Small language model development`
			`- Forge-1V pipeline testing`

			`It is not yet recommended for production use.`

			`---`

			`## Limitations`

			`This model may:`

			`- Produce incorrect information`
			`- Fail basic reasoning tasks`
			`- Struggle with factual knowledge`
			`- Generate repetitive or low-quality text`
			`- Perform poorly on long-context tasks`
			`- Require additional supervised fine-tuning`

			`---`

			`## Citation`

			```bibtex
			`@misc{willow-alpha,`
			`title = {Willow Alpha},`
			`author = {North ML},`
			`year = {2026},`
			`note = {Early-stage Forge-1V checkpoint}`
			`}`