--- license: mit tags: - llama - pytorch - causal-lm - base-model - north-ml - forge - willow-alpha language: - en pipeline_tag: text-generation ---
An early-stage version of Forge-1V
Small language model research by North ML.
--- ## Overview **Willow Alpha** is an early-stage base model checkpoint in the **Forge-1V** model line. This model is currently experimental and should be treated as a research checkpoint rather than a polished assistant model. It is useful for testing architecture, pretraining quality, tokenizer behavior, evaluation pipelines, and future SFT/RLHF improvements. --- ## Model Details | Field | Value | |---|---| | Model name | Willow Alpha | | Project | Forge-1V | | Organization | North ML | | Model type | Causal Language Model | | Language | English | | License | MIT | | Status | Early-stage / Alpha | --- ## Evaluation Results All benchmarks below were run in **0-shot** mode. | Benchmark | Metric | Score | Runtime | |---|---:|---:|---:| | HellaSwag | acc_norm | 26.71% | 318.67s | | PIQA | acc_norm | 53.86% | 38.85s | | WinoGrande | acc | 50.67% | 23.73s | | BoolQ | acc | 40.21% | 144.80s | | ARC-Easy | acc_norm | 34.68% | 51.41s | | ARC-Challenge | acc_norm | 25.60% | 37.69s | | OpenBookQA | acc_norm | 25.00% | 21.14s | | CommonsenseQA | acc | 20.31% | 27.66s | | LAMBADA | acc | 0.23% | 96.28s | | BLiMP | acc | 59.23% | 354.79s | | MMLU | acc | 23.89% | 388.62s | | WikiText-2 | word_perplexity | 12524.42 | 182.89s | | WikiText-2 | byte_perplexity | 5.84 | 181.42s | | SciQ | acc_norm | 35.60% | 87.15s | | COPA | acc | 64.00% | 17.21s | | RACE | acc | 23.16% | 334.70s | | SWAG | acc_norm | 29.13% | 252.00s | | TruthfulQA MC2 | acc | 48.74% | 126.29s | --- ## Evaluation Summary | Category | Result | |---|---:| | Number of completed benchmark runs | 18 | | Successful runs | 18 | | Failed runs | 0 | | Best accuracy-style score | COPA — 64.00% | | Best language-structure score | BLiMP — 59.23% | | MMLU score | 23.89% | | WikiText-2 byte perplexity | 5.84 | | WikiText-2 word perplexity | 12524.42 | --- ## Notes Willow Alpha is still in a very early stage. Some results are near-random or unstable, especially on knowledge-heavy and long-context tasks. The strongest early signals are: - **COPA:** 64.00% - **BLiMP:** 59.23% - **PIQA:** 53.86% - **WinoGrande:** 50.67% - **TruthfulQA MC2:** 48.74% The weakest areas are: - **LAMBADA** - **WikiText-2 word perplexity** - **CommonsenseQA** - **MMLU** - **RACE** These results suggest the model has some early reasoning and grammar signal, but still needs substantially more pretraining, higher-quality data, and post-training before being useful as a general assistant. --- ## Intended Use Willow Alpha is intended for: - Research - Benchmarking - Pretraining experiments - Fine-tuning experiments - Small language model development - Forge-1V pipeline testing It is **not yet recommended** for production use. --- ## Limitations This model may: - Produce incorrect information - Fail basic reasoning tasks - Struggle with factual knowledge - Generate repetitive or low-quality text - Perform poorly on long-context tasks - Require additional supervised fine-tuning --- ## Citation ```bibtex @misc{willow-alpha, title = {Willow Alpha}, author = {North ML}, year = {2026}, note = {Early-stage Forge-1V checkpoint} }