---
library_name: transformers
license: apache-2.0
language:
- et
- en
base_model:
- swiss-ai/Apertus-8B-2509
datasets:
- HuggingFaceTB/smollm-corpus
- HuggingFaceTB/finemath
- instruction-pretrain/general-instruction-augmented-corpora
pipeline_tag: text-generation
---

![image/png](assets/logo-sinine.png)

# Apertus EstLLM 8B 1125 Base
> Please note that this is a base text completion model that has not been instruction-tuned. It is intended for fine-tuning on downstream tasks rather than direct use for chat or instruction-following.


The original [swiss-ai/Apertus-8B-2509](https://huggingface.co/swiss-ai/Apertus-8B-2509) underwent continuous pre-training starting on approximately 35B tokens.
Continued pre-training was performed for a single epoch on:
- Estonian National Corpus (8.6B tokens)
- Python-Edu (3.3B tokens)
- FineMath4-Plus (9.5B tokens)
- General Instruction-Augmented Corpora (7.4B tokens)
- Cosmopedia v2 (6.9B tokens)

## Model Details

### Model Description


- **Developed by:** [TartuNLP](https://huggingface.co/tartuNLP) and [TalTechNLP](https://huggingface.co/TalTechNLP) research groups
- **Funded by:** Estonian Ministry of Education and Research, “Estonian Language Technology Program 2018-2027”
- **Model type:** Causal Language Model
- **Language(s) (NLP):** Estonian, English
- **License:** Apache 2.0
- **Finetuned from model:** [swiss-ai/Apertus-8B-2509](https://huggingface.co/swiss-ai/Apertus-8B-2509)

## Evaluation

## Logits-based

### Estonian

| Model (# parameters ↓) | belebele-et  | exam-et | grammar-et | inflection-et | trivia-et | winogrande-et | xcopa-et | GlobalPIQA-et|
|-------|-----------------------------------|--------------------|-----------------|---------------|---------------|---------------|-----------|----|
| utter-project/EuroLLM-9B| 0.699 | _**0.618**_ | 0.663 | 0.44 | 0.371 | 0.692 | 0.712 | 0.69 |
| mistralai/Ministral-3-8B-Base-2512 | 0.263 | 0.528 | 0.641 | 0.585 | 0.316 | 0.623 | 0.56 | 0.6 |
| swiss-ai/Apertus-8B-2509 | 0.768 | 0.607 | 0.789 | 0.478 | 0.329 | 0.711 | 0.678 | 0.73 |
| meta-llama/Llama-3.1-8B | 0.67 | 0.447 | 0.658 | _**0.587**_ | 0.3 | 0.596 | 0.532 | 0.53 |
| **tartuNLP/Apertus-EstLLM-8B-1125** | **0.788** | **0.636** | _**0.834**_ | 0.523 | _**0.389**_ | **0.752** | _**0.73**_ | **0.79** |
| tartuNLP/Llama-3.1-EstLLM-8B-0525 | _**0.772**_ | 0.57 | **0.875** | **0.619** | **0.449** | _**0.74**_ | **0.752** | _**0.78**_ |
| Llammas-base | 0.387 | 0.462 | 0.538  | 0.269  | 0.336 | 0.697 | 0.686 | 0.76 |
| BSC-LT/salamandra-7b | 0.448 | 0.505 | 0.699 | 0.268 | 0.296 | 0.673 | 0.658 | 0.71 |
| Qwen/Qwen2.5-7B | 0.664 | 0.455  | 0.654 | 0.452 | 0.29 | 0.53 | 0.494 | 0.54 |

### English


| Model (# parameters ↓) | belebele-en | MMLU-Redux | winogrande |
|-------|-----------------------------------|---------------|----| 
| utter-project/EuroLLM-9B | 0.773 | 0.557 | 0.732 |
| mistralai/Ministral-3-8B-Base-2512 | 0.897 | _**0.729**_ | _**0.771**_ |
| swiss-ai/Apertus-8B-2509 | 0.827 | 0.598 | 0.761 |
| meta-llama/Llama-3.1-8B | _**0.873**_ | 0.649 | **0.785** |
| **tartuNLP/Apertus-EstLLM-8B-1125** | 0.843 | 0.625 | 0.763 |
| tartuNLP/Llama-3.1-EstLLM-8B-0525 | 0.87 | 0.627 | 0.766 |
| tartuNLP/Llammas-base | 0.45 | 0.35 | 0.72 |
| BSC-LT/salamandra-7b | 0.531 | 0.449 | 0.706 |
| Qwen/Qwen2.5-7B | **0.912** | **0.75** | 0.751 |

### Translation


| Model (# parameters ↓) | flores en→et (BLEU) | flores et→en (BLEU) |
|-------|-----------------------------------|---------------|
| utter-project/EuroLLM-9B | **29.0** | **41.2** |
| mistralai/Ministral-3-8B-Base-2512 | 12.6 | 29.6 |
| swiss-ai/Apertus-8B-2509 | 25.0 | _**38.5**_ |
| meta-llama/Llama-3.1-8B | 13.5 | 33.7 |
| **tartuNLP/Apertus-EstLLM-8B-1125** |  27.4 | 37.4 |
| tartuNLP/Llama-3.1-EstLLM-8B-0525 |  _**28.1**_ | 36.8 |
| tartuNLP/Llammas-base | 22.0 | 32.7 |
| BSC-LT/salamandra-7b | 14.7 | 18.2 |
| Qwen/Qwen2.5-7B | 5.1 | 27.5 |

## Limitations

In addition to the limitations of the original Apertus 8B model, this model has the following:

- Somewhat limited context size due to the continued training being done with the sequence length of 4096 tokens.

## Citation 

```
@misc{dorkin2026estllmenhancingestoniancapabilities,
      title={{EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training}}, 
      author={Aleksei Dorkin and Taido Purason and Emil Kalbaliyev and Hele-Andra Kuulmets and Marii Ojastu and Mark Fišel and Tanel Alumäe and Eleri Aedmaa and Krister Kruusmaa and Kairit Sirts},
      year={2026},
      eprint={2603.02041},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.02041}, 
}
```