---
library_name: transformers
license: llama3.1
language:
- et
- en
base_model:
- tartuNLP/Llama-3.1-EstLLM-8B-0525
pipeline_tag: text-generation
datasets:
- nvidia/HelpSteer3
- allenai/tulu-3-sft-mixture
- utter-project/EuroBlocks-SFT-Synthetic-1124
---

![image/png](assets/logo-sinine.png)

# Llama 3.1 EstLLM 8B 0825 Instruct

> This checkpoint is identical to [tartuNLP/llama-estllm-prototype-0825](https://huggingface.co/tartuNLP/llama-estllm-prototype-0825). The reupload is for naming consistency in the model tree.

`Llama-3.1-EstLLM-8B-Instruct-0825` is the first artifact produced by the EstLLM project. The intention of this release is to evaluate the first prototype in a conversational
ChatbotArena-style setting on [baromeeter.ai](https://baromeeter.ai), and thus establish a baseline for future improvements.

The model underwent continuous pre-training starting from [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) on approximately 35B tokens,
which resulted in [tartuNLP/Llama-3.1-EstLLM-8B-0525](https://huggingface.co/tartuNLP/Llama-3.1-EstLLM-8B-0525),
then supervised fine-tuning and direct preference optimization were applied.

## Use with transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="auto",
    device_map="auto"
)

# to use on apple silicon, load the following way
# model = AutoModelForCausalLM.from_pretrained(
#     model_name,
#     dtype=torch.float16,
#     device_map="mps",
# )

tokenizer = AutoTokenizer.from_pretrained(model_name)

messages = [
    {"role": "user", "content": "Kas sa räägid eesti keelt?"}
]

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

model_inputs = tokenizer(text, return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=128,
    do_sample=True,
    temperature=0.4,
    # specify eos token to stop at the end of the assistant response
    eos_token_id=tokenizer.eos_token_id,
)

# generated_ids include the input tokens as well, so we only decode new tokens
response = tokenizer.decode(
    generated_ids[0][model_inputs["input_ids"].shape[1]:],
    skip_special_tokens=True,
)

print(response)
```

## Model Details

### Model Description

- **Developed by:** [TartuNLP](https://huggingface.co/tartuNLP) and [TalTechNLP](https://huggingface.co/TalTechNLP) research groups
- **Funded by:** Estonian Ministry of Education and Research, “Estonian Language Technology Program 2018-2027”
- **Model type:** Causal Language Model, Instruction-following
- **Language(s) (NLP):** Estonian, English
- **License:** Llama 3.1 Community License Agreement
- **Finetuned from model:** [tartuNLP/Llama-3.1-EstLLM-8B-0525](https://huggingface.co/tartuNLP/Llama-3.1-EstLLM-8B-0525)

### Continued Pre-Training

Continued Pre-Training was performed for a single epoch on:
- Estonian National Corpus (8.6B tokens)
- Python-Edu (3.3B tokens)
- FineMath4-Plus (9.5B tokens)
- General Instruction-Augmented Corpora (7.4B tokens)
- Cosmopedia v2 (6.9B tokens)
  
### Supervised Fine-Tuning

Approximately 764k examples were used for Supervised Fine-Tuning. The examples mainly come from [the Tulu 3 SFT mixture](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) and [EuroBlocks](https://huggingface.co/datasets/utter-project/EuroBlocks-SFT-Synthetic-1124).
Additional data provided by the Institute of Estonian Language (EKI) was also used. In total about 80% of examples are in English. More details TBA.

### Direct Preference Optimization


English-only [HelpSteer3](https://huggingface.co/datasets/nvidia/HelpSteer3) was used as is in the Direct Preference Optimization step, as [previous research on Poro 2 models](https://rocm.blogs.amd.com/artificial-intelligence/multilingual-continued-pretraining/README.html)
showed that there's no observable benefit from translating preference pairs.


## Evaluation

## Logits-based

Scores for logits-based evaluation benchmarks are available on the [EuroEval](https://euroeval.com/leaderboards/Monolingual/estonian/) leaderboard.

## Generative

Every benchmark in this category is treated as a *generative* problem, and thus the evaluation is performed on the model responses obtained with 0 temperature (not logits).
The top scores are higlighted with **bold**. Second best scores are highlighted with **_italic bold_**. Rows are sorted in descending order based on the number of parameters of models (not scores).
The test set is used for evaluation of each dataset unless noted otherwise.

Note that _all models are evaluated with the same prompt template_ for comparability, meaning that the scores do not necessarily represent each model's best possible
performance. This is especially the case for `deepseek-ai/DeepSeek-V3-0324` on some of the benchmarks.

Only models of comparable size are evaluated on benchmarks in English.

### Instruction-following

#### Estonian

Instruction level strict accuracy is reported for IFEval-et.

| Model (# parameters ↓) | [IFEval-et](https://huggingface.co/datasets/tartuNLP/ifeval_et)  |
|-------|-----------------------------------|
| moonshotai/Kimi-K2-Instruct | **0.7891** |
| deepseek-ai/DeepSeek-V3.2 | 0.7221 |
| deepseek-ai/DeepSeek-V3-0324 | 0.7171 |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | 0.7097 |
| meta-llama/Llama-3.1-405B-Instruct | 0.7159 |
| meta-llama/Llama-3.3-70B-Instruct | **_0.7705_** |
| Qwen/Qwen2.5-72B-Instruct | 0.7407 |
| google/gemma-3-27b-it | 0.7655 |
| google/gemma-3-12b-it | 0.7556 |
| utter-project/EuroLLM-9B-Instruct | 0.5397 |
| mistralai/Ministral-3-8B-Instruct-2512 | 0.4888 |
| swiss-ai/Apertus-8B-Instruct-2509| 0.5484 |
| meta-llama/Llama-3.1-8B-Instruct | 0.3797 |
| **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825** | 0.5174 |
| BSC-LT/salamandra-7b-instruct | 0.5195 |
| tartuNLP/Llammas | 0.3524 |
| Qwen/Qwen2.5-7B-Instruct | 0.4988 |


#### English

Instruction level strict accuracy is reported for IFEval-en.


| Model (# parameters ↓) | [IFEval-en](https://huggingface.co/datasets/tartuNLP/ifeval_en)  |
|-------|-----------------------------------|
| utter-project/EuroLLM-9B-Instruct | 0.7004 |
| mistralai/Ministral-3-8B-Instruct-2512 | 0.6845 |
| swiss-ai/Apertus-8B-Instruct-2509 | 0.7808 |
| meta-llama/Llama-3.1-8B-Instruct | **0.8106** |
| **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825** | 0.7527 |
| tartuNLP/Llammas | 0.4373 |
| BSC-LT/salamandra-7b-instruct | 0.3289 |
| Qwen/Qwen2.5-7B-Instruct | _**0.7954**_ |

### Multiple Choice

All datasets except Winogrande-et are evaluated in 0-shot mode. Winogrande-et is evaluated in 3-shot mode. Exact match accuracy is reported for every dataset.

#### Estonian Language Competence

| Model (# parameters ↓) | [Grammar-et](https://huggingface.co/datasets/TalTechNLP/grammar_et)| [Inflection-et](https://huggingface.co/datasets/TalTechNLP/inflection_et)| [Word-Meanings-et](https://huggingface.co/datasets/TalTechNLP/word_meanings_et) |
|-------|------|------|--------|
| moonshotai/Kimi-K2-Instruct  | **0.916** | 0.6458 | **0.9689** |
| deepseek-ai/DeepSeek-V3.2 | 0.781 | 0.6891 | 0.8134 |
| deepseek-ai/DeepSeek-V3-0324  | 0.364 | 0 | 0 |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | 0.796 | _**0.8355**_ | 0.9488 |
| meta-llama/Llama-3.1-405B-Instruct | **_0.818_** | **0.9089** | 0.9438 |
| meta-llama/Llama-3.3-70B-Instruct  | 0.797 | 0.6421 | 0.9408 |
| Qwen/Qwen2.5-72B-Instruct  | 0.694 | 0.5208 | 0.9057 |
| google/gemma-3-27b-it | 0.817 | 0.5934 | 0.9529 |
| google/gemma-3-12b-it | 0.789 | 0.4227 | 0.9318 |
| utter-project/EuroLLM-9B-Instruct | 0.764 | 0.367 | 0.9258 |
| mistralai/Ministral-3-8B-Instruct-2512 | 0.562 | 0.4833 | 0.8395 |
| swiss-ai/Apertus-8B-Instruct-2509 | 0.512 | 0.3662 | 0.9027 |
| meta-llama/Llama-3.1-8B-Instruct | 0.657 | 0.4165 | 0.8335 |
| **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825**  | 0.692 | 0.5188 | **_0.9569_** |
| BSC-LT/salamandra-7b-instruct  | 0.594 | 0.2668 | 0.8084 |
| Qwen/Qwen2.5-7B-Instruct | 0.598 | 0.4136 | 0.7984 |
| tartuNLP/Llammas | 0.529 | 0.2289 | 0.5326 |

#### Knowledge and Reasoning (Estonian)


| Model (# parameters ↓) | [Winogrande-et](https://huggingface.co/datasets/tartuNLP/winogrande_et) | [Trivia-et](https://huggingface.co/datasets/TalTechNLP/trivia_et) | [Exam-et](https://huggingface.co/datasets/TalTechNLP/exam_et) | [GlobalPIQA-et](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel/viewer/ekk_latn)|  [TruthfulQA-et](https://huggingface.co/datasets/LumiOpen/opengpt-x_truthfulqax/viewer/mc_ET) |
|-------|-----------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|-------------------------------------------|
| moonshotai/Kimi-K2-Instruct  | **0.8138** | 0.4225 | **0.8414** | **0.79** | **0.7136** |
| deepseek-ai/DeepSeek-V3.2 | 0.4805 | 0.38 | 0.614 | 0.7 | 0.5863 |
| deepseek-ai/DeepSeek-V3-0324  | **_0.8042_** | 0.27 | 0.1221 | 0.04 | 0.2093 |
| mistralai/Mistral-Large-3-675B-Instruct-2512 | 0.7487 | _**0.4275**_ | 0.7931 | _**0.73**_ | 0.6854 |
| meta-llama/Llama-3.1-405B-Instruct |0.7878  | **0.4713** | _**0.8309**_ | 0.58 | _**0.7001**_ |
| meta-llama/Llama-3.3-70B-Instruct  |0.7397  | 0.3875 | 0.7652 | 0.58 | 0.6255 |
| Qwen/Qwen2.5-72B-Instruct  | 0.7227 | 0.315 | 0.7162 | 0.65 | 0.6683 |
| google/gemma-3-27b-it | 0.7510 | 0.325 | 0.7751 | 0.71 | 0.5814 |
| google/gemma-3-12b-it | 0.6712 | 0.3237 | 0.7069 | 0.54 | 0.3158 |
| utter-project/EuroLLM-9B-Instruct | 0.5846 | 0.3738 | 0.5589 | 0.55 | 0.2889 |
| mistralai/Ministral-3-8B-Instruct-2512 | 0.5812 | 0.3125 | 0.5012 | 0.48 | 0.3525 |
| swiss-ai/Apertus-8B-Instruct-2509 | 0.5105 | 0.345 | 0.552 | 0.59 | 0.366 |
| meta-llama/Llama-3.1-8B-Instruct | 0.5399 | 0.2888 | 0.5 | 0.54 | 0.437 |
| **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825**  | 0.5812 | 0.425 | 0.5093 | 0.63 | 0.3525 |
| BSC-LT/salamandra-7b-instruct  | 0.2878 | 0.2875 | 0.3556 | 0.55 | 0.3011 |
| Qwen/Qwen2.5-7B-Instruct | 0.5473 | 0.2938 | 0.4913 | 0.57 | 0.4113 |
| tartuNLP/Llammas | 0.5037 | 0.2838 | 0.3649 | 0.01 | 0.2032 |

#### Knowledge and Reasoning (English)


| Model (# parameters ↓) | [Winogrande](https://huggingface.co/datasets/allenai/winogrande)  | [GlobalPIQA-en](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel/viewer/eng_latn) | [TruthfulQA](https://huggingface.co/datasets/truthfulqa/truthful_qa) | [MMLU-Redux](https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0) | [GSM8K](https://huggingface.co/datasets/openai/gsm8k) |
|-------|-----------------------------------|-----------------------------------|-----------------------------------|-----------------------------------|-----------------------------------|
| utter-project/EuroLLM-9B-Instruct | 0.5059 | 0.58 | 0.2962 | 0.5741 | 0.5944 |
| meta-llama/Llama-3.1-8B-Instruct | 0.5625 | 0.76 | _**0.5239**_ | 0.6959 | _**0.7710**_ |
| mistralai/Ministral-3-8B-Instruct-2512 | _**0.6503**_ | _**0.77**_ | 0.519 | _**0.7418**_ | 0.3927 |
| swiss-ai/Apertus-8B-Instruct-2509 | 0.5133 | 0.73 | 0.3831 | 0.6099 | 0.5936 |
| **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825** | 0.6084 | 0.71 | 0.366 | 0.6388 | 0.7202 |
| tartuNLP/Llammas | 0.498 | 0 | 0.1971 | 0.3417 | 0.1456 |
| BSC-LT/salamandra-7b-instruct | 0.4029 | 0.63 | 0.2717 | 0.5180 | 0.0076 |
| Qwen/Qwen2.5-7B-Instruct | **0.6627** | **0.83** | **0.5875** | **0.7555** | **0.7862** |


### Translation

#### English to Estonian

| Model | [wmt24pp](https://huggingface.co/datasets/google/wmt24pp) (BLEU ↑) |
|-------|---------|
| BSC-LT/salamandraTA-7b-instruct | 0.2713 |
| **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825** | 0.264 |
| utter-project/EuroLLM-9B-Instruct | 0.2602 |
| swiss-ai/Apertus-8B-Instruct-2509 | 0.2372 |
| tartuNLP/Llammas | 0.1472 |
| meta-llama/Llama-3.1-8B-Instruct | 0.1406 |
| BSC-LT/salamandra-7b-instruct | 0.1201 |
| Qwen/Qwen2.5-7B-Instruct | 0.0476 |


## Limitations

This is an early prototype version. Accordingly, it has limitations *in addition* to the base Llama limitations:

- Relatively short context of 4096 tokens. It's not expected to perform well on context sizes beyond that.
- Multi-turn conversations are not supported in this version.
- Trained with the original Llama 3.1 system prompt that has a hard-coded date cut-off.


## Citation

```
@misc{dorkin2026estllmenhancingestoniancapabilities,
      title={{EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training}}, 
      author={Aleksei Dorkin and Taido Purason and Emil Kalbaliyev and Hele-Andra Kuulmets and Marii Ojastu and Mark Fišel and Tanel Alumäe and Eleri Aedmaa and Krister Kruusmaa and Kairit Sirts},
      year={2026},
      eprint={2603.02041},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2603.02041}, 
}
```