277 lines
13 KiB
Markdown
277 lines
13 KiB
Markdown
|
|
---
|
||
|
|
library_name: transformers
|
||
|
|
license: llama3.1
|
||
|
|
language:
|
||
|
|
- et
|
||
|
|
- en
|
||
|
|
base_model:
|
||
|
|
- tartuNLP/Llama-3.1-EstLLM-8B-0525
|
||
|
|
pipeline_tag: text-generation
|
||
|
|
datasets:
|
||
|
|
- nvidia/HelpSteer3
|
||
|
|
- allenai/tulu-3-sft-mixture
|
||
|
|
- utter-project/EuroBlocks-SFT-Synthetic-1124
|
||
|
|
---
|
||
|
|
|
||
|
|

|
||
|
|
|
||
|
|
# Llama 3.1 EstLLM 8B 0825 Instruct
|
||
|
|
|
||
|
|
> This checkpoint is identical to [tartuNLP/llama-estllm-prototype-0825](https://huggingface.co/tartuNLP/llama-estllm-prototype-0825). The reupload is for naming consistency in the model tree.
|
||
|
|
|
||
|
|
`Llama-3.1-EstLLM-8B-Instruct-0825` is the first artifact produced by the EstLLM project. The intention of this release is to evaluate the first prototype in a conversational
|
||
|
|
ChatbotArena-style setting on [baromeeter.ai](https://baromeeter.ai), and thus establish a baseline for future improvements.
|
||
|
|
|
||
|
|
The model underwent continuous pre-training starting from [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) on approximately 35B tokens,
|
||
|
|
which resulted in [tartuNLP/Llama-3.1-EstLLM-8B-0525](https://huggingface.co/tartuNLP/Llama-3.1-EstLLM-8B-0525),
|
||
|
|
then supervised fine-tuning and direct preference optimization were applied.
|
||
|
|
|
||
|
|
## Use with transformers
|
||
|
|
|
||
|
|
```python
|
||
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||
|
|
import torch
|
||
|
|
|
||
|
|
model_name = "tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825"
|
||
|
|
|
||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
||
|
|
model_name,
|
||
|
|
dtype="auto",
|
||
|
|
device_map="auto"
|
||
|
|
)
|
||
|
|
|
||
|
|
# to use on apple silicon, load the following way
|
||
|
|
# model = AutoModelForCausalLM.from_pretrained(
|
||
|
|
# model_name,
|
||
|
|
# dtype=torch.float16,
|
||
|
|
# device_map="mps",
|
||
|
|
# )
|
||
|
|
|
||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
||
|
|
|
||
|
|
messages = [
|
||
|
|
{"role": "user", "content": "Kas sa räägid eesti keelt?"}
|
||
|
|
]
|
||
|
|
|
||
|
|
text = tokenizer.apply_chat_template(
|
||
|
|
messages,
|
||
|
|
tokenize=False,
|
||
|
|
add_generation_prompt=True
|
||
|
|
)
|
||
|
|
|
||
|
|
model_inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
||
|
|
|
||
|
|
generated_ids = model.generate(
|
||
|
|
**model_inputs,
|
||
|
|
max_new_tokens=128,
|
||
|
|
do_sample=True,
|
||
|
|
temperature=0.4,
|
||
|
|
# specify eos token to stop at the end of the assistant response
|
||
|
|
eos_token_id=tokenizer.eos_token_id,
|
||
|
|
)
|
||
|
|
|
||
|
|
# generated_ids include the input tokens as well, so we only decode new tokens
|
||
|
|
response = tokenizer.decode(
|
||
|
|
generated_ids[0][model_inputs["input_ids"].shape[1]:],
|
||
|
|
skip_special_tokens=True,
|
||
|
|
)
|
||
|
|
|
||
|
|
print(response)
|
||
|
|
```
|
||
|
|
|
||
|
|
## Model Details
|
||
|
|
|
||
|
|
### Model Description
|
||
|
|
|
||
|
|
- **Developed by:** [TartuNLP](https://huggingface.co/tartuNLP) and [TalTechNLP](https://huggingface.co/TalTechNLP) research groups
|
||
|
|
- **Funded by:** Estonian Ministry of Education and Research, “Estonian Language Technology Program 2018-2027”
|
||
|
|
- **Model type:** Causal Language Model, Instruction-following
|
||
|
|
- **Language(s) (NLP):** Estonian, English
|
||
|
|
- **License:** Llama 3.1 Community License Agreement
|
||
|
|
- **Finetuned from model:** [tartuNLP/Llama-3.1-EstLLM-8B-0525](https://huggingface.co/tartuNLP/Llama-3.1-EstLLM-8B-0525)
|
||
|
|
|
||
|
|
### Continued Pre-Training
|
||
|
|
|
||
|
|
Continued Pre-Training was performed for a single epoch on:
|
||
|
|
- Estonian National Corpus (8.6B tokens)
|
||
|
|
- Python-Edu (3.3B tokens)
|
||
|
|
- FineMath4-Plus (9.5B tokens)
|
||
|
|
- General Instruction-Augmented Corpora (7.4B tokens)
|
||
|
|
- Cosmopedia v2 (6.9B tokens)
|
||
|
|
|
||
|
|
### Supervised Fine-Tuning
|
||
|
|
|
||
|
|
Approximately 764k examples were used for Supervised Fine-Tuning. The examples mainly come from [the Tulu 3 SFT mixture](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) and [EuroBlocks](https://huggingface.co/datasets/utter-project/EuroBlocks-SFT-Synthetic-1124).
|
||
|
|
Additional data provided by the Institute of Estonian Language (EKI) was also used. In total about 80% of examples are in English. More details TBA.
|
||
|
|
|
||
|
|
### Direct Preference Optimization
|
||
|
|
|
||
|
|
|
||
|
|
English-only [HelpSteer3](https://huggingface.co/datasets/nvidia/HelpSteer3) was used as is in the Direct Preference Optimization step, as [previous research on Poro 2 models](https://rocm.blogs.amd.com/artificial-intelligence/multilingual-continued-pretraining/README.html)
|
||
|
|
showed that there's no observable benefit from translating preference pairs.
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
|
||
|
|
## Evaluation
|
||
|
|
|
||
|
|
## Logits-based
|
||
|
|
|
||
|
|
Scores for logits-based evaluation benchmarks are available on the [EuroEval](https://euroeval.com/leaderboards/Monolingual/estonian/) leaderboard.
|
||
|
|
|
||
|
|
## Generative
|
||
|
|
|
||
|
|
Every benchmark in this category is treated as a *generative* problem, and thus the evaluation is performed on the model responses obtained with 0 temperature (not logits).
|
||
|
|
The top scores are higlighted with **bold**. Second best scores are highlighted with **_italic bold_**. Rows are sorted in descending order based on the number of parameters of models (not scores).
|
||
|
|
The test set is used for evaluation of each dataset unless noted otherwise.
|
||
|
|
|
||
|
|
Note that _all models are evaluated with the same prompt template_ for comparability, meaning that the scores do not necessarily represent each model's best possible
|
||
|
|
performance. This is especially the case for `deepseek-ai/DeepSeek-V3-0324` on some of the benchmarks.
|
||
|
|
|
||
|
|
Only models of comparable size are evaluated on benchmarks in English.
|
||
|
|
|
||
|
|
### Instruction-following
|
||
|
|
|
||
|
|
#### Estonian
|
||
|
|
|
||
|
|
Instruction level strict accuracy is reported for IFEval-et.
|
||
|
|
|
||
|
|
| Model (# parameters ↓) | [IFEval-et](https://huggingface.co/datasets/tartuNLP/ifeval_et) |
|
||
|
|
|-------|-----------------------------------|
|
||
|
|
| moonshotai/Kimi-K2-Instruct | **0.7891** |
|
||
|
|
| deepseek-ai/DeepSeek-V3.2 | 0.7221 |
|
||
|
|
| deepseek-ai/DeepSeek-V3-0324 | 0.7171 |
|
||
|
|
| mistralai/Mistral-Large-3-675B-Instruct-2512 | 0.7097 |
|
||
|
|
| meta-llama/Llama-3.1-405B-Instruct | 0.7159 |
|
||
|
|
| meta-llama/Llama-3.3-70B-Instruct | **_0.7705_** |
|
||
|
|
| Qwen/Qwen2.5-72B-Instruct | 0.7407 |
|
||
|
|
| google/gemma-3-27b-it | 0.7655 |
|
||
|
|
| google/gemma-3-12b-it | 0.7556 |
|
||
|
|
| utter-project/EuroLLM-9B-Instruct | 0.5397 |
|
||
|
|
| mistralai/Ministral-3-8B-Instruct-2512 | 0.4888 |
|
||
|
|
| swiss-ai/Apertus-8B-Instruct-2509| 0.5484 |
|
||
|
|
| meta-llama/Llama-3.1-8B-Instruct | 0.3797 |
|
||
|
|
| **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825** | 0.5174 |
|
||
|
|
| BSC-LT/salamandra-7b-instruct | 0.5195 |
|
||
|
|
| tartuNLP/Llammas | 0.3524 |
|
||
|
|
| Qwen/Qwen2.5-7B-Instruct | 0.4988 |
|
||
|
|
|
||
|
|
|
||
|
|
#### English
|
||
|
|
|
||
|
|
Instruction level strict accuracy is reported for IFEval-en.
|
||
|
|
|
||
|
|
|
||
|
|
| Model (# parameters ↓) | [IFEval-en](https://huggingface.co/datasets/tartuNLP/ifeval_en) |
|
||
|
|
|-------|-----------------------------------|
|
||
|
|
| utter-project/EuroLLM-9B-Instruct | 0.7004 |
|
||
|
|
| mistralai/Ministral-3-8B-Instruct-2512 | 0.6845 |
|
||
|
|
| swiss-ai/Apertus-8B-Instruct-2509 | 0.7808 |
|
||
|
|
| meta-llama/Llama-3.1-8B-Instruct | **0.8106** |
|
||
|
|
| **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825** | 0.7527 |
|
||
|
|
| tartuNLP/Llammas | 0.4373 |
|
||
|
|
| BSC-LT/salamandra-7b-instruct | 0.3289 |
|
||
|
|
| Qwen/Qwen2.5-7B-Instruct | _**0.7954**_ |
|
||
|
|
|
||
|
|
### Multiple Choice
|
||
|
|
|
||
|
|
All datasets except Winogrande-et are evaluated in 0-shot mode. Winogrande-et is evaluated in 3-shot mode. Exact match accuracy is reported for every dataset.
|
||
|
|
|
||
|
|
#### Estonian Language Competence
|
||
|
|
|
||
|
|
| Model (# parameters ↓) | [Grammar-et](https://huggingface.co/datasets/TalTechNLP/grammar_et)| [Inflection-et](https://huggingface.co/datasets/TalTechNLP/inflection_et)| [Word-Meanings-et](https://huggingface.co/datasets/TalTechNLP/word_meanings_et) |
|
||
|
|
|-------|------|------|--------|
|
||
|
|
| moonshotai/Kimi-K2-Instruct | **0.916** | 0.6458 | **0.9689** |
|
||
|
|
| deepseek-ai/DeepSeek-V3.2 | 0.781 | 0.6891 | 0.8134 |
|
||
|
|
| deepseek-ai/DeepSeek-V3-0324 | 0.364 | 0 | 0 |
|
||
|
|
| mistralai/Mistral-Large-3-675B-Instruct-2512 | 0.796 | _**0.8355**_ | 0.9488 |
|
||
|
|
| meta-llama/Llama-3.1-405B-Instruct | **_0.818_** | **0.9089** | 0.9438 |
|
||
|
|
| meta-llama/Llama-3.3-70B-Instruct | 0.797 | 0.6421 | 0.9408 |
|
||
|
|
| Qwen/Qwen2.5-72B-Instruct | 0.694 | 0.5208 | 0.9057 |
|
||
|
|
| google/gemma-3-27b-it | 0.817 | 0.5934 | 0.9529 |
|
||
|
|
| google/gemma-3-12b-it | 0.789 | 0.4227 | 0.9318 |
|
||
|
|
| utter-project/EuroLLM-9B-Instruct | 0.764 | 0.367 | 0.9258 |
|
||
|
|
| mistralai/Ministral-3-8B-Instruct-2512 | 0.562 | 0.4833 | 0.8395 |
|
||
|
|
| swiss-ai/Apertus-8B-Instruct-2509 | 0.512 | 0.3662 | 0.9027 |
|
||
|
|
| meta-llama/Llama-3.1-8B-Instruct | 0.657 | 0.4165 | 0.8335 |
|
||
|
|
| **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825** | 0.692 | 0.5188 | **_0.9569_** |
|
||
|
|
| BSC-LT/salamandra-7b-instruct | 0.594 | 0.2668 | 0.8084 |
|
||
|
|
| Qwen/Qwen2.5-7B-Instruct | 0.598 | 0.4136 | 0.7984 |
|
||
|
|
| tartuNLP/Llammas | 0.529 | 0.2289 | 0.5326 |
|
||
|
|
|
||
|
|
#### Knowledge and Reasoning (Estonian)
|
||
|
|
|
||
|
|
|
||
|
|
| Model (# parameters ↓) | [Winogrande-et](https://huggingface.co/datasets/tartuNLP/winogrande_et) | [Trivia-et](https://huggingface.co/datasets/TalTechNLP/trivia_et) | [Exam-et](https://huggingface.co/datasets/TalTechNLP/exam_et) | [GlobalPIQA-et](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel/viewer/ekk_latn)| [TruthfulQA-et](https://huggingface.co/datasets/LumiOpen/opengpt-x_truthfulqax/viewer/mc_ET) |
|
||
|
|
|-------|-----------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|-------------------------------------------|
|
||
|
|
| moonshotai/Kimi-K2-Instruct | **0.8138** | 0.4225 | **0.8414** | **0.79** | **0.7136** |
|
||
|
|
| deepseek-ai/DeepSeek-V3.2 | 0.4805 | 0.38 | 0.614 | 0.7 | 0.5863 |
|
||
|
|
| deepseek-ai/DeepSeek-V3-0324 | **_0.8042_** | 0.27 | 0.1221 | 0.04 | 0.2093 |
|
||
|
|
| mistralai/Mistral-Large-3-675B-Instruct-2512 | 0.7487 | _**0.4275**_ | 0.7931 | _**0.73**_ | 0.6854 |
|
||
|
|
| meta-llama/Llama-3.1-405B-Instruct |0.7878 | **0.4713** | _**0.8309**_ | 0.58 | _**0.7001**_ |
|
||
|
|
| meta-llama/Llama-3.3-70B-Instruct |0.7397 | 0.3875 | 0.7652 | 0.58 | 0.6255 |
|
||
|
|
| Qwen/Qwen2.5-72B-Instruct | 0.7227 | 0.315 | 0.7162 | 0.65 | 0.6683 |
|
||
|
|
| google/gemma-3-27b-it | 0.7510 | 0.325 | 0.7751 | 0.71 | 0.5814 |
|
||
|
|
| google/gemma-3-12b-it | 0.6712 | 0.3237 | 0.7069 | 0.54 | 0.3158 |
|
||
|
|
| utter-project/EuroLLM-9B-Instruct | 0.5846 | 0.3738 | 0.5589 | 0.55 | 0.2889 |
|
||
|
|
| mistralai/Ministral-3-8B-Instruct-2512 | 0.5812 | 0.3125 | 0.5012 | 0.48 | 0.3525 |
|
||
|
|
| swiss-ai/Apertus-8B-Instruct-2509 | 0.5105 | 0.345 | 0.552 | 0.59 | 0.366 |
|
||
|
|
| meta-llama/Llama-3.1-8B-Instruct | 0.5399 | 0.2888 | 0.5 | 0.54 | 0.437 |
|
||
|
|
| **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825** | 0.5812 | 0.425 | 0.5093 | 0.63 | 0.3525 |
|
||
|
|
| BSC-LT/salamandra-7b-instruct | 0.2878 | 0.2875 | 0.3556 | 0.55 | 0.3011 |
|
||
|
|
| Qwen/Qwen2.5-7B-Instruct | 0.5473 | 0.2938 | 0.4913 | 0.57 | 0.4113 |
|
||
|
|
| tartuNLP/Llammas | 0.5037 | 0.2838 | 0.3649 | 0.01 | 0.2032 |
|
||
|
|
|
||
|
|
#### Knowledge and Reasoning (English)
|
||
|
|
|
||
|
|
|
||
|
|
| Model (# parameters ↓) | [Winogrande](https://huggingface.co/datasets/allenai/winogrande) | [GlobalPIQA-en](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel/viewer/eng_latn) | [TruthfulQA](https://huggingface.co/datasets/truthfulqa/truthful_qa) | [MMLU-Redux](https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0) | [GSM8K](https://huggingface.co/datasets/openai/gsm8k) |
|
||
|
|
|-------|-----------------------------------|-----------------------------------|-----------------------------------|-----------------------------------|-----------------------------------|
|
||
|
|
| utter-project/EuroLLM-9B-Instruct | 0.5059 | 0.58 | 0.2962 | 0.5741 | 0.5944 |
|
||
|
|
| meta-llama/Llama-3.1-8B-Instruct | 0.5625 | 0.76 | _**0.5239**_ | 0.6959 | _**0.7710**_ |
|
||
|
|
| mistralai/Ministral-3-8B-Instruct-2512 | _**0.6503**_ | _**0.77**_ | 0.519 | _**0.7418**_ | 0.3927 |
|
||
|
|
| swiss-ai/Apertus-8B-Instruct-2509 | 0.5133 | 0.73 | 0.3831 | 0.6099 | 0.5936 |
|
||
|
|
| **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825** | 0.6084 | 0.71 | 0.366 | 0.6388 | 0.7202 |
|
||
|
|
| tartuNLP/Llammas | 0.498 | 0 | 0.1971 | 0.3417 | 0.1456 |
|
||
|
|
| BSC-LT/salamandra-7b-instruct | 0.4029 | 0.63 | 0.2717 | 0.5180 | 0.0076 |
|
||
|
|
| Qwen/Qwen2.5-7B-Instruct | **0.6627** | **0.83** | **0.5875** | **0.7555** | **0.7862** |
|
||
|
|
|
||
|
|
|
||
|
|
### Translation
|
||
|
|
|
||
|
|
#### English to Estonian
|
||
|
|
|
||
|
|
| Model | [wmt24pp](https://huggingface.co/datasets/google/wmt24pp) (BLEU ↑) |
|
||
|
|
|-------|---------|
|
||
|
|
| BSC-LT/salamandraTA-7b-instruct | 0.2713 |
|
||
|
|
| **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825** | 0.264 |
|
||
|
|
| utter-project/EuroLLM-9B-Instruct | 0.2602 |
|
||
|
|
| swiss-ai/Apertus-8B-Instruct-2509 | 0.2372 |
|
||
|
|
| tartuNLP/Llammas | 0.1472 |
|
||
|
|
| meta-llama/Llama-3.1-8B-Instruct | 0.1406 |
|
||
|
|
| BSC-LT/salamandra-7b-instruct | 0.1201 |
|
||
|
|
| Qwen/Qwen2.5-7B-Instruct | 0.0476 |
|
||
|
|
|
||
|
|
|
||
|
|
## Limitations
|
||
|
|
|
||
|
|
This is an early prototype version. Accordingly, it has limitations *in addition* to the base Llama limitations:
|
||
|
|
|
||
|
|
- Relatively short context of 4096 tokens. It's not expected to perform well on context sizes beyond that.
|
||
|
|
- Multi-turn conversations are not supported in this version.
|
||
|
|
- Trained with the original Llama 3.1 system prompt that has a hard-coded date cut-off.
|
||
|
|
|
||
|
|
|
||
|
|
## Citation
|
||
|
|
|
||
|
|
```
|
||
|
|
@misc{dorkin2026estllmenhancingestoniancapabilities,
|
||
|
|
title={{EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training}},
|
||
|
|
author={Aleksei Dorkin and Taido Purason and Emil Kalbaliyev and Hele-Andra Kuulmets and Marii Ojastu and Mark Fišel and Tanel Alumäe and Eleri Aedmaa and Krister Kruusmaa and Kairit Sirts},
|
||
|
|
year={2026},
|
||
|
|
eprint={2603.02041},
|
||
|
|
archivePrefix={arXiv},
|
||
|
|
primaryClass={cs.CL},
|
||
|
|
url={https://arxiv.org/abs/2603.02041},
|
||
|
|
}
|
||
|
|
```
|