--- library_name: transformers license: llama3.1 language: - et - en base_model: - tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 - meta-llama/Llama-3.1-8B-Instruct tags: - merge pipeline_tag: text-generation datasets: - nvidia/HelpSteer3 - allenai/tulu-3-sft-mixture - utter-project/EuroBlocks-SFT-Synthetic-1124 --- ![image/png](assets/logo-sinine.png) # Llama 3.1 EstLLM 8B 1125 Instruct `Llama-3.1-EstLLM-8B-Instruct-1125` is obtained by applying the chat-vector merge approach to [tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825](https://huggingface.co/tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825). The model underwent continuous pre-training starting from [meta-llama/Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B) on approximately 35B tokens, which resulted in [tartuNLP/Llama-3.1-EstLLM-8B-0525](https://huggingface.co/tartuNLP/Llama-3.1-EstLLM-8B-0525), then supervised fine-tuning and direct preference optimization were applied. ## Use with transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125" model = AutoModelForCausalLM.from_pretrained( model_name, dtype="auto", device_map="auto" ) # to use on apple silicon, load the following way # model = AutoModelForCausalLM.from_pretrained( # model_name, # dtype=torch.float16, # device_map="mps", # ) tokenizer = AutoTokenizer.from_pretrained(model_name) messages = [ {"role": "user", "content": "Kas sa räägid eesti keelt?"} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer(text, return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=128, do_sample=True, temperature=0.4, # specify eos token to stop at the end of the assistant response eos_token_id=tokenizer.eos_token_id, ) # generated_ids include the input tokens as well, so we only decode new tokens response = tokenizer.decode( generated_ids[0][model_inputs["input_ids"].shape[1]:], skip_special_tokens=True, ) print(response) ``` ## Model Details ### Model Description - **Developed by:** [TartuNLP](https://huggingface.co/tartuNLP) and [TalTechNLP](https://huggingface.co/TalTechNLP) research groups - **Funded by:** Estonian Ministry of Education and Research, “Estonian Language Technology Program 2018-2027” - **Model type:** Causal Language Model, Instruction-following - **Language(s) (NLP):** Estonian, English - **License:** Llama 3.1 Community License Agreement - **Finetuned from model:** [tartuNLP/Llama-3.1-EstLLM-8B-0525](https://huggingface.co/tartuNLP/Llama-3.1-EstLLM-8B-0525) ### Continued Pre-Training Continued Pre-Training was performed for a single epoch on: - Estonian National Corpus (8.6B tokens) - Python-Edu (3.3B tokens) - FineMath4-Plus (9.5B tokens) - General Instruction-Augmented Corpora (7.4B tokens) - Cosmopedia v2 (6.9B tokens) ### Supervised Fine-Tuning Approximately 764k examples were used for Supervised Fine-Tuning. The examples mainly come from [the Tulu 3 SFT mixture](https://huggingface.co/datasets/allenai/tulu-3-sft-mixture) and [EuroBlocks](https://huggingface.co/datasets/utter-project/EuroBlocks-SFT-Synthetic-1124). Additional data provided by the Institute of Estonian Language (EKI) was also used. In total about 80% of examples are in English. More details TBA. ### Direct Preference Optimization English-only [HelpSteer3](https://huggingface.co/datasets/nvidia/HelpSteer3) was used as is in the Direct Preference Optimization step, as [previous research on Poro 2 models](https://rocm.blogs.amd.com/artificial-intelligence/multilingual-continued-pretraining/README.html) showed that there's no observable benefit from translating preference pairs. ## Evaluation ## Logits-based Scores for logits-based evaluation benchmarks are available on the [EuroEval](https://euroeval.com/leaderboards/Monolingual/estonian/) leaderboard. ## Generative Every benchmark in this category is treated as a *generative* problem, and thus the evaluation is performed on the model responses obtained with 0 temperature (not logits). The top scores are higlighted with **bold**. Second best scores are highlighted with **_italic bold_**. Rows are sorted in descending order based on the number of parameters of models (not scores). The arrow up symbol (↑) next to the score indicates an improvement compared to the previous version of the model (`Llama-3.1-EstLLM-8B-Instruct-0825`). The test set is used for evaluation of each dataset unless noted otherwise. Note that _all models are evaluated with the same prompt template_ for comparability, meaning that the scores do not necessarily represent each model's best possible performance. This is especially the case for `deepseek-ai/DeepSeek-V3-0324` on some of the benchmarks. Only models of comparable size are evaluated on benchmarks in English. ### Instruction-following #### Estonian Instruction level strict accuracy is reported for IFEval-et. | Model (# parameters ↓) | [IFEval-et](https://huggingface.co/datasets/tartuNLP/ifeval_et) | |-------|-----------------------------------| | moonshotai/Kimi-K2-Instruct | **0.7891** | | deepseek-ai/DeepSeek-V3.2 | 0.7221 | | deepseek-ai/DeepSeek-V3-0324 | 0.7171 | | mistralai/Mistral-Large-3-675B-Instruct-2512 | 0.7097 | | meta-llama/Llama-3.1-405B-Instruct | 0.7159 | | meta-llama/Llama-3.3-70B-Instruct | **_0.7705_** | | Qwen/Qwen2.5-72B-Instruct | 0.7407 | | google/gemma-3-27b-it | 0.7655 | | google/gemma-3-12b-it | 0.7556 | | utter-project/EuroLLM-9B-Instruct-2512 | 0.5571 | | utter-project/EuroLLM-9B-Instruct | 0.5397 | | mistralai/Ministral-3-8B-Instruct-2512 | 0.4888 | | swiss-ai/Apertus-8B-Instruct-2509| 0.5484 | | meta-llama/Llama-3.1-8B-Instruct | 0.3797 | | **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125** | 0.6141 ↑ | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.5174 | | BSC-LT/salamandra-7b-instruct | 0.5195 | | tartuNLP/Llammas | 0.3524 | | Qwen/Qwen2.5-7B-Instruct | 0.4988 | | CohereLabs/tiny-aya-global | 0.6687 | #### English Instruction level strict accuracy is reported for IFEval-en. | Model (# parameters ↓) | [IFEval-en](https://huggingface.co/datasets/tartuNLP/ifeval_en) | |-------|-----------------------------------| | utter-project/EuroLLM-9B-Instruct-2512 | 0.7564 | | utter-project/EuroLLM-9B-Instruct | 0.7004 | | mistralai/Ministral-3-8B-Instruct-2512 | 0.6845 | | swiss-ai/Apertus-8B-Instruct-2509 | 0.7808 | | meta-llama/Llama-3.1-8B-Instruct | _**0.8106**_ | | **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125** | **0.8173 ↑** | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.7527 | | tartuNLP/Llammas | 0.4373 | | BSC-LT/salamandra-7b-instruct | 0.3289 | | Qwen/Qwen2.5-7B-Instruct | 0.7954 | ### Multiple Choice All datasets except Winogrande-et are evaluated in 0-shot mode. Winogrande-et is evaluated in 3-shot mode. Exact match accuracy is reported for every dataset. #### Estonian Language Competence | Model (# parameters ↓) | [Grammar-et](https://huggingface.co/datasets/TalTechNLP/grammar_et)| [Inflection-et](https://huggingface.co/datasets/TalTechNLP/inflection_et)| [Word-Meanings-et](https://huggingface.co/datasets/TalTechNLP/word_meanings_et) | |-------|------|------|--------| | moonshotai/Kimi-K2-Instruct | **0.916** | 0.6458 | **0.9689** | | deepseek-ai/DeepSeek-V3.2 | 0.781 | 0.6891 | 0.8134 | | deepseek-ai/DeepSeek-V3-0324 | 0.364 | 0 | 0 | | mistralai/Mistral-Large-3-675B-Instruct-2512 | 0.796 | _**0.8355**_ | 0.9488 | | meta-llama/Llama-3.1-405B-Instruct | 0.818 | **0.9089** | 0.9438 | | meta-llama/Llama-3.3-70B-Instruct | 0.797 | 0.6421 | 0.9408 | | Qwen/Qwen2.5-72B-Instruct | 0.694 | 0.5208 | 0.9057 | | google/gemma-3-27b-it | 0.817 | 0.5934 | 0.9529 | | google/gemma-3-12b-it | 0.789 | 0.4227 | 0.9318 | | utter-project/EuroLLM-9B-Instruct-2512 | 0.644 | 0.4466 | 0.9288 | | utter-project/EuroLLM-9B-Instruct | 0.764 | 0.367 | 0.9258 | | mistralai/Ministral-3-8B-Instruct-2512 | 0.562 | 0.4833 | 0.8395 | | swiss-ai/Apertus-8B-Instruct-2509 | 0.512 | 0.3662 | 0.9027 | | meta-llama/Llama-3.1-8B-Instruct | 0.657 | 0.4165 | 0.8335 | | **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125** | _**0.8310 ↑**_ | 0.5777 ↑ | _**0.9619 ↑**_ | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.692 | 0.5188 | 0.9569 | | BSC-LT/salamandra-7b-instruct | 0.594 | 0.2668 | 0.8084 | | Qwen/Qwen2.5-7B-Instruct | 0.598 | 0.4136 | 0.7984 | | tartuNLP/Llammas | 0.529 | 0.2289 | 0.5326 | | CohereLabs/tiny-aya-global | 0.563 | 0.3221 | 0.8455 | #### Knowledge and Reasoning (Estonian) | Model (# parameters ↓) | [Winogrande-et](https://huggingface.co/datasets/tartuNLP/winogrande_et) | [Trivia-et](https://huggingface.co/datasets/TalTechNLP/trivia_et) | [Exam-et](https://huggingface.co/datasets/TalTechNLP/exam_et) | [GlobalPIQA-et](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel/viewer/ekk_latn)| [TruthfulQA-et](https://huggingface.co/datasets/LumiOpen/opengpt-x_truthfulqax/viewer/mc_ET) | |-------|-----------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|-------------------------------------------| | moonshotai/Kimi-K2-Instruct | **0.8138** | 0.4225 | **0.8414** | **0.79** | **0.7136** | | deepseek-ai/DeepSeek-V3.2 | 0.4805 | 0.38 | 0.614 | 0.7 | 0.5863 | | deepseek-ai/DeepSeek-V3-0324 | **_0.8042_** | 0.27 | 0.1221 | 0.04 | 0.2093 | | mistralai/Mistral-Large-3-675B-Instruct-2512 | 0.7487 | 0.4275 | 0.7931 | _**0.73**_ | 0.6854 | | meta-llama/Llama-3.1-405B-Instruct |0.7878 | **0.4713** | _**0.8309**_ | 0.58 | _**0.7001**_ | | meta-llama/Llama-3.3-70B-Instruct |0.7397 | 0.3875 | 0.7652 | 0.58 | 0.6255 | | Qwen/Qwen2.5-72B-Instruct | 0.7227 | 0.315 | 0.7162 | 0.65 | 0.6683 | | google/gemma-3-27b-it | 0.7510 | 0.325 | 0.7751 | 0.71 | 0.5814 | | google/gemma-3-12b-it | 0.6712 | 0.3237 | 0.7069 | 0.54 | 0.3158 | | utter-project/EuroLLM-9B-Instruct-2512 | 0.5195 | 0.375 | 0.6097 | 0.52 | 0.399 | | utter-project/EuroLLM-9B-Instruct | 0.5846 | 0.3738 | 0.5589 | 0.55 | 0.2889 | | mistralai/Ministral-3-8B-Instruct-2512 | 0.5812 | 0.3125 | 0.5012 | 0.48 | 0.3525 | | swiss-ai/Apertus-8B-Instruct-2509 | 0.5105 | 0.345 | 0.552 | 0.59 | 0.366 | | meta-llama/Llama-3.1-8B-Instruct | 0.5399 | 0.2888 | 0.5 | 0.54 | 0.437 | | **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125** | 0.6440 ↑ | _**0.4288 ↑**_ | 0.6332 ↑ | 0.68 ↑ | 0.3794 ↑ | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.5812 | 0.425 | 0.5093 | 0.63 | 0.3525 | | BSC-LT/salamandra-7b-instruct | 0.2878 | 0.2875 | 0.3556 | 0.55 | 0.3011 | | Qwen/Qwen2.5-7B-Instruct | 0.5473 | 0.2938 | 0.4913 | 0.57 | 0.4113 | | tartuNLP/Llammas | 0.5037 | 0.2838 | 0.3649 | 0.01 | 0.2032 | | CohereLabs/tiny-aya-global | 0.5603 | 0.31 | 0.5638 | 0.52 | 0.3782 | #### Knowledge and Reasoning (English) | Model (# parameters ↓) | [Winogrande](https://huggingface.co/datasets/allenai/winogrande) | [GlobalPIQA-en](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel/viewer/eng_latn) | [TruthfulQA](https://huggingface.co/datasets/truthfulqa/truthful_qa) | [MMLU-Redux](https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0) | [GSM8K](https://huggingface.co/datasets/openai/gsm8k) | |-------|-----------------------------------|-----------------------------------|-----------------------------------|-----------------------------------|-----------------------------------| | utter-project/EuroLLM-9B-Instruct-2512 | 0.5546 | 0.58 |0.4614 | 0.6334 | 0.4139 | | utter-project/EuroLLM-9B-Instruct | 0.5059 | 0.58 | 0.2962 | 0.5741 | 0.5944 | | mistralai/Ministral-3-8B-Instruct-2512 | _**0.6503**_ | _**0.77**_ | 0.519 | _**0.7418**_ | 0.3927 | | swiss-ai/Apertus-8B-Instruct-2509 | 0.5133 | 0.73 | 0.3831 | 0.6099 | 0.5936 | | meta-llama/Llama-3.1-8B-Instruct | 0.5625 | 0.76 | _**0.5239**_ | 0.6959 | 0.7710 | | **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125** | 0.6118 ↑ | 0.76 ↑ | 0.3635 | 0.6606 ↑ | _**0.7726 ↑**_ | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.6084 | 0.71 | 0.366 | 0.6388 | 0.7202 | | tartuNLP/Llammas | 0.498 | 0 | 0.1971 | 0.3417 | 0.1456 | | BSC-LT/salamandra-7b-instruct | 0.4029 | 0.63 | 0.2717 | 0.5180 | 0.0076 | | Qwen/Qwen2.5-7B-Instruct | **0.6627** | **0.83** | **0.5875** | **0.7555** | **0.7862** | ### Translation #### English to Estonian | Model | [wmt24pp](https://huggingface.co/datasets/google/wmt24pp) (BLEU ↑) | |-------|---------| | BSC-LT/salamandraTA-7b-instruct | 0.2713 | | **tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125** | 0.2635 | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.264 | | utter-project/EuroLLM-9B-Instruct | 0.2602 | | utter-project/EuroLLM-9B-Instruct-2512 | 0.2567 | | swiss-ai/Apertus-8B-Instruct-2509 | 0.2372 | | tartuNLP/Llammas | 0.1472 | | meta-llama/Llama-3.1-8B-Instruct | 0.1406 | | BSC-LT/salamandra-7b-instruct | 0.1201 | | Qwen/Qwen2.5-7B-Instruct | 0.0476 | ## Limitations This is an early prototype version. Accordingly, it has limitations *in addition* to the base Llama limitations: - Relatively short context of 4096 tokens. It's not expected to perform well on context sizes beyond that. Merging somewhat mitigates that. - Multi-turn conversations are not guaranteed to be supported, although this capability is improved by merging. - Trained with the original Llama 3.1 system prompt that has a hard-coded date cut-off. - There may be unexpected side effects as a result of merging. ## Citation ``` @misc{dorkin2026estllmenhancingestoniancapabilities, title={{EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training}}, author={Aleksei Dorkin and Taido Purason and Emil Kalbaliyev and Hele-Andra Kuulmets and Marii Ojastu and Mark Fišel and Tanel Alumäe and Eleri Aedmaa and Krister Kruusmaa and Kairit Sirts}, year={2026}, eprint={2603.02041}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.02041}, } ```