--- library_name: transformers language: - et - en base_model: - tartuNLP/Apertus-EstLLM-8B-Instruct-1125 - swiss-ai/Apertus-8B-Instruct-2509 tags: - merge license: apache-2.0 --- ![image/png](assets/logo-sinine.png) # Apertus EstLLM 8B 0326 Instruct `Llama-3.1-EstLLM-8B-Instruct-0326` is obtained by applying the chat-vector merge approach to [tartuNLP/Apertus-EstLLM-8B-Instruct-1125](https://huggingface.co/tartuNLP/Apertus-EstLLM-8B-Instruct-1125). ## Use with transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "tartuNLP/Apertus-EstLLM-8B-Instruct-0326" model = AutoModelForCausalLM.from_pretrained( model_name, dtype="auto", device_map="auto" ) # to use on apple silicon, load the following way # model = AutoModelForCausalLM.from_pretrained( # model_name, # dtype=torch.float16, # device_map="mps", # ) tokenizer = AutoTokenizer.from_pretrained(model_name) messages = [ {"role": "user", "content": "Kas sa räägid eesti keelt?"} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer(text, return_tensors="pt").to(model.device) generated_ids = model.generate( **model_inputs, max_new_tokens=128, do_sample=True, temperature=0.4, # specify eos token to stop at the end of the assistant response eos_token_id=tokenizer.eos_token_id, ) # generated_ids include the input tokens as well, so we only decode new tokens response = tokenizer.decode( generated_ids[0][model_inputs["input_ids"].shape[1]:], skip_special_tokens=True, ) print(response) ``` ## Evaluation ## Logits-based Scores for logits-based evaluation benchmarks are available on the [EuroEval](https://euroeval.com/leaderboards/Monolingual/estonian/) leaderboard. ## Generative Every benchmark in this category is treated as a *generative* problem, and thus the evaluation is performed on the model responses obtained with 0 temperature (not logits). The top scores are higlighted with **bold**. Second best scores are highlighted with **_italic bold_**. Rows are sorted in descending order based on the number of parameters of models (not scores). The test set is used for evaluation of each dataset unless noted otherwise. Note that _all models are evaluated with the same prompt template_ for comparability, meaning that the scores do not necessarily represent each model's best possible performance. This is especially the case for `deepseek-ai/DeepSeek-V3-0324` on some of the benchmarks. Only models of comparable size are evaluated on benchmarks in English. ### Instruction-following #### Estonian Instruction level strict accuracy is reported for IFEval-et. | Model (# parameters ↓) | [IFEval-et](https://huggingface.co/datasets/tartuNLP/ifeval_et) | |-------|-----------------------------------| | moonshotai/Kimi-K2-Instruct | **0.7891** | | deepseek-ai/DeepSeek-V3.2 | 0.7221 | | deepseek-ai/DeepSeek-V3-0324 | 0.7171 | | mistralai/Mistral-Large-3-675B-Instruct-2512 | 0.7097 | | meta-llama/Llama-3.1-405B-Instruct | 0.7159 | | meta-llama/Llama-3.3-70B-Instruct | **_0.7705_** | | Qwen/Qwen2.5-72B-Instruct | 0.7407 | | google/gemma-3-27b-it | 0.7655 | | google/gemma-3-12b-it | 0.7556 | | utter-project/EuroLLM-9B-Instruct-2512 | 0.5571 | | utter-project/EuroLLM-9B-Instruct | 0.5397 | | mistralai/Ministral-3-8B-Instruct-2512 | 0.4888 | | **tartuNLP/Apertus-EstLLM-8B-Instruct-0326** | 0.5608 | | tartuNLP/Apertus-EstLLM-8B-Instruct-1125 | 0.4665 | | swiss-ai/Apertus-8B-Instruct-2509| 0.5484 | | meta-llama/Llama-3.1-8B-Instruct | 0.3797 | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125 | 0.6141 | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.5174 | | BSC-LT/salamandra-7b-instruct | 0.5195 | | tartuNLP/Llammas | 0.3524 | | Qwen/Qwen2.5-7B-Instruct | 0.4988 | | CohereLabs/tiny-aya-global | 0.6687 | #### English Instruction level strict accuracy is reported for IFEval-en. | Model (# parameters ↓) | [IFEval-en](https://huggingface.co/datasets/tartuNLP/ifeval_en) | |-------|-----------------------------------| | utter-project/EuroLLM-9B-Instruct-2512 | 0.7564 | | utter-project/EuroLLM-9B-Instruct | 0.7004 | | mistralai/Ministral-3-8B-Instruct-2512 | 0.6845 | | **tartuNLP/Apertus-EstLLM-8B-Instruct-0326** | 0.7089 | | tartuNLP/Apertus-EstLLM-8B-Instruct-1125 | 0.6638 | | swiss-ai/Apertus-8B-Instruct-2509 | 0.7808 | | meta-llama/Llama-3.1-8B-Instruct | _**0.8106**_ | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125 | **0.8173** | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.7527 | | tartuNLP/Llammas | 0.4373 | | BSC-LT/salamandra-7b-instruct | 0.3289 | | Qwen/Qwen2.5-7B-Instruct | 0.7954 | ### Multiple Choice All datasets except Winogrande-et are evaluated in 0-shot mode. Winogrande-et is evaluated in 3-shot mode. Exact match accuracy is reported for every dataset. #### Estonian Language Competence | Model (# parameters ↓) | [Grammar-et](https://huggingface.co/datasets/TalTechNLP/grammar_et)| [Inflection-et](https://huggingface.co/datasets/TalTechNLP/inflection_et)| [Word-Meanings-et](https://huggingface.co/datasets/TalTechNLP/word_meanings_et) | |-------|------|------|--------| | moonshotai/Kimi-K2-Instruct | **0.916** | 0.6458 | **0.9689** | | deepseek-ai/DeepSeek-V3.2 | 0.781 | 0.6891 | 0.8134 | | deepseek-ai/DeepSeek-V3-0324 | 0.364 | 0 | 0 | | mistralai/Mistral-Large-3-675B-Instruct-2512 | 0.796 | _**0.8355**_ | 0.9488 | | meta-llama/Llama-3.1-405B-Instruct | 0.818 | **0.9089** | 0.9438 | | meta-llama/Llama-3.3-70B-Instruct | 0.797 | 0.6421 | 0.9408 | | Qwen/Qwen2.5-72B-Instruct | 0.694 | 0.5208 | 0.9057 | | google/gemma-3-27b-it | 0.817 | 0.5934 | 0.9529 | | google/gemma-3-12b-it | 0.789 | 0.4227 | 0.9318 | | utter-project/EuroLLM-9B-Instruct-2512 | 0.644 | 0.4466 | 0.9288 | | utter-project/EuroLLM-9B-Instruct | 0.764 | 0.367 | 0.9258 | | mistralai/Ministral-3-8B-Instruct-2512 | 0.562 | 0.4833 | 0.8395 | | **tartuNLP/Apertus-EstLLM-8B-Instruct-0326**| 0.713 | 0.4326 | 0.9438 | | tartuNLP/Apertus-EstLLM-8B-Instruct-1125| 0.646 | 0.421 | 0.9178 | | swiss-ai/Apertus-8B-Instruct-2509 | 0.512 | 0.3662 | 0.9027 | | meta-llama/Llama-3.1-8B-Instruct | 0.657 | 0.4165 | 0.8335 | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125 | _**0.8310**_ | 0.5777 | _**0.9619**_ | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.692 | 0.5188 | 0.9569 | | BSC-LT/salamandra-7b-instruct | 0.594 | 0.2668 | 0.8084 | | Qwen/Qwen2.5-7B-Instruct | 0.598 | 0.4136 | 0.7984 | | tartuNLP/Llammas | 0.529 | 0.2289 | 0.5326 | | CohereLabs/tiny-aya-global | 0.563 | 0.3221 | 0.8455 | #### Knowledge and Reasoning (Estonian) | Model (# parameters ↓) | [Winogrande-et](https://huggingface.co/datasets/tartuNLP/winogrande_et) | [Trivia-et](https://huggingface.co/datasets/TalTechNLP/trivia_et) | [Exam-et](https://huggingface.co/datasets/TalTechNLP/exam_et) | [GlobalPIQA-et](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel/viewer/ekk_latn)| [TruthfulQA-et](https://huggingface.co/datasets/LumiOpen/opengpt-x_truthfulqax/viewer/mc_ET) | |-------|-----------------------------------|---------------------------------------------|---------------------------------------------|---------------------------------------------|-------------------------------------------| | moonshotai/Kimi-K2-Instruct | **0.8138** | 0.4225 | **0.8414** | **0.79** | **0.7136** | | deepseek-ai/DeepSeek-V3.2 | 0.4805 | 0.38 | 0.614 | 0.7 | 0.5863 | | deepseek-ai/DeepSeek-V3-0324 | **_0.8042_** | 0.27 | 0.1221 | 0.04 | 0.2093 | | mistralai/Mistral-Large-3-675B-Instruct-2512 | 0.7487 | 0.4275 | 0.7931 | _**0.73**_ | 0.6854 | | meta-llama/Llama-3.1-405B-Instruct |0.7878 | **0.4713** | _**0.8309**_ | 0.58 | _**0.7001**_ | | meta-llama/Llama-3.3-70B-Instruct |0.7397 | 0.3875 | 0.7652 | 0.58 | 0.6255 | | Qwen/Qwen2.5-72B-Instruct | 0.7227 | 0.315 | 0.7162 | 0.65 | 0.6683 | | google/gemma-3-27b-it | 0.7510 | 0.325 | 0.7751 | 0.71 | 0.5814 | | google/gemma-3-12b-it | 0.6712 | 0.3237 | 0.7069 | 0.54 | 0.3158 | | utter-project/EuroLLM-9B-Instruct-2512 | 0.5195 | 0.375 | 0.6097 | 0.52 | 0.399 | | utter-project/EuroLLM-9B-Instruct | 0.5846 | 0.3738 | 0.5589 | 0.55 | 0.2889 | | mistralai/Ministral-3-8B-Instruct-2512 | 0.5812 | 0.3125 | 0.5012 | 0.48 | 0.3525 | | **tartuNLP/Apertus-EstLLM-8B-Instruct-0326** | 0.5976 | 0.35 | 0.6022 | 0.64 | 0.4296 | | tartuNLP/Apertus-EstLLM-8B-Instruct-1125 | 0.5467 | 0.3575 | 0.5651 | 0.63 | 0.3696 | | swiss-ai/Apertus-8B-Instruct-2509 | 0.5105 | 0.345 | 0.552 | 0.59 | 0.366 | | meta-llama/Llama-3.1-8B-Instruct | 0.5399 | 0.2888 | 0.5 | 0.54 | 0.437 | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125 | 0.6440 | _**0.4288**_ | 0.6332 | 0.68 | 0.3794 | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.5812 | 0.425 | 0.5093 | 0.63 | 0.3525 | | BSC-LT/salamandra-7b-instruct | 0.2878 | 0.2875 | 0.3556 | 0.55 | 0.3011 | | Qwen/Qwen2.5-7B-Instruct | 0.5473 | 0.2938 | 0.4913 | 0.57 | 0.4113 | | tartuNLP/Llammas | 0.5037 | 0.2838 | 0.3649 | 0.01 | 0.2032 | | CohereLabs/tiny-aya-global | 0.5603 | 0.31 | 0.5638 | 0.52 | 0.3782 | #### Knowledge and Reasoning (English) | Model (# parameters ↓) | [Winogrande](https://huggingface.co/datasets/allenai/winogrande) | [GlobalPIQA-en](https://huggingface.co/datasets/mrlbenchmarks/global-piqa-nonparallel/viewer/eng_latn) | [TruthfulQA](https://huggingface.co/datasets/truthfulqa/truthful_qa) | [MMLU-Redux](https://huggingface.co/datasets/edinburgh-dawg/mmlu-redux-2.0) | [GSM8K](https://huggingface.co/datasets/openai/gsm8k) | |-------|-----------------------------------|-----------------------------------|-----------------------------------|-----------------------------------|-----------------------------------| | utter-project/EuroLLM-9B-Instruct-2512 | 0.5546 | 0.58 |0.4614 | 0.6334 | 0.4139 | | utter-project/EuroLLM-9B-Instruct | 0.5059 | 0.58 | 0.2962 | 0.5741 | 0.5944 | | mistralai/Ministral-3-8B-Instruct-2512 | _**0.6503**_ | _**0.77**_ | 0.519 | _**0.7418**_ | 0.3927 | | **tartuNLP/Apertus-EstLLM-8B-Instruct-0326** | 0.5699 | 0.69 | 0.4174 | 0.5946 | 0.5588 | | tartuNLP/Apertus-EstLLM-8B-Instruct-1125 | 0.5348 | 0.56 | 0.3647 | 0.5944 | 0.5277 | | swiss-ai/Apertus-8B-Instruct-2509 | 0.5133 | 0.73 | 0.3831 | 0.6099 | 0.5936 | | meta-llama/Llama-3.1-8B-Instruct | 0.5625 | 0.76 | _**0.5239**_ | 0.6959 | 0.7710 | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125 | 0.6118 | 0.76 | 0.3635 | 0.6606 | _**0.7726**_ | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.6084 | 0.71 | 0.366 | 0.6388 | 0.7202 | | tartuNLP/Llammas | 0.498 | 0 | 0.1971 | 0.3417 | 0.1456 | | BSC-LT/salamandra-7b-instruct | 0.4029 | 0.63 | 0.2717 | 0.5180 | 0.0076 | | Qwen/Qwen2.5-7B-Instruct | **0.6627** | **0.83** | **0.5875** | **0.7555** | **0.7862** | ### Translation #### English to Estonian | Model | [wmt24pp](https://huggingface.co/datasets/google/wmt24pp) (BLEU ↑) | |-------|---------| | BSC-LT/salamandraTA-7b-instruct | 0.2713 | | **tartuNLP/Apertus-EstLLM-8B-Instruct-0326** | 0.2676 | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-1125 | 0.2635 | | tartuNLP/Llama-3.1-EstLLM-8B-Instruct-0825 | 0.264 | | tartuNLP/Apertus-EstLLM-8B-Instruct-1125| 0.2609 | | utter-project/EuroLLM-9B-Instruct | 0.2602 | | utter-project/EuroLLM-9B-Instruct-2512 | 0.2567 | | swiss-ai/Apertus-8B-Instruct-2509 | 0.2372 | | tartuNLP/Llammas | 0.1472 | | meta-llama/Llama-3.1-8B-Instruct | 0.1406 | | BSC-LT/salamandra-7b-instruct | 0.1201 | | Qwen/Qwen2.5-7B-Instruct | 0.0476 | ## Citation ``` @misc{dorkin2026estllmenhancingestoniancapabilities, title={{EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training}}, author={Aleksei Dorkin and Taido Purason and Emil Kalbaliyev and Hele-Andra Kuulmets and Marii Ojastu and Mark Fišel and Tanel Alumäe and Eleri Aedmaa and Krister Kruusmaa and Kairit Sirts}, year={2026}, eprint={2603.02041}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2603.02041}, }