Llama-Krikri-8B-Base/README.md

---
license: llama3.1
language:
- el
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- text-generation-inference
---

🚨 **THIS IS A BASE MODEL. CONSIDER USING [Krikri 8B Instruct](https://huggingface.co/ilsp/Llama-Krikri-8B-Instruct) FOR CHAT APPLICATIONS** 🚨

# Llama-Krikri-8B-Base: A large foundation Language Model for the Greek language

Following the release of [Meltemi-7B](https://huggingface.co/ilsp/Meltemi-7B-v1) on the 26th March 2024, we are happy to welcome Krikri to the family of ILSP open Greek LLMs.
Krikri is built on top of [Llama-3.1-8B](https://huggingface.co/meta-llama/Llama-3.1-8B), extending its capabilities for Greek through continual pretraining on a large corpus of high-quality and locally relevant Greek texts. We present Llama-Krikri-8B-Base, as well as an instruct version, [Llama-Krikri-8B-Instruct](https://huggingface.co/ilsp/Llama-Krikri-8B-instruct).

![image/png](llama-krikri-image.jpg)

# Model Information

- Vocabulary extension of the Llama-3.1 tokenizer with Greek tokens
- 128k context length (**approximately 80,000 Greek words**)
- We extend the pretraining of Llama-3.1-8B with added proficiency for the Greek language, by utilizing a large training corpus.
  * This corpus includes 56.7 billion monolingual Greek tokens, constructed from publicly available resources.
  * Additionaly, to mitigate catastrophic forgetting and ensure that the model has bilingual capabilities, we use additional sub-corpora with monolingual English texts (21 billion tokens) and Greek-English parallel data (5.5 billion tokens).
  * The training corpus also contains 7.8 billion math and code tokens.
  * This corpus has been processed, filtered, and deduplicated to ensure data quality and is outlined below:


| Sub-corpus   | # Tokens         | Percentage |
|-----------|------------------|------------|
| Greek     | 56.7 B   | 62.3 %      |
| English   | 21.0 B   | 23.1 %      |
| Parallel  |  5.5 B   | 6.0 %       |
| Math/Code |  7.8 B   | 8.6 %       |
| **Total** | 91 B   |  **100%**       |


Chosen subsets of the 91 billion corpus were upsampled resulting in a size of **110 billion tokens**.


# How to use

## With Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"

model = AutoModelForCausalLM.from_pretrained("ilsp/Llama-Krikri-8B-Base")
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Base")

model.to(device)

input_text = tokenizer("Ένα κρικρί διαφέρει απο ένα λάμα επειδή", return_tensors='pt').to(device)
outputs = model.generate(input_text['input_ids'], max_new_tokens=256, do_sample=True)

print(tokenizer.batch_decode(outputs)[0])
```

## With OpenAI compatible server via vLLM

```bash
vllm serve ilsp/Llama-Krikri-8B-Base \
  --enforce-eager \
  --dtype 'bfloat16' \
  --api-key token-abc123
```

Then, the model can be used through Python using:
```python
from openai import OpenAI

api_key = "token-abc123"
base_url = "http://localhost:8000/v1"

client = OpenAI(
    api_key=api_key,
    base_url=base_url,
)

response = client.completions.create(model="ilsp/Llama-Krikri-8B-Base",
                                     prompt="Η εκπαίδευση μεγάλων γλωσσικών μοντέλων περιλαμβάνει")
print(response.choices[0].text)
```

# Evaluation

Below, we report improvements of Llama-Krikri-8B-Base over Llama-3.1-8B for Greek and English:
- **+10.8%** on Greek benchmarks
- **+0.8%** on English benchmarks

Our evaluations for Llama-Krikri-8B-Base, Llama-3.1-8B, and Meltemi 7B v1.5 are performed in a few-shot setting, consistent with the settings in the [Open LLM leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard).

## Greek Benchmarks


The evaluation suite we created for the Greek language includes 6 test sets. You can run the suite by cloning this [lighteval fork](https://github.com/LeonVouk/lighteval).

Our evaluation suite includes:
* Four machine-translated versions ([ARC Greek](https://huggingface.co/datasets/ilsp/arc_greek), [Truthful QA Greek](https://huggingface.co/datasets/ilsp/truthful_qa_greek), [HellaSwag Greek](https://huggingface.co/datasets/ilsp/hellaswag_greek), [MMLU Greek](https://huggingface.co/datasets/ilsp/mmlu_greek)) of established English benchmarks for language understanding and reasoning ([ARC Challenge](https://arxiv.org/abs/1803.05457), [Truthful QA](https://arxiv.org/abs/2109.07958), [Hellaswag](https://arxiv.org/abs/1905.07830), [MMLU](https://arxiv.org/abs/2009.03300)).
* An existing benchmark for question answering in Greek ([Belebele](https://arxiv.org/abs/2308.16884))
* A novel benchmark created by the ILSP team for medical question answering based on the medical exams of [DOATAP](https://www.doatap.gr) ([Medical MCQA](https://huggingface.co/datasets/ilsp/medical_mcqa_greek)).

We can see that our continual pretraining methodology enhances performance across all Greek test sets by a **+10.8%** average improvement over the base model. The results for the Greek test sets are shown in the following table:

|                | Medical MCQA EL (15-shot) | Belebele EL (5-shot) | HellaSwag EL (10-shot) | ARC-Challenge EL (25-shot) | TruthfulQA MC2 EL (0-shot) | MMLU EL (5-shot) | Average |
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
| Meltemi 7B v1.5 | 42.2%         | 61.0%       | 53.8%        | 40.0%            | 49.0%             | 41.2%   | 47.9%   |
| Llama-3.1-8B    | 33.4%         | 72.8%       | 52.1%        | 39.9%            | 51.1%             | 42.6%   | 48.7%   |
| Llama-Krikri-8B | **53.8%**         | **82.7%**       | **64.6%**        | **49.4%**            | **54.2%**             | **52.0%**   | **59.5%**   |


## English Benchmarks

We can also see that our training methodology not only mitigates catastrophic forgetting effectively, but also improves average performance across all English test sets by **+0.8%**. The results for the English test sets are shown in the following table:

|                | Winogrande (5-shot) | Belebele (5-shot) | HellaSwag (10-shot) | ARC-Challenge (25-shot) | TruthfulQA MC2 (0-shot) | MMLU (5-shot) | Average |
|----------------|----------------|-------------|--------------|------------------|-------------------|---------|---------|
| Meltemi 7B v1.5 | 73.4%         | 77.7%       | 79.6%        | 54.1%            | 40.5%             | 56.9%   | 63.7%   |
| Llama-3.1-8B    | **74.6%**         | 71.5%       | **82.0%**        | **58.5%**            | 44.2%             | **66.2%**   | 66.2%   |
| Llama-Krikri-8B | 72.6%         | **79.8%**       | 80.7%        | 57.8%            | **44.8%**             | 65.1%   | **67.0%**   |

Please note that all evaluations were run with the latest version of lighteval, which has some differences from past versions. This is why we report different scores for Meltemi-7B-v1.5


# Ethical Considerations

This model has not been aligned with human preferences, and therefore might generate misleading, harmful, and toxic content.


# Acknowledgements

The ILSP team utilized Amazon's cloud computing services, which were made available via GRNET under the [OCRE Cloud framework](https://www.ocre-project.eu/), providing Amazon Web Services for the Greek Academic and Research Community.

# Citation

```
@misc{roussis2025krikriadvancingopenlarge,
      title={Krikri: Advancing Open Large Language Models for Greek},
      author={Dimitris Roussis and Leon Voukoutis and Georgios Paraskevopoulos and Sokratis Sofianopoulos and Prokopis Prokopidis and Vassilis Papavasileiou and Athanasios Katsamanis and Stelios Piperidis and Vassilis Katsouros},
      year={2025},
      eprint={2505.13772},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.13772},
}
```