Files
Gemma-Kimu-2b-base/README.md
ModelHub XC 6c9bc263c5 初始化项目,由ModelHub XC社区提供模型
Model: orai-nlp/Gemma-Kimu-2b-base
Source: Original Platform
2026-06-16 10:42:17 +08:00

54 lines
3.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
base_model: google/gemma-2-2b
datasets:
- orai-nlp/ZelaiHandi
- HuggingFaceFW/fineweb
language:
- eu
library_name: transformers
pipeline_tag: text-generation
license: gemma
---
Gemma-Kimu-2b v1.0 is a **continually pre-trained large language model (LLM)** for the **Basque language**, built upon Googles Gemma-2-2b foundational model.
This model focuses solely on **language adaptation**, without instruction-following alignment, serving as the **base model** for subsequent instruct-tuned versions such as [Gemma-Kimu-2b-it](https://huggingface.co/orai-nlp/Gemma-Kimu-2b-it).
We continually pre-train the Gemma-2-2b model on a combination of **Basque monolingual data** and **English replay** to improve Basque linguistic capabilities while preserving English performance.
This phase enhances the models **syntactic, lexical, and morphological competence** in Basque and establishes a solid foundation for downstream instruction-tuned and task-specific models.
Evaluations show that Gemma-Kimu-2b exhibits significant improvements over the original Gemma-2-2b in Basque language understanding, coherence, and text generation fluency.
# Training Data
For continual pre-training, we leveraged a combination of Basque and English data to enhance linguistic performance in Basque while maintaining general English capabilities. The goal is to improve cross-lingual transfer by retaining the model's proficiency in English.
ZelaiHandi [ZelaiHandi dataset](https://huggingface.co/datasets/orai-nlp/ZelaiHandi) (San Vicente et al., 2024): ZelaiHandi is the largest collection of freely licensed and high-quality Basque texts gathered from selected web sources. This collection comprises approximately 521 million words which correspond to 1.5 billion tokens (Llama 3.1 tokenizer).
[FineWeb dataset](https://huggingface.co/datasets/HuggingFaceFW/fineweb) (Penedo et al., 2024): FineWeb consists of more than 15T tokens of cleaned and deduplicated English web data from CommonCrawl. We selected a random subset of around 300 million tokens (Llama 3.1 tokenizer)
# License
This model is derived from Gemma 2 and is licensed under the Gemma License. Copyright © Google DeepMind. All Rights Reserved.
# Acknowledgments
This work is part of the BasqueLLM project, titled "bi-SLM: Optimization of Industrial Processes through Bilingual SLMs" (EXP: 2025-CIE4-000048-01), partially funded by the Guipuzcoa Science, Technology and Innovation Network Program of the Provincial Council of Gipuzkoa. Model training and development were conducted using the Hyperion system at the Donostia International Physics Center (DIPC).
# Citation
If you use this model, please cite:
```bibtex
@inproceedings{sarasua2025,
title={DIPLomA: Efficient Adaptation of Instructed LLMs to Low-Resource Languages via Post-Training Delta Merging},
author={Sarasua, Ixak and Corral, Ander and Saralegi, Xabier},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2025},
year={2025}
}
```
# Contact
- Ixak Sarasua (i.sarasua@orai.eus)
- Ander Corral (a.corral@orai.eus)
- Xabier Saralegi (x.saralegi@orai.eus)