100 lines
2.3 KiB
Markdown
100 lines
2.3 KiB
Markdown
|
|
---
|
||
|
|
library_name: transformers
|
||
|
|
tags:
|
||
|
|
- SmolLM-3B
|
||
|
|
- Arabic
|
||
|
|
language:
|
||
|
|
- ar
|
||
|
|
metrics:
|
||
|
|
- chrf
|
||
|
|
base_model:
|
||
|
|
- HuggingFaceTB/SmolLM3-3B
|
||
|
|
pipeline_tag: text-generation
|
||
|
|
---
|
||
|
|
|
||
|
|
# Model Card for unige-fti/Aladdin-3B
|
||
|
|
|
||
|
|
Multidialectal Arabic generation and translation model fine-tuned for dialect fidelity and diglossia.
|
||
|
|
|
||
|
|
## Model Details
|
||
|
|
|
||
|
|
### Model Description
|
||
|
|
|
||
|
|
- **Base model:** SmolLM3-3B
|
||
|
|
- **Architecture:** Decoder-only causal transformer (SmolLM architecture)
|
||
|
|
- **Parameters:** ~3B
|
||
|
|
- **Language coverage:** Arabic dialects, Modern Standard Arabic (MSA), English
|
||
|
|
|
||
|
|
Primary tasks:
|
||
|
|
- Dialectal Arabic generation
|
||
|
|
- Bidirectional translation (DA ↔ MSA ↔ English)
|
||
|
|
- Controlled generation conditioned on dialect instructions
|
||
|
|
|
||
|
|
This model was fine-tuned by the Aladdin-FTI team for the AMIYA shared task to jointly optimize:
|
||
|
|
|
||
|
|
- Machine translation (semantic adequacy & diglossia)
|
||
|
|
```
|
||
|
|
Instruction-formatted prompts:
|
||
|
|
|
||
|
|
Translate from English into Egyptian Arabic:
|
||
|
|
<SOURCE>
|
||
|
|
```
|
||
|
|
- Instruction-conditioned generation (dialect fidelity)
|
||
|
|
```
|
||
|
|
Complete the sentence in Moroccan Arabic:
|
||
|
|
<PREFIX>
|
||
|
|
|
||
|
|
```
|
||
|
|
|
||
|
|
The objective balances meaning preservation and dialect naturalness in Arabic diglossia settings.
|
||
|
|
|
||
|
|
|
||
|
|
### Model Sources
|
||
|
|
|
||
|
|
- **Repository:** [Github repository](https://github.com/drvenabili/mtfinetune_amiya/tree/main)
|
||
|
|
- **Paper:** [https://arxiv.org/abs/2602.16290](https://arxiv.org/abs/2602.16290)
|
||
|
|
|
||
|
|
|
||
|
|
## How to Get Started with the Model
|
||
|
|
|
||
|
|
TODO
|
||
|
|
|
||
|
|
## Training Details
|
||
|
|
|
||
|
|
|
||
|
|
### Training Data: Closed-track training data only.
|
||
|
|
Datasets span multiple dialect regions and domains
|
||
|
|
|
||
|
|
Parallel corpora:
|
||
|
|
- SauDial
|
||
|
|
- Casablanca corpus
|
||
|
|
- JODA
|
||
|
|
- UFAL Levantine
|
||
|
|
- DODA
|
||
|
|
- Atlas
|
||
|
|
|
||
|
|
Monolingual dialect corpora:
|
||
|
|
- MADAR
|
||
|
|
- Shami
|
||
|
|
- Saudi Tweets
|
||
|
|
- EDGAD / EDC
|
||
|
|
- HABIBI lyrics
|
||
|
|
|
||
|
|
## Citation
|
||
|
|
|
||
|
|
If you use this model in your research, please cite the following paper:
|
||
|
|
|
||
|
|
```
|
||
|
|
@inproceedings{mutal2026aladdinfti,
|
||
|
|
title = {Aladdin-FTI @ AMIYA: Three Wishes for Arabic NLP: Fidelity, Diglossia, and Multidialectal Generation},
|
||
|
|
author = {Mutal, Jonathan and Al Almaoui, Perla and Hengchen, Simon and Bouillon, Pierrette},
|
||
|
|
booktitle = {Proceedings of the AMIYA Shared Task, co-located with VarDial at EACL 2026},
|
||
|
|
year = {2026},
|
||
|
|
address = {Rabat, Morocco},
|
||
|
|
publisher = {Association for Computational Linguistics},
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Compute infrastructure
|
||
|
|
|
||
|
|
The computations were performed at the University of Geneva using the Baobab HPC service.
|