Unlocking Moroccan Darija proficiency in a compact and efficient large language model, trained with a minimal-data, green-AI recipe that preserves Qwen2.5-7B-Instruct's strong reasoning abilities while adding fluent Darija generation.
Inclusive AI > 36 million speakers of Moroccan Arabic remain underserved by open LLMs.
Quality-over-quantity A carefully curated 50 K instruction set surfaces Darija competence without sacrificing cross-lingual reasoning.
Green AI Qwen2.5-7B-Instruct-darija achieves competitive Darija scores using minimal energy.
Efficiency 7B parameters provide excellent performance-to-size ratio for resource-constrained environments.
Benchmark summary
Darija Benchmarks
Model
Darija MMLU
Darija HellaSwag
Sentiment Analysis
GSM8K Darija
Summarization (chrF)
ROUGE-1
ROUGE-L
BERTScore
Qwen2.5-7B-Instruct
44.9 %
38.5 %
63.6 %
43.9 %
26.5
9.4
9.1
36.7
Qwen2.5-7B-Instruct-darija
52.7 %
45.5 %
60.4 %
69.8 %
27.4
8.2
8.0
39.0
English Benchmarks
Model
MMLU
TruthfulQA
HellaSwag
GSM8K @5
GSM8K Gen
Qwen2.5-7B-Instruct
68.7 %
63.1 %
65.4 %
75.8 %
90.1 %
Qwen2.5-7B-Instruct-darija
70.0 %
53.6 %
73.9 %
74.6 %
87.2 %
Zero-shot accuracy; full table in the paper.
Quick start
fromtransformersimportAutoModelForCausalLM,AutoTokenizer,pipelinemodel_id="GemMaroc/Qwen2.5-7B-Instruct-darija"tokenizer=AutoTokenizer.from_pretrained(model_id)model=AutoModelForCausalLM.from_pretrained(model_id,torch_dtype="auto",device_map="auto")pipe=pipeline("text-generation",model=model,tokenizer=tokenizer,device_map="auto",max_new_tokens=1024,temperature=0.7,repetition_penalty=1.2,no_repeat_ngram_size=3,)messages=[{"role":"user","content":"شنو هي نظرية 'butterfly effect'؟ فسّرها بدارجة ونقّط مثال بسيط."}]prompt=tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True)print(pipe(prompt)[0]["generated_text"][len(prompt):])
Chat template (Qwen2.5 format)
The tokenizer provides a baked-in Jinja template that starts with a begin-of-sequence token (<|im_start|>), then alternates user/model turns, each wrapped by <|im_start|> … <|im_end|> markers. When you set add_generation_prompt=True it ends after the opening model tag so the model can continue:
Merge & push Merge LoRA into base weights (peft.merge_and_unload), convert to safetensors, upload.
Limitations & ethical considerations
Sentiment and abstractive summarisation still trail state-of-the-art.
Tokeniser is unchanged; rare Darija spellings may fragment.
Model may inherit societal biases present in pre-training data.
No RLHF / RLAIF safety alignment yet – apply a moderation layer in production.
Citation
If you use Qwen2.5-7B-Instruct-darija in your work, please cite:
@misc{skiredj2025gemmarocunlockingdarijaproficiency,title={GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data},author={Abderrahman Skiredj and Ferdaous Azhari and Houdaifa Atou and Nouamane Tazi and Ismail Berrada},year={2025},eprint={2505.17082},archivePrefix={arXiv},primaryClass={cs.CL},url={https://arxiv.org/abs/2505.17082},}