Files

ModelHub XC 62f580ae0a 初始化项目，由ModelHub XC社区提供模型

Model: GemMaroc/Qwen2.5-7B-Instruct-darija
Source: Original Platform

2026-05-30 10:09:31 +08:00

6.1 KiB

Raw Permalink Blame History

library_name, tags, pipeline_tag, datasets, language, base_model

library_name

Model Card for Qwen2.5-7B-Instruct-darija

Qwen2.5-7B-Instruct-darija

Unlocking Moroccan Darija proficiency in a compact and efficient large language model, trained with a minimal-data, green-AI recipe that preserves Qwen2.5-7B-Instruct's strong reasoning abilities while adding fluent Darija generation.

Model at a glance

Parameter	Value
Model ID	`GemMaroc/Qwen2.5-7B-Instruct-darija`
Base model	`Qwen/Qwen2.5-7B-Instruct`
Architecture	Decoder-only Transformer (Qwen2.5)
Parameters	7 billion
Context length	32,768 tokens
Training regime	Supervised fine-tuning (LoRA → merged) on 50K high-quality Darija/English instructions TULU-50K slice
License	Apache 2.0

Why another Darija model?

Inclusive AI > 36 million speakers of Moroccan Arabic remain underserved by open LLMs.
Quality-over-quantity A carefully curated 50 K instruction set surfaces Darija competence without sacrificing cross-lingual reasoning.
Green AI Qwen2.5-7B-Instruct-darija achieves competitive Darija scores using minimal energy.
Efficiency 7B parameters provide excellent performance-to-size ratio for resource-constrained environments.

Benchmark summary

Darija Benchmarks

Model	Darija MMLU	Darija HellaSwag	Sentiment Analysis	GSM8K Darija	Summarization (chrF)	ROUGE-1	ROUGE-L	BERTScore
Qwen2.5-7B-Instruct	44.9 %	38.5 %	63.6 %	43.9 %	26.5	9.4	9.1	36.7
Qwen2.5-7B-Instruct-darija	52.7 %	45.5 %	60.4 %	69.8 %	27.4	8.2	8.0	39.0

English Benchmarks

Model	MMLU	TruthfulQA	HellaSwag	GSM8K @5	GSM8K Gen
Qwen2.5-7B-Instruct	68.7 %	63.1 %	65.4 %	75.8 %	90.1 %
Qwen2.5-7B-Instruct-darija	70.0 %	53.6 %	73.9 %	74.6 %	87.2 %

_{Zero-shot accuracy; full table in the paper.}

Quick start

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

model_id = "GemMaroc/Qwen2.5-7B-Instruct-darija"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto"
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
    max_new_tokens=1024,
    temperature=0.7,
    repetition_penalty=1.2,
    no_repeat_ngram_size=3,
)

messages = [
    {"role": "user", "content": "شنو هي نظرية 'butterfly effect'؟ فسّرها بدارجة ونقّط مثال بسيط."}
]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print(pipe(prompt)[0]["generated_text"][len(prompt):])

Chat template (Qwen2.5 format)

The tokenizer provides a baked-in Jinja template that starts with a begin-of-sequence token (<|im_start|>), then alternates user/model turns, each wrapped by <|im_start|> … <|im_end|> markers. When you set add_generation_prompt=True it ends after the opening model tag so the model can continue:

<|im_start|>user
{user message}<|im_end|>
<|im_start|>assistant

The assistant will keep generating tokens until it decides to emit <|im_end|>.

prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

No manual token juggling required—the call above handles BOS, turn delimiters, and newline placement automatically.

Pre-quantised checkpoints will be published under the same repo tags (qwen2.5-7b-darija-awq-int4, qwen2.5-7b-darija-gguf-q4_k_m).

Training recipe (one-paragraph recap)

Data Translate a 44 K reasoning slice of TULU 50K into Darija, keeping 20 % English for cross-lingual robustness.
LoRA SFT Rank 16, α = 32, 3 epochs, bf16, context 32,768.
Merge & push Merge LoRA into base weights (peft.merge_and_unload), convert to safetensors, upload.

Limitations & ethical considerations

Sentiment and abstractive summarisation still trail state-of-the-art.
Tokeniser is unchanged; rare Darija spellings may fragment.
Model may inherit societal biases present in pre-training data.
No RLHF / RLAIF safety alignment yet – apply a moderation layer in production.

Citation

If you use Qwen2.5-7B-Instruct-darija in your work, please cite:

@misc{skiredj2025gemmarocunlockingdarijaproficiency,
      title={GemMaroc: Unlocking Darija Proficiency in LLMs with Minimal Data},
      author={Abderrahman Skiredj and Ferdaous Azhari and Houdaifa Atou and Nouamane Tazi and Ismail Berrada},
      year={2025},
      eprint={2505.17082},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.17082},
}

6.1 KiB Raw Permalink Blame History Unescape Escape