初始化项目,由ModelHub XC社区提供模型

Model: arcee-ai/Meraj-Mini
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-15 09:28:54 +08:00
commit d556f6dd94
15 changed files with 151871 additions and 0 deletions

40
.gitattributes vendored Normal file
View File

@@ -0,0 +1,40 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
tokenizer.json filter=lfs diff=lfs merge=lfs -text
model-00001-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text
model-00002-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text
model-00003-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text
model-00004-of-00004.safetensors filter=lfs diff=lfs merge=lfs -text

134
README.md Normal file
View File

@@ -0,0 +1,134 @@
---
license: apache-2.0
language:
- ar
- en
base_model:
- Qwen/Qwen2.5-7B-Instruct
pipeline_tag: text2text-generation
library_name: transformers
tags:
- qwen
- text-generation-inference
---
<div align="center">
<img src="https://i.ibb.co/CmPSSpq/Screenshot-2024-10-06-at-9-45-06-PM.png" alt="Arcee Meraj Mini" style="border-radius: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19); max-width: 100%; height: auto;">
</div>
Following the release of [Arcee Meraj](https://meraj.arcee.ai/), our enterprise's globally top-performing Arabic LLM, we are thrilled to unveil Arcee Meraj Mini. This open-source model, meticulously fine-tuned from [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct), is expertly designed for both Arabic and English. This model has undergone rigorous evaluation across multiple benchmarks in both languages, demonstrating top-tier performance in Arabic and competitive results in English. Arcee Meraj Minis primary objective is to enhance Arabic capabilities while maintaining robust English language proficiency. Benchmark results confirm that Arcee Meraj Mini excels in Arabic, with English performance comparable to leading models — perfectly aligning with our vision for balanced bilingual strength.
## Technical Details
Below is an overview of the key stages in Meraj Minis development:
1. **Data Preparation:** We filter candidate samples from diverse English and Arabic sources to ensure high-quality data. Some of the selected English datasets are translated into Arabic to increase the quantity of Arabic samples and improve the models quality in bilingual performance. Then, new [Direct Preference Optimization (DPO)](https://arxiv.org/pdf/2305.18290) datasets are continuously prepared, filtered, and translated to maintain a fresh and diverse dataset that supports better generalization across domains.
2. **Initial Training:** We train the Qwen2.5 model with 7 billion parameters using these high-quality datasets in both languages. This allows the model to handle diverse linguistic patterns from over 500 million tokens, ensuring strong performance in Arabic and English tasks.
3. **Iterative Training and Post-Training:** Iterative training and post-training iterations refine the model, enhancing its accuracy and adaptability to ensure it can perform well across varied tasks and language contexts.
4. **Evaluation:** Arcee Meraj Mini is based on training and evaluating 15 different variants to explore optimal configurations, with assessments done on both Arabic and English benchmarks and leaderboards. This step ensures the model is robust in handling both general and domain-specific tasks.
5. **Final Model Creation:** We select the best-performing variant and use the [MergeKit](https://arxiv.org/pdf/2403.13257) library to merge the configurations, resulting in the final Arcee Meraj Mini model. This model is not only optimized for language understanding but also serves as a starting point for domain adaptation in different areas.
With this process, Arcee Meraj Mini is crafted to be more than just a general-purpose language model—its an adaptable tool, ready to be fine-tuned for specific industries and applications, empowering users to extend its capabilities for domain-specific tasks.
## Capabilities and Use Cases
Arcee Meraj Mini is capable of solving a wide range of language tasks, including the tasks as below:
1. **Arabic Language Understanding**: Arcee Meraj Mini excels in general language comprehension, reading comprehension, and common-sense reasoning, all tailored to the Arabic language, providing strong performance in a variety of linguistic tasks.
2. **Cultural Adaptation**: The model ensures content creation that goes beyond linguistic accuracy, incorporating cultural nuances to align with Arabic norms and values, making it suitable for culturally relevant applications.
3. **Education**: It enables personalized, adaptive learning experiences for Arabic speakers by generating high-quality educational content across diverse subjects, enhancing the overall learning journey.
4. **Mathematics and Coding**: With robust support for mathematical reasoning and problem-solving, as well as code generation in Arabic, Arcee Meraj Mini serves as a valuable tool for developers and professionals in technical fields.
5. **Customer Service**: The model facilitates the development of advanced Arabic-speaking chatbots and virtual assistants, capable of managing customer queries with a high degree of natural language understanding and precision.
6. **Content Creation**: Arcee Meraj Mini generates high-quality Arabic content for various needs, from marketing materials and technical documentation to creative writing, ensuring impactful communication and engagement in the Arabic-speaking world.
## Quantized GGUF
Here are GGUF models:
- [Meraj-Mini-GGUF](https://huggingface.co/MaziyarPanahi/Meraj-Mini-GGUF)
## How to
This model uses ChatML prompt template:
```
<|im_start|>system
{System}
<|im_end|>
<|im_start|>user
{User}
<|im_end|>
<|im_start|>assistant
{Assistant}
```
```python
# Use a pipeline as a high-level helper
from transformers import pipeline
messages = [
{"role": "user", "content": "مرحبا، كيف حالك؟"},
]
pipe = pipeline("text-generation", model="arcee-ai/Meraj-Mini")
pipe(messages)
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("arcee-ai/Meraj-Mini")
model = AutoModelForCausalLM.from_pretrained("arcee-ai/Meraj-Mini")
```
## Evaluations
#### Open Arabic LLM Leaderboard (OALL) Benchmarks
Arcee Meraj Mini model consistently outperforms state-of-the-art models on most of the Open Arabic LLM Leaderboard (OALL) benchmarks, highlighting its improvements and effectiveness in Arabic language content, and securing the top performing position on average among the other models.
<div align="center">
<img src="https://i.ibb.co/LQ0z7fH/Screenshot-2024-10-15-at-2-53-45-PM.png" alt="Arcee Meraj Mini Open Arabic LLM Leaderboard (OALL) - table 1" style="border-radius: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19); max-width: 80%; height: auto;">
</div>
<div align="center">
<img src="https://i.ibb.co/fM6VQR7/Screenshot-2024-10-15-at-2-53-55-PM.png" alt="Arcee Meraj Mini Open Arabic LLM Leaderboard (OALL) - table 2" style="border-radius: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19); max-width: 80%; height: auto;">
</div>
#### Translated MMLU
We focused on the multilingual MMLU dataset, as distributed through the LM Evaluation Harness repository, to compare the multilingual strength of different models for this benchmark. Arcee Meraj Mini outperforms the other models, showcasing these models superior performance compared to the other state-of-the-art models.
<div align="center">
<img src="https://i.ibb.co/dfwW1W5/W-B-Chart-10-15-2024-2-07-12-PM.png" alt="Arcee Meraj Mini Trnalsated MMLU" style="border-radius: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19); max-width: 80%; height: auto;">
</div>
#### English Benchmarks:
Arcee Meraj Mini performs comparably to state-of-the-art models, demonstrating how the model retains its English language knowledge and capabilities while learning Arabic.
<div align="center">
<img src="https://i.ibb.co/mTcLFzt/W-B-Chart-10-15-2024-2-15-57-PM.png" alt="Arcee Meraj Mini Winogrande" style="border-radius: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19); max-width: 80%; height: auto;">
</div>
<div align="center">
<img src="https://i.ibb.co/GRBjjGN/W-B-Chart-10-15-2024-2-17-34-PM.png" alt="Arcee Meraj Mini Arc Challenge" style="border-radius: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19); max-width: 80%; height: auto;">
</div>
<div align="center">
<img src="https://i.ibb.co/98s0qTf/W-B-Chart-10-15-2024-2-17-46-PM.png" alt="Arcee Meraj Mini TruthfulQA" style="border-radius: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19); max-width: 80%; height: auto;">
</div>
<div align="center">
<img src="https://i.ibb.co/yqvRK3L/W-B-Chart-10-15-2024-2-17-57-PM.png" alt="Arcee Meraj Mini GSM8K" style="border-radius: 10px; box-shadow: 0 4px 8px 0 rgba(0, 0, 0, 0.2), 0 6px 20px 0 rgba(0, 0, 0, 0.19); max-width: 80%; height: auto;">
</div>
## Model Usage
For a detailed explanation of the model's capabilities, architecture, and applications, please refer to our blog post: https://blog.arcee.ai/arcee-meraj-mini-2/
To test the model directly, you can try it out using this Google Colab notebook: https://colab.research.google.com/drive/1hXXyNM-X0eKwlZ5OwqhZfO0U8CBq8pFO?usp=sharing
## Acknowledgements
We are grateful to the open-source AI community for their continuous contributions and to the Qwen team for their foundational efforts on the Qwen2.5 model series.
## Future Directions
As we release the Arcee Meraj Mini to the public, we invite researchers, developers, and businesses to engage with the Arcee Meraj Mini model, particularly in enhancing support for the Arabic language and fostering domain adaptation. We are committed to advancing open-source AI technology and invite the community to explore, contribute, and build upon Arcee Meraj Mini.

24
added_tokens.json Normal file
View File

@@ -0,0 +1,24 @@
{
"</tool_call>": 151658,
"<tool_call>": 151657,
"<|box_end|>": 151649,
"<|box_start|>": 151648,
"<|endoftext|>": 151643,
"<|file_sep|>": 151664,
"<|fim_middle|>": 151660,
"<|fim_pad|>": 151662,
"<|fim_prefix|>": 151659,
"<|fim_suffix|>": 151661,
"<|im_end|>": 151645,
"<|im_start|>": 151644,
"<|image_pad|>": 151655,
"<|object_ref_end|>": 151647,
"<|object_ref_start|>": 151646,
"<|quad_end|>": 151651,
"<|quad_start|>": 151650,
"<|repo_name|>": 151663,
"<|video_pad|>": 151656,
"<|vision_end|>": 151653,
"<|vision_pad|>": 151654,
"<|vision_start|>": 151652
}

29
config.json Normal file
View File

@@ -0,0 +1,29 @@
{
"_name_or_path": "arcee-train/Arcee-Qwwen2.5-English-Arabic",
"architectures": [
"Qwen2ForCausalLM"
],
"attention_dropout": 0.0,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 3584,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 131072,
"max_window_layers": 28,
"model_type": "qwen2",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.45.1",
"use_cache": false,
"use_mrope": false,
"use_sliding_window": false,
"vocab_size": 152064
}

1
configuration.json Normal file
View File

@@ -0,0 +1 @@
{"framework": "pytorch", "task": "text2text-generation", "allow_remote": true}

151388
merges.txt Normal file

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:e2bb743e0916cf2e57ab7b321f22f7332f465a8060f87b430adb7283614797b8
size 4976698776

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:53f2b4abe66fd8be4736f5a13a52e42b2c79c0b7fb9877f92b68d7124b858cb4
size 4932751032

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:287e987cf95374c484c83b8de49746768a6fd17cc26f8915206b1883cab48702
size 4991495808

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:9e366ae6d76edb8e349d93efbf75c8fcba4808a61dfa34865addfa238d5e94a6
size 330326240

File diff suppressed because one or more lines are too long

31
special_tokens_map.json Normal file
View File

@@ -0,0 +1,31 @@
{
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>"
],
"eos_token": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

3
tokenizer.json Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:9c5ae00e602b8860cbd784ba82a8aa14e8feecec692e7076590d014d7b7fdafa
size 11421896

207
tokenizer_config.json Normal file
View File

@@ -0,0 +1,207 @@
{
"add_bos_token": false,
"add_prefix_space": false,
"added_tokens_decoder": {
"151643": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151644": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151645": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151646": {
"content": "<|object_ref_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151647": {
"content": "<|object_ref_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151648": {
"content": "<|box_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151649": {
"content": "<|box_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151650": {
"content": "<|quad_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151651": {
"content": "<|quad_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151652": {
"content": "<|vision_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151653": {
"content": "<|vision_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151654": {
"content": "<|vision_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151655": {
"content": "<|image_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151656": {
"content": "<|video_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151657": {
"content": "<tool_call>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151658": {
"content": "</tool_call>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151659": {
"content": "<|fim_prefix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151660": {
"content": "<|fim_middle|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151661": {
"content": "<|fim_suffix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151662": {
"content": "<|fim_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151663": {
"content": "<|repo_name|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151664": {
"content": "<|file_sep|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
}
},
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>"
],
"bos_token": null,
"chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
"clean_up_tokenization_spaces": false,
"eos_token": "<|im_end|>",
"errors": "replace",
"model_max_length": 131072,
"pad_token": "<|endoftext|>",
"split_special_tokens": false,
"tokenizer_class": "Qwen2Tokenizer",
"unk_token": null
}

1
vocab.json Normal file

File diff suppressed because one or more lines are too long