初始化项目,由ModelHub XC社区提供模型

Model: AIDC-AI/Marco-Mini-Instruct
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-04-28 23:00:10 +08:00
commit ce49c9f737
16 changed files with 325558 additions and 0 deletions

35
.gitattributes vendored Normal file
View File

@@ -0,0 +1,35 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text

200
README.md Normal file
View File

@@ -0,0 +1,200 @@
---
license: apache-2.0
language:
- en
- zh
- ar
- de
- es
- fr
- ko
- ja
- pt
- tr
- id
- it
- nl
- pl
- ru
- vi
- th
- he
- uk
- ms
- bn
- cs
- ur
- kk
- el
- ro
- hu
- ne
- az
library_name: transformers
tags:
- moe
- mixture-of-experts
- multilingual
- upcycling
- on-policy distillation
datasets:
- allenai/Dolci-Instruct-SFT
- nvidia/Nemotron-Cascade-2-SFT-Data
- nvidia/Nemotron-RL-instruction_following
- nvidia/Nemotron-RL-instruction_following-structured_outputs
- nvidia/Nemotron-RL-ReasoningGym-v1
- nvidia/Nemotron-RL-knowledge-mcqa
- nvidia/Nemotron-Cascade-RL-RLHF
- BytedTsinghua-SIA/DAPO-Math-17k
- Skywork/Skywork-OR1-RL-Data
- nvidia/Nemotron-SFT-Multilingual-v1
---
# Marco-Mini-Instruct
**Marco-Mini-Instruct** is the instruction-tuned variant of [Marco-Mini-Base](https://huggingface.co/AIDC-AI/Marco-Mini-Base), a highly sparse Mixture-of-Experts (MoE) multilingual language model from the [Marco-MoE](https://github.com/AIDC-AI/Marco-LLM) family, developed by Alibaba International Digital Commerce. It activates only **0.86B out of 17.3B total parameters** (5% activation ratio) per token. Marco-Mini-Instruct achieves the **best average performance** across English, multilingual general, and multilingual cultural benchmarks when compared against instruct models with up to 12B activated parameters, including Qwen3-4B-Instruct, Ministral3-8B-Instruct, Gemma3-12B-Instruct, LFM2-24B-A2B, and Granite4-Small-Instruct.
## Model Description
Marco-Mini-Instruct shares the same architecture as [Marco-Mini-Base](https://huggingface.co/AIDC-AI/Marco-Mini-Base): a decoder-only Transformer with sparse MoE layers replacing standard FFN layers, upcycled from [Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base) using fine-grained sub-matrix splitting combined with Drop-Upcycling.
| Configuration | Value |
|:---|:---:|
| Total Parameters | 17.3B |
| Activated Parameters | 0.86B |
| Activation Ratio | 5% |
| Num Layers | 28 |
| Model Dimension | 1024 |
| FFN Intermediate Dimension | 3072 |
| Q-Heads | 16 |
| KV-Heads | 8 |
| Head Dimension | 128 |
| Expert Dimension | 768 |
| Total Experts | 256 |
| Activated Experts | 8 |
| Tie Embeddings | True |
| Training FLOPs | $1.56 \times 10^{23}$ |
## Post-Training Details
Marco-Mini-Instruct is trained from [Marco-Mini-Base](https://huggingface.co/AIDC-AI/Marco-Mini-Base) using a two-stage post-training pipeline implemented with the SLIME framework:
### Stage 1: Supervised Fine-Tuning (SFT)
- **Duration:** ~24 hours on 64 GPUs
- **Steps:** ~4,000 (1 epoch)
- **Learning rate:** 1e-5 with cosine decay to 1e-6
- **Batch size:** 512, context length 8,192 tokens
**Data sources:**
1. **General instructions** — Dolci-Instruct dataset, augmented with Nemotron-Cascade-2 data
2. **Knowledge-intensive data** — Scientific prompts from Nemotron-Cascade-2, responses distilled from Gemini3-Flash
3. **Translation data** — Web-mined NLLB translation pairs, filtered and scored with Qwen3-Embedding-8B (top 10K per language)
4. **Multilingual & cultural data** — Wikidata-sourced content with Gemini3-Flash text synthesis for cultural concepts.
### Stage 2: On-Policy Distillation (OPD)
- **Duration:** ~110 hours on 64 GPUs
- **Steps:** ~3,800 total (2 responses sampled per prompt)
- **Learning rate:** 1e-6 (constant)
**Cascaded distillation:**
1. ~1,900 steps with Qwen3-30B-A3B-Instruct as teacher
2. ~1,900 steps with Qwen3-Next-80B-A3B-Instruct as stronger teacher
**OPD data mixture:**
| Category | Datasets | Ratio |
|:---|:---|:---:|
| Instruction Following | Nemotron-RL-instruction-following + structured outputs | 25% |
| Knowledge & Reasoning | Nemotron-RL-ReasoningGym-v1 + knowledge-mcqa | 25% |
| Alignment | Nemotron-Cascade-RL-RLHF | 10% |
| Math | DAPO-Math-17k + Skywork-OR1-RL-Data | 10% |
| Multilingual | Translation + Cultural + Nemotron-SFT-Multilingual-v1 | 30% |
## Supported Languages
English, Chinese, Arabic, German, Spanish, French, Korean, Japanese, Portuguese, Turkish, Indonesian, Italian, Dutch, Polish, Russian, Vietnamese, Thai, Hebrew, Ukrainian, Malay, Bengali, Czech, Urdu, Kazakh, Greek, Romanian, Hungarian, Nepali, Azerbaijani
## Evaluation
We compare Marco-Mini-Instruct against strong instruct baselines: **Qwen3-4B-Instruct** (4B activated), **Ministral3-8B-Instruct** (8.8B activated), **Gemma3-12B-Instruct** (12B activated), **Granite4-Small-Instruct** (9B activated), and **LFM2-24B-A2B** (2B activated). Marco-Mini-Instruct uses only **0.86B activated parameters**. Avg@8 accuracies are reported, except for GlobalMMLU and MMMLU where Acc@1 is reported.
### English
| Benchmark | Qwen3-4B | Ministral3-8B | Gemma3-12B | Granite4-Small | LFM2-24B-A2B | **Marco-Mini** |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| MMLU _(Acc)_ | 80.8 | 79.8 | 76.2 | 76.7 | 74.9 | **83.4** |
| MMLU-Redux _(Acc)_ | 80.9 | 79.9 | 76.2 | 76.7 | 74.9 | **83.5** |
| MMLU-Pro _(Acc)_ | 66.9 | 63.9 | 55.8 | 57.1 | 57.6 | **70.7** |
| AGIEval _(Acc)_ | 51.7 | 52.4 | 43.6 | 44.7 | 49.0 | **55.4** |
| GPQA-Diamond _(Acc)_ | **50.8** | 44.8 | 35.2 | 38.6 | 39.7 | 50.3 |
| GSM8K _(EM)_ | 88.6 | 89.5 | 89.7 | 83.9 | 87.2 | **93.1** |
| MATH _(EM)_ | **93.4** | 86.2 | 83.8 | 75.7 | 83.9 | 91.8 |
| **Average** | 73.3 | 70.9 | 65.8 | 64.8 | 66.7 | **75.5** |
### Multilingual — General
| Benchmark | Qwen3-4B | Ministral3-8B | Gemma3-12B | Granite4-Small | LFM2-24B-A2B | **Marco-Mini** |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| GlobalMMLU _(Acc)_ | 70.2 | 55.4 | 69.2 | 67.4 | 57.0 | **73.3** |
| MMMLU _(Acc)_ | 71.3 | 56.4 | 69.4 | 68.1 | 62.3 | **73.7** |
| MMLU-ProX-Lite _(Acc)_ | 58.3 | 43.3 | 51.3 | 51.6 | 43.3 | **61.2** |
| MGPQA _(Acc)_ | 41.0 | 30.5 | 32.8 | 35.0 | 32.7 | **41.8** |
| FLORES-200 En→Xx _(BLEU)_ | 22.1 | 17.5 | **35.6** | 31.9 | 19.2 | 30.6 |
| FLORES-200 Xx→En _(BLEU)_ | 33.5 | 31.0 | **40.3** | 32.2 | 22.7 | 36.8 |
| WMT24++ En→Xx _(BLEU)_ | 20.9 | 14.4 | **32.1** | 26.6 | 16.0 | 26.8 |
| WMT24++ Xx→En _(BLEU)_ | 29.9 | 24.2 | **35.5** | 27.5 | 18.8 | 31.3 |
| MGSM _(EM)_ | 84.4 | 68.7 | 84.0 | 75.7 | 67.8 | **87.4** |
| PolyMath _(EM)_ | **47.2** | 26.4 | 35.5 | 28.9 | 29.3 | 44.7 |
| **Average** | 47.9 | 36.8 | 48.6 | 44.5 | 36.9 | **50.8** |
### Multilingual — Cultural & Regional
| Benchmark | Qwen3-4B | Ministral3-8B | Gemma3-12B | Granite4-Small | LFM2-24B-A2B | **Marco-Mini** |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| INCLUDE _(Acc)_ | 63.8 | 50.7 | 65.0 | 60.3 | 49.1 | **65.6** |
| Global-PIQA _(Acc)_ | 79.6 | 61.3 | 82.2 | 80.2 | 69.0 | **84.2** |
| CMMLU _(Acc)_ | **78.6** | 67.4 | 60.8 | 59.6 | 56.7 | 75.3 |
| C-Eval _(Acc)_ | **80.4** | 68.0 | 59.7 | 59.4 | 56.7 | 75.4 |
| ArabicMMLU _(Acc)_ | 66.0 | 41.4 | **70.1** | 66.3 | 61.3 | 67.8 |
| TurkishMMLU _(Acc)_ | 71.6 | 48.2 | 64.4 | 57.9 | 33.4 | **74.7** |
| GreekMMLU _(Acc)_ | 68.6 | 49.5 | **77.7** | 71.7 | 44.7 | 72.5 |
| KazakhMMLU _(Acc)_ | 66.6 | 59.1 | 66.8 | 63.5 | 47.6 | **68.8** |
| IndoMMLU _(Acc)_ | 64.4 | 52.4 | 65.3 | 59.6 | 42.7 | **65.7** |
| IndoCareer _(Acc)_ | 62.2 | 53.4 | 63.2 | 56.3 | 43.7 | **64.4** |
| IndoCulture _(Acc)_ | 58.7 | 47.8 | **69.6** | 59.3 | 44.2 | 67.1 |
| **Average** | 69.1 | 54.5 | 67.7 | 63.1 | 49.9 | **71.0** |
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "AIDC-AI/Marco-Mini-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
messages = [
{"role": "user", "content": "What is the capital of France?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
```
**Note**: vLLM is the recommended engine for deployment, as SGLang currently lacks support for MoE models with tied embeddings (see [PR #20127](https://github.com/sgl-project/sglang/pull/20127)). If SGLang is required for your workflow, please use the specific build at commit e5f48b32abff027d859a43b7d5ba3aece04471c7.
## Citation
```bibtex
@article{marco-moe,
title={Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling},
author={Fan Jiang, Yu Zhao, Chenyang Lyu, Tianqi Shi, Yichao Du, Feihu Jiang, Longyue Wang and Weihua Luo},
year={2026}
}
```
## License
This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

40
config.json Normal file
View File

@@ -0,0 +1,40 @@
{
"architectures": [
"Qwen3MoeForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 151643,
"decoder_sparse_step": 1,
"dtype": "float32",
"eos_token_id": 151643,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 3072,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"mlp_only_layers": [],
"model_type": "qwen3_moe",
"moe_intermediate_size": 768,
"norm_topk_prob": true,
"num_attention_heads": 16,
"num_experts": 256,
"num_experts_per_tok": 8,
"num_hidden_layers": 28,
"num_key_value_heads": 8,
"output_router_logits": false,
"qkv_bias": false,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000.0,
"router_aux_loss_coef": 0.001,
"sliding_window": null,
"tie_word_embeddings": true,
"transformers_version": "4.57.1",
"use_cache": true,
"use_qk_norm": true,
"use_sliding_window": false,
"vocab_size": 151936
}

1
configuration.json Normal file
View File

@@ -0,0 +1 @@
{"framework":"Pytorch","task":"text-generation"}

6
generation_config.json Normal file
View File

@@ -0,0 +1,6 @@
{
"_from_model_config": true,
"bos_token_id": 151643,
"eos_token_id": 151643,
"transformers_version": "4.57.1"
}

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:3b19e7172bcd8b7ffe266ead406aabcb392b711ae3c220d08418f331a6ee4cc0
size 5368320120

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:1bf0abe85405e521bb875032eccfb9eb303ebf4a40a17308d514100122334262
size 5367589824

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:0c4618c2a2a2af2abec0ee6526bdec4bc28811360cd8d9f8e4017894f2832a9c
size 5369132800

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:37e1a800efe2e84ffc70bed96800992f9634ce9761bea0e25b07e8e8f2dcf967
size 5367593048

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:9695175d9c065c195a915756a6635a4bd44e28d77ce6ffe3c222c64732f1a9ad
size 5367593272

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:50ec013c1c1ce3dba2eeba8e5edee2b8501b6a14d84ea4059d22cb79089b47bd
size 5368609760

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:2361868fa0ce57386c5a539e485c77c656c6990bfbeb921c8fabd4a44b285253
size 2295024184

21765
model.safetensors.index.json Normal file

File diff suppressed because it is too large Load Diff

303282
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

207
tokenizer_config.json Normal file
View File

@@ -0,0 +1,207 @@
{
"add_prefix_space": false,
"added_tokens_decoder": {
"151643": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151644": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151645": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151646": {
"content": "<|object_ref_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151647": {
"content": "<|object_ref_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151648": {
"content": "<|box_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151649": {
"content": "<|box_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151650": {
"content": "<|quad_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151651": {
"content": "<|quad_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151652": {
"content": "<|vision_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151653": {
"content": "<|vision_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151654": {
"content": "<|vision_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151655": {
"content": "<|image_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151656": {
"content": "<|video_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151657": {
"content": "<tool_call>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151658": {
"content": "</tool_call>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151659": {
"content": "<|fim_prefix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151660": {
"content": "<|fim_middle|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151661": {
"content": "<|fim_suffix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151662": {
"content": "<|fim_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151663": {
"content": "<|repo_name|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151664": {
"content": "<|file_sep|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
}
},
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>"
],
"bos_token": null,
"chat_template": "{%- if tools %}\n {{- '<|im_start|>system\\n' }}\n {%- if messages[0].role == 'system' %}\n {{- messages[0].content + '\\n\\n' }}\n {%- endif %}\n {{- \"# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}\n {%- for tool in tools %}\n {{- \"\\n\" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}\n{%- else %}\n {%- if messages[0].role == 'system' %}\n {{- '<|im_start|>system\\n' + messages[0].content + '<|im_end|>\\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if message.content is string %}\n {%- set content = message.content %}\n {%- else %}\n {%- set content = '' %}\n {%- endif %}\n {%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) %}\n {{- '<|im_start|>' + message.role + '\\n' + content + '<|im_end|>' + '\\n' }}\n {%- elif message.role == \"assistant\" %}\n {{- '<|im_start|>' + message.role + '\\n' + content }}\n {%- if message.tool_calls %}\n {%- for tool_call in message.tool_calls %}\n {%- if (loop.first and content) or (not loop.first) %}\n {{- '\\n' }}\n {%- endif %}\n {%- if tool_call.function %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '<tool_call>\\n{\"name\": \"' }}\n {{- tool_call.name }}\n {{- '\", \"arguments\": ' }}\n {%- if tool_call.arguments is string %}\n {{- tool_call.arguments }}\n {%- else %}\n {{- tool_call.arguments | tojson }}\n {%- endif %}\n {{- '}\\n</tool_call>' }}\n {%- endfor %}\n {%- endif %}\n {{- '<|im_end|>\\n' }}\n {%- elif message.role == \"tool\" %}\n {%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\\n<tool_response>\\n' }}\n {{- content }}\n {{- '\\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}\n {{- '<|im_end|>\\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\\n' }}\n{%- endif %}",
"clean_up_tokenization_spaces": false,
"eos_token": "<|im_end|>",
"errors": "replace",
"model_max_length": 131072,
"pad_token": "<|endoftext|>",
"split_special_tokens": false,
"tokenizer_class": "Qwen2Tokenizer",
"unk_token": null,
"add_bos_token": false
}

1
vocab.json Normal file

File diff suppressed because one or more lines are too long