Marco-Nano-Instruct/README.md

---
license: apache-2.0
language:
- en
- zh
- ar
- de
- es
- fr
- ko
- ja
- pt
- tr
- id
- it
- nl
- pl
- ru
- vi
- th
- he
- uk
- ms
- bn
- cs
- ur
- kk
- el
- ro
- hu
- ne
- az
library_name: transformers
tags:
- moe
- mixture-of-experts
- multilingual
- upcycling
datasets:
- allenai/Dolci-Instruct-SFT
- nvidia/Nemotron-Cascade-2-SFT-Data
- nvidia/Nemotron-RL-instruction_following
- nvidia/Nemotron-RL-instruction_following-structured_outputs
- nvidia/Nemotron-RL-ReasoningGym-v1
- nvidia/Nemotron-RL-knowledge-mcqa
- nvidia/Nemotron-Cascade-RL-RLHF
- BytedTsinghua-SIA/DAPO-Math-17k
- Skywork/Skywork-OR1-RL-Data
- nvidia/Nemotron-SFT-Multilingual-v1
---

# Marco-Nano-Instruct

**Marco-Nano-Instruct** is the post-trained variant of [Marco-Nano-Base](https://huggingface.co/AIDC-AI/Marco-Nano-Base), a highly sparse Mixture-of-Experts (MoE) multilingual language model from the [Marco-MoE](https://github.com/AIDC-AI/Marco-LLM) family, developed by Alibaba International Digital Commerce. It activates only **0.6B out of 8B total parameters** (7.5% activation ratio) per token. Despite its extreme sparsity, Marco-Nano-Instruct achieves the **best average performance** across English, multilingual general, and multilingual cultural benchmarks among all comparable instruct models up to 3.84B activated parameters.

## Model Description

Marco-Nano-Instruct shares the same architecture as [Marco-Nano-Base](https://huggingface.co/AIDC-AI/Marco-Nano-Base): a decoder-only Transformer with sparse MoE layers replacing standard FFN layers, upcycled from [Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base) using fine-grained sub-matrix splitting combined with Drop-Upcycling.

| Configuration | Value |
|:---|:---:|
| Total Parameters | 8B |
| Activated Parameters | 0.6B |
| Activation Ratio | 7.5% |
| Num Layers | 28 |
| Model Dimension | 1024 |
| FFN Intermediate Dimension | 3072 |
| Q-Heads | 16 |
| KV-Heads | 8 |
| Head Dimension | 128 |
| Expert Dimension | 384 |
| Total Experts | 232 |
| Activated Experts | 8 |
| Tie Embeddings | True |
| Training FLOPs | $1.40 \times 10^{23}$ |

## Post-Training Details

Marco-Nano-Instruct is trained from [Marco-Nano-Base](https://huggingface.co/AIDC-AI/Marco-Nano-Base) using a two-stage post-training pipeline implemented with the SLIME framework:

### Stage 1: Supervised Fine-Tuning (SFT)

- **Duration:** ~24 hours on 64 GPUs
- **Steps:** ~4,000 (1 epoch)
- **Learning rate:** 1e-5 with cosine decay to 1e-6
- **Batch size:** 512, context length 8,192 tokens

**Data sources:**
1. **General instructions** — Dolci-Instruct dataset, augmented with Nemotron-Cascade-2 data
2. **Knowledge-intensive data** — Scientific prompts from Nemotron-Cascade-2, responses distilled from Gemini3-Flash
3. **Translation data** — Web-mined NLLB translation pairs, filtered and scored with [Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B) (top 10K per language)
4. **Multilingual & cultural data** — Wikidata-sourced content with Gemini3-Flash text synthesis for cultural concepts.

### Stage 2: On-Policy Distillation (OPD)

- **Duration:** ~110 hours on 64 GPUs
- **Steps:** ~2,900 total (2 responses sampled per prompt)
- **Learning rate:** 1e-6 (constant)

**Cascaded distillation:**
1. ~1,900 steps with Qwen3-30B-A3B-Instruct as teacher
2. ~1,000 steps with Qwen3-Next-80B-A3B-Instruct as stronger teacher

**OPD data mixture:**

| Category | Datasets | Ratio |
|:---|:---|:---:|
| Instruction Following | Nemotron-RL-instruction-following + structured outputs | 25% |
| Knowledge & Reasoning | Nemotron-RL-ReasoningGym-v1 + knowledge-mcqa | 25% |
| Alignment | Nemotron-Cascade-RL-RLHF | 10% |
| Math | DAPO-Math-17k + Skywork-OR1-RL-Data | 10% |
| Multilingual | Translation + Cultural + Nemotron-SFT-Multilingual-v1 | 30% |

## Supported Languages

English, Chinese, Arabic, German, Spanish, French, Korean, Japanese, Portuguese, Turkish, Indonesian, Italian, Dutch, Polish, Russian, Vietnamese, Thai, Hebrew, Ukrainian, Malay, Bengali, Czech, Urdu, Kazakh, Greek, Romanian, Hungarian, Nepali, Azerbaijani

## Evaluation

We compare Marco-Nano-Instruct against instruct models of comparable size: **Qwen3-1.7B-Instruct** (1.7B activated), **Qwen3-VL-2B-Instruct** (2B activated), **Ministral3-3B-Instruct** (3.84B activated), **LFM2-8B-A1B** (1.5B activated), and **Granite4-Tiny-Instruct** (1.47B activated). Marco-Nano-Instruct uses only **0.6B activated parameters** — the smallest among all baselines. Avg@8 accuracies are reported, except for GlobalMMLU and MMMLU where Acc@1 is reported.

### English

| Benchmark | Qwen3-1.7B | Qwen3-VL-2B | Ministral3-3B | LFM2-8B-A1B | Granite4-Tiny | **Marco-Nano** |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| MMLU _(Acc)_ | 62.4 | 62.1 | 69.8 | 72.1 | 50.8 | **73.2** |
| MMLU-Redux _(Acc)_ | 62.4 | 62.2 | 69.6 | 71.9 | 51.2 | **73.3** |
| MMLU-Pro _(Acc)_ | 35.2 | 38.3 | 49.5 | 49.5 | 25.3 | **54.5** |
| AGIEval _(Acc)_ | 39.6 | 33.0 | 44.7 | 45.2 | 30.7 | **49.8** |
| GPQA-Diamond _(Acc)_ | 27.5 | 21.0 | 31.6 | **31.9** | 28.3 | 22.2 |
| GSM8K _(EM)_ | 77.9 | 79.7 | 79.0 | 84.6 | 71.1 | **86.7** |
| MATH _(EM)_ | 70.6 | 73.7 | 70.2 | **82.6** | 53.4 | 79.6 |
| **Average** | 53.7 | 52.9 | 59.2 | 62.5 | 44.4 | **62.8** |

### Multilingual — General

| Benchmark | Qwen3-1.7B | Qwen3-VL-2B | Ministral3-3B | LFM2-8B-A1B | Granite4-Tiny | **Marco-Nano** |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| GlobalMMLU _(Acc)_ | 46.3 | 45.9 | 38.4 | 49.0 | 43.0 | **58.7** |
| MMMLU _(Acc)_ | 49.0 | 49.0 | 39.4 | 56.5 | 44.1 | **59.9** |
| MMLU-ProX-Lite _(Acc)_ | 28.6 | 30.3 | 26.7 | 33.8 | 22.1 | **43.2** |
| MGPQA _(Acc)_ | 25.3 | 22.3 | 18.8 | **27.2** | 25.9 | 21.6 |
| FLORES-200 En→Xx _(BLEU)_ | 12.7 | 15.3 | 8.3 | 14.9 | **22.5** | 22.3 |
| FLORES-200 Xx→En _(BLEU)_ | 28.2 | 28.6 | 18.9 | 20.1 | 30.4 | **31.1** |
| WMT24++ En→Xx _(BLEU)_ | 13.2 | 14.6 | 4.4 | 14.6 | **18.9** | 18.7 |
| WMT24++ Xx→En _(BLEU)_ | 26.4 | 26.2 | 8.3 | 17.9 | 25.1 | **27.3** |
| MGSM _(EM)_ | 63.6 | 67.6 | 47.0 | 56.5 | 55.3 | **76.5** |
| PolyMath _(EM)_ | 23.4 | 25.5 | 16.3 | 26.5 | 18.7 | **29.6** |
| **Average** | 31.7 | 32.5 | 22.7 | 31.7 | 30.6 | **38.9** |

### Multilingual — Cultural & Regional

| Benchmark | Qwen3-1.7B | Qwen3-VL-2B | Ministral3-3B | LFM2-8B-A1B | Granite4-Tiny | **Marco-Nano** |
|:---|:---:|:---:|:---:|:---:|:---:|:---:|
| INCLUDE _(Acc)_ | 44.9 | 44.4 | 35.4 | 43.5 | 38.6 | **54.3** |
| Global-PIQA _(Acc)_ | 62.0 | 65.8 | 50.6 | 60.8 | 63.3 | **70.7** |
| CMMLU _(Acc)_ | 60.4 | **63.3** | 48.9 | 52.7 | 39.2 | 60.0 |
| C-Eval _(Acc)_ | 58.7 | **63.2** | 50.6 | 50.8 | 39.4 | 60.8 |
| ArabicMMLU _(Acc)_ | 48.8 | 46.9 | 22.7 | **56.5** | 43.4 | **56.5** |
| TurkishMMLU _(Acc)_ | 42.7 | 39.6 | 38.6 | 26.3 | 31.6 | **59.9** |
| GreekMMLU _(Acc)_ | 48.7 | 48.0 | 38.4 | 40.0 | 44.8 | **61.6** |
| KazakhMMLU _(Acc)_ | 46.0 | 47.1 | 41.4 | 39.6 | 39.6 | **56.3** |
| IndoMMLU _(Acc)_ | 48.8 | 49.3 | 35.2 | 41.1 | 37.2 | **56.3** |
| IndoCareer _(Acc)_ | 46.1 | 45.7 | 36.0 | 41.7 | 34.7 | **54.9** |
| IndoCulture _(Acc)_ | 45.8 | 47.7 | 37.2 | 45.9 | 42.8 | **59.1** |
| **Average** | 50.3 | 51.0 | 39.5 | 45.4 | 41.3 | **59.1** |

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "AIDC-AI/Marco-Nano-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

messages = [
    {"role": "user", "content": "What is the capital of France?"}
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
```

**Note**: vLLM is the recommended engine for deployment, as SGLang currently lacks support for MoE models with tied embeddings (see [PR #20127](https://github.com/sgl-project/sglang/pull/20127)). If SGLang is required for your workflow, please use the specific build at commit e5f48b32abff027d859a43b7d5ba3aece04471c7.

## Citation

```bibtex
@article{marco-moe,
  title={Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling},
  author={Fan Jiang, Yu Zhao, Chenyang Lyu, Tianqi Shi, Yichao Du, Feihu Jiang, Longyue Wang and Weihua Luo},
  year={2026}
}
```

## License

This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).
初始化项目，由ModelHub XC社区提供模型 Model: AIDC-AI/Marco-Nano-Instruct Source: Original Platform 2026-04-27 22:50:45 +08:00			`---`
			`license: apache-2.0`
			`language:`
			`- en`
			`- zh`
			`- ar`
			`- de`
			`- es`
			`- fr`
			`- ko`
			`- ja`
			`- pt`
			`- tr`
			`- id`
			`- it`
			`- nl`
			`- pl`
			`- ru`
			`- vi`
			`- th`
			`- he`
			`- uk`
			`- ms`
			`- bn`
			`- cs`
			`- ur`
			`- kk`
			`- el`
			`- ro`
			`- hu`
			`- ne`
			`- az`
			`library_name: transformers`
			`tags:`
			`- moe`
			`- mixture-of-experts`
			`- multilingual`
			`- upcycling`
			`datasets:`
			`- allenai/Dolci-Instruct-SFT`
			`- nvidia/Nemotron-Cascade-2-SFT-Data`
			`- nvidia/Nemotron-RL-instruction_following`
			`- nvidia/Nemotron-RL-instruction_following-structured_outputs`
			`- nvidia/Nemotron-RL-ReasoningGym-v1`
			`- nvidia/Nemotron-RL-knowledge-mcqa`
			`- nvidia/Nemotron-Cascade-RL-RLHF`
			`- BytedTsinghua-SIA/DAPO-Math-17k`
			`- Skywork/Skywork-OR1-RL-Data`
			`- nvidia/Nemotron-SFT-Multilingual-v1`
			`---`

			`# Marco-Nano-Instruct`

			Marco-Nano-Instruct is the post-trained variant of [Marco-Nano-Base](https://huggingface.co/AIDC-AI/Marco-Nano-Base), a highly sparse Mixture-of-Experts (MoE) multilingual language model from the [Marco-MoE](https://github.com/AIDC-AI/Marco-LLM) family, developed by Alibaba International Digital Commerce. It activates only 0.6B out of 8B total parameters (7.5% activation ratio) per token. Despite its extreme sparsity, Marco-Nano-Instruct achieves the best average performance across English, multilingual general, and multilingual cultural benchmarks among all comparable instruct models up to 3.84B activated parameters.

			`## Model Description`

			`Marco-Nano-Instruct shares the same architecture as [Marco-Nano-Base](https://huggingface.co/AIDC-AI/Marco-Nano-Base): a decoder-only Transformer with sparse MoE layers replacing standard FFN layers, upcycled from [Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base) using fine-grained sub-matrix splitting combined with Drop-Upcycling.`

			`\| Configuration \| Value \|`
			`\|:---\|:---:\|`
			`\| Total Parameters \| 8B \|`
			`\| Activated Parameters \| 0.6B \|`
			`\| Activation Ratio \| 7.5% \|`
			`\| Num Layers \| 28 \|`
			`\| Model Dimension \| 1024 \|`
			`\| FFN Intermediate Dimension \| 3072 \|`
			`\| Q-Heads \| 16 \|`
			`\| KV-Heads \| 8 \|`
			`\| Head Dimension \| 128 \|`
			`\| Expert Dimension \| 384 \|`
			`\| Total Experts \| 232 \|`
			`\| Activated Experts \| 8 \|`
			`\| Tie Embeddings \| True \|`
			`\| Training FLOPs \| $1.40 \times 10^{23}$ \|`

			`## Post-Training Details`

			`Marco-Nano-Instruct is trained from [Marco-Nano-Base](https://huggingface.co/AIDC-AI/Marco-Nano-Base) using a two-stage post-training pipeline implemented with the SLIME framework:`

			`### Stage 1: Supervised Fine-Tuning (SFT)`

			`- Duration: ~24 hours on 64 GPUs`
			`- Steps: ~4,000 (1 epoch)`
			`- Learning rate: 1e-5 with cosine decay to 1e-6`
			`- Batch size: 512, context length 8,192 tokens`

			`Data sources:`
			`1. General instructions — Dolci-Instruct dataset, augmented with Nemotron-Cascade-2 data`
			`2. Knowledge-intensive data — Scientific prompts from Nemotron-Cascade-2, responses distilled from Gemini3-Flash`
			`3. Translation data — Web-mined NLLB translation pairs, filtered and scored with [Qwen3-Embedding-8B](https://huggingface.co/Qwen/Qwen3-Embedding-8B) (top 10K per language)`
			`4. Multilingual & cultural data — Wikidata-sourced content with Gemini3-Flash text synthesis for cultural concepts.`

			`### Stage 2: On-Policy Distillation (OPD)`

			`- Duration: ~110 hours on 64 GPUs`
			`- Steps: ~2,900 total (2 responses sampled per prompt)`
			`- Learning rate: 1e-6 (constant)`

			`Cascaded distillation:`
			`1. ~1,900 steps with Qwen3-30B-A3B-Instruct as teacher`
			`2. ~1,000 steps with Qwen3-Next-80B-A3B-Instruct as stronger teacher`

			`OPD data mixture:`

			`\| Category \| Datasets \| Ratio \|`
			`\|:---\|:---\|:---:\|`
			`\| Instruction Following \| Nemotron-RL-instruction-following + structured outputs \| 25% \|`
			`\| Knowledge & Reasoning \| Nemotron-RL-ReasoningGym-v1 + knowledge-mcqa \| 25% \|`
			`\| Alignment \| Nemotron-Cascade-RL-RLHF \| 10% \|`
			`\| Math \| DAPO-Math-17k + Skywork-OR1-RL-Data \| 10% \|`
			`\| Multilingual \| Translation + Cultural + Nemotron-SFT-Multilingual-v1 \| 30% \|`

			`## Supported Languages`

			`English, Chinese, Arabic, German, Spanish, French, Korean, Japanese, Portuguese, Turkish, Indonesian, Italian, Dutch, Polish, Russian, Vietnamese, Thai, Hebrew, Ukrainian, Malay, Bengali, Czech, Urdu, Kazakh, Greek, Romanian, Hungarian, Nepali, Azerbaijani`

			`## Evaluation`

			`We compare Marco-Nano-Instruct against instruct models of comparable size: Qwen3-1.7B-Instruct (1.7B activated), Qwen3-VL-2B-Instruct (2B activated), Ministral3-3B-Instruct (3.84B activated), LFM2-8B-A1B (1.5B activated), and Granite4-Tiny-Instruct (1.47B activated). Marco-Nano-Instruct uses only 0.6B activated parameters — the smallest among all baselines. Avg@8 accuracies are reported, except for GlobalMMLU and MMMLU where Acc@1 is reported.`

			`### English`

			`\| Benchmark \| Qwen3-1.7B \| Qwen3-VL-2B \| Ministral3-3B \| LFM2-8B-A1B \| Granite4-Tiny \| Marco-Nano \|`
			`\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|`
			`\| MMLU _(Acc)_ \| 62.4 \| 62.1 \| 69.8 \| 72.1 \| 50.8 \| 73.2 \|`
			`\| MMLU-Redux _(Acc)_ \| 62.4 \| 62.2 \| 69.6 \| 71.9 \| 51.2 \| 73.3 \|`
			`\| MMLU-Pro _(Acc)_ \| 35.2 \| 38.3 \| 49.5 \| 49.5 \| 25.3 \| 54.5 \|`
			`\| AGIEval _(Acc)_ \| 39.6 \| 33.0 \| 44.7 \| 45.2 \| 30.7 \| 49.8 \|`
			`\| GPQA-Diamond _(Acc)_ \| 27.5 \| 21.0 \| 31.6 \| 31.9 \| 28.3 \| 22.2 \|`
			`\| GSM8K _(EM)_ \| 77.9 \| 79.7 \| 79.0 \| 84.6 \| 71.1 \| 86.7 \|`
			`\| MATH _(EM)_ \| 70.6 \| 73.7 \| 70.2 \| 82.6 \| 53.4 \| 79.6 \|`
			`\| Average \| 53.7 \| 52.9 \| 59.2 \| 62.5 \| 44.4 \| 62.8 \|`

			`### Multilingual — General`

			`\| Benchmark \| Qwen3-1.7B \| Qwen3-VL-2B \| Ministral3-3B \| LFM2-8B-A1B \| Granite4-Tiny \| Marco-Nano \|`
			`\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|`
			`\| GlobalMMLU _(Acc)_ \| 46.3 \| 45.9 \| 38.4 \| 49.0 \| 43.0 \| 58.7 \|`
			`\| MMMLU _(Acc)_ \| 49.0 \| 49.0 \| 39.4 \| 56.5 \| 44.1 \| 59.9 \|`
			`\| MMLU-ProX-Lite _(Acc)_ \| 28.6 \| 30.3 \| 26.7 \| 33.8 \| 22.1 \| 43.2 \|`
			`\| MGPQA _(Acc)_ \| 25.3 \| 22.3 \| 18.8 \| 27.2 \| 25.9 \| 21.6 \|`
			`\| FLORES-200 En→Xx _(BLEU)_ \| 12.7 \| 15.3 \| 8.3 \| 14.9 \| 22.5 \| 22.3 \|`
			`\| FLORES-200 Xx→En _(BLEU)_ \| 28.2 \| 28.6 \| 18.9 \| 20.1 \| 30.4 \| 31.1 \|`
			`\| WMT24++ En→Xx _(BLEU)_ \| 13.2 \| 14.6 \| 4.4 \| 14.6 \| 18.9 \| 18.7 \|`
			`\| WMT24++ Xx→En _(BLEU)_ \| 26.4 \| 26.2 \| 8.3 \| 17.9 \| 25.1 \| 27.3 \|`
			`\| MGSM _(EM)_ \| 63.6 \| 67.6 \| 47.0 \| 56.5 \| 55.3 \| 76.5 \|`
			`\| PolyMath _(EM)_ \| 23.4 \| 25.5 \| 16.3 \| 26.5 \| 18.7 \| 29.6 \|`
			`\| Average \| 31.7 \| 32.5 \| 22.7 \| 31.7 \| 30.6 \| 38.9 \|`

			`### Multilingual — Cultural & Regional`

			`\| Benchmark \| Qwen3-1.7B \| Qwen3-VL-2B \| Ministral3-3B \| LFM2-8B-A1B \| Granite4-Tiny \| Marco-Nano \|`
			`\|:---\|:---:\|:---:\|:---:\|:---:\|:---:\|:---:\|`
			`\| INCLUDE _(Acc)_ \| 44.9 \| 44.4 \| 35.4 \| 43.5 \| 38.6 \| 54.3 \|`
			`\| Global-PIQA _(Acc)_ \| 62.0 \| 65.8 \| 50.6 \| 60.8 \| 63.3 \| 70.7 \|`
			`\| CMMLU _(Acc)_ \| 60.4 \| 63.3 \| 48.9 \| 52.7 \| 39.2 \| 60.0 \|`
			`\| C-Eval _(Acc)_ \| 58.7 \| 63.2 \| 50.6 \| 50.8 \| 39.4 \| 60.8 \|`
			`\| ArabicMMLU _(Acc)_ \| 48.8 \| 46.9 \| 22.7 \| 56.5 \| 43.4 \| 56.5 \|`
			`\| TurkishMMLU _(Acc)_ \| 42.7 \| 39.6 \| 38.6 \| 26.3 \| 31.6 \| 59.9 \|`
			`\| GreekMMLU _(Acc)_ \| 48.7 \| 48.0 \| 38.4 \| 40.0 \| 44.8 \| 61.6 \|`
			`\| KazakhMMLU _(Acc)_ \| 46.0 \| 47.1 \| 41.4 \| 39.6 \| 39.6 \| 56.3 \|`
			`\| IndoMMLU _(Acc)_ \| 48.8 \| 49.3 \| 35.2 \| 41.1 \| 37.2 \| 56.3 \|`
			`\| IndoCareer _(Acc)_ \| 46.1 \| 45.7 \| 36.0 \| 41.7 \| 34.7 \| 54.9 \|`
			`\| IndoCulture _(Acc)_ \| 45.8 \| 47.7 \| 37.2 \| 45.9 \| 42.8 \| 59.1 \|`
			`\| Average \| 50.3 \| 51.0 \| 39.5 \| 45.4 \| 41.3 \| 59.1 \|`

			`## Usage`

			```python
			`from transformers import AutoModelForCausalLM, AutoTokenizer`

			`model_name = "AIDC-AI/Marco-Nano-Instruct"`
			`tokenizer = AutoTokenizer.from_pretrained(model_name)`
			`model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")`

			`messages = [`
			`{"role": "user", "content": "What is the capital of France?"}`
			`]`
			`inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)`
			`outputs = model.generate(inputs, max_new_tokens=256)`
			`print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))`
			```

			`Note: vLLM is the recommended engine for deployment, as SGLang currently lacks support for MoE models with tied embeddings (see [PR #20127](https://github.com/sgl-project/sglang/pull/20127)). If SGLang is required for your workflow, please use the specific build at commit e5f48b32abff027d859a43b7d5ba3aece04471c7.`

			`## Citation`

			```bibtex
			`@article{marco-moe,`
			`title={Marco-MoE: Open Multilingual Mixture-of-Expert Language Models with Efficient Upcycling},`
			`author={Fan Jiang, Yu Zhao, Chenyang Lyu, Tianqi Shi, Yichao Du, Feihu Jiang, Longyue Wang and Weihua Luo},`
			`year={2026}`
			`}`
			```

			`## License`

			`This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).`