Light-TLLM-7B/README.md

---
license: apache-2.0
license_link: https://huggingface.co/Qihoo360/Light-TLLM-7B/blob/main/LICENSE
language:
- en
- zh
pipeline_tag: text-generation
base_model: Qwen/Qwen2.5-7B
tags:
- machine-translation
- multilingual
- qwen2
library_name: transformers
---

# Light-TLLM-7B
<a href="https://huggingface.co/qihoo360/Light-TLLM-7B" target="_blank" style="margin: 2px;">
    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-FF6B6B" style="display: inline-block; vertical-align: middle;"/>
</a>

## Introduction

Light-TLLM-7B is a machine translation focused variant of Qwen2.5-7B developed by 360 AI Research.

**This repo contains the machine translation specialized 7B model**, which has the following features:
- Type: Causal Language Models for Machine Translation
- Training Stage: Continued pretraining, curriculum SFT, and MtPO reinforcement learning
- Architecture: transformers with RoPE, SwiGLU, RMSNorm, and Attention QKV bias
- Number of Parameters: 7.61B (6.53B non-embedding)
- Number of Layers: 28
- Number of Attention Heads (GQA): 28 for Q and 4 for KV
- Context Length: Up to 131,072 tokens
- Vocabulary Size: 180,736 tokens with MtPO vocabulary expansion

## Requirements

The code of Light-TLLM-7B is compatible with the latest Hugging Face `transformers` library. We recommend using the latest version of `transformers`.

With `transformers<4.37.0`, you will encounter the following error:
```
KeyError: 'qwen2'
```

## Quickstart

Here provides a code snippet to show you how to load the tokenizer and model for machine translation tasks.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "qihoo360/Light-TLLM-7B"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Example translation prompt
prompt = "Translate the following English text to Chinese: Hello, how are you today?"
messages = [
    {"role": "system", "content": "You are a professional translator. Translate the given text accurately and naturally."},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=512,
    temperature=0.7,
    do_sample=True
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
```

## Training Pipeline (MtPO)

Runs in four stages from tokenizer expansion to reinforcement learning alignment.

- **Stage 1 - Vocabulary expansion:** Extend the Qwen2.5 tokenizer with 3k-4k tokens per target language (Khmer, Lao, Mongolian, Myanmar, Tamil, Thai, Tibetan, Uyghur). FLORES-Plus diagnostics show 2.1x-5.4x compression gains, cutting Khmer token counts from 402 to 103 for representative passages.
- **Stage 2 - Balanced continued pretraining:** Continue training on 200B tokens with a 1:1 mix between English and the expanded low-resource corpus to preserve high-resource coverage while materially improving low-resource fluency.
- **Stage 3 - Curriculum SFT:** Train on a 7M-sample blend (5:1 general instructions vs. multilingual data) that progresses from base instruction-following to ASEAN translation and mixed-format prompts.
- **Stage 4 - MtPO reinforcement learning:** Optimize with entropy-tempered policy updates that keep sampling temperature consistent, apply asymmetric ratio clipping, and normalize advantages at the microbatch level to avoid length bias or entropy collapse.

## Verifiable Reward Guardrails

Reinforcement Learning with Verifiable Rewards (RLVR) combines the translation reward model with deterministic validators. During RL we sample K candidates per prompt, score them with RLVR, and keep the top-G diverse outputs for gradient updates. Each candidate is checked for:
- Length ratio safety relative to the source (default bounds 0.5-2.0 with soft penalties outside range)
- Structural token preservation for HTML, Markdown, and code blocks using lightweight parsers
- Target-language verification via a confidence-gated language ID classifier
- Code-mixing penalties that suppress unintended language drift

These verifiable rewards are added to the semantic score so bad outputs receive immediate negative credit, while high-quality candidates remain eligible for optimization.

## Data and Training Budget

Summary of resources and evaluation suites used during MtPO development.

- Continued pretraining: 200B tokens with adaptive sampling over English, ASEAN, Tibetan, Mongolian, Tamil, and Uyghur corpora
- Reinforcement learning: 60k steps, batch size 128, top-G candidate selection with RLVR filtering
- Reward model: Preference data spans ten error categories (accuracy, fluency, terminology, formatting, code-mixing, etc.)
- Benchmarks: FLORES-Plus (90 directions), BBH, CMMLU, HellaSwag, MMLU

## Model Details

- **Model Type**: Qwen2-based Causal Language Model
- **Language(s)**: Multilingual (English, Chinese, Khmer, Lao, Myanmar, Thai, Tibetan, Mongolian, Tamil, Malay, Indonesian, Filipino, Vietnamese, Uyghur, etc.)
- **License**: Apache 2.0
- **Finetuned from**: Qwen/Qwen2.5-7B
- **Model Size**: 7.61B parameters
- **Context Length**: 131,072 tokens

## Usage

This model is specifically designed for machine translation tasks. It can handle various translation scenarios including:

- English <-> Chinese translation
- Multilingual translation tasks
- Professional document translation
- Conversational translation

## Evaluation

### Translation and General Benchmarks

Light-TLLM-7B is evaluated on FLORES-Plus (90 directions) and standard instruction-following benchmarks. Scores below use sacreBLEU (higher is better) and zero-shot accuracy (percentage).

| Model | Group | xx->en | en->xx | xx->xx | Avg. | BBH | CMMLU | HellaSwag | MMLU |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Gemma3-27B-IT | Multilingual chat | **36.8** | 30.7 | 22.3 | 24.7 | 55.9 | 55.9 | 55.9 | **56.0** |
| Qwen3-8B | Multilingual chat | 31.1 | 23.3 | 14.4 | 16.9 | **63.8** | 60.8 | 26.0 | 51.3 |
| Qwen2.5-7B-Instruct | Multilingual chat | 24.8 | 17.4 | 9.2 | 11.6 | 54.4 | **64.1** | **85.2** | 40.9 |
| Apertus-8B-Instruct | Multilingual chat | 32.5 | 25.7 | 15.6 | 18.3 | 49.2 | 45.3 | 64.2 | 45.2 |
| Tower-Plus-9B | Multilingual chat | 28.2 | 18.3 | 9.8 | 12.5 | 40.4 | 57.2 | 73.1 | 42.1 |
| Qwen-MT-Plus | Translation-focused | 34.0 | 29.6 | 19.6 | 22.1 | - | - | - | - |
| Seed-X-PPO-7B | Translation-focused | 25.9 | 22.6 | 10.5 | 13.3 | - | - | - | - |
| Hunyuan-MT-7B | Translation-focused | 24.6 | 23.4 | 14.8 | 16.6 | - | - | - | - |
| Light-TLLM-7B-SFT | Our models | 35.4 | 32.0 | 22.7 | 24.3 | 59.6 | 61.4 | 83.7 | 47.2 |
| **Light-TLLM-7B-RL** | Our models | 36.1 | **32.7** | **23.1** | **24.9** | 60.9 | 63.2 | **85.2** | 48.5 |

- en->xx directions gain +1.1 BLEU over the next best 7B system while preserving reasoning accuracy (+1.3 MMLU over SFT).
- Average BLEU across all FLORES-Plus directions rises to 24.9 despite the compact 7B footprint.

### Tokenizer Efficiency

Vocabulary expansion provides substantial compression on targeted scripts (higher compression ratio means fewer tokens per sentence).

| Language | Added tokens | Old compression ratio | New compression ratio | Speedup |
| --- | --- | --- | --- | --- |
| Khmer | 3712 | 0.85 | 3.49 | 4.09x |
| Lao | 3359 | 0.85 | 3.05 | 3.59x |
| Myanmar | 3226 | 0.69 | 2.87 | 4.17x |
| Thai | 2958 | 1.79 | 2.97 | 1.66x |
| Tibetan | 3920 | 0.75 | 4.03 | 5.39x |

- Khmer passages shrink from 402 tokens to 103 tokens in the running example used in the paper.
- Compression gains translate into lower latency and memory cost during decoding for low-resource scripts.

### Constraint Reliability (RLVR)

RLVR introduces deterministic checks that reduce failure modes compared with general chat models and MT baselines.

| Model | Language targeting | Length control | Format preservation | Code mixing | Overall |
| --- | --- | --- | --- | --- | --- |
| **Light-TLLM-7B-RL** | **97.8** | 99.2 | **92.15** | 92.3 | **95.3** |
| Qwen2.5-7B-Instruct | 92.0 | 97.0 | 51.8 | 62.8 | 75.9 |
| Gemma3-27B-IT | 97.4 | 91.6 | 42.1 | 90.9 | 80.5 |
| Qwen-MT-Plus | 97.6 | **99.8** | 82.5 | 94.8 | 93.6 |
| Seed-X-PPO-7B | 97.6 | 79.8 | 79.0 | 90.3 | 86.6 |
| DeepSeek-V3 | 95.4 | 95.7 | 67.6 | 95.0 | 88.4 |
| Hunyuan-MT-7B | 91.8 | 90.7 | 71.1 | **96.2** | 87.4 |

- Format retention jumps to 92.15 percent versus 51.8 percent for Qwen2.5-7B-Instruct, mitigating HTML or Markdown corruption.
- Language targeting stays above 97 percent while MtPO avoids verbosity by normalizing advantages at the microbatch level.
- Overall pass rate reaches 95.3 percent, surpassing Qwen2.5-7B-Instruct by 19.4 points, DeepSeek-V3 by 6.9 points, and Qwen-MT-Plus by 1.7 points despite identical constraint settings.

### Per-Language FLORES Highlights

- **English->Thai:** 34.1 BLEU, +1.5 over Qwen-MT-Plus.
- **English->Myanmar:** 12.9 BLEU with stable length control.
- **English->Filipino:** 35.4 BLEU after MtPO, combining instruction fidelity and translation quality.
- **Khmer->English:** 44.7 BLEU, reflecting gains from tokenizer expansion.
- **Vietnamese->English:** 37.6 BLEU with consistent improvements across ASEAN language pairs.

## Citation

If you find our work helpful, feel free to give us a cite.

```
@inproceedings{liu2026mtpo,
    title = {Light-TLLM-7B},
    author = {Light-MT Team},
    booktitle = {International Conference on Learning Representations},
    year = {2025},
    url = {https://huggingface.co/qihoo360/Light-TLLM-7B}
}
```

## Disclaimer

This model is provided for research and educational purposes. Please ensure responsible use and compliance with applicable laws and regulations when using this model.