Mecellem-Qwen3-4B-TR/README.md

---
base_model: Qwen/Qwen3-4B
language:
- tr
- en
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
tags:
- text-generation
- turkish
- legal
- turkish-legal
- mecellem
- qwen
- decoder-only
- continual-pretraining
- TRUBA
- MN5
---

# Mecellem-Qwen3-4B-TR

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

This repository contains the **Mecellem-Qwen3-4B-TR** model, as presented in the paper [Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain](https://huggingface.co/papers/2601.16018).

- **GitHub Repository:** [newmindai/mecellem-models](https://github.com/newmindai/mecellem-models)
- **Paper:** [arXiv:2601.16018](https://arxiv.org/abs/2601.16018)

## Model Description

Mecellem-Qwen3-4B-TR is a Turkish legal language model adapted through Continual Pre-training (CPT) on Turkish legal and official texts. The model is based on Qwen3-4B decoder architecture (4B parameters) and trained using a single-phase, large-scale CPT process. Unlike the 1.7B model's four-phase curriculum learning, this model employs a single-phase training strategy on a comprehensive dataset, demonstrating that larger model capacity can benefit from direct large-scale domain adaptation.

**Key Features:**
- Continual pre-training on approximately 270.8 billion tokens in a single phase
- Single-phase large-scale CPT process (270,791,712,595 tokens)
- Dataset includes Turkish legal sources (Yargıtay, Danıştay, YÖKTEZ) and general Turkish web data (FineWeb2, CulturaX)
- Preserves general language capabilities while injecting domain-specific legal knowledge

**Model Type:** Decoder-only Language Model
**Parameters:** 4B
**Base Model:** Qwen/Qwen3-4B
**Architecture:** Qwen3 decoder with grouped query attention (GQA)

### Architecture Details

- **Max Position Embeddings:** 40,960 tokens
- **Number of Layers:** 36 transformer layers
- **Hidden Size:** 2,560
- **FFN Hidden Size:** 9,728
- **Number of Heads:** 32
- **Number of KV Heads (GQA):** 8
- **Activation Function:** SwiGLU
- **Position Encodings:** RoPE (Rotary Position Embeddings)
- **Layer Norm:** RMSNorm

### Training Details

**Continual Pre-training (CPT):**
- **Total Training Tokens:** ~270.8 billion tokens (270,791,712,595 tokens)
- **Training Method:** Single-phase large-scale CPT
- **Framework:** NVIDIA NeMo with Megatron-Core
- **Precision:** BF16 mixed precision
- **Hardware Infrastructure:**
  - **System:** MareNostrum 5 ACC partition at Barcelona Supercomputing Center (BSC)
  - **Compute Nodes:** 100 nodes
  - **GPUs:** 400× NVIDIA Hopper H100 64GB GPUs (SXM) (4 GPUs per node)
  - **Node Configuration:** Each node equipped with 4× H100 GPUs, 80 CPU cores, 512GB DDR5 memory
  - **Interconnect:** 800 Gb/s InfiniBand for distributed training
  - **GPU Interconnect:** NVLink for intra-node GPU communication (4 GPUs per node connected via NVLink)
  - **Distributed Training:** Data-parallel multi-node and multi-GPU distributed architecture with 4 GPUs per node
  - **InfiniBand Network:** Enabled efficient processing of large-scale token flow and ensured high scalability and training stability in long-term CPT training
  - **Hardware Utilization:** 18.7% median MFU, 2.57M tokens/sec throughput

**Dataset Composition:**
- **Legal Sources:**
  - Court of Cassation (Yargıtay): 10.3M sequences, ~3.43B tokens
  - Council of State (Danıştay): 151K sequences, ~0.11B tokens
  - Academic theses (YÖKTEZ): 21.1M sequences, ~9.61B tokens (after DocsOCR processing)
- **General Turkish Sources:**
  - FineWeb2: General Turkish web data
  - CulturaX: Multilingual corpus (Turkish subset)
  - Total general Turkish: 212M sequences, ~96.17B tokens
- **Additional Categories:** English, Mathematics, Python code, multilingual content (Spanish, Arabic, Russian, Chinese)

**Training Hyperparameters:**
- Sequence Length: 4,096 tokens
- Optimizer: Adam with cosine learning rate schedule
- Max Learning Rate: 5×10⁻⁵
- Min Learning Rate: 5×10⁻⁶
- Weight Decay: 0.01
- Warmup Steps: 7,675 steps
- Max Steps: 153,508 steps
- Global Batch Size: 400
- Per-GPU Batch Size: 1
- Gradient Accumulation: 16

### Training Visualization

The following visualizations show the model's training progress and dataset distribution:

![Dataset Distribution](qwen4b_dataset.png)

*Qwen3-4B CPT Dataset Distribution Single Phase. The model was trained using a single-phase, large-scale CPT process.*

![Training Loss](qwen4b_loss.png)

*Qwen3-4B CPT Training and Validation Loss Curves. The model shows consistent improvement throughout training.*

### Benchmark Performance

The model was evaluated using the Muhakim reward model on Turkish legal tasks:

![Benchmark Performance](4b_qwen_armo.png)

*Benchmark Performance of 4B Decoder-Only Models Across Context Lengths Using the Muhakim Reward Model. Mecellem-Qwen3-4B-TR consistently outperforms the base Qwen3-4B model across all five legal quality objectives.*

### Rewards Comparison Analysis

The following visualization compares rewards across different token lengths for base vs CPT models:

![Rewards Comparison](comparison_rewards_by_token_length-filtered.png)

*Rewards Comparison: Base vs CPT Models Across Token Lengths. Mecellem-Qwen3-4B-TR shows consistent improvements over the base model across all context length settings, demonstrating the effectiveness of Turkish legal domain adaptation.*


## Usage

### Installation

```bash
pip install transformers torch
```

### Text Generation

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("newmindai/Mecellem-Qwen3-4B-TR")
model = AutoModelForCausalLM.from_pretrained("newmindai/Mecellem-Qwen3-4B-TR")

# Example prompt
prompt = "Türk hukuk sisteminde sözleşme feshi"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        do_sample=True,
        top_p=0.9
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

## Use Cases

- Turkish legal text generation
- Legal document summarization
- Legal question answering
- Legal text completion
- Domain-specific language modeling for Turkish legal domain
- Retrieval-Augmented Generation (RAG) applications

## Acknowledgments

This work was supported by the EuroHPC Joint Undertaking through project etur46 with access to the MareNostrum 5 supercomputer, hosted by Barcelona Supercomputing Center (BSC), Spain. MareNostrum 5 is owned by EuroHPC JU and operated by BSC. We are grateful to the BSC support team for their assistance with job scheduling, environment configuration, and technical guidance throughout the project.

The numerical calculations reported in this work were fully/partially performed at TÜBİTAK ULAKBİM, High Performance and Grid Computing Center (TRUBA resources). The authors gratefully acknowledge the know-how provided by the MINERVA Support for expert guidance and collaboration opportunities in HPC-AI integration.

## References

If you use this model, please cite our paper:

```bibtex
@article{mecellem2026,
  title={Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain},
  author={Uğur, Özgür and Göksu, Mahmut and Çimen, Mahmut and Yılmaz, Musa and Şavirdi, Esra and Demir, Alp Talha and Güllüce, Rumeysa and İclal Çetin, Ömer Can Sağbaş},
  journal={arXiv preprint arXiv:2601.16018},
  year={2026},
  month={January},
  url={https://arxiv.org/abs/2601.16018},
  doi={10.48550/arXiv.2601.16018},
  eprint={2601.16018},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
```

### Base Model References

```bibtex
@article{qwen2024,
  title={Qwen3: A Large Language Model Series},
  author={Qwen Team},
  journal={arXiv preprint arXiv:2409.00000},
  year={2024}
}
```