Mecellem-Qwen3-1.7B-TR/README.md

---
base_model: Qwen/Qwen3-1.7B
language:
- tr
- en
license: apache-2.0
pipeline_tag: text-generation
library_name: transformers
tags:
- text-generation
- turkish
- legal
- turkish-legal
- mecellem
- qwen
- decoder-only
- continual-pretraining
- TRUBA
- MN5
---

# Mecellem-Qwen3-1.7B-TR

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)

Mecellem-Qwen3-1.7B-TR is a Turkish legal language model presented in [Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain](https://huggingface.co/papers/2601.16018).

**Resources:**
- **Code:** [GitHub Repository](https://github.com/newmindai/mecellem-models)
- **Paper:** [arXiv:2601.16018](https://arxiv.org/abs/2601.16018)

## Model Description

Mecellem-Qwen3-1.7B-TR is a Turkish legal language model adapted through Continual Pre-training (CPT) on Turkish legal and official texts. The model is based on Qwen3-1.7B decoder architecture (1.7B parameters) and trained using a four-phase curriculum learning strategy specifically designed to account for Turkish linguistic complexity. The CPT process progressively transitions from general-purpose texts to domain-specific legal content, achieving 36.2% perplexity reduction on Turkish legal text compared to the base Qwen3-1.7B model.

**Key Features:**
- Continual pre-training on approximately 225 billion tokens across four phases
- Four-phase curriculum learning:
  - Phase 1: ~3.7B tokens
  - Phase 2: ~57B tokens
  - Phase 3: ~165B tokens
  - Phase 4: ~24.9B tokens
- Dataset includes Turkish legal sources (Yargıtay, Danıştay, YÖKTEZ) and general Turkish web data (FineWeb2, CulturaX)
- Preserves general language capabilities while injecting domain-specific legal knowledge

**Model Type:** Decoder-only Language Model
**Parameters:** 1.7B
**Base Model:** Qwen/Qwen3-1.7B
**Architecture:** Qwen3 decoder with grouped query attention (GQA)

### Architecture Details

- **Max Position Embeddings:** 40,960 tokens
- **Number of Layers:** 28 transformer layers
- **Hidden Size:** 2,048
- **FFN Hidden Size:** 6,144
- **Number of Heads:** 16
- **Number of KV Heads (GQA):** 8
- **Activation Function:** SwiGLU
- **Position Encodings:** RoPE (Rotary Position Embeddings)
- **Layer Norm:** RMSNorm

### Training Details

**Continual Pre-training (CPT):**
- **Total Training Tokens:** ~225 billion tokens (250,739,476,454 tokens across four phases)
- **Training Method:** Four-phase curriculum learning
- **Framework:** NVIDIA NeMo with Megatron-Core
- **Hardware:** MareNostrum 5 supercomputer (BSC), H100 GPUs
- **Precision:** BF16

**Dataset Composition:**
- **Legal Sources:**
  - Court of Cassation (Yargıtay): 10.3M sequences, ~3.43B tokens
  - Council of State (Danıştay): 151K sequences, ~0.11B tokens
  - Academic theses (YÖKTEZ): 21.1M sequences, ~9.61B tokens (after DocsOCR processing)
- **General Turkish Sources:**
  - FineWeb2: General Turkish web data
  - CulturaX: Multilingual corpus (Turkish subset)
  - Total general Turkish: 212M sequences, ~96.17B tokens
- **Additional Categories:** English, Mathematics, Python code, multilingual content (Spanish, Arabic, Russian, Chinese)

**Phase 1 (~3.7B tokens):**
- Focus: Short, general-purpose Turkish texts
- Purpose: Adapt model to Turkish language patterns while maintaining stability
- Learning Rate: Higher with extended warmup
- Dataset: Academic-focused data with semantic deduplication and FineWeb quality filtering

**Phase 2 (~57B tokens):**
- Focus: Legal content with domain-specific terminology
- Includes: Court decisions, legal articles, regulatory documents
- Data Replay: YÖKTEZ academic legal data from Phase 1
- Dataset: Lighter pipeline with FineWeb quality filtering, preserving topical diversity

**Phase 3 (~165B tokens):**
- Focus: Long, structurally complex normative texts
- Includes: Full court decisions, legislative documents, academic legal theses
- Purpose: Refine model's understanding of legal reasoning patterns
- Dataset: Long-form documents with merged consecutive pages

**Phase 4 (~24.9B tokens):**
- Focus: Extended domain-specific refinement
- Includes: Mixed complexity documents
- Purpose: Consolidate knowledge and improve generalization

**Training Hyperparameters:**
- Sequence Length: 4,096 tokens
- Optimizer: Adam with cosine learning rate schedule
- Max Learning Rate: 5×10⁻⁵
- Min Learning Rate: 5×10⁻⁶
- Weight Decay: 0.01
- Warmup Steps: Phase-dependent (200-2,340 steps)
- Precision: BF16 mixed precision
- Framework: NVIDIA NeMo with Megatron-Core

**Hardware Infrastructure:**
- **System:** MareNostrum 5 ACC partition at Barcelona Supercomputing Center (BSC)
- **Node Configuration:** Each node equipped with 4× NVIDIA Hopper H100 64GB GPUs (SXM), 80 CPU cores, 512GB DDR5 memory
- **Interconnect:** 800 Gb/s InfiniBand for distributed training
- **GPU Interconnect:** NVLink for intra-node GPU communication (4 GPUs per node connected via NVLink)
- **Distributed Training:** Data-parallel multi-node and multi-GPU distributed architecture with 4 GPUs per node
- **InfiniBand Network:** Enabled efficient processing of large-scale token flow and ensured high scalability and training stability in long-term CPT training
- **Phase-Specific Hardware:**
  - **Phase 1:** 50 nodes, 200 GPUs, ~3.7B tokens, 3.77M tokens/sec throughput, 20.7% median MFU
  - **Phase 2:** 50 nodes, 200 GPUs, ~57B tokens, 3.59M tokens/sec throughput, 20.7% median MFU
  - **Phase 3:** 100 nodes, 400 GPUs, ~165B tokens, 7.35M tokens/sec throughput, 20.3% median MFU
  - **Phase 4:** 50 nodes, 200 GPUs, ~24.9B tokens, 3.25M tokens/sec throughput, 20.6% median MFU

**Catastrophic Forgetting Mitigation:**
- Curriculum learning: Progressive transition from general to specialized knowledge
- Replay buffer: YÖKTEZ data from Phase 1 included in Phase 2
- Conservative learning rates and extended warmup periods

**Performance:** Achieved 36.2% perplexity reduction on Turkish legal text compared to base Qwen3-1.7B model.

### Training Visualization

The following visualizations show the model's training progress and dataset distribution:

![Dataset Distribution](qwen3-1.7_dataset.png)

*Qwen3-1.7B CPT Dataset Distribution across Four Phases. The curriculum learning strategy progressively introduces more complex legal content.*

![Training Loss](qwen3-1.7b_loss.png)

*Qwen3-1.7B CPT Training and Validation Loss Across Four Phases. The model shows consistent improvement throughout all training phases.*

### Benchmark Performance

The model was evaluated using the Muhakim reward model on Turkish legal tasks:

![Benchmark Performance](1_7b_qwen_armo.png)

*Benchmark Performance of 1.7B Decoder-Only Models Across Context Lengths Using the Muhakim Reward Model. Mecellem-Qwen3-1.7B-TR consistently outperforms the base Qwen3-1.7B model across all five legal quality objectives, with particularly pronounced gains for depth of coverage, statute reference usage, and legal accuracy.*

### Rewards Comparison Analysis

The following visualization compares rewards across different token lengths for base vs CPT models:

![Rewards Comparison](comparison_rewards_by_token_length-filtered.png)

*Rewards Comparison: Base vs CPT Models Across Token Lengths. Mecellem-Qwen3-1.7B-TR shows consistent improvements over the base model across all context length settings, demonstrating the effectiveness of Turkish legal domain adaptation.*


## Usage

### Installation

```bash
pip install transformers torch
```

### Text Generation

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("newmindai/Mecellem-Qwen3-1.7B-TR")
model = AutoModelForCausalLM.from_pretrained("newmindai/Mecellem-Qwen3-1.7B-TR")

# Example prompt
prompt = "Türk hukuk sisteminde sözleşme feshi"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        temperature=0.7,
        do_sample=True,
        top_p=0.9
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

### Chat Format

```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("newmindai/Mecellem-Qwen3-1.7B-TR")
model = AutoModelForCausalLM.from_pretrained("newmindai/Mecellem-Qwen3-1.7B-TR")

messages = [
    {"role": "user", "content": "Türk hukuk sisteminde sözleşme feshi nasıl yapılır?"}
]

# Apply chat template
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")

# Generate response
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=256)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

## Use Cases

- Turkish legal text generation
- Legal document summarization
- Legal question answering
- Legal text completion
- Domain-specific language modeling for Turkish legal domain
- Retrieval-Augmented Generation (RAG) applications

## Acknowledgments

This work was supported by the EuroHPC Joint Undertaking through project etur46 with access to the MareNostrum 5 supercomputer, hosted by Barcelona Supercomputing Center (BSC), Spain. MareNostrum 5 is owned by EuroHPC JU and operated by BSC. We are grateful to the BSC support team for their assistance with job scheduling, environment configuration, and technical guidance throughout the project.

The numerical calculations reported in this work were fully/partially performed at TÜBİTAK ULAKBİM, High Performance and Grid Computing Center (TRUBA resources). The authors gratefully acknowledge the know-how provided by the MINERVA Support for expert guidance and collaboration opportunities in HPC-AI integration.

## References

If you use this model, please cite our paper:

```bibtex
@article{mecellem2026,
  title={Mecellem Models: Turkish Models Trained from Scratch and Continually Pre-trained for the Legal Domain},
  author={Uğur, Özgür and Göksu, Mahmut and Çimen, Mahmut and Yılmaz, Musa and Şavirdi, Esra and Demir, Alp Talha and Güllüce, Rumeysa and İclal Çetin and Sağbaş, Ömer Can},
  journal={arXiv preprint arXiv:2601.16018},
  year={2026},
  month={January},
  url={https://arxiv.org/abs/2601.16018},
  doi={10.48550/arXiv.2601.16018},
  eprint={2601.16018},
  archivePrefix={arXiv},
  primaryClass={cs.CL}
}
```
### Base Model References

```bibtex
@article{qwen2024,
  title={Qwen3: A Large Language Model Series},
  author={Qwen Team},
  journal={arXiv preprint arXiv:2409.00000},
  year={2024}
}
```