Romulus-cpt-Llama-3.1-8B-v0…/README.md

---
datasets:
- louisbrulenaudet/Romulus-cpt-fr
license: llama3
language:
- fr
base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
pipeline_tag: text-generation
library_name: transformers
tags:
- law
- droit
- unsloth
- trl
- transformers
- sft
- llama
---
<img src="assets/thumbnail.webp">

# Romulus, continually pre-trained models for French law.

Romulus is a series of continually pre-trained models enriched in French law and intended to serve as the basis for a fine-tuning process on labeled data. Please note that these models have not been aligned for the production of usable text as they stand, and will certainly need to be fine-tuned for the desired tasks in order to produce satisfactory results.

The training corpus is made up of around 34,864,949 tokens (calculated with the meta-llama/Meta-Llama-3.1-8B-Instruct tokenizer).

## Hyperparameters

The following table outlines the key hyperparameters used for training Romulus.

| **Parameter**                   | **Description**                                                 | **Value**                   |
|----------------------------------|-----------------------------------------------------------------|-----------------------------|
| `max_seq_length`                 | Maximum sequence length for the model                           | 4096                        |
| `load_in_4bit`                   | Whether to load the model in 4-bit precision                    | False                       |
| `model_name`                     | Pre-trained model name from Hugging Face                        | meta-llama/Meta-Llama-3.1-8B-Instruct|
| `r`                              | Rank of the LoRA adapter                                        | 128                         |
| `lora_alpha`                     | Alpha value for the LoRA module                                 | 32                          |
| `lora_dropout`                   | Dropout rate for LoRA layers                                    | 0                           |
| `bias`                           | Bias type for LoRA adapters                                     | none                        |
| `use_gradient_checkpointing`     | Whether to use gradient checkpointing                           | unsloth                     |
| `train_batch_size`               | Per device training batch size                                  | 8                           |
| `gradient_accumulation_steps`    | Number of gradient accumulation steps                           | 8                           |
| `warmup_ratio`                   | Warmup steps as a fraction of total steps                       | 0.1                         |
| `num_train_epochs`               | Number of training epochs                                       | 1                           |
| `learning_rate`                  | Learning rate for the model                                     | 5e-5                        |
| `embedding_learning_rate`        | Learning rate for embeddings                                    | 1e-5                        |
| `optim`                          | Optimizer used for training                                     | adamw_8bit                  |
| `weight_decay`                   | Weight decay to prevent overfitting                             | 0.01                        |
| `lr_scheduler_type`              | Type of learning rate scheduler                                 | linear                      |

# Training script

Romulus was trained using Unsloth on a Nvidia H100 Azure EST US instance provided by the Microsoft for Startups program from this script:

```python
# -*- coding: utf-8 -*-
import os

from typing import (
    Dict,
)

from datasets import load_dataset
from unsloth import (
    FastLanguageModel,
    is_bfloat16_supported,
    UnslothTrainer,
    UnslothTrainingArguments,
)

max_seq_length = 4096
dtype = None
load_in_4bit = False

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    token="hf_token",
)

model = FastLanguageModel.get_peft_model(
    model,
    r=128,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "embed_tokens",
        "lm_head",
    ],
    lora_alpha=32,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    use_rslora=True,
    loftq_config=None,
)

prompt = """### Référence :
{}
### Contenu :
{}"""

EOS_TOKEN = tokenizer.eos_token

def formatting_prompts_func(examples):
    """
    Format input examples into prompts for a language model.

    This function takes a dictionary of examples containing titles and texts,
    combines them into formatted prompts, and appends an end-of-sequence token.

    Parameters
    ----------
    examples : dict
        A dictionary containing two keys:
        - 'title': A list of titles.
        - 'text': A list of corresponding text content.

    Returns
    -------
    dict
        A dictionary with a single key 'text', containing a list of formatted prompts.

    Notes
    -----
    - The function assumes the existence of a global `prompt` variable, which is a
      formatting string used to combine the title and text.
    - The function also assumes the existence of a global `EOS_TOKEN` variable,
      which is appended to the end of each formatted prompt.
    - The input lists 'title' and 'text' are expected to have the same length.

    Examples
    --------
    >>> examples = {
    ...     'title': ['Title 1', 'Title 2'],
    ...     'text': ['Content 1', 'Content 2']
    ... }
    >>> formatting_cpt_prompts_func(examples)
    {'text': ['<formatted_prompt_1><EOS>', '<formatted_prompt_2><EOS>']}
    """
    refs = examples["ref"]
    texts = examples["texte"]
    outputs = []

    for ref, text in zip(refs, texts):
        text = prompt.format(ref, text) + EOS_TOKEN
        outputs.append(text)

    return {
        "text": outputs,
    }


cpt_dataset = load_dataset(
    "louisbrulenaudet/Romulus-cpt-fr",
    split="train",
    token="hf_token",
)

cpt_dataset = cpt_dataset.map(
    formatting_prompts_func,
    batched=True,
)

trainer = UnslothTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=cpt_dataset,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    dataset_num_proc=2,
    args=UnslothTrainingArguments(
        per_device_train_batch_size=8,
        gradient_accumulation_steps=8,
        warmup_ratio=0.1,
        num_train_epochs=1,
        learning_rate=5e-5,
        embedding_learning_rate=1e-5,
        fp16=not is_bfloat16_supported(),
        bf16=is_bfloat16_supported(),
        logging_steps=1,
        report_to="wandb",
        save_steps=350,
        run_name="romulus-cpt",
        optim="adamw_8bit",
        weight_decay=0.01,
        lr_scheduler_type="linear",
        seed=3407,
        output_dir="outputs",
    ),
)

trainer_stats = trainer.train()
```

<img src="assets/loss.png">

## Citing & Authors

If you use this code in your research, please use the following BibTeX entry.

```BibTeX
@misc{louisbrulenaudet2024,
  author =       {Louis Brulé Naudet},
  title =        {Romulus, continually pre-trained models for French law},
  year =         {2024}
  howpublished = {\url{https://huggingface.co/datasets/louisbrulenaudet/Romulus-cpt-fr}},
}
```

## Feedback

If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).