初始化项目，由ModelHub XC社区提供模型

Model: louisbrulenaudet/Romulus-cpt-Llama-3.1-8B-v0.1-Instruct Source: Original Platform
2026-06-14 01:04:13 +08:00
commit 6bd1e782c9
15 changed files with 2722 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,221 @@
+---
+datasets:
+- louisbrulenaudet/Romulus-cpt-fr
+license: llama3
+language:
+- fr
+base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- law
+- droit
+- unsloth
+- trl
+- transformers
+- sft
+- llama
+---
+<img src="assets/thumbnail.webp">
+
+# Romulus, continually pre-trained models for French law.
+
+Romulus is a series of continually pre-trained models enriched in French law and intended to serve as the basis for a fine-tuning process on labeled data. Please note that these models have not been aligned for the production of usable text as they stand, and will certainly need to be fine-tuned for the desired tasks in order to produce satisfactory results.
+
+The training corpus is made up of around 34,864,949 tokens (calculated with the meta-llama/Meta-Llama-3.1-8B-Instruct tokenizer).
+
+## Hyperparameters
+
+The following table outlines the key hyperparameters used for training Romulus.
+
+| **Parameter**                   | **Description**                                                 | **Value**                   |
+|----------------------------------|-----------------------------------------------------------------|-----------------------------|
+| `max_seq_length`                 | Maximum sequence length for the model                           | 4096                        |
+| `load_in_4bit`                   | Whether to load the model in 4-bit precision                    | False                       |
+| `model_name`                     | Pre-trained model name from Hugging Face                        | meta-llama/Meta-Llama-3.1-8B-Instruct|
+| `r`                              | Rank of the LoRA adapter                                        | 128                         |
+| `lora_alpha`                     | Alpha value for the LoRA module                                 | 32                          |
+| `lora_dropout`                   | Dropout rate for LoRA layers                                    | 0                           |
+| `bias`                           | Bias type for LoRA adapters                                     | none                        |
+| `use_gradient_checkpointing`     | Whether to use gradient checkpointing                           | unsloth                     |
+| `train_batch_size`               | Per device training batch size                                  | 8                           |
+| `gradient_accumulation_steps`    | Number of gradient accumulation steps                           | 8                           |
+| `warmup_ratio`                   | Warmup steps as a fraction of total steps                       | 0.1                         |
+| `num_train_epochs`               | Number of training epochs                                       | 1                           |
+| `learning_rate`                  | Learning rate for the model                                     | 5e-5                        |
+| `embedding_learning_rate`        | Learning rate for embeddings                                    | 1e-5                        |
+| `optim`                          | Optimizer used for training                                     | adamw_8bit                  |
+| `weight_decay`                   | Weight decay to prevent overfitting                             | 0.01                        |
+| `lr_scheduler_type`              | Type of learning rate scheduler                                 | linear                      |
+
+# Training script
+
+Romulus was trained using Unsloth on a Nvidia H100 Azure EST US instance provided by the Microsoft for Startups program from this script:
+
+```python
+# -*- coding: utf-8 -*-
+import os
+
+from typing import (
+    Dict,
+)
+
+from datasets import load_dataset
+from unsloth import (
+    FastLanguageModel,
+    is_bfloat16_supported,
+    UnslothTrainer,
+    UnslothTrainingArguments,
+)
+
+max_seq_length = 4096
+dtype = None
+load_in_4bit = False
+
+model, tokenizer = FastLanguageModel.from_pretrained(
+    model_name="meta-llama/Meta-Llama-3.1-8B-Instruct",
+    max_seq_length=max_seq_length,
+    dtype=dtype,
+    load_in_4bit=load_in_4bit,
+    token="hf_token",
+)
+
+model = FastLanguageModel.get_peft_model(
+    model,
+    r=128,
+    target_modules=[
+        "q_proj",
+        "k_proj",
+        "v_proj",
+        "o_proj",
+        "gate_proj",
+        "up_proj",
+        "down_proj",
+        "embed_tokens",
+        "lm_head",
+    ],
+    lora_alpha=32,
+    lora_dropout=0,
+    bias="none",
+    use_gradient_checkpointing="unsloth",
+    random_state=3407,
+    use_rslora=True,
+    loftq_config=None,
+)
+
+prompt = """### Référence :
+{}
+### Contenu :
+{}"""
+
+EOS_TOKEN = tokenizer.eos_token
+
+def formatting_prompts_func(examples):
+    """
+    Format input examples into prompts for a language model.
+
+    This function takes a dictionary of examples containing titles and texts,
+    combines them into formatted prompts, and appends an end-of-sequence token.
+
+    Parameters
+    ----------
+    examples : dict
+        A dictionary containing two keys:
+        - 'title': A list of titles.
+        - 'text': A list of corresponding text content.
+
+    Returns
+    -------
+    dict
+        A dictionary with a single key 'text', containing a list of formatted prompts.
+
+    Notes
+    -----
+    - The function assumes the existence of a global `prompt` variable, which is a
+      formatting string used to combine the title and text.
+    - The function also assumes the existence of a global `EOS_TOKEN` variable,
+      which is appended to the end of each formatted prompt.
+    - The input lists 'title' and 'text' are expected to have the same length.
+
+    Examples
+    --------
+    >>> examples = {
+    ...     'title': ['Title 1', 'Title 2'],
+    ...     'text': ['Content 1', 'Content 2']
+    ... }
+    >>> formatting_cpt_prompts_func(examples)
+    {'text': ['<formatted_prompt_1><EOS>', '<formatted_prompt_2><EOS>']}
+    """
+    refs = examples["ref"]
+    texts = examples["texte"]
+    outputs = []
+
+    for ref, text in zip(refs, texts):
+        text = prompt.format(ref, text) + EOS_TOKEN
+        outputs.append(text)
+
+    return {
+        "text": outputs,
+    }
+
+
+cpt_dataset = load_dataset(
+    "louisbrulenaudet/Romulus-cpt-fr",
+    split="train",
+    token="hf_token",
+)
+
+cpt_dataset = cpt_dataset.map(
+    formatting_prompts_func,
+    batched=True,
+)
+
+trainer = UnslothTrainer(
+    model=model,
+    tokenizer=tokenizer,
+    train_dataset=cpt_dataset,
+    dataset_text_field="text",
+    max_seq_length=max_seq_length,
+    dataset_num_proc=2,
+    args=UnslothTrainingArguments(
+        per_device_train_batch_size=8,
+        gradient_accumulation_steps=8,
+        warmup_ratio=0.1,
+        num_train_epochs=1,
+        learning_rate=5e-5,
+        embedding_learning_rate=1e-5,
+        fp16=not is_bfloat16_supported(),
+        bf16=is_bfloat16_supported(),
+        logging_steps=1,
+        report_to="wandb",
+        save_steps=350,
+        run_name="romulus-cpt",
+        optim="adamw_8bit",
+        weight_decay=0.01,
+        lr_scheduler_type="linear",
+        seed=3407,
+        output_dir="outputs",
+    ),
+)
+
+trainer_stats = trainer.train()
+```
+
+<img src="assets/loss.png">
+
+## Citing & Authors
+
+If you use this code in your research, please use the following BibTeX entry.
+
+```BibTeX
+@misc{louisbrulenaudet2024,
+  author =       {Louis Brulé Naudet},
+  title =        {Romulus, continually pre-trained models for French law},
+  year =         {2024}
+  howpublished = {\url{https://huggingface.co/datasets/louisbrulenaudet/Romulus-cpt-fr}},
+}
+```
+
+## Feedback
+
+If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).