Carballo-Legal/README.md

---
library_name: transformers
tags:
- legal
- instruction-tuning
- multilingual
license: mit
language:
- gl
- es
base_model:
- BSC-LT/salamandra-7b-instruct
pipeline_tag: text-generation
datasets:
- proxectonos/corpus_dominio_legal_administrativo
---

# Carballo-Legal

## Table of Contents
<details>
<summary>Click to expand</summary>

- [Carballo-Legal](#carballo-legal)
  - [Table of Contents](#table-of-contents)
  - [Model description](#model-description)
  - [Intended uses and limitations](#intended-uses-and-limitations)
  - [How to use](#how-to-use)
  - [Training](#training)
    - [Tools](#tools)
    - [Training data](#training-data)
    - [Training hyperparameters](#training-hyperparameters)
    - [Framework](#framework)
  - [Evaluation](#evaluation)
  - [Additional information](#additional-information)
    - [Funding](#funding)
    - [Cite this model](#cite-this-model)

</details>

## Model description

**Carballo-Legal** is a specialized 7B-parameter instruction-tuned model designed for **legal text understanding and generation** in **Galician (GL)** and **Spanish (ES)**.

It is based on the foundation model [BSC-LT/salamandra-7b-instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct) and has been further trained on high-quality legal corpora extracted from official public institutions.

This model enhances Salamandra’s instruction-following abilities with legal language, terminology, document structure, and reasoning patterns found in administrative and legislative texts.

## Intended uses and limitations

**Intended uses**
- Legal-oriented text generation (summaries, rephrasing, explanations).
- Chat-style legal assistance (non-professional).
- Downstream fine-tuning for specific legal domains or tasks.

**Limitations**
- Not a substitute for professional legal interpretation.
- May produce incomplete or incorrect legal statements.
- Not suitable for high-stakes or judicial decision-making.
- Works best for GL and ES; other languages are not reinforced in this checkpoint.

## How to use

```python
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "proxectonos/Carballo-Legal"

text = "Qué sabes sobre o Proxecto Nós?"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
  )

message = [ { "role": "user", "content": text } ]
date_string = datetime.today().strftime('%Y-%m-%d')

prompt = tokenizer.apply_chat_template(
    message,
    tokenize=False,
    add_generation_prompt=True,
    date_string=date_string
)

inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
generated_tokens = outputs[0][len(inputs[0]):]
response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip()
response = response.split("<|reserved_token_1|>")[0].strip()
print(response)
```

## Training

### Training data

The model was trained on a mixture of general instructions and domain-specific legal texts.

| **Dataset Type** | **Languages** | **Sources** |
|------------------|---------------|-------------|
| Instruction set  | GL, ES , PT , CAT , EN | [Galician Instruction Datasets](https://github.com/proxectonos/instruction_datasets) |
| Legal corpus     | GL, ES        | DOGA, BOP Pontevedra, BOP A Coruña |

### Training hyperparameters

- **epochs:** 0.5
- **dtype:** bf16
- **block size:** 2048
- **total batch size:** 128
- **learning rate:** 2e-6
- **scheduler:** Linear
- **optimizations:**
  - gradient checkpointing: True
  - flash attention: True
  - liger kernels: True
  - DeepSpeed stage: 2

### Framework
Training was performed at the **Galician Supercomputing Center (CESGA)** on **2 nodes** with **2× NVIDIA A100 40GB** each, totaling **4 GPUs**, across **2 days**.

## Evaluation

Formal evaluation is in progress.  Early observations show improved handling of legal terminology, structured documents, and administrative phrasing in GL and ES.

## Additional information

## Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA

### Cite this model
Please cite the model as follows:

```
@misc{carballo_legal_2025,
    title     = {Carballo-Legal: A Legal Domain Instruction-Tuned Model for Galician and Spanish},
    author    = {Proxecto Nós Team},
    year      = {2025},
    publisher = {HuggingFace},
    howpublished = {\url{https://huggingface.co/proxectonos/Carballo-Legal}},
}
```