147 lines
4.7 KiB
Markdown
147 lines
4.7 KiB
Markdown
|
|
---
|
|||
|
|
library_name: transformers
|
|||
|
|
tags:
|
|||
|
|
- legal
|
|||
|
|
- instruction-tuning
|
|||
|
|
- multilingual
|
|||
|
|
license: mit
|
|||
|
|
language:
|
|||
|
|
- gl
|
|||
|
|
- es
|
|||
|
|
base_model:
|
|||
|
|
- BSC-LT/salamandra-7b-instruct
|
|||
|
|
pipeline_tag: text-generation
|
|||
|
|
datasets:
|
|||
|
|
- proxectonos/corpus_dominio_legal_administrativo
|
|||
|
|
---
|
|||
|
|
|
|||
|
|
# Carballo-Legal
|
|||
|
|
|
|||
|
|
## Table of Contents
|
|||
|
|
<details>
|
|||
|
|
<summary>Click to expand</summary>
|
|||
|
|
|
|||
|
|
- [Carballo-Legal](#carballo-legal)
|
|||
|
|
- [Table of Contents](#table-of-contents)
|
|||
|
|
- [Model description](#model-description)
|
|||
|
|
- [Intended uses and limitations](#intended-uses-and-limitations)
|
|||
|
|
- [How to use](#how-to-use)
|
|||
|
|
- [Training](#training)
|
|||
|
|
- [Tools](#tools)
|
|||
|
|
- [Training data](#training-data)
|
|||
|
|
- [Training hyperparameters](#training-hyperparameters)
|
|||
|
|
- [Framework](#framework)
|
|||
|
|
- [Evaluation](#evaluation)
|
|||
|
|
- [Additional information](#additional-information)
|
|||
|
|
- [Funding](#funding)
|
|||
|
|
- [Cite this model](#cite-this-model)
|
|||
|
|
|
|||
|
|
</details>
|
|||
|
|
|
|||
|
|
## Model description
|
|||
|
|
|
|||
|
|
**Carballo-Legal** is a specialized 7B-parameter instruction-tuned model designed for **legal text understanding and generation** in **Galician (GL)** and **Spanish (ES)**.
|
|||
|
|
|
|||
|
|
It is based on the foundation model [BSC-LT/salamandra-7b-instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct) and has been further trained on high-quality legal corpora extracted from official public institutions.
|
|||
|
|
|
|||
|
|
This model enhances Salamandra’s instruction-following abilities with legal language, terminology, document structure, and reasoning patterns found in administrative and legislative texts.
|
|||
|
|
|
|||
|
|
## Intended uses and limitations
|
|||
|
|
|
|||
|
|
**Intended uses**
|
|||
|
|
- Legal-oriented text generation (summaries, rephrasing, explanations).
|
|||
|
|
- Chat-style legal assistance (non-professional).
|
|||
|
|
- Downstream fine-tuning for specific legal domains or tasks.
|
|||
|
|
|
|||
|
|
**Limitations**
|
|||
|
|
- Not a substitute for professional legal interpretation.
|
|||
|
|
- May produce incomplete or incorrect legal statements.
|
|||
|
|
- Not suitable for high-stakes or judicial decision-making.
|
|||
|
|
- Works best for GL and ES; other languages are not reinforced in this checkpoint.
|
|||
|
|
|
|||
|
|
## How to use
|
|||
|
|
|
|||
|
|
```python
|
|||
|
|
from datetime import datetime
|
|||
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
|||
|
|
import transformers
|
|||
|
|
import torch
|
|||
|
|
|
|||
|
|
model_id = "proxectonos/Carballo-Legal"
|
|||
|
|
|
|||
|
|
text = "Qué sabes sobre o Proxecto Nós?"
|
|||
|
|
|
|||
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
|||
|
|
model = AutoModelForCausalLM.from_pretrained(
|
|||
|
|
model_id,
|
|||
|
|
device_map="auto",
|
|||
|
|
torch_dtype=torch.bfloat16
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
message = [ { "role": "user", "content": text } ]
|
|||
|
|
date_string = datetime.today().strftime('%Y-%m-%d')
|
|||
|
|
|
|||
|
|
prompt = tokenizer.apply_chat_template(
|
|||
|
|
message,
|
|||
|
|
tokenize=False,
|
|||
|
|
add_generation_prompt=True,
|
|||
|
|
date_string=date_string
|
|||
|
|
)
|
|||
|
|
|
|||
|
|
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
|
|||
|
|
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
|
|||
|
|
generated_tokens = outputs[0][len(inputs[0]):]
|
|||
|
|
response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip()
|
|||
|
|
response = response.split("<|reserved_token_1|>")[0].strip()
|
|||
|
|
print(response)
|
|||
|
|
```
|
|||
|
|
|
|||
|
|
## Training
|
|||
|
|
|
|||
|
|
### Training data
|
|||
|
|
|
|||
|
|
The model was trained on a mixture of general instructions and domain-specific legal texts.
|
|||
|
|
|
|||
|
|
| **Dataset Type** | **Languages** | **Sources** |
|
|||
|
|
|------------------|---------------|-------------|
|
|||
|
|
| Instruction set | GL, ES , PT , CAT , EN | [Galician Instruction Datasets](https://github.com/proxectonos/instruction_datasets) |
|
|||
|
|
| Legal corpus | GL, ES | DOGA, BOP Pontevedra, BOP A Coruña |
|
|||
|
|
|
|||
|
|
### Training hyperparameters
|
|||
|
|
|
|||
|
|
- **epochs:** 0.5
|
|||
|
|
- **dtype:** bf16
|
|||
|
|
- **block size:** 2048
|
|||
|
|
- **total batch size:** 128
|
|||
|
|
- **learning rate:** 2e-6
|
|||
|
|
- **scheduler:** Linear
|
|||
|
|
- **optimizations:**
|
|||
|
|
- gradient checkpointing: True
|
|||
|
|
- flash attention: True
|
|||
|
|
- liger kernels: True
|
|||
|
|
- DeepSpeed stage: 2
|
|||
|
|
|
|||
|
|
### Framework
|
|||
|
|
Training was performed at the **Galician Supercomputing Center (CESGA)** on **2 nodes** with **2× NVIDIA A100 40GB** each, totaling **4 GPUs**, across **2 days**.
|
|||
|
|
|
|||
|
|
## Evaluation
|
|||
|
|
|
|||
|
|
Formal evaluation is in progress. Early observations show improved handling of legal terminology, structured documents, and administrative phrasing in GL and ES.
|
|||
|
|
|
|||
|
|
## Additional information
|
|||
|
|
|
|||
|
|
## Funding
|
|||
|
|
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA
|
|||
|
|
|
|||
|
|
### Cite this model
|
|||
|
|
Please cite the model as follows:
|
|||
|
|
|
|||
|
|
```
|
|||
|
|
@misc{carballo_legal_2025,
|
|||
|
|
title = {Carballo-Legal: A Legal Domain Instruction-Tuned Model for Galician and Spanish},
|
|||
|
|
author = {Proxecto Nós Team},
|
|||
|
|
year = {2025},
|
|||
|
|
publisher = {HuggingFace},
|
|||
|
|
howpublished = {\url{https://huggingface.co/proxectonos/Carballo-Legal}},
|
|||
|
|
}
|
|||
|
|
```
|