Files
Carballo-Legal/README.md
ModelHub XC af282b8764 初始化项目,由ModelHub XC社区提供模型
Model: proxectonos/Carballo-Legal
Source: Original Platform
2026-05-13 09:20:34 +08:00

147 lines
4.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
library_name: transformers
tags:
- legal
- instruction-tuning
- multilingual
license: mit
language:
- gl
- es
base_model:
- BSC-LT/salamandra-7b-instruct
pipeline_tag: text-generation
datasets:
- proxectonos/corpus_dominio_legal_administrativo
---
# Carballo-Legal
## Table of Contents
<details>
<summary>Click to expand</summary>
- [Carballo-Legal](#carballo-legal)
- [Table of Contents](#table-of-contents)
- [Model description](#model-description)
- [Intended uses and limitations](#intended-uses-and-limitations)
- [How to use](#how-to-use)
- [Training](#training)
- [Tools](#tools)
- [Training data](#training-data)
- [Training hyperparameters](#training-hyperparameters)
- [Framework](#framework)
- [Evaluation](#evaluation)
- [Additional information](#additional-information)
- [Funding](#funding)
- [Cite this model](#cite-this-model)
</details>
## Model description
**Carballo-Legal** is a specialized 7B-parameter instruction-tuned model designed for **legal text understanding and generation** in **Galician (GL)** and **Spanish (ES)**.
It is based on the foundation model [BSC-LT/salamandra-7b-instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct) and has been further trained on high-quality legal corpora extracted from official public institutions.
This model enhances Salamandras instruction-following abilities with legal language, terminology, document structure, and reasoning patterns found in administrative and legislative texts.
## Intended uses and limitations
**Intended uses**
- Legal-oriented text generation (summaries, rephrasing, explanations).
- Chat-style legal assistance (non-professional).
- Downstream fine-tuning for specific legal domains or tasks.
**Limitations**
- Not a substitute for professional legal interpretation.
- May produce incomplete or incorrect legal statements.
- Not suitable for high-stakes or judicial decision-making.
- Works best for GL and ES; other languages are not reinforced in this checkpoint.
## How to use
```python
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch
model_id = "proxectonos/Carballo-Legal"
text = "Qué sabes sobre o Proxecto Nós?"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16
)
message = [ { "role": "user", "content": text } ]
date_string = datetime.today().strftime('%Y-%m-%d')
prompt = tokenizer.apply_chat_template(
message,
tokenize=False,
add_generation_prompt=True,
date_string=date_string
)
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
generated_tokens = outputs[0][len(inputs[0]):]
response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip()
response = response.split("<|reserved_token_1|>")[0].strip()
print(response)
```
## Training
### Training data
The model was trained on a mixture of general instructions and domain-specific legal texts.
| **Dataset Type** | **Languages** | **Sources** |
|------------------|---------------|-------------|
| Instruction set | GL, ES , PT , CAT , EN | [Galician Instruction Datasets](https://github.com/proxectonos/instruction_datasets) |
| Legal corpus | GL, ES | DOGA, BOP Pontevedra, BOP A Coruña |
### Training hyperparameters
- **epochs:** 0.5
- **dtype:** bf16
- **block size:** 2048
- **total batch size:** 128
- **learning rate:** 2e-6
- **scheduler:** Linear
- **optimizations:**
- gradient checkpointing: True
- flash attention: True
- liger kernels: True
- DeepSpeed stage: 2
### Framework
Training was performed at the **Galician Supercomputing Center (CESGA)** on **2 nodes** with **2× NVIDIA A100 40GB** each, totaling **4 GPUs**, across **2 days**.
## Evaluation
Formal evaluation is in progress. Early observations show improved handling of legal terminology, structured documents, and administrative phrasing in GL and ES.
## Additional information
## Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA
### Cite this model
Please cite the model as follows:
```
@misc{carballo_legal_2025,
title = {Carballo-Legal: A Legal Domain Instruction-Tuned Model for Galician and Spanish},
author = {Proxecto Nós Team},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/proxectonos/Carballo-Legal}},
}
```