Files

ModelHub XC af282b8764 初始化项目，由ModelHub XC社区提供模型

Model: proxectonos/Carballo-Legal
Source: Original Platform

2026-05-13 09:20:34 +08:00

4.7 KiB

Raw Permalink Blame History

library_name, tags, license, language, base_model, pipeline_tag, datasets

library_name

Carballo-Legal

Click to expand

Carballo-Legal

Model description

Carballo-Legal is a specialized 7B-parameter instruction-tuned model designed for legal text understanding and generation in Galician (GL) and Spanish (ES).

It is based on the foundation model BSC-LT/salamandra-7b-instruct and has been further trained on high-quality legal corpora extracted from official public institutions.

This model enhances Salamandra’s instruction-following abilities with legal language, terminology, document structure, and reasoning patterns found in administrative and legislative texts.

Intended uses and limitations

Intended uses

Legal-oriented text generation (summaries, rephrasing, explanations).
Chat-style legal assistance (non-professional).
Downstream fine-tuning for specific legal domains or tasks.

Limitations

Not a substitute for professional legal interpretation.
May produce incomplete or incorrect legal statements.
Not suitable for high-stakes or judicial decision-making.
Works best for GL and ES; other languages are not reinforced in this checkpoint.

How to use

from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "proxectonos/Carballo-Legal"

text = "Qué sabes sobre o Proxecto Nós?"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
  )

message = [ { "role": "user", "content": text } ]
date_string = datetime.today().strftime('%Y-%m-%d')

prompt = tokenizer.apply_chat_template(
    message,
    tokenize=False,
    add_generation_prompt=True,
    date_string=date_string
)

inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
generated_tokens = outputs[0][len(inputs[0]):]
response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip()
response = response.split("<|reserved_token_1|>")[0].strip()
print(response)

Training

Training data

The model was trained on a mixture of general instructions and domain-specific legal texts.

Dataset Type	Languages	Sources
Instruction set	GL, ES , PT , CAT , EN	Galician Instruction Datasets
Legal corpus	GL, ES	DOGA, BOP Pontevedra, BOP A Coruña

Training hyperparameters

epochs: 0.5
dtype: bf16
block size: 2048
total batch size: 128
learning rate: 2e-6
scheduler: Linear
optimizations:
- gradient checkpointing: True
- flash attention: True
- liger kernels: True
- DeepSpeed stage: 2

Framework

Training was performed at the Galician Supercomputing Center (CESGA) on 2 nodes with 2× NVIDIA A100 40GB each, totaling 4 GPUs, across 2 days.

Evaluation

Formal evaluation is in progress. Early observations show improved handling of legal terminology, structured documents, and administrative phrasing in GL and ES.

Additional information

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA

Cite this model

Please cite the model as follows:

@misc{carballo_legal_2025,
    title     = {Carballo-Legal: A Legal Domain Instruction-Tuned Model for Galician and Spanish},
    author    = {Proxecto Nós Team},
    year      = {2025},
    publisher = {HuggingFace},
    howpublished = {\url{https://huggingface.co/proxectonos/Carballo-Legal}},
}

4.7 KiB Raw Permalink Blame History Unescape Escape