初始化项目，由ModelHub XC社区提供模型

Model: proxectonos/Carballo-Legal Source: Original Platform
2026-05-13 09:20:34 +08:00
commit af282b8764
13 changed files with 1680 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,147 @@
+---
+library_name: transformers
+tags:
+- legal
+- instruction-tuning
+- multilingual
+license: mit
+language:
+- gl
+- es
+base_model:
+- BSC-LT/salamandra-7b-instruct
+pipeline_tag: text-generation
+datasets:
+- proxectonos/corpus_dominio_legal_administrativo
+---
+
+# Carballo-Legal
+
+## Table of Contents
+<details>
+<summary>Click to expand</summary>
+
+- [Carballo-Legal](#carballo-legal)
+  - [Table of Contents](#table-of-contents)
+  - [Model description](#model-description)
+  - [Intended uses and limitations](#intended-uses-and-limitations)
+  - [How to use](#how-to-use)
+  - [Training](#training)
+    - [Tools](#tools)
+    - [Training data](#training-data)
+    - [Training hyperparameters](#training-hyperparameters)
+    - [Framework](#framework)
+  - [Evaluation](#evaluation)
+  - [Additional information](#additional-information)
+    - [Funding](#funding)
+    - [Cite this model](#cite-this-model)
+
+</details>
+
+## Model description
+
+**Carballo-Legal** is a specialized 7B-parameter instruction-tuned model designed for **legal text understanding and generation** in **Galician (GL)** and **Spanish (ES)**.
+
+It is based on the foundation model [BSC-LT/salamandra-7b-instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct) and has been further trained on high-quality legal corpora extracted from official public institutions.
+
+This model enhances Salamandra’s instruction-following abilities with legal language, terminology, document structure, and reasoning patterns found in administrative and legislative texts.
+
+## Intended uses and limitations
+
+**Intended uses**
+- Legal-oriented text generation (summaries, rephrasing, explanations).    
+- Chat-style legal assistance (non-professional).  
+- Downstream fine-tuning for specific legal domains or tasks.
+
+**Limitations**
+- Not a substitute for professional legal interpretation.  
+- May produce incomplete or incorrect legal statements.  
+- Not suitable for high-stakes or judicial decision-making.  
+- Works best for GL and ES; other languages are not reinforced in this checkpoint.
+
+## How to use
+
+```python
+from datetime import datetime
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import transformers
+import torch
+
+model_id = "proxectonos/Carballo-Legal"
+
+text = "Qué sabes sobre o Proxecto Nós?"
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    torch_dtype=torch.bfloat16
+  )
+
+message = [ { "role": "user", "content": text } ]
+date_string = datetime.today().strftime('%Y-%m-%d')
+
+prompt = tokenizer.apply_chat_template(
+    message,
+    tokenize=False,
+    add_generation_prompt=True,
+    date_string=date_string
+)
+
+inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
+outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
+generated_tokens = outputs[0][len(inputs[0]):]
+response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip()
+response = response.split("<|reserved_token_1|>")[0].strip()
+print(response)
+```
+
+## Training
+
+### Training data
+
+The model was trained on a mixture of general instructions and domain-specific legal texts.
+
+| **Dataset Type** | **Languages** | **Sources** |
+|------------------|---------------|-------------|
+| Instruction set  | GL, ES , PT , CAT , EN | [Galician Instruction Datasets](https://github.com/proxectonos/instruction_datasets) |
+| Legal corpus     | GL, ES        | DOGA, BOP Pontevedra, BOP A Coruña |
+
+### Training hyperparameters
+
+- **epochs:** 0.5  
+- **dtype:** bf16  
+- **block size:** 2048  
+- **total batch size:** 128  
+- **learning rate:** 2e-6  
+- **scheduler:** Linear  
+- **optimizations:**  
+  - gradient checkpointing: True  
+  - flash attention: True  
+  - liger kernels: True  
+  - DeepSpeed stage: 2  
+
+### Framework
+Training was performed at the **Galician Supercomputing Center (CESGA)** on **2 nodes** with **2× NVIDIA A100 40GB** each, totaling **4 GPUs**, across **2 days**.
+
+## Evaluation
+
+Formal evaluation is in progress.  Early observations show improved handling of legal terminology, structured documents, and administrative phrasing in GL and ES.
+
+## Additional information
+
+## Funding
+This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA
+
+### Cite this model
+Please cite the model as follows:
+
+```
+@misc{carballo_legal_2025,
+    title     = {Carballo-Legal: A Legal Domain Instruction-Tuned Model for Galician and Spanish},
+    author    = {Proxecto Nós Team},
+    year      = {2025},
+    publisher = {HuggingFace},
+    howpublished = {\url{https://huggingface.co/proxectonos/Carballo-Legal}},
+}
+```