初始化项目，由ModelHub XC社区提供模型

Model: proxectonos/Carvalho-Salamandra-Instruct Source: Original Platform
2026-06-07 07:18:15 +08:00
commit 14e22c812b
13 changed files with 1689 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,156 @@
+---
+library_name: transformers
+license: mit
+language:
+- gl
+- pt
+- es
+- en
+- ca
+base_model:
+- BSC-LT/salamandra-7b-instruct
+pipeline_tag: text-generation
+tags:
+- Salamandra
+- Instruction-tuning
+- Multilingual
+datasets:
+- proxectonos/cpt_instruction_datasets
+---
+
+# Carvalho-Salamandra-Instruct
+
+> [!WARNING]
+> **WARNING:** This is a preliminary version of Carvalho-Salamandra-Instruct.  
+
+## Table of Contents
+<details>
+<summary>Click to expand</summary>
+
+- [Carvalho-Salamandra-Instruct](#carvalho-salamandra-instruct)
+  - [Table of Contents](#table-of-contents)
+  - [Model description](#model-description)
+  - [Intended uses and limitations](#intended-uses-and-limitations)
+  - [How to use](#how-to-use)
+  - [Training](#training)
+    - [Tools](#tools)
+    - [Training data](#training-data)
+    - [Training hyperparameters](#training-hyperparameters)
+    - [Framework](#framework)
+  - [Evaluation](#evaluation)
+  - [Additional information](#additional-information)
+    - [Funding](#funding)
+    - [Cite this model](#cite-this-model)
+
+</details>
+
+## Model description
+
+**Carvalho-Salamandra-Instruct** is a 7B-parameter instruction-tuned transformer model covering Galician, Portuguese, Spanish, English and Catalan.
+
+It is based on [BSC-LT/salamandra-7b-instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct) and was further adapted through a 1-epoch training run using high-quality multilingual corpora, with a marked emphasis on Galician and Portuguese.
+
+This model aims to provide strong instruction-following and generation capabilities for underrepresented languages while maintaining robust multilingual behavior.
+
+## Intended uses and limitations
+
+**Intended uses**
+- Instruction following and dialogue-style generation.  
+- Multilingual text generation and content creation.  
+- Downstream fine-tuning for tasks such as summarization, classification, or question answering (with appropriate supervised data).
+
+**Limitations**
+- Not intended as a sole source for high-stakes or safety-critical decisions.  
+- May produce incorrect or biased factual information — verify outputs when accuracy matters.  
+- Performance may vary by language and domain; best results in Galician and Portuguese given training emphasis.
+
+## How to use
+
+```python
+from datetime import datetime
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import transformers
+import torch
+
+model_id = "proxectonos/Carvalho-Salamandra-Instruct"
+
+text = "Qué sabes sobre o Proxecto Nós?"
+
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    device_map="auto",
+    torch_dtype=torch.bfloat16
+  )
+
+message = [ { "role": "user", "content": text } ]
+date_string = datetime.today().strftime('%Y-%m-%d')
+
+prompt = tokenizer.apply_chat_template(
+    message,
+    tokenize=False,
+    add_generation_prompt=True,
+    date_string=date_string
+)
+
+inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
+outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
+generated_tokens = outputs[0][len(inputs[0]):]
+response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip()
+response = response.split("<|reserved_token_1|>")[0].strip()
+print(response)
+
+```
+
+## Training
+
+
+### Training data
+
+The model was trained with a mix of instruction data and high-quality monolingual corpora, designed to maximize performance in Galician and Portuguese while preserving broad multilingual capabilities.
+
+| **Dataset Type**     | **Languages**                | **Tokens per language/Source** |
+|----------------------|------------------------------|------------|
+| Full instruction set | GL , ES , PT , CAT , EN      | [Galician Instruction Datasets](https://github.com/proxectonos/instruction_datasets)         |
+| High-quality corpus  | GL, PT                       | 250M       |
+| Small HQ corpus      | EN, ES, CAT                  | 30M        |
+
+### Training hyperparameters
+
+- **epochs:** 1  
+- **dtype:** bf16  
+- **block size:** 2048  
+- **total batch size:** 128  
+- **learning rate:** 2e-6  
+- **scheduler:** Linear  
+- **optimizations:**  
+  - gradient checkpointing: True  
+  - flash attention: True  
+  - liger kernels: True  
+  - DeepSpeed stage: 2
+
+### Framework
+Training was performed at the **Galician Supercomputing Center (CESGA)** using **2 nodes** (each with **2× NVIDIA A100 40GB**) — a total of **4 GPUs** — across **2 days**.
+
+## Evaluation
+
+Formal evaluation is ongoing. Preliminary internal tests show strong instruction-following ability and improved generation quality for Galician and Portuguese compared to the base model. Detailed benchmarks and quantitative results will be added when available.
+
+## Additional information
+
+## Funding
+This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA
+
+### Cite this model
+Please cite this model as:
+
+```
+@misc{carvalho_salamandra_instruct_2025,
+    title = {Carvalho-Salamandra-Instruct: A Multilingual Instruction-Tuned Model for Underrepresented Languages},
+    author = {Proxecto Nós Team},
+    year = {2025},
+    publisher = {HuggingFace},
+    howpublished = {\url{https://huggingface.co/proxectonos/Carvalho-Salamandra-Instruct}},
+}
+
+```