Carvalho-Salamandra-Instruct/README.md

---
library_name: transformers
license: mit
language:
- gl
- pt
- es
- en
- ca
base_model:
- BSC-LT/salamandra-7b-instruct
pipeline_tag: text-generation
tags:
- Salamandra
- Instruction-tuning
- Multilingual
datasets:
- proxectonos/cpt_instruction_datasets
---

# Carvalho-Salamandra-Instruct

> [!WARNING]
> **WARNING:** This is a preliminary version of Carvalho-Salamandra-Instruct.  

## Table of Contents
<details>
<summary>Click to expand</summary>

- [Carvalho-Salamandra-Instruct](#carvalho-salamandra-instruct)
  - [Table of Contents](#table-of-contents)
  - [Model description](#model-description)
  - [Intended uses and limitations](#intended-uses-and-limitations)
  - [How to use](#how-to-use)
  - [Training](#training)
    - [Tools](#tools)
    - [Training data](#training-data)
    - [Training hyperparameters](#training-hyperparameters)
    - [Framework](#framework)
  - [Evaluation](#evaluation)
  - [Additional information](#additional-information)
    - [Funding](#funding)
    - [Cite this model](#cite-this-model)

</details>

## Model description

**Carvalho-Salamandra-Instruct** is a 7B-parameter instruction-tuned transformer model covering Galician, Portuguese, Spanish, English and Catalan.

It is based on [BSC-LT/salamandra-7b-instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct) and was further adapted through a 1-epoch training run using high-quality multilingual corpora, with a marked emphasis on Galician and Portuguese.

This model aims to provide strong instruction-following and generation capabilities for underrepresented languages while maintaining robust multilingual behavior.

## Intended uses and limitations

**Intended uses**
- Instruction following and dialogue-style generation.  
- Multilingual text generation and content creation.  
- Downstream fine-tuning for tasks such as summarization, classification, or question answering (with appropriate supervised data).

**Limitations**
- Not intended as a sole source for high-stakes or safety-critical decisions.  
- May produce incorrect or biased factual information — verify outputs when accuracy matters.  
- Performance may vary by language and domain; best results in Galician and Portuguese given training emphasis.

## How to use

```python
from datetime import datetime
from transformers import AutoTokenizer, AutoModelForCausalLM
import transformers
import torch

model_id = "proxectonos/Carvalho-Salamandra-Instruct"

text = "Qué sabes sobre o Proxecto Nós?"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16
  )

message = [ { "role": "user", "content": text } ]
date_string = datetime.today().strftime('%Y-%m-%d')

prompt = tokenizer.apply_chat_template(
    message,
    tokenize=False,
    add_generation_prompt=True,
    date_string=date_string
)

inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
generated_tokens = outputs[0][len(inputs[0]):]
response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip()
response = response.split("<|reserved_token_1|>")[0].strip()
print(response)

```

## Training


### Training data

The model was trained with a mix of instruction data and high-quality monolingual corpora, designed to maximize performance in Galician and Portuguese while preserving broad multilingual capabilities.

| **Dataset Type**     | **Languages**                | **Tokens per language/Source** |
|----------------------|------------------------------|------------|
| Full instruction set | GL , ES , PT , CAT , EN      | [Galician Instruction Datasets](https://github.com/proxectonos/instruction_datasets)         |
| High-quality corpus  | GL, PT                       | 250M       |
| Small HQ corpus      | EN, ES, CAT                  | 30M        |

### Training hyperparameters

- **epochs:** 1  
- **dtype:** bf16  
- **block size:** 2048  
- **total batch size:** 128  
- **learning rate:** 2e-6  
- **scheduler:** Linear  
- **optimizations:**  
  - gradient checkpointing: True  
  - flash attention: True  
  - liger kernels: True  
  - DeepSpeed stage: 2

### Framework
Training was performed at the **Galician Supercomputing Center (CESGA)** using **2 nodes** (each with **2× NVIDIA A100 40GB**) — a total of **4 GPUs** — across **2 days**.

## Evaluation

Formal evaluation is ongoing. Preliminary internal tests show strong instruction-following ability and improved generation quality for Galician and Portuguese compared to the base model. Detailed benchmarks and quantitative results will be added when available.

## Additional information

## Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA

### Cite this model
Please cite this model as:

```
@misc{carvalho_salamandra_instruct_2025,
    title = {Carvalho-Salamandra-Instruct: A Multilingual Instruction-Tuned Model for Underrepresented Languages},
    author = {Proxecto Nós Team},
    year = {2025},
    publisher = {HuggingFace},
    howpublished = {\url{https://huggingface.co/proxectonos/Carvalho-Salamandra-Instruct}},
}

```
-												初始化项目，由ModelHub XC社区提供模型

Model: proxectonos/Carvalho-Salamandra-Instruct
Source: Original Platform

											
										
										
											2026-06-07 07:18:15 +08:00
+								---
 								library_name: transformers
 								license: mit
 								language:
 								- gl
 								- pt
 								- es
 								- en
 								- ca
 								base_model:
 								- BSC-LT/salamandra-7b-instruct
 								pipeline_tag: text-generation
 								tags:
 								- Salamandra
 								- Instruction-tuning
 								- Multilingual
 								datasets:
 								- proxectonos/cpt_instruction_datasets
 								---
 								# Carvalho-Salamandra-Instruct
 								> [!WARNING]
 								> **WARNING:** This is a preliminary version of Carvalho-Salamandra-Instruct.
 								## Table of Contents
 								<details>
 								<summary>Click to expand</summary>
 								- [Carvalho-Salamandra-Instruct](#carvalho-salamandra-instruct)
 								  - [Table of Contents](#table-of-contents)
 								  - [Model description](#model-description)
 								  - [Intended uses and limitations](#intended-uses-and-limitations)
 								  - [How to use](#how-to-use)
 								  - [Training](#training)
 								    - [Tools](#tools)
 								    - [Training data](#training-data)
 								    - [Training hyperparameters](#training-hyperparameters)
 								    - [Framework](#framework)
 								  - [Evaluation](#evaluation)
 								  - [Additional information](#additional-information)
 								    - [Funding](#funding)
 								    - [Cite this model](#cite-this-model)
 								</details>
 								## Model description
 								**Carvalho-Salamandra-Instruct** is a 7B-parameter instruction-tuned transformer model covering Galician, Portuguese, Spanish, English and Catalan.
 								It is based on [BSC-LT/salamandra-7b-instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct) and was further adapted through a 1-epoch training run using high-quality multilingual corpora, with a marked emphasis on Galician and Portuguese.
 								This model aims to provide strong instruction-following and generation capabilities for underrepresented languages while maintaining robust multilingual behavior.
 								## Intended uses and limitations
 								**Intended uses**
 								- Instruction following and dialogue-style generation.
 								- Multilingual text generation and content creation.
 								- Downstream fine-tuning for tasks such as summarization, classification, or question answering (with appropriate supervised data).
 								**Limitations**
 								- Not intended as a sole source for high-stakes or safety-critical decisions.
 								- May produce incorrect or biased factual information — verify outputs when accuracy matters.
 								- Performance may vary by language and domain; best results in Galician and Portuguese given training emphasis.
 								## How to use
 								```python
 								from datetime import datetime
 								from transformers import AutoTokenizer, AutoModelForCausalLM
 								import transformers
 								import torch
 								model_id = "proxectonos/Carvalho-Salamandra-Instruct"
 								text = "Qué sabes sobre o Proxecto Nós?"
 								tokenizer = AutoTokenizer.from_pretrained(model_id)
 								model = AutoModelForCausalLM.from_pretrained(
 								    model_id,
 								    device_map="auto",
 								    torch_dtype=torch.bfloat16
 								  )
 								message = [ { "role": "user", "content": text } ]
 								date_string = datetime.today().strftime('%Y-%m-%d')
 								prompt = tokenizer.apply_chat_template(
 								    message,
 								    tokenize=False,
 								    add_generation_prompt=True,
 								    date_string=date_string
 								)
 								inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
 								outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
 								generated_tokens = outputs[0][len(inputs[0]):]
 								response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip()
 								response = response.split("<|reserved_token_1|>")[0].strip()
 								print(response)
 								```
 								## Training
 								### Training data
 								The model was trained with a mix of instruction data and high-quality monolingual corpora, designed to maximize performance in Galician and Portuguese while preserving broad multilingual capabilities.
 								| **Dataset Type**     | **Languages**                | **Tokens per language/Source** |
 								|----------------------|------------------------------|------------|
 								| Full instruction set | GL , ES , PT , CAT , EN      | [Galician Instruction Datasets](https://github.com/proxectonos/instruction_datasets)         |
 								| High-quality corpus  | GL, PT                       | 250M       |
 								| Small HQ corpus      | EN, ES, CAT                  | 30M        |
 								### Training hyperparameters
 								- **epochs:** 1
 								- **dtype:** bf16
 								- **block size:** 2048
 								- **total batch size:** 128
 								- **learning rate:** 2e-6
 								- **scheduler:** Linear
 								- **optimizations:**
 								  - gradient checkpointing: True
 								  - flash attention: True
 								  - liger kernels: True
 								  - DeepSpeed stage: 2
 								### Framework
 								Training was performed at the **Galician Supercomputing Center (CESGA)** using **2 nodes** (each with **2× NVIDIA A100 40GB**) — a total of **4 GPUs** — across **2 days**.
 								## Evaluation
 								Formal evaluation is ongoing. Preliminary internal tests show strong instruction-following ability and improved generation quality for Galician and Portuguese compared to the base model. Detailed benchmarks and quantitative results will be added when available.
 								## Additional information
 								## Funding
 								This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA
 								### Cite this model
 								Please cite this model as:
 								```
 								@misc{carvalho_salamandra_instruct_2025,
 								    title = {Carvalho-Salamandra-Instruct: A Multilingual Instruction-Tuned Model for Underrepresented Languages},
 								    author = {Proxecto Nós Team},
 								    year = {2025},
 								    publisher = {HuggingFace},
 								    howpublished = {\url{https://huggingface.co/proxectonos/Carvalho-Salamandra-Instruct}},
 								}
 								```