初始化项目,由ModelHub XC社区提供模型
Model: proxectonos/Carvalho-Salamandra-Instruct Source: Original Platform
This commit is contained in:
156
README.md
Normal file
156
README.md
Normal file
@@ -0,0 +1,156 @@
|
||||
---
|
||||
library_name: transformers
|
||||
license: mit
|
||||
language:
|
||||
- gl
|
||||
- pt
|
||||
- es
|
||||
- en
|
||||
- ca
|
||||
base_model:
|
||||
- BSC-LT/salamandra-7b-instruct
|
||||
pipeline_tag: text-generation
|
||||
tags:
|
||||
- Salamandra
|
||||
- Instruction-tuning
|
||||
- Multilingual
|
||||
datasets:
|
||||
- proxectonos/cpt_instruction_datasets
|
||||
---
|
||||
|
||||
# Carvalho-Salamandra-Instruct
|
||||
|
||||
> [!WARNING]
|
||||
> **WARNING:** This is a preliminary version of Carvalho-Salamandra-Instruct.
|
||||
|
||||
## Table of Contents
|
||||
<details>
|
||||
<summary>Click to expand</summary>
|
||||
|
||||
- [Carvalho-Salamandra-Instruct](#carvalho-salamandra-instruct)
|
||||
- [Table of Contents](#table-of-contents)
|
||||
- [Model description](#model-description)
|
||||
- [Intended uses and limitations](#intended-uses-and-limitations)
|
||||
- [How to use](#how-to-use)
|
||||
- [Training](#training)
|
||||
- [Tools](#tools)
|
||||
- [Training data](#training-data)
|
||||
- [Training hyperparameters](#training-hyperparameters)
|
||||
- [Framework](#framework)
|
||||
- [Evaluation](#evaluation)
|
||||
- [Additional information](#additional-information)
|
||||
- [Funding](#funding)
|
||||
- [Cite this model](#cite-this-model)
|
||||
|
||||
</details>
|
||||
|
||||
## Model description
|
||||
|
||||
**Carvalho-Salamandra-Instruct** is a 7B-parameter instruction-tuned transformer model covering Galician, Portuguese, Spanish, English and Catalan.
|
||||
|
||||
It is based on [BSC-LT/salamandra-7b-instruct](https://huggingface.co/BSC-LT/salamandra-7b-instruct) and was further adapted through a 1-epoch training run using high-quality multilingual corpora, with a marked emphasis on Galician and Portuguese.
|
||||
|
||||
This model aims to provide strong instruction-following and generation capabilities for underrepresented languages while maintaining robust multilingual behavior.
|
||||
|
||||
## Intended uses and limitations
|
||||
|
||||
**Intended uses**
|
||||
- Instruction following and dialogue-style generation.
|
||||
- Multilingual text generation and content creation.
|
||||
- Downstream fine-tuning for tasks such as summarization, classification, or question answering (with appropriate supervised data).
|
||||
|
||||
**Limitations**
|
||||
- Not intended as a sole source for high-stakes or safety-critical decisions.
|
||||
- May produce incorrect or biased factual information — verify outputs when accuracy matters.
|
||||
- Performance may vary by language and domain; best results in Galician and Portuguese given training emphasis.
|
||||
|
||||
## How to use
|
||||
|
||||
```python
|
||||
from datetime import datetime
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
import transformers
|
||||
import torch
|
||||
|
||||
model_id = "proxectonos/Carvalho-Salamandra-Instruct"
|
||||
|
||||
text = "Qué sabes sobre o Proxecto Nós?"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
||||
model = AutoModelForCausalLM.from_pretrained(
|
||||
model_id,
|
||||
device_map="auto",
|
||||
torch_dtype=torch.bfloat16
|
||||
)
|
||||
|
||||
message = [ { "role": "user", "content": text } ]
|
||||
date_string = datetime.today().strftime('%Y-%m-%d')
|
||||
|
||||
prompt = tokenizer.apply_chat_template(
|
||||
message,
|
||||
tokenize=False,
|
||||
add_generation_prompt=True,
|
||||
date_string=date_string
|
||||
)
|
||||
|
||||
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
|
||||
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=200)
|
||||
generated_tokens = outputs[0][len(inputs[0]):]
|
||||
response = self.tokenizer.decode(generated_tokens, skip_special_tokens=False).strip()
|
||||
response = response.split("<|reserved_token_1|>")[0].strip()
|
||||
print(response)
|
||||
|
||||
```
|
||||
|
||||
## Training
|
||||
|
||||
|
||||
### Training data
|
||||
|
||||
The model was trained with a mix of instruction data and high-quality monolingual corpora, designed to maximize performance in Galician and Portuguese while preserving broad multilingual capabilities.
|
||||
|
||||
| **Dataset Type** | **Languages** | **Tokens per language/Source** |
|
||||
|----------------------|------------------------------|------------|
|
||||
| Full instruction set | GL , ES , PT , CAT , EN | [Galician Instruction Datasets](https://github.com/proxectonos/instruction_datasets) |
|
||||
| High-quality corpus | GL, PT | 250M |
|
||||
| Small HQ corpus | EN, ES, CAT | 30M |
|
||||
|
||||
### Training hyperparameters
|
||||
|
||||
- **epochs:** 1
|
||||
- **dtype:** bf16
|
||||
- **block size:** 2048
|
||||
- **total batch size:** 128
|
||||
- **learning rate:** 2e-6
|
||||
- **scheduler:** Linear
|
||||
- **optimizations:**
|
||||
- gradient checkpointing: True
|
||||
- flash attention: True
|
||||
- liger kernels: True
|
||||
- DeepSpeed stage: 2
|
||||
|
||||
### Framework
|
||||
Training was performed at the **Galician Supercomputing Center (CESGA)** using **2 nodes** (each with **2× NVIDIA A100 40GB**) — a total of **4 GPUs** — across **2 days**.
|
||||
|
||||
## Evaluation
|
||||
|
||||
Formal evaluation is ongoing. Preliminary internal tests show strong instruction-following ability and improved generation quality for Galician and Portuguese compared to the base model. Detailed benchmarks and quantitative results will be added when available.
|
||||
|
||||
## Additional information
|
||||
|
||||
## Funding
|
||||
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project Desarrollo de Modelos ALIA
|
||||
|
||||
### Cite this model
|
||||
Please cite this model as:
|
||||
|
||||
```
|
||||
@misc{carvalho_salamandra_instruct_2025,
|
||||
title = {Carvalho-Salamandra-Instruct: A Multilingual Instruction-Tuned Model for Underrepresented Languages},
|
||||
author = {Proxecto Nós Team},
|
||||
year = {2025},
|
||||
publisher = {HuggingFace},
|
||||
howpublished = {\url{https://huggingface.co/proxectonos/Carvalho-Salamandra-Instruct}},
|
||||
}
|
||||
|
||||
```
|
||||
Reference in New Issue
Block a user