149 lines
6.2 KiB
Markdown
149 lines
6.2 KiB
Markdown
---
|
||
license: mit
|
||
language:
|
||
- pt
|
||
tags:
|
||
- gervasio-pt*
|
||
- gervasio-ptpt
|
||
- gervasio-8b-portuguese-ptpt-decoder
|
||
- portulan
|
||
- albertina-pt*
|
||
- serafim-pt*
|
||
- clm
|
||
- gpt
|
||
- portuguese
|
||
- decoder
|
||
- foundation model
|
||
base_model:
|
||
- meta-llama/Llama-3.1-8B-Instruct
|
||
base_model_relation: finetune
|
||
pipeline_tag: text-generation
|
||
library_name: transformers
|
||
---
|
||
</br>
|
||
</br>
|
||
<img align="left" width="40" height="40" src="https://github.githubassets.com/images/icons/emoji/unicode/1f917.png">
|
||
<p style="text-align: center;"> This is the model card for <b>Gervásio 8B PTPT</b> decoder.
|
||
</br>
|
||
This model is integrated in the <a href="https://evaristo.ai"><b>Evaristo.ai chatbot</b></a>, where its generative capabilities can be experimented with on the fly through a GUI.
|
||
</br>
|
||
You may be interested also in some of the other models in the <a href="https://huggingface.co/PORTULAN">Albertina (encoders) and Serafim (sentence encoder) families</a>.
|
||
</p>
|
||
</br>
|
||
</br>
|
||
|
||
<img width="500" src="logo_gervasio_long_color.png">
|
||
|
||
</br>
|
||
|
||
|
||
|
||
|
||
# Gervásio 8B PTPT
|
||
|
||
</br>
|
||
|
||
**Gervásio 8B PTPT** is an **open** decoder for the **Portuguese language**.
|
||
|
||
It is a **decoder** of the LLaMA family, based on the neural architecture Transformer and developed over the LLaMA 3.1 8B Instruct model.
|
||
Its further improvement through additional training was done over language resources that include data sets of Portuguese prepared for this purpose, that include [extraGLUE-Instruct
|
||
](https://huggingface.co/datasets/PORTULAN/extraglue-instruct), as well as other data sets whose release is being prepared (MMLU PT, Natural Instructions PT, Wikipedia subset, Provérbios PT).
|
||
|
||
**Gervásio 8B PTPT** is openly distributed for free under an open license, including thus for research and commercial purposes, and given its size, can be run on consumer-grade hardware.
|
||
|
||
**Gervásio 8B PTPT** is developed by NLX-Natural Language and Speech Group, at the University of Lisbon, Faculty of Sciences, Department of Informatics, Portugal.
|
||
|
||
For the record, its full name is **Gervásio Produz Textos em Português**, to which corresponds the natural acronym **GPT PT**,
|
||
and which is known more shortly as **Gervásio PT*** or, even more briefly, just as **Gervásio**, among its acquaintances.
|
||
|
||
**Gervásio 8B PTPT** is developed by a team from the University of Lisbon, Portugal.
|
||
|
||
<br>
|
||
<br>
|
||
|
||
# Model Description
|
||
|
||
The model has 8 billion parameters, over 32 layers, with a hidden size of 4096, an intermediate size of 14336, and 32 attention heads. It uses a RoPE tokenizer with a vocabulary of size 128256.
|
||
<br>
|
||
<br>
|
||
|
||
# Training Data
|
||
|
||
**Gervásio 8B PTPT** was trained on various datasets, either native to European Portuguese or translated into European Portuguese.
|
||
For the latter, we selected only those datasets where the outcome of their translation into European Portuguese could preserve, in the target language, the linguistic properties at stake.
|
||
|
||
The training data comprises:
|
||
- [extraGLUE-Instruct](https://huggingface.co/datasets/PORTULAN/extraglue-instruct)
|
||
- MMLU PT (multiple choice question answering).
|
||
- A subset of Natural Instructions (mostly multiple choice question answering tasks).
|
||
- A manually curated subset of Wikipedia.
|
||
- A manually curated list of proverbs.
|
||
<br>
|
||
<br>
|
||
|
||
# Training Details
|
||
|
||
We applied supervised fine-tuning with a causal language modeling training objective following a zero-out technique during the fine-tuning process. Specifically, while the entire prompt and chat template received attention during fine-tuning, only the response tokens were subjected to back-propagation.
|
||
|
||
To accelerate training, the Fully Sharded Data Parallel (FSDP) paradigm was used over 10 L40S GPUs.
|
||
<br>
|
||
<br>
|
||
|
||
# Performance
|
||
|
||
For testing, we use translations of the standard benchmarks GPQA Diamond, MMLU and MMLU Pro, as well as the CoPA, MRPC and RTE datasets in [extraGLUE](https://huggingface.co/datasets/PORTULAN/extraglue).
|
||
|
||
| Model | GPQA Diamond PT | MMLU PT | MMLU Pro PT | CoPA | MRPC | RTE | Average |
|
||
| ---------------------- | --------------: | --------: | ----------: | --------: | --------: | --------: | --------: |
|
||
| Gervásio 8B PTPT | **34.85** | **62.15** | **36.79** | **87.00** | **77.45** | 77.62 | **62.64** |
|
||
| LLaMA 3.1 8B Instruct | 32.32 | 61.49 | 36.10 | 83.00 | 75.25 | **79.42** | 61.26 |
|
||
|
||
<br>
|
||
<br>
|
||
|
||
# How to use
|
||
|
||
You can use this model directly with a pipeline for causal language modeling:
|
||
|
||
```python3
|
||
>>> from transformers import pipeline
|
||
>>> generator = pipeline(model='PORTULAN/gervasio-8b-portuguese-ptpt-decoder')
|
||
>>> generator("A comida portuguesa é", max_new_tokens=10)
|
||
```
|
||
<br>
|
||
<br>
|
||
|
||
# Chatbot
|
||
|
||
This model is integrated in the **chatbot** [**Evaristo.ai**](https://evaristo.ai), where its generative capabilities can be experimented with on the fly through a GUI.
|
||
<br>
|
||
<br>
|
||
|
||
|
||
# Please cite
|
||
|
||
``` latex
|
||
@misc{gervasio,
|
||
title={Advancing Generative AI for Portuguese with
|
||
Open Decoder Gervásio PT-*},
|
||
author={Rodrigo Santos, João Silva, Luís Gomes,
|
||
João Rodrigues, António Branco},
|
||
year={2024},
|
||
eprint={2402.18766},
|
||
archivePrefix={arXiv},
|
||
primaryClass={cs.CL}
|
||
}
|
||
```
|
||
|
||
Please use the above canonical reference when using or citing this model.
|
||
<br>
|
||
<br>
|
||
|
||
# Acknowledgments
|
||
|
||
The research reported here was partially supported by:
|
||
PORTULAN CLARIN—Research Infrastructure for the Science and Technology of Language, funded by Lisboa 2020, Alentejo 2020 and FCT—Fundação para a Ciência e Tecnologia under the grant PINFRA/22117/2016;
|
||
innovation project ACCELERAT.AI - Multilingual Intelligent Contact Centers, funded by IAPMEI, I.P. - Agência para a Competitividade e Inovação I.P. under the grant C625734525-00462629, of Plano de Recuperação e Resiliência, call RE-C05-i01.01 – Agendas/Alianças Mobilizadoras para a Reindustrialização;
|
||
research project "Hey, Hal, curb your hallucination! / Enhancing AI chatbots with enhanced RAG solutions", funded by FCT-Fundação para a Ciência e a Tecnologia under the grant 2024.07592.IACDC;
|
||
project "CLARIN – Infraestrutura de Investigação para a Ciência e Tecnologia da Linguagem", funded by programme Lisboa2030 under the grant LISBOA2030-FEDER-01316900PORTULAN.
|