初始化项目,由ModelHub XC社区提供模型
Model: ChocoLlama/Llama-3-ChocoLlama-8B-instruct Source: Original Platform
This commit is contained in:
187
README.md
Normal file
187
README.md
Normal file
@@ -0,0 +1,187 @@
|
||||
---
|
||||
language:
|
||||
- nl
|
||||
license: cc-by-nc-4.0
|
||||
base_model: ChocoLlama/Llama-3-ChocoLlama-8B-base
|
||||
datasets:
|
||||
- BramVanroy/ultrachat_200k_dutch
|
||||
- BramVanroy/stackoverflow-chat-dutch
|
||||
- BramVanroy/alpaca-cleaned-dutch
|
||||
- BramVanroy/dolly-15k-dutch
|
||||
- BramVanroy/no_robots_dutch
|
||||
- BramVanroy/ultra_feedback_dutch
|
||||
|
||||
---
|
||||
|
||||
<p align="center" style="margin:0;padding:0">
|
||||
<img src="./chocollama_logo.png" alt="ChocoLlama logo" width="500" style="margin-left:'auto' margin-right:'auto' display:'block'"/>
|
||||
</p>
|
||||
<div style="margin:auto; text-align:center">
|
||||
<h1 style="margin-bottom: 0">ChocoLlama</h1>
|
||||
<em>A Llama-2/3-based family of Dutch language models</em>
|
||||
</div>
|
||||
|
||||
## Llama-3-ChocoLlama-8B-instruct: Getting Started
|
||||
|
||||
We here present **ChocoLlama-2-7B-instruct**, an instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
|
||||
Its base model, [Llama-3-ChocoLlama-8B-base](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base), is a language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
|
||||
|
||||
Use the code below to get started with the model.
|
||||
|
||||
```python
|
||||
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained('ChocoLlama/Llama-3-ChocoLlama-8B-instruct')
|
||||
model = AutoModelForCausalLM.from_pretrained('ChocoLlama/Llama-3-ChocoLlama-8B-instruct', device_map="auto")
|
||||
|
||||
messages = [
|
||||
{"role": "system", "content": "Je bent een artificiële intelligentie-assistent en geeft behulpzame, gedetailleerde en beleefde antwoorden op de vragen van de gebruiker."},
|
||||
{"role": "user", "content": "Jacques brel, Willem Elsschot en Jan Jambon zitten op café. Waar zouden ze over babbelen?"},
|
||||
]
|
||||
|
||||
input_ids = tokenizer.apply_chat_template(
|
||||
messages,
|
||||
add_generation_prompt=True,
|
||||
return_tensors="pt"
|
||||
).to(model.device)
|
||||
|
||||
new_terminators = [
|
||||
tokenizer.eos_token_id,
|
||||
tokenizer.convert_tokens_to_ids("<|eot_id|>")
|
||||
]
|
||||
|
||||
outputs = model.generate(
|
||||
input_ids,
|
||||
max_new_tokens=512,
|
||||
eos_token_id=new_terminators,
|
||||
do_sample=True,
|
||||
temperature=0.8,
|
||||
top_p=0.95,
|
||||
)
|
||||
response = outputs[0][input_ids.shape[-1]:]
|
||||
print(tokenizer.decode(response, skip_special_tokens=True))
|
||||
```
|
||||
|
||||
Note that the datasets used for instruction-tuning were translated using GPT-3.5/4, which means that this instruction-tuned model can not be used for commercial purposes.
|
||||
Hence, for any commercial applications, we recommend finetuning the base model on your own Dutch data.
|
||||
|
||||
## Model Details
|
||||
|
||||
ChocoLlama is a family of open LLM's specifically adapted to Dutch, contributing to the state-of-the-art of Dutch open LLM's in their weight class.
|
||||
|
||||
We provide 6 variants (of which 3 base and 3 instruction-tuned models):
|
||||
- **ChocoLlama-2-7B-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-base)): A language-adapted version of Meta's Llama-2-7b, fine-tuned on 32B Dutch Llama-2 tokens (104GB) using LoRa.
|
||||
- **ChocoLlama-2-7B-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-instruct)): An instruction-tuned version of ChocoLlama-2-7B-base, fine-tuned on a collection of Dutch translations of instruction-tuning datasets, using SFT followed by DPO.
|
||||
- **ChocoLlama-2-7B-tokentrans-base** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-base)): A language-adapted version of Meta's Llama-2-7b, using a Dutch RoBERTa-based tokenizer. The token embeddings of this model were reinitialized using the token translation algorithm proposed by [Remy et al.](https://arxiv.org/pdf/2310.03477). The model was subsequently fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
|
||||
- **ChocoLlama-2-7B-tokentrans-instruct** ([link](https://huggingface.co/ChocoLlama/ChocoLlama-2-7B-tokentrans-instruct)): An instruction-tuned version of ChocoLlama-2-7B-tokentrans-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
|
||||
- **Llama-3-ChocoLlama-8B-base** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-base)): A language-adapted version of Meta's Llama-8-8B, fine-tuned on the same Dutch dataset as ChocoLlama-2-7B-base, again using LoRa.
|
||||
- **Llama-3-ChocoLlama-instruct** ([link](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)): An instruction-tuned version of Llama-3-ChocoLlama-8B-base, fine-tuned on the same dataset as ChocoLlama-2-7B-instruct, again using SFT followed by DPO.
|
||||
|
||||
For benchmark results for all models, including compared to their base models and other Dutch LLMs, we refer to our paper [here](https://arxiv.org/pdf/2412.07633).
|
||||
|
||||
### Model Description
|
||||
|
||||
- **Developed by:** [Matthieu Meeus](https://huggingface.co/matthieumeeus97), [Anthony Rathé](https://huggingface.co/anthonyrathe)
|
||||
- **Funded by:** [Vlaams Supercomputer Centrum](https://www.vscentrum.be/), through a grant of apx. 40K GPU hours (NVIDIA A100-80GB)
|
||||
- **Language(s):** Dutch
|
||||
- **License:** cc-by-nc-4.0
|
||||
- **Finetuned from model:** [Llama-3-ChocoLlama-8B-instruct](https://huggingface.co/ChocoLlama/Llama-3-ChocoLlama-8B-instruct)
|
||||
|
||||
### Model Sources
|
||||
|
||||
- **Repository:** [on Github here](https://github.com/ChocoLlamaModel/ChocoLlama).
|
||||
- **Paper:** [on ArXiv here](https://arxiv.org/pdf/2412.07633).
|
||||
|
||||
## Uses
|
||||
|
||||
### Direct Use
|
||||
|
||||
This is an instruction-tuned (SFT + DPO) Dutch model, optimized for Dutch language generation in conversational settings.
|
||||
For optimal behavior, we advice to only use the model with the correct chat template (see Python code above), potentially supported by a system prompt.
|
||||
|
||||
### Out-of-Scope Use
|
||||
|
||||
Use-cases requiring understanding or generation of text in languages other than Dutch: the dataset on which this model was fine-tuned does not contain data in languages other than Dutch, hence we expect significant catastrophic forgetting to have occured for English, which is the language Llama-2 was originally trained for.
|
||||
|
||||
## Bias, Risks, and Limitations
|
||||
|
||||
We have taken care to include only widely used and high-quality data in our dataset. Some of this data has been filtered by the original creators.
|
||||
However we did not explicitly conduct any additional filtering of this dataset with regards to biased or otherwise harmful content.
|
||||
|
||||
## Training Details
|
||||
|
||||
We adopt the same strategy as used to align GEITje-7B to [GEITje-7B-ultra](https://huggingface.co/BramVanroy/GEITje-7B-ultra).
|
||||
First, we apply supervised finetuning (SFT), utilizing the data made available by [Vanroy](https://arxiv.org/pdf/2312.12852):
|
||||
- [BramVanroy/ultrachat_200k_dutch](https://huggingface.co/datasets/BramVanroy/ultrachat_200k_dutch)
|
||||
- [BramVanroy/no_robots_dutch](https://huggingface.co/datasets/BramVanroy/no_robots_dutch)
|
||||
- [BramVanroy/stackoverflow-chat-dutch](https://huggingface.co/datasets/BramVanroy/stackoverflow-chat-dutch)
|
||||
- [BramVanroy/alpaca-cleaned-dutch](https://huggingface.co/datasets/BramVanroy/alpaca-cleaned-dutch)
|
||||
- [BramVanroy/dolly-15k-dutch](https://huggingface.co/datasets/BramVanroy/dolly-15k-dutch)
|
||||
|
||||
Next, we apply Direct Preference Optimization (DPO) to the SFT version of all the pretrained models we here develop,
|
||||
now utilizing a Dutch version of the data used to train Zephyr-7B-$\beta$, [BramVanroy/ultra_feedback_dutch](https://huggingface.co/datasets/BramVanroy/ultra_feedback_dutch).
|
||||
|
||||
For both the SFT and DPO stage, we update all model weights and apply the same set of hyperparameters to all models as used in GEITje-7B-ultra:
|
||||
- learning_rate: 5e-07
|
||||
- train_batch_size: 4
|
||||
- eval_batch_size: 4
|
||||
- seed: 42
|
||||
- distributed_type: multi-GPU
|
||||
- num_devices: 4
|
||||
- gradient_accumulation_steps: 4
|
||||
- total_train_batch_size: 64
|
||||
- total_eval_batch_size: 16
|
||||
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
|
||||
- lr_scheduler_type: cosine
|
||||
- lr_scheduler_warmup_ratio: 0.1
|
||||
- num_epochs: 1
|
||||
|
||||
Further, we leverage the publicly available [alignment handbook](https://github.com/huggingface/alignment-handbook) and use a set of 4 NVIDIA A100 (80 GB) for both stages.
|
||||
|
||||
## Evaluation
|
||||
|
||||
### Quantitative evaluation
|
||||
|
||||
We have evaluated our models on several industry-standard Dutch benchmarks, translated from their original versions. The results can be found in the table below, together with results from several other prominent Dutch models.
|
||||
|
||||
| Model | ARC | HellaSwag | MMLU | TruthfulQA | Avg. |
|
||||
|----------------------------------------------|----------------|----------------|----------------|----------------|----------------|
|
||||
| **Llama-3-ChocoLlama-instruct** | **0.48** | **0.66** | **0.49** | **0.49** | **0.53** |
|
||||
| llama-3-8B-rebatch | 0.44 | 0.64 | 0.46 | 0.48 | 0.51 |
|
||||
| llama-3-8B-instruct | 0.47 | 0.59 | 0.47 | 0.52 | 0.51 |
|
||||
| llama-3-8B | 0.44 | 0.64 | 0.47 | 0.45 | 0.5 |
|
||||
| Reynaerde-7B-Chat | 0.44 | 0.62 | 0.39 | 0.52 | 0.49 |
|
||||
| **Llama-3-ChocoLlama-base** | **0.45** | **0.64** | **0.44** | **0.44** | **0.49** |
|
||||
| zephyr-7b-beta | 0.43 | 0.58 | 0.43 | 0.53 | 0.49 |
|
||||
| geitje-7b-ultra | 0.40 | 0.66 | 0.36 | 0.49 | 0.48 |
|
||||
| **ChocoLlama-2-7B-tokentrans-instruct** | **0.45** | **0.62** | **0.34** | **0.42** | **0.46** |
|
||||
| mistral-7b-v0.1 | 0.43 | 0.58 | 0.37 | 0.45 | 0.46 |
|
||||
| **ChocoLlama-2-7B-tokentrans-base** | **0.42** | **0.61** | **0.32** | **0.43** | **0.45** |
|
||||
| **ChocoLlama-2-7B-instruct** | **0.36** | **0.57** | **0.33** | **0.45** | **0.43 |
|
||||
| **ChocoLlama-2-7B-base** | **0.35** | **0.56** | **0.31** | **0.43** | **0.41** |
|
||||
| llama-2-7b-chat-hf | 0.36 | 0.49 | 0.33 | 0.44 | 0.41 |
|
||||
| llama-2-7b-hf | 0.36 | 0.51 | 0.32 | 0.41 | 0.40 |
|
||||
|
||||
On average, Llama-3-ChocoLlama-instruct surpasses the previous state-of-the-art on these benchmarks.
|
||||
|
||||
### Qualitative evaluation
|
||||
|
||||
In our paper, we also provide an additional qualitative evaluation of all models - which we empirically find more reliable.
|
||||
For details, we refer to the paper and to our benchmark [ChocoLlama-Bench](https://huggingface.co/datasets/ChocoLlama/ChocoLlama-Bench).
|
||||
|
||||
### Compute Infrastructure
|
||||
|
||||
All ChocoLlama models have been trained on the compute cluster provided by the [Flemish Supercomputer Center (VSC)](https://www.vscentrum.be/). We used 8 to 16 NVIDIA A100 GPU's with 80 GB of VRAM.
|
||||
|
||||
## Citation
|
||||
|
||||
If you found this useful for your work, kindly cite our paper:
|
||||
|
||||
```
|
||||
@article{meeus2024chocollama,
|
||||
title={ChocoLlama: Lessons Learned From Teaching Llamas Dutch},
|
||||
author={Meeus, Matthieu and Rath{\'e}, Anthony and Remy, Fran{\c{c}}ois and Delobelle, Pieter and Decorte, Jens-Joris and Demeester, Thomas},
|
||||
journal={arXiv preprint arXiv:2412.07633},
|
||||
year={2024}
|
||||
}
|
||||
```
|
||||
Reference in New Issue
Block a user