初始化项目,由ModelHub XC社区提供模型
Model: OpenLLM-France/Lucie-7B-Instruct-human-data Source: Original Platform
This commit is contained in:
218
README.md
Normal file
218
README.md
Normal file
@@ -0,0 +1,218 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
datasets:
|
||||
- CohereForAI/aya_dataset
|
||||
- argilla/databricks-dolly-15k-curated-multilingual
|
||||
- Gael540/dataSet_ens_sup_fr-v1
|
||||
- ai2-adapt-dev/flan_v2_converted
|
||||
- OpenAssistant/oasst1
|
||||
language:
|
||||
- fr
|
||||
- en
|
||||
- de
|
||||
- it
|
||||
- es
|
||||
base_model:
|
||||
- OpenLLM-France/Lucie-7B
|
||||
pipeline_tag: text-generation
|
||||
---
|
||||
|
||||
# Model Card for Lucie-7B-Instruct-human-data
|
||||
|
||||
* [Model Description](#model-description)
|
||||
<!-- * [Uses](#uses) -->
|
||||
* [Training Details](#training-details)
|
||||
* [Training Data](#training-data)
|
||||
* [Preprocessing](#preprocessing)
|
||||
* [Instruction template](#instruction-template)
|
||||
* [Training Procedure](#training-procedure)
|
||||
<!-- * [Evaluation](#evaluation) -->
|
||||
* [Testing the model](#testing-the-model)
|
||||
* [Test with ollama](#test-with-ollama)
|
||||
* [Test with vLLM](#test-with-vllm)
|
||||
* [Citation](#citation)
|
||||
* [Acknowledgements](#acknowledgements)
|
||||
* [Contact](#contact)
|
||||
|
||||
## Model Description
|
||||
|
||||
Lucie-7B-Instruct-human-data is a fine-tuned version of [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B), an open-source, multilingual causal language model created by OpenLLM-France.
|
||||
|
||||
Lucie-7B-Instruct-human-data is fine-tuned on human-produced instructions collected either from open annotation campaigns or by applying templates to extant datasets. The performance of Lucie-7B-Instruct-human-data falls below that of [Lucie-7B-Instruct-v1.1](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-v1.1); the interest of the model is to show what can be done to fine-tune LLMs to follow instructions without appealing to third party LLMs.
|
||||
|
||||
Note that Lucie-7B-Instruct-human-data is optimized for the generation of French text. It has not been trained for code generation or optimized for math. Such capacities can be improved through further fine-tuning and alignment with methods such as DPO, RLHF, etc.
|
||||
|
||||
While Lucie-7B-Instruct-human-data is trained on sequences of 4096 tokens, its base model, Lucie-7B has a context size of 32K tokens. Based on Needle-in-a-haystack evaluations, Lucie-7B-Instruct-human-data maintains the capacity of the base model to handle 32K-size context windows.
|
||||
|
||||
## Training details
|
||||
### Training data
|
||||
|
||||
Lucie-7B-Instruct-human-data is trained on the following datasets published by third parties:
|
||||
* [Aya Dataset](https://huggingface.co/datasets/CohereForAI/aya_dataset) (English, 3944 samples; French, 1422; German, 241; Italian, 738; Spanish, 3854)
|
||||
* [Dolly](https://huggingface.co/datasets/argilla/databricks-dolly-15k-curated-multilingual) (English, French, German, Spanish; 15015 x 4 samples)
|
||||
* [ENS](https://huggingface.co/datasets/Gael540/dataSet_ens_sup_fr-v1) (French, 394 samples)
|
||||
* [FLAN v2 Converted](https://huggingface.co/datasets/ai2-adapt-dev/flan_v2_converted) (English, 78580 samples)
|
||||
* [Open Assistant 1](https://huggingface.co/datasets/OpenAssistant/oasst1) (English, 21151 samples; French, 1223; German, 1515; Italian, 370; Spanish, 14078)
|
||||
* [Oracle](https://github.com/opinionscience/InstructionFr/tree/main/wikipedia) (French, 4613 samples)
|
||||
* [PIAF](https://www.data.gouv.fr/fr/datasets/piaf-le-dataset-francophone-de-questions-reponses/) (French, 1849 samples)
|
||||
|
||||
|
||||
And the following datasets developed for the Lucie instruct models:
|
||||
* [Croissant Aligned Instruct](https://huggingface.co/datasets/OpenLLM-France/Croissant-Aligned-Instruct) (French-English, 20K examples sampled randomly from 80K total)
|
||||
* Hard-coded prompts concerning OpenLLM and Lucie (based on [allenai/tulu-3-hard-coded-10x](https://huggingface.co/datasets/allenai/tulu-3-hard-coded-10x))
|
||||
* French: openllm_french.jsonl (24x10 samples)
|
||||
* English: openllm_english.jsonl (24x10 samples)
|
||||
|
||||
### Preprocessing
|
||||
* Filtering by language: Aya Dataset, Dolly and Open Assistant were filtered to keep only languages on which Lucie-7B was trained.
|
||||
* Filtering by keyword: Examples containing assistant responses were filtered out from Open Assistant if the responses contained a keyword from the list [filter_strings](https://github.com/OpenLLM-France/Lucie-Training/blob/98792a1a9015dcf613ff951b1ce6145ca8ecb174/tokenization/data.py#L2012). This filter is designed to remove examples in which the assistant is presented as model other than Lucie (e.g., ChatGPT, Gemma, Llama, ...).
|
||||
|
||||
### Instruction template:
|
||||
Lucie-7B-Instruct-human-data was trained on the chat template from Llama 3.1 with the sole difference that `<|begin_of_text|>` is replaced with `<s>`. The resulting template:
|
||||
|
||||
```
|
||||
<s><|start_header_id|>system<|end_header_id|>
|
||||
|
||||
{SYSTEM}<|eot_id|><|start_header_id|>user<|end_header_id|>
|
||||
|
||||
{INPUT}<|eot_id|><|start_header_id|>assistant<|end_header_id|>
|
||||
|
||||
{OUTPUT}<|eot_id|>
|
||||
```
|
||||
|
||||
|
||||
An example:
|
||||
|
||||
|
||||
```
|
||||
<s><|start_header_id|>system<|end_header_id|>
|
||||
|
||||
You are a helpful assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>
|
||||
|
||||
Give me three tips for staying in shape.<|eot_id|><|start_header_id|>assistant<|end_header_id|>
|
||||
|
||||
1. Eat a balanced diet and be sure to include plenty of fruits and vegetables. \n2. Exercise regularly to keep your body active and strong. \n3. Get enough sleep and maintain a consistent sleep schedule.<|eot_id|>
|
||||
```
|
||||
|
||||
### Training procedure
|
||||
|
||||
The model architecture and hyperparameters are the same as for [Lucie-7B](https://huggingface.co/OpenLLM-France/Lucie-7B) during the annealing phase with the following exceptions:
|
||||
* context length: 4096<sup>*</sup>
|
||||
* batch size: 1024
|
||||
* max learning rate: 3e-5
|
||||
* min learning rate: 3e-6
|
||||
|
||||
<sup>*</sup>As noted above, while Lucie-7B-Instruct is trained on sequences of 4096 tokens, it maintains the capacity of the base model, Lucie-7B, to handle context sizes of up to 32K tokens.
|
||||
|
||||
## Testing the model
|
||||
|
||||
### Test with ollama
|
||||
|
||||
* Download and install [Ollama](https://ollama.com/download)
|
||||
* Download the [GGUF model](https://huggingface.co/OpenLLM-France/Lucie-7B-Instruct-human-data/resolve/main/Lucie-7B-q4_k_m.gguf)
|
||||
* Copy the [`Modelfile`](Modelfile), adpating if necessary the path to the GGUF file (line starting with `FROM`).
|
||||
* Run in a shell:
|
||||
* `ollama create -f Modelfile Lucie`
|
||||
* `ollama run Lucie`
|
||||
* Once ">>>" appears, type your prompt(s) and press Enter.
|
||||
* Optionally, restart a conversation by typing "`/clear`"
|
||||
* End the session by typing "`/bye`".
|
||||
|
||||
Useful for debug:
|
||||
* [How to print input requests and output responses in Ollama server?](https://stackoverflow.com/a/78831840)
|
||||
* [Documentation on Modelfile](https://github.com/ollama/ollama/blob/main/docs/modelfile.md#parameter)
|
||||
* Examples: [Ollama model library](https://github.com/ollama/ollama#model-library)
|
||||
* Llama 3 example: https://ollama.com/library/llama3.1
|
||||
* Add GUI : https://docs.openwebui.com/
|
||||
|
||||
### Test with vLLM
|
||||
|
||||
#### 1. Run vLLM Docker Container
|
||||
|
||||
Use the following command to deploy the model,
|
||||
replacing `INSERT_YOUR_HF_TOKEN` with your Hugging Face Hub token.
|
||||
|
||||
```bash
|
||||
docker run --runtime nvidia --gpus=all \
|
||||
--env "HUGGING_FACE_HUB_TOKEN=INSERT_YOUR_HF_TOKEN" \
|
||||
-p 8000:8000 \
|
||||
--ipc=host \
|
||||
vllm/vllm-openai:latest \
|
||||
--model OpenLLM-France/Lucie-7B-Instruct-human-data
|
||||
```
|
||||
|
||||
#### 2. Test using OpenAI Client in Python
|
||||
|
||||
To test the deployed model, use the OpenAI Python client as follows:
|
||||
|
||||
```python
|
||||
from openai import OpenAI
|
||||
|
||||
# Initialize the client
|
||||
client = OpenAI(base_url='http://localhost:8000/v1', api_key='empty')
|
||||
|
||||
# Define the input content
|
||||
content = "Hello Lucie"
|
||||
|
||||
# Generate a response
|
||||
chat_response = client.chat.completions.create(
|
||||
model="OpenLLM-France/Lucie-7B-Instruct-human-data",
|
||||
messages=[
|
||||
{"role": "user", "content": content}
|
||||
],
|
||||
)
|
||||
print(chat_response.choices[0].message.content)
|
||||
```
|
||||
|
||||
## Citation
|
||||
|
||||
When using the Lucie-7B-Instruct-human-data model, please cite the following paper:
|
||||
|
||||
✍ Olivier Gouvert, Julie Hunter, Jérôme Louradour,
|
||||
Christophe Cérisara, Evan Dufraisse, Yaya Sy,
|
||||
Laura Rivière, Jean-Pierre Lorré (2025).
|
||||
[The Lucie-7B LLM and the Lucie Training Dataset:
|
||||
Open resources for multilingual language generation](https://arxiv.org/abs/2503.12294). arxiv:2503.12294.
|
||||
```bibtex
|
||||
@misc{openllm2025lucie,
|
||||
title={The Lucie-7B LLM and the Lucie Training Dataset: Open resources for multilingual language generation},
|
||||
author={Olivier Gouvert and Julie Hunter and Jérôme Louradour and Christophe Cerisara and Evan Dufraisse and Yaya Sy and Laura Rivière and Jean-Pierre Lorré and OpenLLM-France community},
|
||||
year={2025},
|
||||
eprint={2503.12294},
|
||||
archivePrefix={arXiv},
|
||||
primaryClass={cs.CL},
|
||||
url={https://arxiv.org/abs/2503.12294},
|
||||
}
|
||||
```
|
||||
|
||||
## Acknowledgements
|
||||
|
||||
This work was performed using HPC resources from GENCI–IDRIS (Grant 2024-GC011015444). We gratefully acknowledge support from GENCI and IDRIS and from Pierre-François Lavallée (IDRIS) and Stephane Requena (GENCI) in particular.
|
||||
|
||||
Lucie-7B was created by members of [LINAGORA](https://labs.linagora.com/) and the [OpenLLM-France](https://www.openllm-france.fr/) community, including in alphabetical order:
|
||||
Olivier Gouvert (LINAGORA),
|
||||
Ismaïl Harrando (LINAGORA/SciencesPo),
|
||||
Julie Hunter (LINAGORA),
|
||||
Jean-Pierre Lorré (LINAGORA),
|
||||
Jérôme Louradour (LINAGORA),
|
||||
Michel-Marie Maudet (LINAGORA), and
|
||||
Laura Rivière (LINAGORA).
|
||||
|
||||
|
||||
We thank
|
||||
Clément Bénesse (Opsci),
|
||||
Christophe Cerisara (LORIA),
|
||||
Émile Hazard (Opsci),
|
||||
Evan Dufraisse (CEA),
|
||||
Guokan Shang (MBZUAI),
|
||||
Joël Gombin (Opsci),
|
||||
Jordan Ricker (Opsci),
|
||||
and
|
||||
Olivier Ferret (CEA)
|
||||
for their helpful input.
|
||||
|
||||
Finally, we thank the entire OpenLLM-France community, whose members have helped in diverse ways.
|
||||
|
||||
## Contact
|
||||
|
||||
contact@openllm-france.fr
|
||||
Reference in New Issue
Block a user