初始化项目,由ModelHub XC社区提供模型
Model: ayoolaolafenwa/ChatLM Source: Original Platform
This commit is contained in:
145
README.md
Normal file
145
README.md
Normal file
@@ -0,0 +1,145 @@
|
||||
---
|
||||
license: apache-2.0
|
||||
datasets:
|
||||
- ayoolaolafenwa/sft-data
|
||||
language:
|
||||
- en
|
||||
---
|
||||
|
||||
## ChatLM
|
||||
It is a chat Large Language Model finetuned with pretrained [Falcon-1B model](https://huggingface.co/tiiuae/falcon-rw-1b)
|
||||
and trained on [chat-bot-instructions prompts dataset](https://huggingface.co/datasets/ayoolaolafenwa/sft-data).
|
||||
ChatLM was trained on a dataset containing normal day to day human conversations, due to limited data used in training
|
||||
it does not generalize well for tasks like coding, current affairs and hallucinations may occur.
|
||||
|
||||
# Github Repo: https://github.com/ayoolaolafenwa/ChatLM
|
||||
|
||||
# Have a live chat with ChatLM on space https://huggingface.co/spaces/ayoolaolafenwa/ChatLM
|
||||
|
||||
# Install Required Packages
|
||||
```
|
||||
pip install transformers
|
||||
pip install accelerate
|
||||
pip install einops
|
||||
pip install bitsandbytes
|
||||
```
|
||||
|
||||
## Load Model in bfloat16
|
||||
``` python
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_path = "ayoolaolafenwa/ChatLM"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True,
|
||||
torch_dtype=torch.bfloat16).to("cuda")
|
||||
|
||||
prompt = "<user>: Give me a financial advise on investing in stocks. <chatbot>: "
|
||||
|
||||
tokens = tokenizer(prompt, return_tensors="pt")
|
||||
|
||||
token_ids = tokens.input_ids
|
||||
attention_mask=tokens.attention_mask
|
||||
|
||||
token_ids = token_ids.to(model.device)
|
||||
attention_mask=attention_mask.to(model.device)
|
||||
|
||||
outputs = model.generate(input_ids=token_ids, attention_mask = attention_mask, max_length=2048,do_sample=True,
|
||||
num_return_sequences=1,top_k = 10, temperature = 0.7, eos_token_id=tokenizer.eos_token_id)
|
||||
|
||||
output_text = tokenizer.decode(outputs[0])
|
||||
output_text = output_text.replace("<|endoftext|>", "")
|
||||
|
||||
print(output_text)
|
||||
```
|
||||
|
||||
## Load Model in bfloat16 and int8
|
||||
``` python
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_path = "ayoolaolafenwa/ChatLM"
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
||||
|
||||
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code = True,
|
||||
torch_dtype=torch.bfloat16, load_in_8bit=True)
|
||||
|
||||
prompt = "<user>: Give me a financial advise on investing in stocks. <chatbot>: "
|
||||
|
||||
tokens = tokenizer(prompt, return_tensors="pt")
|
||||
|
||||
token_ids = tokens.input_ids
|
||||
attention_mask=tokens.attention_mask
|
||||
|
||||
token_ids = token_ids.to(model.device)
|
||||
attention_mask=attention_mask.to(model.device)
|
||||
|
||||
outputs = model.generate(input_ids=token_ids, attention_mask = attention_mask, max_length=2048,do_sample=True,
|
||||
num_return_sequences=1,top_k = 10, temperature = 0.7, eos_token_id=tokenizer.eos_token_id)
|
||||
|
||||
output_text = tokenizer.decode(outputs[0])
|
||||
output_text = output_text.replace("<|endoftext|>", "")
|
||||
|
||||
print(output_text)
|
||||
```
|
||||
# Training procedure for Supervised Finetuning
|
||||
|
||||
## Dataset Preparation
|
||||
|
||||
Chatbot Instructions prompts dataset from https://huggingface.co/datasets/alespalla/chatbot_instruction_prompts/viewer/alespalla--chatbot_instruction_prompts
|
||||
was processed into a supervised finetuning format for training a user prompt and a corresponding response.
|
||||
|
||||
##### Download Data
|
||||
``` python
|
||||
from datasets import load_dataset
|
||||
|
||||
dataset = load_dataset("alespalla/chatbot_instruction_prompts", split = "train")
|
||||
dataset.save_to_disk('ChatBotInsP')
|
||||
dataset.to_csv('CIPtrain.csv')
|
||||
```
|
||||
|
||||
##### Code to process dataset into Supervised finetuning format
|
||||
``` python
|
||||
# Import pandas library
|
||||
import pandas as pd
|
||||
|
||||
# Read the text dataset from csv file
|
||||
text_data = pd.read_csv("CIPtrain.csv")
|
||||
|
||||
# Create empty lists for prompts and responses
|
||||
prompts = []
|
||||
responses = []
|
||||
|
||||
# Loop through the text data
|
||||
for i in range(len(text_data)):
|
||||
# Get the sender, message, and timestamp of the current row
|
||||
prompt = text_data["prompt"][i]
|
||||
prompt = str(prompt)
|
||||
|
||||
response = text_data["response"][i]
|
||||
response = str(response)
|
||||
|
||||
# Add the message to the prompts list with <user> tag
|
||||
prompts.append("<user>: " + prompt)
|
||||
|
||||
# Add the message to the responses list with <chatbot> tag
|
||||
responses.append("<chatbot>: " + response)
|
||||
|
||||
# Create a new dataframe with prompts and responses columns
|
||||
new_data = pd.DataFrame({"prompt": prompts, "response": responses})
|
||||
|
||||
#alespalla/chatbot_instruction_prompts
|
||||
# Write the new dataframe to a csv file
|
||||
new_data.to_csv("MyData/chatbot_instruction_prompts_train.csv", index=False)
|
||||
```
|
||||
The users` prompts in the dataset are appended with the tag <user> and the corresponding responses with the tag <chatbot>.
|
||||
Check the the modified dataset https://huggingface.co/datasets/ayoolaolafenwa/sft-data .
|
||||
|
||||
### Training
|
||||
|
||||
ChatLM was supervised finetuned with pretrained [Falcon 1-Billion parameters model](https://huggingface.co/tiiuae/falcon-rw-1b) trained on 350-Billion tokens
|
||||
of RefinedWeb. It was trained with a single H100 GPU for 1 epoch. It achieves Perplexity *1.738*. Check the full code for supervised finetune
|
||||
training on its github repository https://github.com/ayoolaolafenwa/ChatLM/tree/main
|
||||
Reference in New Issue
Block a user