Llama-3.1-8B-Fusion-7030 is a mixed model that combines the strengths of two powerful Llama-based models: arcee-ai/Llama-3.1-SuperNova-Lite and mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated. The weights are blended in a 7:3 ratio, with 70% of the weights from SuperNova-Lite and 30% from the abliterated Meta-Llama-3.1-8B-Instruct model.
Although it's a simple mix, the model is usable, and no gibberish has appeared.
This is an experiment. I test the 9:1, 8:2, 7:3, 6:4 and 5:5 ratios separately to see how much impact they have on the model.
All model evaluation reports will be provided subsequently.
SuperNova-Lite Contributions (70%): Llama-3.1-SuperNova-Lite is an 8B parameter model developed by Arcee.ai, based on the Llama-3.1-8B-Instruct architecture.
Meta-Llama-3.1-8B-Instruct-abliterated Contributions (30%): This is an uncensored version of Llama 3.1 8B Instruct created with abliteration.
Usage
You can use this mixed model in your applications by loading it with Hugging Face's transformers library:
importtorchfromtransformersimportAutoModelForCausalLM,AutoTokenizer,TextStreamerimporttimemixed_model_name="huihui-ai/Llama-3.1-8B-Fusion-7030"# Check if CUDA is availabledevice=torch.device("cuda"iftorch.cuda.is_available()else"cpu")# Load model and tokenizermixed_model=AutoModelForCausalLM.from_pretrained(mixed_model_name,device_map=device,torch_dtype=torch.bfloat16)tokenizer=AutoTokenizer.from_pretrained(mixed_model_name)# Ensure the tokenizer has pad_token_id settokenizer.pad_token_id=tokenizer.eos_token_id# Input loopprint("Start inputting text for inference (type 'exit' to quit)")whileTrue:prompt=input("Enter your prompt: ")ifprompt.lower()=="exit":print("Exiting inference loop.")break# Inference phase: Generate text using the modified modelchat=[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":prompt}]# Prepare input datainput_ids=tokenizer.apply_chat_template(chat,tokenize=True,add_generation_prompt=True,return_tensors="pt").to(device)# Use TextStreamer for streaming outputstreamer=TextStreamer(tokenizer,skip_special_tokens=True)# Record the start timestart_time=time.time()# Generate text and stream output character by characteroutputs=mixed_model.generate(input_ids,max_new_tokens=8192,do_sample=True,temperature=0.6,top_p=0.9,streamer=streamer# Enable streaming output)# Record the end timeend_time=time.time()# Calculate the number of generated tokensgenerated_tokens=outputs[0][input_ids.shape[-1]:].shape[0]# Calculate the total time takentotal_time=end_time-start_time# Calculate tokens generated per secondtokens_per_second=generated_tokens/total_timeprint(f"\nGenerated {generated_tokens} tokens in total, took {total_time:.2f} seconds, generating {tokens_per_second:.2f} tokens per second.")
Evaluations
The following data has been re-evaluated and calculated as the average for each test.
Benchmark
SuperNova-Lite
Meta-Llama-3.1-8B-Instruct-abliterated
Llama-3.1-8B-Fusion-9010
Llama-3.1-8B-Fusion-8020
Llama-3.1-8B-Fusion-7030
Llama-3.1-8B-Fusion-6040
Llama-3.1-8B-Fusion-5050
IF_Eval
82.09
76.29
82.44
82.93
83.10
82.94
82.03
MMLU Pro
35.87
33.1
35.65
35.32
34.91
34.5
33.96
TruthfulQA
64.35
53.25
62.67
61.04
59.09
57.8
56.75
BBH
49.48
44.87
48.86
48.47
48.30
48.19
47.93
GPQA
31.98
29.50
32.25
32.38
32.61
31.14
30.6
The script used for evaluation can be found inside this repository under /eval.sh, or click here