313 lines
12 KiB
Markdown
313 lines
12 KiB
Markdown
---
|
|
license: apache-2.0
|
|
tags:
|
|
- merge
|
|
- mergekit
|
|
- lazymergekit
|
|
- openchat/openchat-3.5-0106
|
|
- teknium/OpenHermes-2.5-Mistral-7B
|
|
base_model:
|
|
- openchat/openchat-3.5-0106
|
|
- teknium/OpenHermes-2.5-Mistral-7B
|
|
model-index:
|
|
- name: chatty-djinn-14B
|
|
results:
|
|
- task:
|
|
type: text-generation
|
|
name: Text Generation
|
|
dataset:
|
|
name: AI2 Reasoning Challenge (25-Shot)
|
|
type: ai2_arc
|
|
config: ARC-Challenge
|
|
split: test
|
|
args:
|
|
num_few_shot: 25
|
|
metrics:
|
|
- type: acc_norm
|
|
value: 70.39
|
|
name: normalized accuracy
|
|
source:
|
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mayacinka/chatty-djinn-14B
|
|
name: Open LLM Leaderboard
|
|
- task:
|
|
type: text-generation
|
|
name: Text Generation
|
|
dataset:
|
|
name: HellaSwag (10-Shot)
|
|
type: hellaswag
|
|
split: validation
|
|
args:
|
|
num_few_shot: 10
|
|
metrics:
|
|
- type: acc_norm
|
|
value: 86.45
|
|
name: normalized accuracy
|
|
source:
|
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mayacinka/chatty-djinn-14B
|
|
name: Open LLM Leaderboard
|
|
- task:
|
|
type: text-generation
|
|
name: Text Generation
|
|
dataset:
|
|
name: MMLU (5-Shot)
|
|
type: cais/mmlu
|
|
config: all
|
|
split: test
|
|
args:
|
|
num_few_shot: 5
|
|
metrics:
|
|
- type: acc
|
|
value: 64.4
|
|
name: accuracy
|
|
source:
|
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mayacinka/chatty-djinn-14B
|
|
name: Open LLM Leaderboard
|
|
- task:
|
|
type: text-generation
|
|
name: Text Generation
|
|
dataset:
|
|
name: TruthfulQA (0-shot)
|
|
type: truthful_qa
|
|
config: multiple_choice
|
|
split: validation
|
|
args:
|
|
num_few_shot: 0
|
|
metrics:
|
|
- type: mc2
|
|
value: 67.57
|
|
source:
|
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mayacinka/chatty-djinn-14B
|
|
name: Open LLM Leaderboard
|
|
- task:
|
|
type: text-generation
|
|
name: Text Generation
|
|
dataset:
|
|
name: Winogrande (5-shot)
|
|
type: winogrande
|
|
config: winogrande_xl
|
|
split: validation
|
|
args:
|
|
num_few_shot: 5
|
|
metrics:
|
|
- type: acc
|
|
value: 83.11
|
|
name: accuracy
|
|
source:
|
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mayacinka/chatty-djinn-14B
|
|
name: Open LLM Leaderboard
|
|
- task:
|
|
type: text-generation
|
|
name: Text Generation
|
|
dataset:
|
|
name: GSM8k (5-shot)
|
|
type: gsm8k
|
|
config: main
|
|
split: test
|
|
args:
|
|
num_few_shot: 5
|
|
metrics:
|
|
- type: acc
|
|
value: 60.58
|
|
name: accuracy
|
|
source:
|
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=mayacinka/chatty-djinn-14B
|
|
name: Open LLM Leaderboard
|
|
---
|
|

|
|
# djinn
|
|
|
|
|
|
djinn is a merge of the following models using [LazyMergekit](https://colab.research.google.com/drive/1obulZ1ROXHjYLn6PPZJwRR6GzgQogxxb?usp=sharing):
|
|
* [openchat/openchat-3.5-0106](https://huggingface.co/openchat/openchat-3.5-0106)
|
|
* [teknium/OpenHermes-2.5-Mistral-7B](https://huggingface.co/teknium/OpenHermes-2.5-Mistral-7B)
|
|
* bardsai/jaskier-7b-dpo-v6.1
|
|
* senseable/WestLake-7B-v2
|
|
* NousResearch/Nous-Hermes-2-Mistral-7B-DPO
|
|
* paulml/OGNO-7B
|
|
* paulml/DPOB-INMTOB-7B
|
|
* mlabonne/AlphaMonarch-7B
|
|
|
|
# 🏆 Benchmarks
|
|
Nous benchmarks, find more [details here](https://gist.github.com/majacinka/3f2a797c8872ca9bfdaa2bbf3369edb5)
|
|
|
|
| Model |AGIEval|GPT4All|TruthfulQA|Bigbench|Average|
|
|
|---------------------------------------------------------------------|------:|------:|---------:|-------:|------:|
|
|
|[chatty-djinn-14B](https://huggingface.co/mayacinka/chatty-djinn-14B)| 38.43| 76.29| 68.02| 47.6| 57.59|
|
|
|
|
### AGIEval
|
|
| Task |Version| Metric |Value| |Stderr|
|
|
|------------------------------|------:|--------|----:|---|-----:|
|
|
|agieval_aqua_rat | 0|acc |23.62|± | 2.67|
|
|
| | |acc_norm|21.65|± | 2.59|
|
|
|agieval_logiqa_en | 0|acc |32.26|± | 1.83|
|
|
| | |acc_norm|33.79|± | 1.86|
|
|
|agieval_lsat_ar | 0|acc |23.04|± | 2.78|
|
|
| | |acc_norm|23.04|± | 2.78|
|
|
|agieval_lsat_lr | 0|acc |38.82|± | 2.16|
|
|
| | |acc_norm|39.22|± | 2.16|
|
|
|agieval_lsat_rc | 0|acc |59.48|± | 3.00|
|
|
| | |acc_norm|54.65|± | 3.04|
|
|
|agieval_sat_en | 0|acc |75.73|± | 2.99|
|
|
| | |acc_norm|74.27|± | 3.05|
|
|
|agieval_sat_en_without_passage| 0|acc |35.92|± | 3.35|
|
|
| | |acc_norm|34.47|± | 3.32|
|
|
|agieval_sat_math | 0|acc |31.36|± | 3.14|
|
|
| | |acc_norm|26.36|± | 2.98|
|
|
|
|
Average: 38.43%
|
|
|
|
### GPT4All
|
|
| Task |Version| Metric |Value| |Stderr|
|
|
|-------------|------:|--------|----:|---|-----:|
|
|
|arc_challenge| 0|acc |62.12|± | 1.42|
|
|
| | |acc_norm|65.44|± | 1.39|
|
|
|arc_easy | 0|acc |83.88|± | 0.75|
|
|
| | |acc_norm|78.58|± | 0.84|
|
|
|boolq | 1|acc |88.07|± | 0.57|
|
|
|hellaswag | 0|acc |65.18|± | 0.48|
|
|
| | |acc_norm|86.45|± | 0.34|
|
|
|openbookqa | 0|acc |39.60|± | 2.19|
|
|
| | |acc_norm|48.60|± | 2.24|
|
|
|piqa | 0|acc |82.26|± | 0.89|
|
|
| | |acc_norm|83.62|± | 0.86|
|
|
|winogrande | 0|acc |83.27|± | 1.05|
|
|
|
|
Average: 76.29%
|
|
|
|
### TruthfulQA
|
|
| Task |Version|Metric|Value| |Stderr|
|
|
|-------------|------:|------|----:|---|-----:|
|
|
|truthfulqa_mc| 1|mc1 |50.55|± | 1.75|
|
|
| | |mc2 |68.02|± | 1.52|
|
|
|
|
Average: 68.02%
|
|
|
|
### Bigbench
|
|
| Task |Version| Metric |Value| |Stderr|
|
|
|------------------------------------------------|------:|---------------------|----:|---|-----:|
|
|
|bigbench_causal_judgement | 0|multiple_choice_grade|57.89|± | 3.59|
|
|
|bigbench_date_understanding | 0|multiple_choice_grade|64.50|± | 2.49|
|
|
|bigbench_disambiguation_qa | 0|multiple_choice_grade|32.56|± | 2.92|
|
|
|bigbench_geometric_shapes | 0|multiple_choice_grade|26.18|± | 2.32|
|
|
| | |exact_str_match | 1.11|± | 0.55|
|
|
|bigbench_logical_deduction_five_objects | 0|multiple_choice_grade|30.80|± | 2.07|
|
|
|bigbench_logical_deduction_seven_objects | 0|multiple_choice_grade|22.86|± | 1.59|
|
|
|bigbench_logical_deduction_three_objects | 0|multiple_choice_grade|57.67|± | 2.86|
|
|
|bigbench_movie_recommendation | 0|multiple_choice_grade|62.00|± | 2.17|
|
|
|bigbench_navigate | 0|multiple_choice_grade|56.20|± | 1.57|
|
|
|bigbench_reasoning_about_colored_objects | 0|multiple_choice_grade|65.65|± | 1.06|
|
|
|bigbench_ruin_names | 0|multiple_choice_grade|64.73|± | 2.26|
|
|
|bigbench_salient_translation_error_detection | 0|multiple_choice_grade|17.33|± | 1.20|
|
|
|bigbench_snarks | 0|multiple_choice_grade|76.24|± | 3.17|
|
|
|bigbench_sports_understanding | 0|multiple_choice_grade|75.15|± | 1.38|
|
|
|bigbench_temporal_sequences | 0|multiple_choice_grade|48.90|± | 1.58|
|
|
|bigbench_tracking_shuffled_objects_five_objects | 0|multiple_choice_grade|22.32|± | 1.18|
|
|
|bigbench_tracking_shuffled_objects_seven_objects| 0|multiple_choice_grade|18.17|± | 0.92|
|
|
|bigbench_tracking_shuffled_objects_three_objects| 0|multiple_choice_grade|57.67|± | 2.86|
|
|
|
|
Average: 47.6%
|
|
|
|
Average score: 57.59%
|
|
## 🧩 Configuration
|
|
Inspired by [theprofessor's config](https://huggingface.co/abacusai/TheProfessor-155b)
|
|
|
|
```yaml
|
|
merge_method: linear # use linear so we can include multiple models, albeit at a zero weight
|
|
parameters:
|
|
weight: 1.0 # weight everything as 1 unless specified otherwise - linear with one model weighted at 1 is a no-op like passthrough
|
|
slices:
|
|
- sources:
|
|
- model: openchat/openchat-3.5-0106
|
|
layer_range: [0, 1]
|
|
- model: teknium/OpenHermes-2.5-Mistral-7B
|
|
layer_range: [0, 1]
|
|
parameters:
|
|
weight: 0
|
|
- sources:
|
|
- model: bardsai/jaskier-7b-dpo-v6.1
|
|
layer_range: [1, 10]
|
|
- sources:
|
|
- model: senseable/WestLake-7B-v2
|
|
layer_range: [10, 20]
|
|
- sources:
|
|
- model: NousResearch/Nous-Hermes-2-Mistral-7B-DPO
|
|
layer_range: [20, 30]
|
|
- sources:
|
|
- model: paulml/OGNO-7B
|
|
layer_range: [15, 25]
|
|
- sources:
|
|
- model: paulml/DPOB-INMTOB-7B
|
|
layer_range: [22, 32]
|
|
- sources:
|
|
- model: mlabonne/AlphaMonarch-7B
|
|
layer_range: [5, 15]
|
|
- sources:
|
|
- model: openchat/openchat-3.5-0106
|
|
layer_range: [31, 32]
|
|
- model: teknium/OpenHermes-2.5-Mistral-7B
|
|
layer_range: [31, 32]
|
|
parameters:
|
|
weight: 0
|
|
dtype: float16
|
|
tokenizer_source: model:openchat/openchat-3.5-0106
|
|
```
|
|
|
|
## An example
|
|
|
|
*User:* Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?
|
|
Think step by step and the result MUST BE A NUMBER
|
|
|
|
*djinn:*
|
|
In this family structure, we know there are 4 aka '1 + 3-brothers' (1 being initially for Sally) existing individuals who are
|
|
also the siblings of the 3-brother-group. As each of the 3-brothers has 2-sisters each, that sums up to 6 total sister- roles. Since
|
|
we already accounted for one of the sisters (assuming to be the oldest or in alphabetical order, as names were not given) to be part
|
|
of the 3-brother group, we subtract one from the 6-sisters total. So, considering the provided information, Sally would have 6 - 1 = 5
|
|
remaining younger or alphabetically-wise- placed, biological-sisters. However, note that this problem statement might not perfectly align
|
|
with real-life complexities, such as same-sex twins, adoption, etc., but within the given context, the solution is 5.
|
|
|
|
Remember, in more intricate or unclear family tree scenarios, it'd be best to directly ask for clarifications or consider additional data
|
|
points. In short, with the given constraints, Sally has 5 known, direct, biological, and likely-in-age-order, younger/alphabetically-placed,
|
|
sister-siblings.(To reiterate the core result in a concise manner: 5).
|
|
|
|
(Please excuse the verbosity, aiming for thoroughness and clarity.)
|
|
|
|
This response maintains the initial request's numerical focus while providing a detailed explanation to ensure understanding.
|
|
|
|
## 💻 Usage
|
|
|
|
```python
|
|
!pip install -qU transformers accelerate
|
|
|
|
from transformers import AutoTokenizer
|
|
import transformers
|
|
import torch
|
|
|
|
model = "mayacinka/djinn"
|
|
messages = [{"role": "user", "content": "What is a large language model?"}]
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model)
|
|
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
|
|
pipeline = transformers.pipeline(
|
|
"text-generation",
|
|
model=model,
|
|
torch_dtype=torch.float16,
|
|
device_map="auto",
|
|
)
|
|
|
|
outputs = pipeline(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
|
|
print(outputs[0]["generated_text"])
|
|
```
|
|
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
|
|
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_mayacinka__chatty-djinn-14B)
|
|
|
|
| Metric |Value|
|
|
|---------------------------------|----:|
|
|
|Avg. |72.08|
|
|
|AI2 Reasoning Challenge (25-Shot)|70.39|
|
|
|HellaSwag (10-Shot) |86.45|
|
|
|MMLU (5-Shot) |64.40|
|
|
|TruthfulQA (0-shot) |67.57|
|
|
|Winogrande (5-shot) |83.11|
|
|
|GSM8k (5-shot) |60.58|
|
|
|