初始化项目,由ModelHub XC社区提供模型

Model: NeuralNovel/Confinus-2x7B
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-27 06:44:17 +08:00
commit d956debae8
22 changed files with 91497 additions and 0 deletions

35
.gitattributes vendored Normal file
View File

@@ -0,0 +1,35 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text

171
README.md Normal file
View File

@@ -0,0 +1,171 @@
---
language:
- en
license: apache-2.0
tags:
- moe
- merge
model-index:
- name: Confinus-2x7B
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 73.89
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=NeuralNovel/Confinus-2x7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 88.82
name: normalized accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=NeuralNovel/Confinus-2x7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 65.12
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=NeuralNovel/Confinus-2x7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 71.88
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=NeuralNovel/Confinus-2x7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 84.77
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=NeuralNovel/Confinus-2x7B
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 68.84
name: accuracy
source:
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=NeuralNovel/Confinus-2x7B
name: Open LLM Leaderboard
---
![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/645cfe4603fc86c46b3e46d1/iSQEdQ-cr6brinPNhrtn3.jpeg)
# NeuralNovel/Confinus-2x7B
In the boundless sands ..
A model to test how MoE will route without square expansion.
## "[What is a Mixture of Experts (MoE)?](https://huggingface.co/blog/moe)"
<a href='https://ko-fi.com/S6S2UH2TC' target='_blank'><img height='38' style='border:0px;height:36px;' src='https://storage.ko-fi.com/cdn/kofi1.png?v=3' border='0' alt='Buy Me a Coffee at ko-fi.com' /></a>
<a href='https://discord.gg/KFS229xD' target='_blank'><img width='140' height='500' style='border:0px;height:36px;' src='https://i.ibb.co/tqwznYM/Discord-button.png' border='0' alt='Join Our Discord!' /></a>
### (from the MistralAI papers...click the quoted question above to navigate to it directly.)
The scale of a model is one of the most important axes for better model quality. Given a fixed computing budget, training a larger model for fewer steps is better than training a smaller model for more steps.
Mixture of Experts enable models to be pretrained with far less compute, which means you can dramatically scale up the model or dataset size with the same compute budget as a dense model. In particular, a MoE model should achieve the same quality as its dense counterpart much faster during pretraining.
So, what exactly is a MoE? In the context of transformer models, a MoE consists of two main elements:
Sparse MoE layers are used instead of dense feed-forward network (FFN) layers. MoE layers have a certain number of “experts” (e.g. 32 in my "frankenMoE"), where each expert is a neural network. In practice, the experts are FFNs, but they can also be more complex networks or even a MoE itself, leading to hierarchical MoEs!
A gate network or router, that determines which tokens are sent to which expert. For example, in the image below, the token “More” is sent to the second expert, and the token "Parameters” is sent to the first network. As well explore later, we can send a token to more than one expert. How to route a token to an expert is one of the big decisions when working with MoEs - the router is composed of learned parameters and is pretrained at the same time as the rest of the network.
At every layer, for every token, a router network chooses two of these groups (the “experts”) to process the token and combine their output additively.
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/up_I0R2TQGjqTShZp_1Sz.png)
Switch Layer
MoE layer from the [Switch Transformers paper](https://arxiv.org/abs/2101.03961)
So, to recap, in MoEs we replace every FFN layer of the transformer model with an MoE layer, which is composed of a gate network and a certain number of experts.
Although MoEs provide benefits like efficient pretraining and faster inference compared to dense models, they also come with challenges:
Training: MoEs enable significantly more compute-efficient pretraining, but theyve historically struggled to generalize during fine-tuning, leading to overfitting.
Inference: Although a MoE might have many parameters, only some of them are used during inference. This leads to much faster inference compared to a dense model with the same number of parameters. However, all parameters need to be loaded in RAM, so memory requirements are high. For example, [given a MoE like Mixtral 8x7B](https://huggingface.co/blog/moe), well need to have enough VRAM to hold a dense 47B parameter model. Why 47B parameters and not 8 x 7B = 56B? Thats because in MoE models, only the FFN layers are treated as individual experts, and the rest of the model parameters are shared. At the same time, assuming just two experts are being used per token, the inference speed (FLOPs) is like using a 12B model (as opposed to a 14B model), because it computes 2x7B matrix multiplications, but with some layers shared (more on this soon).
If all our tokens are sent to just a few popular experts, that will make training inefficient. In a normal MoE training, the gating network converges to mostly activate the same few experts. This self-reinforces as favored experts are trained quicker and hence selected more. To mitigate this, an auxiliary loss is added to encourage giving all experts equal importance. This loss ensures that all experts receive a roughly equal number of training examples. The following sections will also explore the concept of expert capacity, which introduces a threshold of how many tokens can be processed by an expert. In transformers, the auxiliary loss is exposed via the aux_loss parameter.
![image/gif](https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/43v7GezlOGg2BYljbU5ge.gif)
## "Wait...but you called this a frankenMoE?"
The difference between MoE and "frankenMoE" lies in the fact that the router layer in a model like the one on this repo is not trained simultaneously.
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_NeuralNovel__Confinus-2x7B)
| Metric |Value|
|---------------------------------|----:|
|Avg. |75.55|
|AI2 Reasoning Challenge (25-Shot)|73.89|
|HellaSwag (10-Shot) |88.82|
|MMLU (5-Shot) |65.12|
|TruthfulQA (0-shot) |71.88|
|Winogrande (5-shot) |84.77|
|GSM8k (5-shot) |68.84|

30
config.json Normal file
View File

@@ -0,0 +1,30 @@
{
"_name_or_path": "AiMavenAi/AiMaven-Prometheus",
"architectures": [
"MixtralForCausalLM"
],
"attention_dropout": 0.0,
"bos_token_id": 1,
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 4096,
"initializer_range": 0.02,
"intermediate_size": 14336,
"max_position_embeddings": 32768,
"model_type": "mixtral",
"num_attention_heads": 32,
"num_experts_per_tok": 2,
"num_hidden_layers": 32,
"num_key_value_heads": 8,
"num_local_experts": 2,
"output_router_logits": false,
"rms_norm_eps": 1e-05,
"rope_theta": 10000.0,
"router_aux_loss_coef": 0.001,
"sliding_window": null,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.37.2",
"use_cache": false,
"vocab_size": 32000
}

23
mergekit_moe_config.yml Normal file
View File

@@ -0,0 +1,23 @@
base_model: AiMavenAi/AiMaven-Prometheus
gate_mode: hidden
dtype: bfloat16
experts:
- source_model: AiMavenAi/AiMaven-Prometheus
positive_prompts:
- "Answer this question from the ARC (Argument Reasoning Comprehension)."
- "Use common sense and logical reasoning skills."
negative_prompts:
- "nonsense"
- "irrational"
- "math"
- "code"
- source_model: alnrg2arg/test3_sft_16bit
positive_prompts:
- "Calculate the answer to this math problem"
- "My mathematical capabilities are strong, allowing me to handle complex mathematical queries"
- "solve for"
negative_prompts:
- "incorrect"
- "inaccurate"
- "creativity"

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:418ccc2c4f64e06c70ef74630a0f626925731e7918bce786471d4675c19f2ed3
size 1933849912

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:344ea2b6485f9b0d5367e482d581c67a13d1d60eae7ef081e32fc788308d0a02
size 1996490960

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:74dbc912e787207bc321c1fb03ecbb3ddc804db681a6b6c201ac3289d64fc537
size 1996490968

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:e0601419decbd5b485d261022d4eee9a3d8d19d96e059a76f62d4732c2ed2363
size 1996490968

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:b057680cb64251e9dfed6152bc898512068e2908436d844c9c616fdfe2752509
size 1996490952

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:3c6159d51a2d3cabdc5e1ca4d43cb3d22bd80523e6a06509140768ad13116639
size 1996490960

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:6a099b7bc42a6af3d7d7821e9d1f0f656d824a9e1aed160e0d247cbbc088789f
size 1996490968

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:b10f8305774065e38f89f60ffc3465f7a43c1994291607f61a465820579f431b
size 1996490960

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:613c75324ac93f4ba152ae3080564032034dfee0a1d5d6568d5492277d3a0c8e
size 1996490952

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:0a769f958f9fd3023cfa5bc927e6a74e865d0a41fb783ae725756d12431bdb4c
size 1996490968

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:54b6186950686015f5b58e09857388ce5e0b70e9bf418abceafe7a58ff24c98d
size 1996490968

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:40b57de053da4dbc4d6e44aaaa6d0be0220e35a00d14ff7e6461211877dd630f
size 1979981568

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:d6578ffc7d64b566ece258a6d2320f933a0c0173cb50d383c4b2cfc609f2309a
size 1879588416

File diff suppressed because one or more lines are too long

24
special_tokens_map.json Normal file
View File

@@ -0,0 +1,24 @@
{
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": "<s>",
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

91129
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

BIN
tokenizer.model (Stored with Git LFS) Normal file

Binary file not shown.

42
tokenizer_config.json Normal file
View File

@@ -0,0 +1,42 @@
{
"add_bos_token": true,
"add_eos_token": false,
"added_tokens_decoder": {
"0": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"additional_special_tokens": [],
"bos_token": "<s>",
"clean_up_tokenization_spaces": false,
"eos_token": "</s>",
"legacy": true,
"model_max_length": 1000000000000000019884624838656,
"pad_token": "<s>",
"sp_model_kwargs": {},
"spaces_between_special_tokens": false,
"tokenizer_class": "LlamaTokenizer",
"unk_token": "<unk>",
"use_default_system_prompt": false
}