初始化项目,由ModelHub XC社区提供模型
Model: OpenBMB/MiniCPM-2B-128k Source: Original Platform
This commit is contained in:
35
.gitattributes
vendored
Normal file
35
.gitattributes
vendored
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.model filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||||
|
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||||
93
README.md
Normal file
93
README.md
Normal file
@@ -0,0 +1,93 @@
|
|||||||
|
---
|
||||||
|
datasets:
|
||||||
|
- stingning/ultrachat
|
||||||
|
language:
|
||||||
|
- zh
|
||||||
|
- en
|
||||||
|
library_name: transformers
|
||||||
|
pipeline_tag: text-generation
|
||||||
|
tags:
|
||||||
|
- MiniCPM
|
||||||
|
- ModelBest
|
||||||
|
- THUNLP
|
||||||
|
- conversational
|
||||||
|
- custom_code
|
||||||
|
---
|
||||||
|
|
||||||
|
# MiniCPM-2B-128k
|
||||||
|
|
||||||
|
<!-- Provide a quick summary of what the model is/does. -->
|
||||||
|
|
||||||
|
[OpenBMB Technical Blog Series](https://openbmb.vercel.app/)
|
||||||
|
|
||||||
|
MiniCPM is an End-Size LLM developed by ModelBest Inc. and TsinghuaNLP, with only 2.4B parameters excluding embeddings.
|
||||||
|
MiniCPM-2B-128k is a long context extension trial of [MiniCPM-2B](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16).
|
||||||
|
To our best knowledge, MiniCPM-2B-128k is the first long context(>=128k) SLM smaller than 3B。
|
||||||
|
In comparison with the previous released [MiniCPM-2B](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16), the improvements include:
|
||||||
|
|
||||||
|
- Supports 128k context, achieving the best score under 7B on the comprehensive long-text evaluation InfiniteBench, but performance drops within 4k context
|
||||||
|
- To facilitate community developers, the model has updated the <user>{}<AI> directive template to chatml format (user\n{}\nassistant\n) during alignment, which also aids users in deploying and using the vllm openai compatible server mode.
|
||||||
|
- Due to the parallel mechanism requirement, removed tie_embedding and expanded the vocabulary to 127660.
|
||||||
|
|
||||||
|
For more details, please refer to the [GitHub repo](https://github.com/OpenBMB/MiniCPM) and [Blog](https://openbmb.vercel.app/minicpm-2b-128k-en).
|
||||||
|
|
||||||
|
|
||||||
|
MiniCPM 是面壁与清华大学自然语言处理实验室共同开源的系列端侧语言大模型,主体语言模型 [MiniCPM-2B](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) 仅有 24亿(2.4B)的非词嵌入参数量。
|
||||||
|
MiniCPM-2B-128k 是一次基于 [MiniCPM-2B](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) 的长度扩展尝试,也是第一个 3B 以下的长文本模型。相对于之前发布的版本,改进如下:
|
||||||
|
|
||||||
|
- 支持 128k 上下文,在综合长文本评测 [InfiniteBench](https://github.com/OpenBMB/InfiniteBench) 上取得 7B 以下最佳成绩,但在 4k 以内性能有下降
|
||||||
|
- 为方便社区开发者使用,该模型在对齐时将 <用户>{}<AI> 指令模板更新为了 chatml 格式(<|im_start|>user\n{}<|im_end|>\n<|im_start|>assistant\n),这也有助于用户使用 vllm openai compatible server 模式部署和使用。
|
||||||
|
- 由于并行机制需要,去除了 tie_embedding,并扩展词表到 127660。
|
||||||
|
|
||||||
|
更多细节请参考 [GitHub repo](https://github.com/OpenBMB/MiniCPM) 和 [Blog](https://openbmb.vercel.app/minicpm-2b-128k)
|
||||||
|
|
||||||
|
## Evaluation Results 评测结果
|
||||||
|
|
||||||
|
| Model | avg | avg w/o code&math | passkey | number_string | kv_retrieval | longbook_choice_eng | longbook_qa_chn | longbook_qa_eng | longbook_sum_eng | longdialogue_qa_eng | math_calc | math_find | code_debug | code_run |
|
||||||
|
|-------------------------------------|-------|-------------------|---------|---------------|--------------|---------------------|-----------------|-----------------|------------------|---------------------|-----------|-----------|------------|----------|
|
||||||
|
| LWM-Text-128k | 24.45 | 33.62 | 100 | 97.8 | 0.6 | 28.82 | 15.93 | 14.31 | 9.99 | 1.5 | 0 | 3.43 | 20.05 | 1 |
|
||||||
|
| Yarn-Mistral-7b-128k | 19.84 | 27.36 | 92.71 | | 0 | 27.95 | 15.49 | 9.55 | 9.06 | 7.5 | 0 | 17.14 | 0.76 | 1.25 |
|
||||||
|
| Mistral-7B-Instruct-v0.2(ABF 1000w) | 27.75 | 36.9 | 100 | 78.98 | 3.6 | 37.12 | 11.74 | 17.37 | 21.12 | 9.5 | 0 | 29.43 | 17.51 | 0 |
|
||||||
|
| Yi-6B-200k | 22.15 | 32.54 | 100 | 94.92 | 0 | 36.68 | 15.07 | 9.2 | 0.92 | 3.5 | 0 | 4.29 | 0.51 | 0.75 |
|
||||||
|
| chatglm3-6b-128k | 25.58 | 36.57 | 89.93 | 99.66 | 5.2 | 46.29 | 10.7 | 8.38 | 25.91 | 6.5 | 0 | 8 | 5.33 | 1 |
|
||||||
|
| MiniCPM-2.4B-128k | 27.32 | 37.68 | 98.31 | 99.83 | 9 | 29.69 | 23.06 | 16.33 | 15.73 | 9.5 | 0 | 4.29 | 22.08 | 0 |
|
||||||
|
|
||||||
|
Notice: We discovered that the quality of Huggingface generation is slightly lower and significantly slower than vLLM, thus benchmarking using vLLM is recommended.
|
||||||
|
|
||||||
|
注意:我们发现使用Huggingface生成质量略差于vLLM,因此推荐使用vLLM进行测试。
|
||||||
|
|
||||||
|
## Limitations 局限性
|
||||||
|
|
||||||
|
- Due to limitations in model size, the model may experience hallucinatory issues. As DPO model tend to generate longer response, hallucinations are more likely to occur. We will also continue to iterate and improve the MiniCPM model.
|
||||||
|
- To ensure the universality of the model for academic research purposes, we did not conduct any identity training on the model. Meanwhile, as we use ShareGPT open-source corpus as part of the training data, the model may output identity information similar to the GPT series models.
|
||||||
|
- Due to the limitation of model size, the output of the model is greatly influenced by prompt words, which may result in inconsistent results from multiple attempts.
|
||||||
|
- Due to limited model capacity, the model's knowledge memory is not accurate. In the future, we will combine the RAG method to enhance the model's knowledge memory ability.
|
||||||
|
|
||||||
|
- 受限于模型规模,模型可能出现幻觉性问题。其中由于DPO模型生成的回复内容更长,更容易出现幻觉。我们也将持续进行MiniCPM模型的迭代改进;
|
||||||
|
- 为了保证在学术研究用途上模型的通用性,我们未对模型进行任何身份认同训练。同时由于我们用ShareGPT开源语料作为部分训练数据,模型可能会输出类似GPT系列模型的身份认同信息;
|
||||||
|
- 受限于模型规模,模型的输出受到提示词(prompt)的影响较大,可能多次尝试产生不一致的结果;
|
||||||
|
- 受限于模型容量,模型的知识记忆较不准确,后续我们将结合RAG方法来增强模型的知识记忆能力。
|
||||||
|
|
||||||
|
## Usage 模型使用
|
||||||
|
|
||||||
|
<!-- Provide a longer summary of what this model is. -->
|
||||||
|
|
||||||
|
|
||||||
|
- Run the following code after install transformers>=4.36.0 and accelerate
|
||||||
|
- Warning: It is necessary to specify the data type of the model clearly in 'from_pretrained', otherwise large calculation errors will be caused
|
||||||
|
|
||||||
|
- 安装transformers>=4.36.0以及accelerate后,运行以下代码
|
||||||
|
- 注意:需要在from_pretrained中明确指明模型的数据类型,否则会引起较大计算误差
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||||
|
import torch
|
||||||
|
torch.manual_seed(0)
|
||||||
|
|
||||||
|
path = 'openbmb/MiniCPM-2B-128k'
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(path)
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map='cuda', trust_remote_code=True)
|
||||||
|
|
||||||
|
responds, history = model.chat(tokenizer, "山东省最高的山是哪座山, 它比黄山高还是矮?差距多少?", temperature=0.8, top_p=0.8)
|
||||||
|
print(responds)
|
||||||
|
```
|
||||||
9
added_tokens.json
Normal file
9
added_tokens.json
Normal file
@@ -0,0 +1,9 @@
|
|||||||
|
{
|
||||||
|
"<|im_end|>": 122753,
|
||||||
|
"<|im_start|>": 122757,
|
||||||
|
"<|tool_call|>": 122756,
|
||||||
|
"▁<EOT>": 122758,
|
||||||
|
"▁<MID>": 122755,
|
||||||
|
"▁<PRE>": 122754,
|
||||||
|
"▁<SUF>": 122759
|
||||||
|
}
|
||||||
35
config.json
Normal file
35
config.json
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
{
|
||||||
|
"_name_or_path": "openbmb/CPM-2B",
|
||||||
|
"architectures": [
|
||||||
|
"MiniCPMForCausalLM"
|
||||||
|
],
|
||||||
|
"auto_map": {
|
||||||
|
"AutoConfig": "configuration_minicpm.MiniCPMConfig",
|
||||||
|
"AutoModel": "modeling_minicpm.MiniCPMModel",
|
||||||
|
"AutoModelForCausalLM": "modeling_minicpm.MiniCPMForCausalLM",
|
||||||
|
"AutoModelForSeq2SeqLM": "modeling_minicpm.MiniCPMForCausalLM",
|
||||||
|
"AutoModelForSequenceClassification": "modeling_minicpm.MiniCPMForSequenceClassification"
|
||||||
|
},
|
||||||
|
"bos_token_id": 1,
|
||||||
|
"eos_token_id": 2,
|
||||||
|
"hidden_act": "silu",
|
||||||
|
"hidden_size": 2304,
|
||||||
|
"initializer_range": 0.1,
|
||||||
|
"intermediate_size": 5760,
|
||||||
|
"max_position_embeddings": 65536,
|
||||||
|
"max_length": 131072,
|
||||||
|
"num_attention_heads": 36,
|
||||||
|
"num_hidden_layers": 40,
|
||||||
|
"num_key_value_heads": 36,
|
||||||
|
"rms_norm_eps": 1e-05,
|
||||||
|
"rope_scaling": {"type": "dynamic", "factor": 4.0},
|
||||||
|
"torch_dtype": "bfloat16",
|
||||||
|
"transformers_version": "4.36.0",
|
||||||
|
"use_cache": true,
|
||||||
|
"vocab_size": 122760,
|
||||||
|
"scale_emb": 12,
|
||||||
|
"dim_model_base": 256,
|
||||||
|
"scale_depth": 1.4,
|
||||||
|
"tie_word_embeddings": false,
|
||||||
|
"rope_theta": 1000000.0
|
||||||
|
}
|
||||||
1
configuration.json
Normal file
1
configuration.json
Normal file
@@ -0,0 +1 @@
|
|||||||
|
{"framework":"Pytorch","task":"text-generation"}
|
||||||
206
configuration_minicpm.py
Normal file
206
configuration_minicpm.py
Normal file
@@ -0,0 +1,206 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
|
||||||
|
# and OPT implementations in this library. It has been modified from its
|
||||||
|
# original forms to accommodate minor architectural differences compared
|
||||||
|
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" MiniCPM model configuration"""
|
||||||
|
|
||||||
|
from transformers.configuration_utils import PretrainedConfig
|
||||||
|
from transformers.utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
MINICPM_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
|
||||||
|
|
||||||
|
|
||||||
|
class MiniCPMConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
This is the configuration class to store the configuration of a [`MiniCPMModel`]. It is used to instantiate an MiniCPM
|
||||||
|
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
|
||||||
|
defaults will yield a similar configuration to that of the MiniCPM-7B.
|
||||||
|
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_size (`int`, *optional*, defaults to 32000):
|
||||||
|
Vocabulary size of the MiniCPM model. Defines the number of different tokens that can be represented by the
|
||||||
|
`inputs_ids` passed when calling [`MiniCPMModel`]
|
||||||
|
hidden_size (`int`, *optional*, defaults to 4096):
|
||||||
|
Dimension of the hidden representations.
|
||||||
|
intermediate_size (`int`, *optional*, defaults to 11008):
|
||||||
|
Dimension of the MLP representations.
|
||||||
|
num_hidden_layers (`int`, *optional*, defaults to 32):
|
||||||
|
Number of hidden layers in the Transformer decoder.
|
||||||
|
num_attention_heads (`int`, *optional*, defaults to 32):
|
||||||
|
Number of attention heads for each attention layer in the Transformer decoder.
|
||||||
|
num_key_value_heads (`int`, *optional*):
|
||||||
|
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
|
||||||
|
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
|
||||||
|
`num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
|
||||||
|
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
|
||||||
|
by meanpooling all the original heads within that group. For more details checkout [this
|
||||||
|
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
|
||||||
|
`num_attention_heads`.
|
||||||
|
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
|
||||||
|
The non-linear activation function (function or string) in the decoder.
|
||||||
|
max_position_embeddings (`int`, *optional*, defaults to 2048):
|
||||||
|
The maximum sequence length that this model might ever be used with. MiniCPM 1 supports up to 2048 tokens,
|
||||||
|
MiniCPM 2 up to 4096, CodeMiniCPM up to 16384.
|
||||||
|
initializer_range (`float`, *optional*, defaults to 0.02):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
|
||||||
|
rms_norm_eps (`float`, *optional*, defaults to 1e-06):
|
||||||
|
The epsilon used by the rms normalization layers.
|
||||||
|
use_cache (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not the model should return the last key/values attentions (not used by all models). Only
|
||||||
|
relevant if `config.is_decoder=True`.
|
||||||
|
pad_token_id (`int`, *optional*):
|
||||||
|
Padding token id.
|
||||||
|
bos_token_id (`int`, *optional*, defaults to 1):
|
||||||
|
Beginning of stream token id.
|
||||||
|
eos_token_id (`int`, *optional*, defaults to 2):
|
||||||
|
End of stream token id.
|
||||||
|
pretraining_tp (`int`, *optional*, defaults to 1):
|
||||||
|
Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
|
||||||
|
document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
|
||||||
|
necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
|
||||||
|
issue](https://github.com/pytorch/pytorch/issues/76232).
|
||||||
|
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether to tie weight embeddings
|
||||||
|
rope_theta (`float`, *optional*, defaults to 10000.0):
|
||||||
|
The base period of the RoPE embeddings.
|
||||||
|
rope_scaling (`Dict`, *optional*):
|
||||||
|
Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
|
||||||
|
strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
|
||||||
|
`{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
|
||||||
|
`max_position_embeddings` to the expected new maximum. See the following thread for more information on how
|
||||||
|
these scaling strategies behave:
|
||||||
|
https://www.reddit.com/r/LocalMiniCPM/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
|
||||||
|
experimental feature, subject to breaking API changes in future versions.
|
||||||
|
attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
|
||||||
|
Whether to use a bias in the query, key, value and output projection layers during self-attention.
|
||||||
|
attention_dropout (`float`, *optional*, defaults to 0.0):
|
||||||
|
The dropout ratio for the attention probabilities.
|
||||||
|
|
||||||
|
```python
|
||||||
|
>>> from transformers import MiniCPMModel, MiniCPMConfig
|
||||||
|
|
||||||
|
>>> # Initializing a MiniCPM minicpm-7b style configuration
|
||||||
|
>>> configuration = MiniCPMConfig()
|
||||||
|
|
||||||
|
>>> # Initializing a model from the minicpm-7b style configuration
|
||||||
|
>>> model = MiniCPMModel(configuration)
|
||||||
|
|
||||||
|
>>> # Accessing the model configuration
|
||||||
|
>>> configuration = model.config
|
||||||
|
```"""
|
||||||
|
|
||||||
|
model_type = "minicpm"
|
||||||
|
keys_to_ignore_at_inference = ["past_key_values"]
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_size=32000,
|
||||||
|
hidden_size=4096,
|
||||||
|
intermediate_size=11008,
|
||||||
|
num_hidden_layers=32,
|
||||||
|
num_attention_heads=32,
|
||||||
|
num_key_value_heads=None,
|
||||||
|
hidden_act="silu",
|
||||||
|
max_position_embeddings=2048,
|
||||||
|
initializer_range=0.02,
|
||||||
|
rms_norm_eps=1e-6,
|
||||||
|
use_cache=True,
|
||||||
|
pad_token_id=None,
|
||||||
|
bos_token_id=1,
|
||||||
|
eos_token_id=2,
|
||||||
|
pretraining_tp=1,
|
||||||
|
tie_word_embeddings=True,
|
||||||
|
rope_theta=10000.0,
|
||||||
|
rope_scaling=None,
|
||||||
|
attention_bias=False,
|
||||||
|
attention_dropout=0.0,
|
||||||
|
scale_emb=1,
|
||||||
|
dim_model_base=1,
|
||||||
|
scale_depth=1,
|
||||||
|
num_experts=0,
|
||||||
|
num_experts_per_tok=0,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
|
||||||
|
# for backward compatibility
|
||||||
|
if num_key_value_heads is None:
|
||||||
|
num_key_value_heads = num_attention_heads
|
||||||
|
|
||||||
|
self.num_key_value_heads = num_key_value_heads
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.rms_norm_eps = rms_norm_eps
|
||||||
|
self.pretraining_tp = pretraining_tp
|
||||||
|
self.use_cache = use_cache
|
||||||
|
self.rope_theta = rope_theta
|
||||||
|
self.rope_scaling = rope_scaling
|
||||||
|
self._rope_scaling_validation()
|
||||||
|
self.attention_bias = attention_bias
|
||||||
|
self.attention_dropout = attention_dropout
|
||||||
|
self.scale_emb = scale_emb
|
||||||
|
self.dim_model_base = dim_model_base
|
||||||
|
self.scale_depth = scale_depth
|
||||||
|
self.num_experts = num_experts
|
||||||
|
self.num_experts_per_tok = num_experts_per_tok
|
||||||
|
|
||||||
|
super().__init__(
|
||||||
|
pad_token_id=pad_token_id,
|
||||||
|
bos_token_id=bos_token_id,
|
||||||
|
eos_token_id=eos_token_id,
|
||||||
|
tie_word_embeddings=tie_word_embeddings,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
try:
|
||||||
|
import flash_attn
|
||||||
|
self._attn_implementation = "flash_attention_2"
|
||||||
|
except:
|
||||||
|
pass
|
||||||
|
|
||||||
|
def _rope_scaling_validation(self):
|
||||||
|
"""
|
||||||
|
Validate the `rope_scaling` configuration.
|
||||||
|
"""
|
||||||
|
if self.rope_scaling is None:
|
||||||
|
return
|
||||||
|
|
||||||
|
if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
|
||||||
|
raise ValueError(
|
||||||
|
"`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, "
|
||||||
|
f"got {self.rope_scaling}"
|
||||||
|
)
|
||||||
|
rope_scaling_type = self.rope_scaling.get("type", None)
|
||||||
|
rope_scaling_factor = self.rope_scaling.get("factor", None)
|
||||||
|
if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
|
||||||
|
raise ValueError(
|
||||||
|
f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
|
||||||
|
)
|
||||||
|
if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
|
||||||
|
raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}")
|
||||||
1455
modeling_minicpm.py
Normal file
1455
modeling_minicpm.py
Normal file
File diff suppressed because it is too large
Load Diff
3
pytorch_model.bin
Normal file
3
pytorch_model.bin
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:2dc123401500a7b5f61f8873e6c08f03e065a4cd6576f5a76eea03effe069e35
|
||||||
|
size 6015591434
|
||||||
32
special_tokens_map.json
Normal file
32
special_tokens_map.json
Normal file
@@ -0,0 +1,32 @@
|
|||||||
|
{
|
||||||
|
"additional_special_tokens": [
|
||||||
|
"<|im_end|>",
|
||||||
|
"▁<PRE>",
|
||||||
|
"▁<MID>",
|
||||||
|
"<|tool_call|>",
|
||||||
|
"<|im_start|>",
|
||||||
|
"▁<EOT>",
|
||||||
|
"▁<SUF>"
|
||||||
|
],
|
||||||
|
"bos_token": {
|
||||||
|
"content": "<s>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"eos_token": {
|
||||||
|
"content": "</s>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
},
|
||||||
|
"unk_token": {
|
||||||
|
"content": "<unk>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false
|
||||||
|
}
|
||||||
|
}
|
||||||
294498
tokenizer.json
Normal file
294498
tokenizer.json
Normal file
File diff suppressed because it is too large
Load Diff
3
tokenizer.model
Normal file
3
tokenizer.model
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:c9aafcd7da1f5611dab6be545db74d5552a2ccc9c2a12c72ea7be63aac4a25d7
|
||||||
|
size 1994871
|
||||||
107
tokenizer_config.json
Normal file
107
tokenizer_config.json
Normal file
@@ -0,0 +1,107 @@
|
|||||||
|
{
|
||||||
|
"add_bos_token": true,
|
||||||
|
"add_eos_token": false,
|
||||||
|
"added_tokens_decoder": {
|
||||||
|
"0": {
|
||||||
|
"content": "<unk>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"1": {
|
||||||
|
"content": "<s>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"2": {
|
||||||
|
"content": "</s>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"122753": {
|
||||||
|
"content": "<|im_end|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"122754": {
|
||||||
|
"content": "▁<PRE>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"122755": {
|
||||||
|
"content": "▁<MID>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"122756": {
|
||||||
|
"content": "<|tool_call|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"122757": {
|
||||||
|
"content": "<|im_start|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"122758": {
|
||||||
|
"content": "▁<EOT>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"122759": {
|
||||||
|
"content": "▁<SUF>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"additional_special_tokens": [
|
||||||
|
"<|im_end|>",
|
||||||
|
"▁<PRE>",
|
||||||
|
"▁<MID>",
|
||||||
|
"<|tool_call|>",
|
||||||
|
"<|im_start|>",
|
||||||
|
"▁<EOT>",
|
||||||
|
"▁<SUF>"
|
||||||
|
],
|
||||||
|
"bos_token": "<s>",
|
||||||
|
"chat_template": "{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}{% endif %}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}",
|
||||||
|
"clean_up_tokenization_spaces": false,
|
||||||
|
"eos_token": "</s>",
|
||||||
|
"legacy": true,
|
||||||
|
"model_max_length": 131072,
|
||||||
|
"pad_token": null,
|
||||||
|
"sp_model_kwargs": {},
|
||||||
|
"spaces_between_special_tokens": false,
|
||||||
|
"tokenizer_class": "LlamaTokenizer",
|
||||||
|
"unk_token": "<unk>",
|
||||||
|
"use_default_system_prompt": false
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user