初始化项目,由ModelHub XC社区提供模型
Model: PAIXAI/Astrid-3B Source: Original Platform
This commit is contained in:
35
.gitattributes
vendored
Normal file
35
.gitattributes
vendored
Normal file
@@ -0,0 +1,35 @@
|
|||||||
|
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.model filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||||
|
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||||
206
README.md
Normal file
206
README.md
Normal file
@@ -0,0 +1,206 @@
|
|||||||
|
---
|
||||||
|
language:
|
||||||
|
- en
|
||||||
|
library_name: transformers
|
||||||
|
tags:
|
||||||
|
- gpt
|
||||||
|
- llm
|
||||||
|
- large language model
|
||||||
|
- PAIX.Cloud
|
||||||
|
inference: true
|
||||||
|
thumbnail: >-
|
||||||
|
https://static.wixstatic.com/media/bdee4e_8aa5cefc86024bc88f7e20e3e19d9ff3~mv2.png/v1/fill/w_192%2Ch_192%2Clg_1%2Cusm_0.66_1.00_0.01/bdee4e_8aa5cefc86024bc88f7e20e3e19d9ff3~mv2.png
|
||||||
|
license: apache-2.0
|
||||||
|
---
|
||||||
|
|
||||||
|
# Model Card
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
This model, Astrid-3B, is a StableLMEpochModel model for causal language modeling, designed to generate human-like text.
|
||||||
|
It's part of our mission to make AI technology accessible to everyone, focusing on personalization, data privacy, and transparent AI governance.
|
||||||
|
Trained in English, it's a versatile tool for a variety of applications.
|
||||||
|
This model is one of the many models available on our platform, and we currently have a 1B and 7B open-source model.
|
||||||
|
|
||||||
|
This model was trained by [PAIX.Cloud](https://www.paix.cloud/).
|
||||||
|
- Wait list: [Wait List](https://www.paix.cloud/join-waitlist)
|
||||||
|
|
||||||
|
|
||||||
|
## Usage
|
||||||
|
|
||||||
|
To use the model with the `transformers` library on a machine with GPUs, first make sure you have the `transformers` library installed.
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pip install transformers==4.34.0
|
||||||
|
```
|
||||||
|
|
||||||
|
Also make sure you are providing your huggingface token to the pipeline if the model is lying in a private repo.
|
||||||
|
- Either leave `token=True` in the `pipeline` and login to hugginface_hub by running
|
||||||
|
```python
|
||||||
|
import huggingface_hub
|
||||||
|
huggingface_hub.login(<ACCES_TOKEN>)
|
||||||
|
```
|
||||||
|
- Or directly pass your <ACCES_TOKEN> to `token` in the `pipeline`
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import pipeline
|
||||||
|
|
||||||
|
generate_text = pipeline(
|
||||||
|
model="PAIXAI/Astrid-3B",
|
||||||
|
torch_dtype="auto",
|
||||||
|
trust_remote_code=True,
|
||||||
|
use_fast=True,
|
||||||
|
device_map={"": "cuda:0"},
|
||||||
|
token=True,
|
||||||
|
)
|
||||||
|
|
||||||
|
res = generate_text(
|
||||||
|
"Why is drinking water so healthy?",
|
||||||
|
min_new_tokens=2,
|
||||||
|
max_new_tokens=256,
|
||||||
|
do_sample=False,
|
||||||
|
num_beams=1,
|
||||||
|
temperature=float(0.3),
|
||||||
|
repetition_penalty=float(1.2),
|
||||||
|
renormalize_logits=True
|
||||||
|
)
|
||||||
|
print(res[0]["generated_text"])
|
||||||
|
```
|
||||||
|
|
||||||
|
You can print a sample prompt after the preprocessing step to see how it is feed to the tokenizer:
|
||||||
|
|
||||||
|
```python
|
||||||
|
print(generate_text.preprocess("Why is drinking water so healthy?")["prompt_text"])
|
||||||
|
```
|
||||||
|
|
||||||
|
```bash
|
||||||
|
<|prompt|>Why is drinking water so healthy?<|endoftext|><|answer|>
|
||||||
|
```
|
||||||
|
|
||||||
|
Alternatively, you can download [h2oai_pipeline.py](h2oai_pipeline.py), store it alongside your notebook, and construct the pipeline yourself from the loaded model and tokenizer. If the model and the tokenizer are fully supported in the `transformers` package, this will allow you to set `trust_remote_code=False`.
|
||||||
|
|
||||||
|
```python
|
||||||
|
from h2oai_pipeline import H2OTextGenerationPipeline
|
||||||
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(
|
||||||
|
"PAIXAI/Astrid-3B",
|
||||||
|
use_fast=True,
|
||||||
|
padding_side="left",
|
||||||
|
trust_remote_code=True,
|
||||||
|
)
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(
|
||||||
|
"PAIXAI/Astrid-3B",
|
||||||
|
torch_dtype="auto",
|
||||||
|
device_map={"": "cuda:0"},
|
||||||
|
trust_remote_code=True,
|
||||||
|
)
|
||||||
|
generate_text = H2OTextGenerationPipeline(model=model, tokenizer=tokenizer)
|
||||||
|
|
||||||
|
res = generate_text(
|
||||||
|
"Why is drinking water so healthy?",
|
||||||
|
min_new_tokens=2,
|
||||||
|
max_new_tokens=256,
|
||||||
|
do_sample=False,
|
||||||
|
num_beams=1,
|
||||||
|
temperature=float(0.3),
|
||||||
|
repetition_penalty=float(1.2),
|
||||||
|
renormalize_logits=True
|
||||||
|
)
|
||||||
|
print(res[0]["generated_text"])
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
You may also construct the pipeline from the loaded model and tokenizer yourself and consider the preprocessing steps:
|
||||||
|
|
||||||
|
```python
|
||||||
|
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||||
|
|
||||||
|
model_name = "PAIXAI/Astrid-3B" # either local folder or huggingface model name
|
||||||
|
# Important: The prompt needs to be in the same format the model was trained with.
|
||||||
|
# You can find an example prompt in the experiment logs.
|
||||||
|
prompt = "<|prompt|>How are you?<|endoftext|><|answer|>"
|
||||||
|
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained(
|
||||||
|
model_name,
|
||||||
|
use_fast=True,
|
||||||
|
trust_remote_code=True,
|
||||||
|
)
|
||||||
|
model = AutoModelForCausalLM.from_pretrained(
|
||||||
|
model_name,
|
||||||
|
torch_dtype="auto",
|
||||||
|
device_map={"": "cuda:0"},
|
||||||
|
trust_remote_code=True,
|
||||||
|
)
|
||||||
|
model.cuda().eval()
|
||||||
|
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False).to("cuda")
|
||||||
|
|
||||||
|
# generate configuration can be modified to your needs
|
||||||
|
tokens = model.generate(
|
||||||
|
input_ids=inputs["input_ids"],
|
||||||
|
attention_mask=inputs["attention_mask"],
|
||||||
|
min_new_tokens=2,
|
||||||
|
max_new_tokens=256,
|
||||||
|
do_sample=False,
|
||||||
|
num_beams=1,
|
||||||
|
temperature=float(0.3),
|
||||||
|
repetition_penalty=float(1.2),
|
||||||
|
renormalize_logits=True
|
||||||
|
)[0]
|
||||||
|
|
||||||
|
tokens = tokens[inputs["input_ids"].shape[1]:]
|
||||||
|
answer = tokenizer.decode(tokens, skip_special_tokens=True)
|
||||||
|
print(answer)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Quantization and sharding
|
||||||
|
|
||||||
|
You can load the models using quantization by specifying ```load_in_8bit=True``` or ```load_in_4bit=True```. Also, sharding on multiple GPUs is possible by setting ```device_map=auto```.
|
||||||
|
|
||||||
|
## Model Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
StableLMEpochForCausalLM(
|
||||||
|
(model): StableLMEpochModel(
|
||||||
|
(embed_tokens): Embedding(50304, 2560, padding_idx=0)
|
||||||
|
(layers): ModuleList(
|
||||||
|
(0-31): 32 x DecoderLayer(
|
||||||
|
(self_attn): Attention(
|
||||||
|
(q_proj): Linear(in_features=2560, out_features=2560, bias=False)
|
||||||
|
(k_proj): Linear(in_features=2560, out_features=2560, bias=False)
|
||||||
|
(v_proj): Linear(in_features=2560, out_features=2560, bias=False)
|
||||||
|
(o_proj): Linear(in_features=2560, out_features=2560, bias=False)
|
||||||
|
(rotary_emb): RotaryEmbedding()
|
||||||
|
)
|
||||||
|
(mlp): MLP(
|
||||||
|
(gate_proj): Linear(in_features=2560, out_features=6912, bias=False)
|
||||||
|
(up_proj): Linear(in_features=2560, out_features=6912, bias=False)
|
||||||
|
(down_proj): Linear(in_features=6912, out_features=2560, bias=False)
|
||||||
|
(act_fn): SiLU()
|
||||||
|
)
|
||||||
|
(input_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
|
||||||
|
(post_attention_layernorm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
|
||||||
|
)
|
||||||
|
)
|
||||||
|
(norm): LayerNorm((2560,), eps=1e-05, elementwise_affine=True)
|
||||||
|
)
|
||||||
|
(lm_head): Linear(in_features=2560, out_features=50304, bias=False)
|
||||||
|
)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Model Configuration
|
||||||
|
|
||||||
|
This model was trained using H2O LLM Studio and with the configuration in [cfg.yaml](cfg.yaml). Visit [H2O LLM Studio](https://github.com/h2oai/h2o-llmstudio) to learn how to train your own large language models.
|
||||||
|
|
||||||
|
|
||||||
|
## Disclaimer
|
||||||
|
|
||||||
|
Please read this disclaimer carefully before using the large language model provided in this repository. Your use of the model signifies your agreement to the following terms and conditions.
|
||||||
|
|
||||||
|
- Biases and Offensiveness: The large language model is trained on a diverse range of internet text data, which may contain biased, racist, offensive, or otherwise inappropriate content. By using this model, you acknowledge and accept that the generated content may sometimes exhibit biases or produce content that is offensive or inappropriate. The developers of this repository do not endorse, support, or promote any such content or viewpoints.
|
||||||
|
- Limitations: The large language model is an AI-based tool and not a human. It may produce incorrect, nonsensical, or irrelevant responses. It is the user's responsibility to critically evaluate the generated content and use it at their discretion.
|
||||||
|
- Use at Your Own Risk: Users of this large language model must assume full responsibility for any consequences that may arise from their use of the tool. The developers and contributors of this repository shall not be held liable for any damages, losses, or harm resulting from the use or misuse of the provided model.
|
||||||
|
- Ethical Considerations: Users are encouraged to use the large language model responsibly and ethically. By using this model, you agree not to use it for purposes that promote hate speech, discrimination, harassment, or any form of illegal or harmful activities.
|
||||||
|
- Reporting Issues: If you encounter any biased, offensive, or otherwise inappropriate content generated by the large language model, please report it to the repository maintainers through the provided channels. Your feedback will help improve the model and mitigate potential issues.
|
||||||
|
- Changes to this Disclaimer: The developers of this repository reserve the right to modify or update this disclaimer at any time without prior notice. It is the user's responsibility to periodically review the disclaimer to stay informed about any changes.
|
||||||
|
|
||||||
|
By using the large language model provided in this repository, you agree to accept and comply with the terms and conditions outlined in this disclaimer. If you do not agree with any part of this disclaimer, you should refrain from using the model and any content generated by it.
|
||||||
99
cfg.yaml
Normal file
99
cfg.yaml
Normal file
@@ -0,0 +1,99 @@
|
|||||||
|
architecture:
|
||||||
|
backbone_dtype: int4
|
||||||
|
force_embedding_gradients: false
|
||||||
|
gradient_checkpointing: true
|
||||||
|
intermediate_dropout: 0.0
|
||||||
|
pretrained: true
|
||||||
|
pretrained_weights: ''
|
||||||
|
augmentation:
|
||||||
|
random_parent_probability: 0.0
|
||||||
|
skip_parent_probability: 0.0
|
||||||
|
token_mask_probability: 0.0
|
||||||
|
dataset:
|
||||||
|
add_eos_token_to_answer: true
|
||||||
|
add_eos_token_to_prompt: true
|
||||||
|
add_eos_token_to_system: true
|
||||||
|
answer_column: output
|
||||||
|
chatbot_author: PAIX.Cloud
|
||||||
|
chatbot_name: Astrid
|
||||||
|
data_sample: 1.0
|
||||||
|
data_sample_choice:
|
||||||
|
- Train
|
||||||
|
- Validation
|
||||||
|
limit_chained_samples: false
|
||||||
|
mask_prompt_labels: true
|
||||||
|
parent_id_column: None
|
||||||
|
personalize: true
|
||||||
|
prompt_column:
|
||||||
|
- instruction
|
||||||
|
system_column: None
|
||||||
|
text_answer_separator: <|answer|>
|
||||||
|
text_prompt_start: <|prompt|>
|
||||||
|
text_system_start: <|system|>
|
||||||
|
train_dataframe: /workspace/data/user/oasst/train_full.pq
|
||||||
|
validation_dataframe: None
|
||||||
|
validation_size: 0.01
|
||||||
|
validation_strategy: automatic
|
||||||
|
environment:
|
||||||
|
compile_model: false
|
||||||
|
find_unused_parameters: false
|
||||||
|
gpus:
|
||||||
|
- '0'
|
||||||
|
huggingface_branch: main
|
||||||
|
mixed_precision: true
|
||||||
|
number_of_workers: 8
|
||||||
|
seed: -1
|
||||||
|
trust_remote_code: true
|
||||||
|
use_fsdp: false
|
||||||
|
experiment_name: Astrid-3B
|
||||||
|
llm_backbone: stabilityai/stablelm-3b-4e1t
|
||||||
|
logging:
|
||||||
|
logger: Neptune
|
||||||
|
neptune_project: daviess/paix
|
||||||
|
output_directory: /workspace/output/user/Astrid-3B/
|
||||||
|
prediction:
|
||||||
|
batch_size_inference: 0
|
||||||
|
do_sample: false
|
||||||
|
max_length_inference: 256
|
||||||
|
metric: GPT
|
||||||
|
metric_gpt_model: gpt-3.5-turbo-0301
|
||||||
|
min_length_inference: 2
|
||||||
|
num_beams: 1
|
||||||
|
num_history: 4
|
||||||
|
repetition_penalty: 1.2
|
||||||
|
stop_tokens: ''
|
||||||
|
temperature: 0.3
|
||||||
|
top_k: 0
|
||||||
|
top_p: 1.0
|
||||||
|
problem_type: text_causal_language_modeling
|
||||||
|
tokenizer:
|
||||||
|
add_prefix_space: false
|
||||||
|
add_prompt_answer_tokens: false
|
||||||
|
max_length: 512
|
||||||
|
max_length_answer: 256
|
||||||
|
max_length_prompt: 256
|
||||||
|
padding_quantile: 1.0
|
||||||
|
use_fast: true
|
||||||
|
training:
|
||||||
|
batch_size: 2
|
||||||
|
differential_learning_rate: 1.0e-05
|
||||||
|
differential_learning_rate_layers: []
|
||||||
|
drop_last_batch: true
|
||||||
|
epochs: 1
|
||||||
|
evaluate_before_training: false
|
||||||
|
evaluation_epochs: 1.0
|
||||||
|
grad_accumulation: 1
|
||||||
|
gradient_clip: 0.0
|
||||||
|
learning_rate: 0.0001
|
||||||
|
lora: true
|
||||||
|
lora_alpha: 16
|
||||||
|
lora_dropout: 0.05
|
||||||
|
lora_r: 4
|
||||||
|
lora_target_modules: ''
|
||||||
|
loss_function: TokenAveragedCrossEntropy
|
||||||
|
optimizer: AdamW
|
||||||
|
save_best_checkpoint: false
|
||||||
|
schedule: Cosine
|
||||||
|
train_validation_data: false
|
||||||
|
warmup_epochs: 0.0
|
||||||
|
weight_decay: 0.0
|
||||||
40
config.json
Normal file
40
config.json
Normal file
@@ -0,0 +1,40 @@
|
|||||||
|
{
|
||||||
|
"_name_or_path": "stabilityai/stablelm-3b-4e1t",
|
||||||
|
"architectures": [
|
||||||
|
"StableLMEpochForCausalLM"
|
||||||
|
],
|
||||||
|
"attention_probs_dropout_prob": 0.0,
|
||||||
|
"auto_map": {
|
||||||
|
"AutoConfig": "configuration_stablelm_epoch.StableLMEpochConfig",
|
||||||
|
"AutoModelForCausalLM": "modeling_stablelm_epoch.StableLMEpochForCausalLM"
|
||||||
|
},
|
||||||
|
"bos_token_id": 0,
|
||||||
|
"custom_pipelines": {
|
||||||
|
"text-generation": {
|
||||||
|
"impl": "h2oai_pipeline.H2OTextGenerationPipeline",
|
||||||
|
"pt": "AutoModelForCausalLM"
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"eos_token_id": 0,
|
||||||
|
"hidden_act": "silu",
|
||||||
|
"hidden_dropout_prob": 0.0,
|
||||||
|
"hidden_size": 2560,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"intermediate_size": 6912,
|
||||||
|
"max_position_embeddings": 4096,
|
||||||
|
"model_type": "stablelm_epoch",
|
||||||
|
"norm_eps": 1e-05,
|
||||||
|
"num_attention_heads": 32,
|
||||||
|
"num_heads": 32,
|
||||||
|
"num_hidden_layers": 32,
|
||||||
|
"num_key_value_heads": 32,
|
||||||
|
"pad_token_id": 0,
|
||||||
|
"rope_pct": 0.25,
|
||||||
|
"rope_theta": 10000,
|
||||||
|
"rotary_scaling_factor": 1.0,
|
||||||
|
"tie_word_embeddings": false,
|
||||||
|
"torch_dtype": "float16",
|
||||||
|
"transformers_version": "4.34.0",
|
||||||
|
"use_cache": true,
|
||||||
|
"vocab_size": 50304
|
||||||
|
}
|
||||||
110
configuration_stablelm_epoch.py
Normal file
110
configuration_stablelm_epoch.py
Normal file
@@ -0,0 +1,110 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 Stability and The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
""" StableLM Epoch model configuration"""
|
||||||
|
from transformers import PretrainedConfig
|
||||||
|
from transformers.utils import logging
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
class StableLMEpochConfig(PretrainedConfig):
|
||||||
|
r"""
|
||||||
|
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
|
||||||
|
documentation from [`PretrainedConfig`] for more information.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
vocab_size (`int`, *optional*, defaults to 50_304):
|
||||||
|
Vocabulary size of the StableLM model. Defines the number of different tokens that
|
||||||
|
can be represented by the `inputs_ids` passed when calling [`StableLMEpochModel`].
|
||||||
|
intermediate_size (`int`, *optional*, defaults to 6912):
|
||||||
|
Dimension of the MLP representations.
|
||||||
|
hidden_size (`int`, *optional*, defaults to 2560):
|
||||||
|
Dimension of the decoder layers and the pooler layer.
|
||||||
|
num_hidden_layers (`int`, *optional*, defaults to 32):
|
||||||
|
Number of hidden layers in the Transformer decoder.
|
||||||
|
num_attention_heads (`int`, *optional*, defaults to 32):
|
||||||
|
Number of attention heads for each attention layer in the Transformer encoder.
|
||||||
|
num_key_value_heads (`int`, *optional*):
|
||||||
|
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
|
||||||
|
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
|
||||||
|
`num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
|
||||||
|
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
|
||||||
|
by meanpooling all the original heads within that group. For more details checkout [this
|
||||||
|
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
|
||||||
|
`num_attention_heads`.
|
||||||
|
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
|
||||||
|
The non-linear activation function (function or string).
|
||||||
|
rope_pct (`float`, *optional*, defaults to 1.0):
|
||||||
|
Percentage of hidden dimensions to allocate to rotary embeddings.
|
||||||
|
rope_theta (`float`, *optional*, defaults to 10000.0):
|
||||||
|
The base period of the RoPE embeddings.
|
||||||
|
max_position_embeddings (`int`, *optional*, defaults to 2048):
|
||||||
|
The maximum sequence length that this model might ever be used with.
|
||||||
|
Typically set this to something large just in case (e.g., 512 or 1024 or 2048).
|
||||||
|
initializer_range (`float`, *optional*, defaults to 1e-5):
|
||||||
|
The standard deviation of the truncated_normal_initializer for initializing
|
||||||
|
all weight matrices.
|
||||||
|
norm_eps (`float`, *optional*, defaults to 1e-8):
|
||||||
|
The epsilon used by the normalization layers.
|
||||||
|
use_cache (`bool`, *optional*, defaults to `True`):
|
||||||
|
Whether or not the model should return the last key/values attentions
|
||||||
|
(not used by all models). Only relevant if `config.is_decoder=True`.
|
||||||
|
tie_word_embeddings(`bool`, *optional*, defaults to `False`):
|
||||||
|
Whether to tie weight embeddings
|
||||||
|
"""
|
||||||
|
model_type = "stablelm_epoch"
|
||||||
|
keys_to_ignore_at_inference = ["past_key_values"]
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
vocab_size=50_304,
|
||||||
|
intermediate_size=6912,
|
||||||
|
hidden_size=2560,
|
||||||
|
num_hidden_layers=32,
|
||||||
|
num_attention_heads=32,
|
||||||
|
num_key_value_heads=32,
|
||||||
|
hidden_act="silu",
|
||||||
|
rope_pct=0.25,
|
||||||
|
rope_theta=10_000,
|
||||||
|
max_position_embeddings=4096,
|
||||||
|
initializer_range=0.02,
|
||||||
|
norm_eps=1.0e-5,
|
||||||
|
use_cache=True,
|
||||||
|
bos_token_id=0,
|
||||||
|
eos_token_id=2,
|
||||||
|
tie_word_embeddings=False,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
self.vocab_size = vocab_size
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.intermediate_size = intermediate_size
|
||||||
|
self.hidden_size = hidden_size
|
||||||
|
self.num_hidden_layers = num_hidden_layers
|
||||||
|
self.num_attention_heads = num_attention_heads
|
||||||
|
self.num_key_value_heads = num_key_value_heads
|
||||||
|
self.hidden_act = hidden_act
|
||||||
|
self.rope_pct = rope_pct
|
||||||
|
self.rope_theta = rope_theta
|
||||||
|
self.initializer_range = initializer_range
|
||||||
|
self.norm_eps = norm_eps
|
||||||
|
self.use_cache = use_cache
|
||||||
|
self.tie_word_embeddings = tie_word_embeddings
|
||||||
|
super().__init__(
|
||||||
|
bos_token_id=bos_token_id,
|
||||||
|
eos_token_id=eos_token_id,
|
||||||
|
tie_word_embeddings=tie_word_embeddings,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
7
generation_config.json
Normal file
7
generation_config.json
Normal file
@@ -0,0 +1,7 @@
|
|||||||
|
{
|
||||||
|
"_from_model_config": true,
|
||||||
|
"bos_token_id": 0,
|
||||||
|
"eos_token_id": 0,
|
||||||
|
"pad_token_id": 0,
|
||||||
|
"transformers_version": "4.34.0"
|
||||||
|
}
|
||||||
42
h2oai_pipeline.py
Normal file
42
h2oai_pipeline.py
Normal file
@@ -0,0 +1,42 @@
|
|||||||
|
from transformers import TextGenerationPipeline
|
||||||
|
from transformers.pipelines.text_generation import ReturnType
|
||||||
|
|
||||||
|
STYLE = "<|prompt|>{instruction}<|endoftext|><|answer|>"
|
||||||
|
|
||||||
|
|
||||||
|
class H2OTextGenerationPipeline(TextGenerationPipeline):
|
||||||
|
def __init__(self, *args, **kwargs):
|
||||||
|
super().__init__(*args, **kwargs)
|
||||||
|
self.prompt = STYLE
|
||||||
|
|
||||||
|
def preprocess(
|
||||||
|
self, prompt_text, prefix="", handle_long_generation=None, **generate_kwargs
|
||||||
|
):
|
||||||
|
prompt_text = self.prompt.format(instruction=prompt_text)
|
||||||
|
return super().preprocess(
|
||||||
|
prompt_text,
|
||||||
|
prefix=prefix,
|
||||||
|
handle_long_generation=handle_long_generation,
|
||||||
|
**generate_kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
def postprocess(
|
||||||
|
self,
|
||||||
|
model_outputs,
|
||||||
|
return_type=ReturnType.FULL_TEXT,
|
||||||
|
clean_up_tokenization_spaces=True,
|
||||||
|
):
|
||||||
|
records = super().postprocess(
|
||||||
|
model_outputs,
|
||||||
|
return_type=return_type,
|
||||||
|
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
|
||||||
|
)
|
||||||
|
for rec in records:
|
||||||
|
rec["generated_text"] = (
|
||||||
|
rec["generated_text"]
|
||||||
|
.split("<|answer|>")[1]
|
||||||
|
.strip()
|
||||||
|
.split("<|prompt|>")[0]
|
||||||
|
.strip()
|
||||||
|
)
|
||||||
|
return records
|
||||||
3
model.safetensors
Normal file
3
model.safetensors
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:81a47c9006956c120a1116c0fe6aab374f2e36eab45c9cfa9188c4104c74a837
|
||||||
|
size 5590927136
|
||||||
687
modeling_stablelm_epoch.py
Normal file
687
modeling_stablelm_epoch.py
Normal file
@@ -0,0 +1,687 @@
|
|||||||
|
# coding=utf-8
|
||||||
|
# Copyright 2023 Stability AI, EleutherAI, and The HuggingFace Inc. team. All rights reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
#
|
||||||
|
# This code is based off the following work:
|
||||||
|
# https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py
|
||||||
|
# https://github.com/huggingface/transformers/blob/main/src/transformers/models/gpt_neox/modeling_gpt_neox.py
|
||||||
|
""" PyTorch StableLM Epoch model. """
|
||||||
|
from typing import Optional, Tuple, Union
|
||||||
|
import math
|
||||||
|
|
||||||
|
import torch
|
||||||
|
import torch.utils.checkpoint
|
||||||
|
from torch import nn
|
||||||
|
from torch.nn import CrossEntropyLoss
|
||||||
|
from transformers.modeling_outputs import (
|
||||||
|
BaseModelOutputWithPast,
|
||||||
|
CausalLMOutputWithPast,
|
||||||
|
)
|
||||||
|
from transformers.modeling_utils import PreTrainedModel
|
||||||
|
from transformers.utils import logging
|
||||||
|
from .configuration_stablelm_epoch import StableLMEpochConfig
|
||||||
|
|
||||||
|
|
||||||
|
logger = logging.get_logger(__name__)
|
||||||
|
|
||||||
|
|
||||||
|
# Copied from transformers.models.bart.modeling_bart._make_causal_mask
|
||||||
|
def _make_causal_mask(
|
||||||
|
input_ids_shape: torch.Size,
|
||||||
|
dtype: torch.dtype,
|
||||||
|
device: torch.device,
|
||||||
|
past_key_values_length: int = 0,
|
||||||
|
):
|
||||||
|
"""Make causal mask used for bi-directional self-attention."""
|
||||||
|
batch_size, tgt_len = input_ids_shape
|
||||||
|
mask = torch.full((tgt_len, tgt_len), torch.finfo(torch.float16).min, device=device)
|
||||||
|
mask_cond = torch.arange(mask.size(-1), device=device)
|
||||||
|
mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
|
||||||
|
mask = mask.to(dtype)
|
||||||
|
if past_key_values_length > 0:
|
||||||
|
mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
|
||||||
|
return mask[None, None, :, :].expand(batch_size, 1, tgt_len, tgt_len + past_key_values_length)
|
||||||
|
|
||||||
|
|
||||||
|
# Copied from transformers.models.bart.modeling_bart._expand_mask
|
||||||
|
def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
|
||||||
|
"""Expands attention_mask from `[batch_size, seq_len]` to `[batch_size, 1, tgt_seq_len, src_seq_len]`."""
|
||||||
|
batch_size, src_len = mask.size()
|
||||||
|
tgt_len = tgt_len if tgt_len is not None else src_len
|
||||||
|
|
||||||
|
expanded_mask = mask[:, None, None, :].expand(batch_size, 1, tgt_len, src_len).to(dtype)
|
||||||
|
inverted_mask = 1.0 - expanded_mask
|
||||||
|
|
||||||
|
return inverted_mask.masked_fill(
|
||||||
|
inverted_mask.to(torch.bool), torch.finfo(dtype).min
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class RotaryEmbedding(nn.Module):
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
dim: int,
|
||||||
|
max_position_embeddings: int,
|
||||||
|
base: int = 10_000,
|
||||||
|
device: Optional[torch.device] = None,
|
||||||
|
):
|
||||||
|
super().__init__()
|
||||||
|
|
||||||
|
self.dim = dim
|
||||||
|
self.max_position_embeddings = max_position_embeddings
|
||||||
|
self.base = base
|
||||||
|
inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, device=device, dtype=torch.float32) / self.dim))
|
||||||
|
self.register_buffer("inv_freq", inv_freq, persistent=False)
|
||||||
|
|
||||||
|
# Build here to make `torch.jit.trace` work.
|
||||||
|
self._set_cos_sin_cache(
|
||||||
|
seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype(),
|
||||||
|
)
|
||||||
|
|
||||||
|
def _set_cos_sin_cache(self, seq_len: int, device: torch.device, dtype: torch.dtype):
|
||||||
|
self.max_seq_len_cached = seq_len
|
||||||
|
t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.float32)
|
||||||
|
|
||||||
|
# Don't do einsum, it converts fp32 to fp16 under AMP
|
||||||
|
# freqs = torch.einsum("i,j->ij", t, self.inv_freq)
|
||||||
|
freqs = torch.outer(t, self.inv_freq)
|
||||||
|
# Different from paper, but it uses a different permutation in order to obtain the same calculation
|
||||||
|
emb = torch.cat((freqs, freqs), dim=-1)
|
||||||
|
self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
|
||||||
|
self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)
|
||||||
|
|
||||||
|
def forward(self, x: torch.Tensor, seq_len: Optional[int] = None):
|
||||||
|
# x: [batch_size, num_heads, seq_len, head_size]
|
||||||
|
if seq_len > self.max_seq_len_cached:
|
||||||
|
self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=torch.get_default_dtype())
|
||||||
|
return (
|
||||||
|
self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
|
||||||
|
self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def rotate_half(x: torch.Tensor):
|
||||||
|
"""Rotates half the hidden dims of the input."""
|
||||||
|
x1, x2 = torch.chunk(x, 2, dim=-1)
|
||||||
|
return torch.cat((-x2, x1), dim=-1)
|
||||||
|
|
||||||
|
|
||||||
|
def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
|
||||||
|
# The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
|
||||||
|
cos = cos.squeeze(1).squeeze(0) # [seq_len, dim]
|
||||||
|
sin = sin.squeeze(1).squeeze(0) # [seq_len, dim]
|
||||||
|
cos = cos[position_ids].unsqueeze(1) # [batch_size, 1, seq_len, dim]
|
||||||
|
sin = sin[position_ids].unsqueeze(1) # [batch_size, 1, seq_len, dim]
|
||||||
|
q_embed = (q * cos) + (rotate_half(q) * sin)
|
||||||
|
k_embed = (k * cos) + (rotate_half(k) * sin)
|
||||||
|
return q_embed, k_embed
|
||||||
|
|
||||||
|
|
||||||
|
class MLP(nn.Module):
|
||||||
|
def __init__(self, config: StableLMEpochConfig):
|
||||||
|
super().__init__()
|
||||||
|
self.config = config
|
||||||
|
self.hidden_size = config.hidden_size
|
||||||
|
self.intermediate_size = config.intermediate_size
|
||||||
|
self.gate_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)
|
||||||
|
self.up_proj = nn.Linear(config.hidden_size, config.intermediate_size, bias=False)
|
||||||
|
self.down_proj = nn.Linear(config.intermediate_size, config.hidden_size, bias=False)
|
||||||
|
self.act_fn = nn.SiLU()
|
||||||
|
|
||||||
|
def forward(self, x: torch.Tensor) -> torch.Tensor:
|
||||||
|
return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
|
||||||
|
|
||||||
|
|
||||||
|
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
|
||||||
|
"""
|
||||||
|
This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
|
||||||
|
num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
|
||||||
|
"""
|
||||||
|
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
|
||||||
|
if n_rep == 1:
|
||||||
|
return hidden_states
|
||||||
|
hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
|
||||||
|
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
|
||||||
|
|
||||||
|
|
||||||
|
class Attention(nn.Module):
|
||||||
|
def __init__(self, config: StableLMEpochConfig):
|
||||||
|
super().__init__()
|
||||||
|
self.config = config
|
||||||
|
self.hidden_size = config.hidden_size
|
||||||
|
self.num_heads = config.num_attention_heads
|
||||||
|
self.head_dim = self.hidden_size // self.num_heads
|
||||||
|
self.num_key_value_heads = config.num_key_value_heads
|
||||||
|
self.num_key_value_groups = self.num_heads // self.num_key_value_heads
|
||||||
|
self.max_position_embeddings = config.max_position_embeddings
|
||||||
|
|
||||||
|
if (self.head_dim * self.num_heads) != self.hidden_size:
|
||||||
|
raise ValueError(
|
||||||
|
f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
|
||||||
|
f" and `num_heads`: {self.num_heads})."
|
||||||
|
)
|
||||||
|
self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=False)
|
||||||
|
self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
|
||||||
|
self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=False)
|
||||||
|
self.o_proj = nn.Linear(self.hidden_size, self.hidden_size, bias=False)
|
||||||
|
|
||||||
|
self._init_rope()
|
||||||
|
|
||||||
|
def _init_rope(self):
|
||||||
|
self.rotary_ndims = int(self.head_dim * self.config.rope_pct)
|
||||||
|
self.rotary_emb = RotaryEmbedding(
|
||||||
|
self.rotary_ndims,
|
||||||
|
max_position_embeddings=self.config.max_position_embeddings,
|
||||||
|
base=self.config.rope_theta,
|
||||||
|
)
|
||||||
|
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
hidden_states: torch.FloatTensor,
|
||||||
|
attention_mask: torch.FloatTensor,
|
||||||
|
position_ids: torch.LongTensor,
|
||||||
|
past_key_value: Optional[Tuple[torch.Tensor]] = None,
|
||||||
|
output_attentions: Optional[bool] = False,
|
||||||
|
use_cache: Optional[bool] = False,
|
||||||
|
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
|
||||||
|
bsz, q_len, _ = hidden_states.size()
|
||||||
|
|
||||||
|
query_states = self.q_proj(hidden_states)
|
||||||
|
key_states = self.k_proj(hidden_states)
|
||||||
|
value_states = self.v_proj(hidden_states)
|
||||||
|
|
||||||
|
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
|
||||||
|
key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
|
||||||
|
value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
|
||||||
|
|
||||||
|
query_rot = query_states[..., : self.rotary_ndims]
|
||||||
|
query_pass = query_states[..., self.rotary_ndims :]
|
||||||
|
key_rot = key_states[..., : self.rotary_ndims]
|
||||||
|
key_pass = key_states[..., self.rotary_ndims :]
|
||||||
|
|
||||||
|
kv_seq_len = key_states.shape[-2]
|
||||||
|
if past_key_value is not None:
|
||||||
|
kv_seq_len += past_key_value[0].shape[-2]
|
||||||
|
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
|
||||||
|
query_states, key_states = apply_rotary_pos_emb(query_rot, key_rot, cos, sin, position_ids)
|
||||||
|
|
||||||
|
# [batch_size, num_heads, seq_len, head_dim]
|
||||||
|
query_states = torch.cat((query_states, query_pass), dim=-1)
|
||||||
|
key_states = torch.cat((key_states, key_pass), dim=-1)
|
||||||
|
|
||||||
|
if past_key_value is not None:
|
||||||
|
# Reuse k, v, self_attention
|
||||||
|
key_states = torch.cat((past_key_value[0], key_states), dim=2)
|
||||||
|
value_states = torch.cat((past_key_value[1], value_states), dim=2)
|
||||||
|
|
||||||
|
past_key_value = (key_states, value_states) if use_cache else None
|
||||||
|
|
||||||
|
# Repeat k/v heads if n_kv_heads < n_heads
|
||||||
|
key_states = repeat_kv(key_states, self.num_key_value_groups)
|
||||||
|
value_states = repeat_kv(value_states, self.num_key_value_groups)
|
||||||
|
|
||||||
|
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
|
||||||
|
|
||||||
|
if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
|
||||||
|
raise ValueError(
|
||||||
|
f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
|
||||||
|
f" {attn_weights.size()}"
|
||||||
|
)
|
||||||
|
|
||||||
|
if attention_mask is not None:
|
||||||
|
if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
|
||||||
|
raise ValueError(
|
||||||
|
f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
|
||||||
|
)
|
||||||
|
attn_weights = attn_weights + attention_mask
|
||||||
|
|
||||||
|
# Upcast attention to fp32
|
||||||
|
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
|
||||||
|
attn_output = torch.matmul(attn_weights, value_states)
|
||||||
|
|
||||||
|
if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
|
||||||
|
raise ValueError(
|
||||||
|
f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
|
||||||
|
f" {attn_output.size()}"
|
||||||
|
)
|
||||||
|
|
||||||
|
# Merge heads
|
||||||
|
attn_output = attn_output.transpose(1, 2).contiguous()
|
||||||
|
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
|
||||||
|
|
||||||
|
# Final linear projection
|
||||||
|
attn_output = self.o_proj(attn_output)
|
||||||
|
|
||||||
|
if not output_attentions:
|
||||||
|
attn_weights = None
|
||||||
|
|
||||||
|
return attn_output, attn_weights, past_key_value
|
||||||
|
|
||||||
|
|
||||||
|
class DecoderLayer(nn.Module):
|
||||||
|
def __init__(self, config: StableLMEpochConfig):
|
||||||
|
super().__init__()
|
||||||
|
self.self_attn = Attention(config)
|
||||||
|
self.mlp = MLP(config)
|
||||||
|
self.input_layernorm = nn.LayerNorm(config.hidden_size, eps=config.norm_eps)
|
||||||
|
self.post_attention_layernorm = nn.LayerNorm(config.hidden_size, eps=config.norm_eps)
|
||||||
|
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
hidden_states: Optional[torch.FloatTensor],
|
||||||
|
attention_mask: Optional[torch.FloatTensor] = None,
|
||||||
|
position_ids: Optional[torch.LongTensor] = None,
|
||||||
|
past_key_value: Optional[Tuple[torch.Tensor]] = None,
|
||||||
|
output_attentions: Optional[bool] = False,
|
||||||
|
use_cache: Optional[bool] = False,
|
||||||
|
) -> Union[Tuple[torch.Tensor], Optional[Tuple[torch.Tensor, Tuple[torch.FloatTensor, ...]]]]:
|
||||||
|
residual = hidden_states
|
||||||
|
|
||||||
|
hidden_states = self.input_layernorm(hidden_states)
|
||||||
|
|
||||||
|
# Self Attention
|
||||||
|
hidden_states, self_attn_weights, present_key_value = self.self_attn(
|
||||||
|
hidden_states=hidden_states,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
position_ids=position_ids,
|
||||||
|
past_key_value=past_key_value,
|
||||||
|
output_attentions=output_attentions,
|
||||||
|
use_cache=use_cache,
|
||||||
|
)
|
||||||
|
hidden_states = residual + hidden_states
|
||||||
|
|
||||||
|
# Fully Connected
|
||||||
|
residual = hidden_states
|
||||||
|
hidden_states = self.post_attention_layernorm(hidden_states)
|
||||||
|
hidden_states = self.mlp(hidden_states)
|
||||||
|
hidden_states = residual + hidden_states
|
||||||
|
|
||||||
|
outputs = (hidden_states,)
|
||||||
|
|
||||||
|
if output_attentions:
|
||||||
|
outputs += (self_attn_weights,)
|
||||||
|
|
||||||
|
if use_cache:
|
||||||
|
outputs += (present_key_value,)
|
||||||
|
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
class StableLMEpochPreTrainedModel(PreTrainedModel):
|
||||||
|
"""An abstract class to handle weights initialization and a simple interface
|
||||||
|
for downloading and loading pretrained models.
|
||||||
|
"""
|
||||||
|
|
||||||
|
config_class = StableLMEpochConfig
|
||||||
|
base_model_prefix = "transformer"
|
||||||
|
supports_gradient_checkpointing = True
|
||||||
|
_no_split_modules = ["DecoderLayer"]
|
||||||
|
_skip_keys_device_placement = "past_key_values"
|
||||||
|
|
||||||
|
def _init_weights(self, module: nn.Module):
|
||||||
|
"""Initialize the weights"""
|
||||||
|
if isinstance(module, nn.Linear):
|
||||||
|
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
|
||||||
|
if module.bias is not None:
|
||||||
|
module.bias.data.zero_()
|
||||||
|
elif isinstance(module, nn.Embedding):
|
||||||
|
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
|
||||||
|
if module.padding_idx is not None:
|
||||||
|
module.weight.data[module.padding_idx].zero_()
|
||||||
|
elif isinstance(module, nn.LayerNorm):
|
||||||
|
module.bias.data.zero_()
|
||||||
|
module.weight.data.fill_(1.0)
|
||||||
|
|
||||||
|
def _set_gradient_checkpointing(self, module: nn.Module, value=False):
|
||||||
|
if isinstance(module, StableLMEpochModel):
|
||||||
|
module.gradient_checkpointing = value
|
||||||
|
|
||||||
|
|
||||||
|
class StableLMEpochModel(StableLMEpochPreTrainedModel):
|
||||||
|
def __init__(self, config: StableLMEpochConfig):
|
||||||
|
super().__init__(config)
|
||||||
|
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, config.pad_token_id)
|
||||||
|
self.layers = nn.ModuleList([DecoderLayer(config) for _ in range(config.num_hidden_layers)])
|
||||||
|
self.norm = nn.LayerNorm(config.hidden_size, eps=config.norm_eps)
|
||||||
|
|
||||||
|
self.gradient_checkpointing = False
|
||||||
|
# Initialize weights and apply final processing
|
||||||
|
self.post_init()
|
||||||
|
|
||||||
|
def get_input_embeddings(self):
|
||||||
|
return self.embed_tokens
|
||||||
|
|
||||||
|
def set_input_embeddings(self, value: nn.Module):
|
||||||
|
self.embed_tokens = value
|
||||||
|
|
||||||
|
# Copied from transformers.models.bart.modeling_bart.BartDecoder._prepare_decoder_attention_mask
|
||||||
|
def _prepare_decoder_attention_mask(
|
||||||
|
self,
|
||||||
|
attention_mask: torch.Tensor,
|
||||||
|
input_shape: torch.Size,
|
||||||
|
inputs_embeds: torch.Tensor,
|
||||||
|
past_key_values_length: int,
|
||||||
|
):
|
||||||
|
# Create causal mask
|
||||||
|
# [batch_size, seq_len] -> [batch_size, 1, tgt_seq_len, src_seq_len]
|
||||||
|
combined_attention_mask = None
|
||||||
|
if input_shape[-1] > 1:
|
||||||
|
combined_attention_mask = _make_causal_mask(
|
||||||
|
input_shape,
|
||||||
|
inputs_embeds.dtype,
|
||||||
|
device=inputs_embeds.device,
|
||||||
|
past_key_values_length=past_key_values_length,
|
||||||
|
)
|
||||||
|
|
||||||
|
if attention_mask is not None:
|
||||||
|
# [batch_size, seq_len] -> [batch_size, 1, tgt_seq_len, src_seq_len]
|
||||||
|
expanded_attn_mask = _expand_mask(
|
||||||
|
attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]
|
||||||
|
).to(inputs_embeds.device)
|
||||||
|
combined_attention_mask = expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
|
||||||
|
|
||||||
|
return combined_attention_mask
|
||||||
|
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
input_ids: Optional[torch.LongTensor] = None,
|
||||||
|
attention_mask: Optional[torch.FloatTensor] = None,
|
||||||
|
position_ids: Optional[torch.LongTensor] = None,
|
||||||
|
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||||
|
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||||
|
use_cache: Optional[bool] = None,
|
||||||
|
output_attentions: Optional[bool] = None,
|
||||||
|
output_hidden_states: Optional[bool] = None,
|
||||||
|
return_dict: Optional[bool] = None,
|
||||||
|
) -> Union[Tuple, BaseModelOutputWithPast]:
|
||||||
|
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
|
||||||
|
output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
|
||||||
|
use_cache = use_cache if use_cache is not None else self.config.use_cache
|
||||||
|
|
||||||
|
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||||
|
|
||||||
|
# Retrieve input_ids and inputs_embeds
|
||||||
|
if input_ids is not None and inputs_embeds is not None:
|
||||||
|
raise ValueError(
|
||||||
|
"You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time"
|
||||||
|
)
|
||||||
|
elif input_ids is not None:
|
||||||
|
batch_size, seq_length = input_ids.shape
|
||||||
|
elif inputs_embeds is not None:
|
||||||
|
batch_size, seq_length, _ = inputs_embeds.shape
|
||||||
|
else:
|
||||||
|
raise ValueError(
|
||||||
|
"You have to specify either decoder_input_ids or decoder_inputs_embeds"
|
||||||
|
)
|
||||||
|
|
||||||
|
seq_length_with_past = seq_length
|
||||||
|
past_key_values_length = 0
|
||||||
|
|
||||||
|
if past_key_values is not None:
|
||||||
|
past_key_values_length = past_key_values[0][0].shape[2]
|
||||||
|
seq_length_with_past = seq_length_with_past + past_key_values_length
|
||||||
|
|
||||||
|
if position_ids is None:
|
||||||
|
device = input_ids.device if input_ids is not None else inputs_embeds.device
|
||||||
|
position_ids = torch.arange(
|
||||||
|
past_key_values_length,
|
||||||
|
seq_length + past_key_values_length,
|
||||||
|
dtype=torch.long,
|
||||||
|
device=device,
|
||||||
|
)
|
||||||
|
position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
|
||||||
|
else:
|
||||||
|
position_ids = position_ids.view(-1, seq_length).long()
|
||||||
|
|
||||||
|
if inputs_embeds is None:
|
||||||
|
inputs_embeds = self.embed_tokens(input_ids)
|
||||||
|
# Embed positions
|
||||||
|
if attention_mask is None:
|
||||||
|
attention_mask = torch.ones(
|
||||||
|
(batch_size, seq_length_with_past),
|
||||||
|
dtype=torch.bool,
|
||||||
|
device=inputs_embeds.device,
|
||||||
|
)
|
||||||
|
attention_mask = self._prepare_decoder_attention_mask(
|
||||||
|
attention_mask,
|
||||||
|
(batch_size, seq_length),
|
||||||
|
inputs_embeds,
|
||||||
|
past_key_values_length,
|
||||||
|
)
|
||||||
|
|
||||||
|
hidden_states = inputs_embeds
|
||||||
|
|
||||||
|
if self.gradient_checkpointing and self.training:
|
||||||
|
if use_cache:
|
||||||
|
logger.warning(
|
||||||
|
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
|
||||||
|
)
|
||||||
|
use_cache = False
|
||||||
|
|
||||||
|
# Decoder layers
|
||||||
|
all_hidden_states = () if output_hidden_states else None
|
||||||
|
all_self_attns = () if output_attentions else None
|
||||||
|
next_decoder_cache = () if use_cache else None
|
||||||
|
|
||||||
|
for idx, decoder_layer in enumerate(self.layers):
|
||||||
|
if output_hidden_states:
|
||||||
|
all_hidden_states += (hidden_states,)
|
||||||
|
|
||||||
|
past_key_value = (
|
||||||
|
past_key_values[idx] if past_key_values is not None else None
|
||||||
|
)
|
||||||
|
|
||||||
|
if self.gradient_checkpointing and self.training:
|
||||||
|
|
||||||
|
def create_custom_forward(module):
|
||||||
|
def custom_forward(*inputs):
|
||||||
|
# None for past_key_value
|
||||||
|
return module(*inputs, past_key_value, output_attentions)
|
||||||
|
|
||||||
|
return custom_forward
|
||||||
|
|
||||||
|
layer_outputs = torch.utils.checkpoint.checkpoint(
|
||||||
|
create_custom_forward(decoder_layer),
|
||||||
|
hidden_states,
|
||||||
|
attention_mask,
|
||||||
|
position_ids,
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
layer_outputs = decoder_layer(
|
||||||
|
hidden_states,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
position_ids=position_ids,
|
||||||
|
past_key_value=past_key_value,
|
||||||
|
output_attentions=output_attentions,
|
||||||
|
use_cache=use_cache,
|
||||||
|
)
|
||||||
|
|
||||||
|
hidden_states = layer_outputs[0]
|
||||||
|
|
||||||
|
if use_cache:
|
||||||
|
next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
|
||||||
|
|
||||||
|
if output_attentions:
|
||||||
|
all_self_attns += (layer_outputs[1],)
|
||||||
|
|
||||||
|
hidden_states = self.norm(hidden_states)
|
||||||
|
|
||||||
|
# Add hidden states from the last decoder layer
|
||||||
|
if output_hidden_states:
|
||||||
|
all_hidden_states += (hidden_states,)
|
||||||
|
|
||||||
|
next_cache = next_decoder_cache if use_cache else None
|
||||||
|
if not return_dict:
|
||||||
|
return tuple(
|
||||||
|
v
|
||||||
|
for v in [hidden_states, next_cache, all_hidden_states, all_self_attns]
|
||||||
|
if v is not None
|
||||||
|
)
|
||||||
|
return BaseModelOutputWithPast(
|
||||||
|
last_hidden_state=hidden_states,
|
||||||
|
past_key_values=next_cache,
|
||||||
|
hidden_states=all_hidden_states,
|
||||||
|
attentions=all_self_attns,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class StableLMEpochForCausalLM(StableLMEpochPreTrainedModel):
|
||||||
|
_tied_weights_keys = ["lm_head.weight"]
|
||||||
|
|
||||||
|
def __init__(self, config: StableLMEpochConfig):
|
||||||
|
super().__init__(config)
|
||||||
|
|
||||||
|
self.model = StableLMEpochModel(config)
|
||||||
|
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
|
||||||
|
|
||||||
|
# Initialize weights and apply final processing
|
||||||
|
self.post_init()
|
||||||
|
|
||||||
|
def get_input_embeddings(self):
|
||||||
|
return self.model.embed_tokens
|
||||||
|
|
||||||
|
def set_input_embeddings(self, value):
|
||||||
|
self.model.embed_tokens = value
|
||||||
|
|
||||||
|
def get_output_embeddings(self):
|
||||||
|
return self.lm_head
|
||||||
|
|
||||||
|
def set_output_embeddings(self, new_embeddings: nn.Module):
|
||||||
|
self.lm_head = new_embeddings
|
||||||
|
|
||||||
|
def get_decoder(self):
|
||||||
|
return self.transformer
|
||||||
|
|
||||||
|
def set_decoder(self, decoder):
|
||||||
|
self.transformer = decoder
|
||||||
|
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
input_ids: Optional[torch.LongTensor] = None,
|
||||||
|
attention_mask: Optional[torch.FloatTensor] = None,
|
||||||
|
position_ids: Optional[torch.LongTensor] = None,
|
||||||
|
past_key_values: Optional[Tuple[Tuple[torch.FloatTensor]]] = None,
|
||||||
|
inputs_embeds: Optional[torch.FloatTensor] = None,
|
||||||
|
labels: Optional[torch.LongTensor] = None,
|
||||||
|
use_cache: Optional[bool] = None,
|
||||||
|
output_attentions: Optional[bool] = None,
|
||||||
|
output_hidden_states: Optional[bool] = None,
|
||||||
|
return_dict: Optional[bool] = None,
|
||||||
|
) -> Union[Tuple, CausalLMOutputWithPast]:
|
||||||
|
output_attentions = (
|
||||||
|
output_attentions
|
||||||
|
if output_attentions is not None
|
||||||
|
else self.config.output_attentions
|
||||||
|
)
|
||||||
|
output_hidden_states = (
|
||||||
|
output_hidden_states
|
||||||
|
if output_hidden_states is not None
|
||||||
|
else self.config.output_hidden_states
|
||||||
|
)
|
||||||
|
return_dict = (
|
||||||
|
return_dict if return_dict is not None else self.config.use_return_dict
|
||||||
|
)
|
||||||
|
|
||||||
|
# decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
|
||||||
|
outputs = self.model(
|
||||||
|
input_ids,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
position_ids=position_ids,
|
||||||
|
past_key_values=past_key_values,
|
||||||
|
inputs_embeds=inputs_embeds,
|
||||||
|
use_cache=use_cache,
|
||||||
|
output_attentions=output_attentions,
|
||||||
|
output_hidden_states=output_hidden_states,
|
||||||
|
return_dict=return_dict,
|
||||||
|
)
|
||||||
|
|
||||||
|
hidden_states = outputs[0]
|
||||||
|
logits = self.lm_head(hidden_states).float()
|
||||||
|
|
||||||
|
loss = None
|
||||||
|
if labels is not None:
|
||||||
|
# Shift so that tokens < n predict n
|
||||||
|
shift_logits = logits[..., :-1, :].contiguous()
|
||||||
|
shift_labels = labels[..., 1:].contiguous()
|
||||||
|
# Flatten the tokens
|
||||||
|
loss_fct = CrossEntropyLoss()
|
||||||
|
shift_logits = shift_logits.view(-1, self.config.vocab_size)
|
||||||
|
shift_labels = shift_labels.view(-1)
|
||||||
|
# Enable model parallelism
|
||||||
|
shift_labels = shift_labels.to(shift_logits.device)
|
||||||
|
loss = loss_fct(shift_logits, shift_labels)
|
||||||
|
|
||||||
|
if not return_dict:
|
||||||
|
output = (logits,) + outputs[1:]
|
||||||
|
return (loss,) + output if loss is not None else output
|
||||||
|
|
||||||
|
return CausalLMOutputWithPast(
|
||||||
|
loss=loss,
|
||||||
|
logits=logits,
|
||||||
|
past_key_values=outputs.past_key_values,
|
||||||
|
hidden_states=outputs.hidden_states,
|
||||||
|
attentions=outputs.attentions,
|
||||||
|
)
|
||||||
|
|
||||||
|
def prepare_inputs_for_generation(
|
||||||
|
self,
|
||||||
|
input_ids,
|
||||||
|
past_key_values: Optional[torch.Tensor] = None,
|
||||||
|
attention_mask: Optional[torch.Tensor] = None,
|
||||||
|
inputs_embeds: Optional[torch.Tensor] = None,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
# Trim decoder_input_ids if past is used
|
||||||
|
if past_key_values and past_key_values[0] is not None:
|
||||||
|
input_ids = input_ids[:, -1:]
|
||||||
|
|
||||||
|
position_ids = kwargs.get("position_ids", None)
|
||||||
|
if attention_mask is not None and position_ids is None:
|
||||||
|
# Create position_ids on the fly for batch generation
|
||||||
|
position_ids = attention_mask.long().cumsum(-1) - 1
|
||||||
|
position_ids.masked_fill_(attention_mask == 0, 1)
|
||||||
|
if past_key_values:
|
||||||
|
position_ids = position_ids[:, -1].unsqueeze(-1)
|
||||||
|
|
||||||
|
# If `inputs_embeds` are passed, we only want to use them in the 1st generation step
|
||||||
|
if inputs_embeds is not None and past_key_values is None:
|
||||||
|
model_inputs = {"inputs_embeds": inputs_embeds}
|
||||||
|
else:
|
||||||
|
model_inputs = {"input_ids": input_ids}
|
||||||
|
|
||||||
|
model_inputs.update(
|
||||||
|
{
|
||||||
|
"attention_mask": attention_mask,
|
||||||
|
"past_key_values": past_key_values,
|
||||||
|
"use_cache": kwargs.get("use_cache"),
|
||||||
|
"position_ids": position_ids,
|
||||||
|
}
|
||||||
|
)
|
||||||
|
return model_inputs
|
||||||
|
|
||||||
|
@staticmethod
|
||||||
|
def _reorder_cache(past_key_values, beam_idx):
|
||||||
|
reordered_past = ()
|
||||||
|
for layer_past in past_key_values:
|
||||||
|
reordered_past += (
|
||||||
|
tuple(
|
||||||
|
past_state.index_select(0, beam_idx.to(past_state.device))
|
||||||
|
for past_state in layer_past
|
||||||
|
),
|
||||||
|
)
|
||||||
|
return reordered_past
|
||||||
|
|
||||||
|
|
||||||
|
StableLMEpochConfig.register_for_auto_class()
|
||||||
|
StableLMEpochForCausalLM.register_for_auto_class("AutoModelForCausalLM")
|
||||||
8
special_tokens_map.json
Normal file
8
special_tokens_map.json
Normal file
@@ -0,0 +1,8 @@
|
|||||||
|
{
|
||||||
|
"bos_token": "<|endoftext|>",
|
||||||
|
"cls_token": "<|endoftext|>",
|
||||||
|
"eos_token": "<|endoftext|>",
|
||||||
|
"pad_token": "<|endoftext|>",
|
||||||
|
"sep_token": "<|endoftext|>",
|
||||||
|
"unk_token": "<|endoftext|>"
|
||||||
|
}
|
||||||
100529
tokenizer.json
Normal file
100529
tokenizer.json
Normal file
File diff suppressed because it is too large
Load Diff
212
tokenizer_config.json
Normal file
212
tokenizer_config.json
Normal file
@@ -0,0 +1,212 @@
|
|||||||
|
{
|
||||||
|
"add_prefix_space": false,
|
||||||
|
"added_tokens_decoder": {
|
||||||
|
"0": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"1": {
|
||||||
|
"content": "<|padding|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": false,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
},
|
||||||
|
"50254": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50255": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50256": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50257": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50258": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50259": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50260": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50261": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50262": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50263": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50264": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50265": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50266": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50267": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50268": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50269": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50270": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50271": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50272": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50273": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50274": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50275": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
},
|
||||||
|
"50276": {
|
||||||
|
"content": " ",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": false
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"additional_special_tokens": [],
|
||||||
|
"bos_token": "<|endoftext|>",
|
||||||
|
"clean_up_tokenization_spaces": true,
|
||||||
|
"eos_token": "<|endoftext|>",
|
||||||
|
"model_max_length": 1000000000000000019884624838656,
|
||||||
|
"tokenizer_class": "GPTNeoXTokenizer",
|
||||||
|
"unk_token": "<|endoftext|>"
|
||||||
|
}
|
||||||
Reference in New Issue
Block a user