初始化项目,由ModelHub XC社区提供模型
Model: k050506koch/GPT3-dev-125m-0104 Source: Original Platform
This commit is contained in:
41
.gitattributes
vendored
Normal file
41
.gitattributes
vendored
Normal file
@@ -0,0 +1,41 @@
|
|||||||
|
*.7z filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.arrow filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bin filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ftz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.gz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.h5 filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.joblib filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.model filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npy filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.npz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.onnx filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.ot filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.parquet filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pb filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pickle filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pkl filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pt filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.pth filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.rar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
||||||
|
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tar filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tflite filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.tgz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.wasm filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.xz filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zip filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*.zst filter=lfs diff=lfs merge=lfs -text
|
||||||
|
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
||||||
|
model_f16.gguf filter=lfs diff=lfs merge=lfs -text
|
||||||
|
model_f32.gguf filter=lfs diff=lfs merge=lfs -text
|
||||||
|
model_q8_0.gguf filter=lfs diff=lfs merge=lfs -text
|
||||||
|
model_instruct_fp16.gguf filter=lfs diff=lfs merge=lfs -text
|
||||||
|
model_instruct_fp32.gguf filter=lfs diff=lfs merge=lfs -text
|
||||||
|
model_instruct_q8_0.gguf filter=lfs diff=lfs merge=lfs -text
|
||||||
51
README.md
Normal file
51
README.md
Normal file
@@ -0,0 +1,51 @@
|
|||||||
|
---
|
||||||
|
license: mit
|
||||||
|
datasets:
|
||||||
|
- HuggingFaceFW/fineweb
|
||||||
|
language:
|
||||||
|
- en
|
||||||
|
pipeline_tag: text-generation
|
||||||
|
library_name: transformers
|
||||||
|
---
|
||||||
|
# GPT3
|
||||||
|
|
||||||
|
Welcome to the GPT3 repository! This project is an attempt to recreate the architecture and approach from the original OpenAI GPT-3 paper. The repository includes scripts for training, fine-tuning, and inference of a GPT-3-like model using PyTorch and the Hugging Face Transformers library.
|
||||||
|
Here are located weights of dev checkpoints of my models. You can always download a folder, paste it's path inside inference.py and chat with them.
|
||||||
|
|
||||||
|
# **You can find all code on [GitHub](https://github.com/krll-corp/GPT3)**
|
||||||
|
# Note: This is a model with 125 million parameters (attempt to replicate GPT-3 Small). (it's very undertrained.)
|
||||||
|
# Note 2: This is a model checkpoint released on 12/02 2025 (12 batch size, 4 grad accumulation, 512 tokens and 65000 steps under Lion optimizer). It scores 28.65% on MMLU which is slightly higher than 25% (random guess)
|
||||||
|
# Note 3: This model already demonstrates basic abilities in generating text. It's not perfect and I will continue working on it. Expect Instruct models soon.
|
||||||
|
|
||||||
|
## inference:
|
||||||
|
```python
|
||||||
|
from transformers import AutoTokenizer, AutoModelForCausalLM
|
||||||
|
model = AutoModelForCausalLM.from_pretrained('k050506koch/GPT3-dev-125m-0104', trust_remote_code=True)
|
||||||
|
tokenizer = AutoTokenizer.from_pretrained('k050506koch/GPT3-dev-125m-1202')
|
||||||
|
tokenizer.pad_token_id = tokenizer.eos_token_id
|
||||||
|
print("\n", tokenizer.decode(model.generate(tokenizer.encode("He is a doctor. His main goal is", return_tensors='pt'),
|
||||||
|
max_length=128, temperature=0.7, top_p=0.9, repetition_penalty=1.2, no_repeat_ngram_size=3,
|
||||||
|
num_return_sequences=1, do_sample=True)[0], skip_special_tokens=True))
|
||||||
|
```
|
||||||
|
|
||||||
|
# Note for instruct models: all instruct models share same chat template
|
||||||
|
```plaintext
|
||||||
|
User: Here is a user prompt.<|endoftext|>\nAssistant: Here's an answer from model.<|endoftext|>
|
||||||
|
```
|
||||||
|
Please note that the model is not trained for continious chat and not tested for it so it's possible but unlikely that it will keep the topic and act coherently between messages.
|
||||||
|
|
||||||
|
## Contributing
|
||||||
|
|
||||||
|
Contributions are welcome! I'm just a student who is interested in AI so my code may be incorrect or have logical issues. Please open an issue or submit a pull request for any improvements or bug fixes, I will be happy.
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
This project is licensed under the MIT License. See the LICENSE file for details. Everyone can use and modify this code at their discretion.
|
||||||
|
|
||||||
|
## Acknowledgements
|
||||||
|
|
||||||
|
Thanks OpenAI, HuggingFace and Pytorch for making this project possible!
|
||||||
|
|
||||||
|
- [OpenAI GPT-3 Paper](https://arxiv.org/abs/2005.14165)
|
||||||
|
- [Hugging Face Transformers](https://github.com/huggingface/transformers)
|
||||||
|
- [PyTorch](https://pytorch.org/)
|
||||||
36
config.json
Normal file
36
config.json
Normal file
@@ -0,0 +1,36 @@
|
|||||||
|
{
|
||||||
|
"activation_function": "gelu",
|
||||||
|
"architectures": ["GPT3DevLMHeadModel"],
|
||||||
|
"auto_map": {
|
||||||
|
"AutoConfig": "modeling_gpt3dev.GPT3DevConfig",
|
||||||
|
"AutoModel": "modeling_gpt3dev.GPT3DevModel",
|
||||||
|
"AutoModelForCausalLM": "modeling_gpt3dev.GPT3DevLMHeadModel"
|
||||||
|
},
|
||||||
|
"attn_pdrop": 0.0,
|
||||||
|
"bos_token_id": 50256,
|
||||||
|
"embd_pdrop": 0.0,
|
||||||
|
"eos_token_id": 50256,
|
||||||
|
"initializer_range": 0.02,
|
||||||
|
"layer_norm_epsilon": 1e-05,
|
||||||
|
"model_type": "gpt3dev",
|
||||||
|
"n_ctx": 2048,
|
||||||
|
"n_embd": 768,
|
||||||
|
"n_head": 12,
|
||||||
|
"n_inner": 3072,
|
||||||
|
"n_layer": 12,
|
||||||
|
"n_positions": 2048,
|
||||||
|
"reorder_and_upcast_attn": false,
|
||||||
|
"resid_pdrop": 0.0,
|
||||||
|
"scale_attn_by_inverse_layer_idx": false,
|
||||||
|
"scale_attn_weights": true,
|
||||||
|
"summary_activation": null,
|
||||||
|
"summary_first_dropout": 0.1,
|
||||||
|
"summary_proj_to_labels": true,
|
||||||
|
"summary_type": "cls_index",
|
||||||
|
"summary_use_proj": true,
|
||||||
|
"torch_dtype": "bfloat16",
|
||||||
|
"transformers_version": "4.50.1",
|
||||||
|
"use_cache": true,
|
||||||
|
"use_pre_layernorm": true,
|
||||||
|
"vocab_size": 50257
|
||||||
|
}
|
||||||
6
generation_config.json
Normal file
6
generation_config.json
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
{
|
||||||
|
"_from_model_config": true,
|
||||||
|
"bos_token_id": 50256,
|
||||||
|
"eos_token_id": 50256,
|
||||||
|
"transformers_version": "4.50.1"
|
||||||
|
}
|
||||||
50001
merges.txt
Normal file
50001
merges.txt
Normal file
File diff suppressed because it is too large
Load Diff
3
model.safetensors
Normal file
3
model.safetensors
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:31145eacbdc57d44235953a981c90911ea23d029de2ff40122c4050c383eed7f
|
||||||
|
size 250467520
|
||||||
3
model_f16.gguf
Normal file
3
model_f16.gguf
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:732b58ed13a496101e9b37c22b2854871e7a1a98dfc69ec204072967074adadd
|
||||||
|
size 332811008
|
||||||
3
model_f32.gguf
Normal file
3
model_f32.gguf
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:8a6fce0ee77d1aa24a882234fe38d93c47a19334377a2a0dca0456dcb8a60d5a
|
||||||
|
size 657069824
|
||||||
3
model_instruct_fp16.gguf
Normal file
3
model_instruct_fp16.gguf
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:ffbf013872f422143126160d88c999ab870e3daed8a843abd3ea0f4ef5e7bf58
|
||||||
|
size 332811040
|
||||||
3
model_instruct_fp32.gguf
Normal file
3
model_instruct_fp32.gguf
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:4d99ded5c86ab3b55769afc3f43617ff1d87d5c803aaba526682845c6c99b9fa
|
||||||
|
size 657069856
|
||||||
3
model_instruct_q8_0.gguf
Normal file
3
model_instruct_q8_0.gguf
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:d0c402aa4a1c1b9b8eafdc853b16df5322b1a0155fc142b3478abce77847b96a
|
||||||
|
size 180814752
|
||||||
3
model_q8_0.gguf
Normal file
3
model_q8_0.gguf
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
version https://git-lfs.github.com/spec/v1
|
||||||
|
oid sha256:1094384992e54b6485fa553ab5e837a15bb5ec32b9b3e3dcb6d68b9dd2ea4c94
|
||||||
|
size 180814720
|
||||||
480
modeling_gpt3dev.py
Normal file
480
modeling_gpt3dev.py
Normal file
@@ -0,0 +1,480 @@
|
|||||||
|
import math
|
||||||
|
import torch
|
||||||
|
import torch.nn as nn
|
||||||
|
from packaging.version import Version
|
||||||
|
from transformers.models.gpt2.configuration_gpt2 import GPT2Config
|
||||||
|
from transformers.models.gpt2.modeling_gpt2 import GPT2MLP
|
||||||
|
|
||||||
|
from transformers import (
|
||||||
|
__version__ as TRANSFORMERS_VERSION,
|
||||||
|
AutoConfig,
|
||||||
|
AutoModel,
|
||||||
|
AutoModelForCausalLM
|
||||||
|
)
|
||||||
|
|
||||||
|
from transformers.modeling_outputs import (
|
||||||
|
CausalLMOutputWithCrossAttentions,
|
||||||
|
)
|
||||||
|
from transformers.models.gpt2.configuration_gpt2 import GPT2Config
|
||||||
|
from transformers.models.gpt2.modeling_gpt2 import (
|
||||||
|
GPT2LMHeadModel,
|
||||||
|
GPT2Model,
|
||||||
|
GPT2Block,
|
||||||
|
GPT2Attention,
|
||||||
|
GPT2MLP,
|
||||||
|
CausalLMOutputWithCrossAttentions
|
||||||
|
)
|
||||||
|
|
||||||
|
IS_TRANSFORMERS_V5 = Version(TRANSFORMERS_VERSION) >= Version("5.0.0")
|
||||||
|
|
||||||
|
|
||||||
|
def _normalize_block_args(
|
||||||
|
extra_args,
|
||||||
|
*,
|
||||||
|
head_mask=None,
|
||||||
|
encoder_hidden_states=None,
|
||||||
|
encoder_attention_mask=None,
|
||||||
|
use_cache=False,
|
||||||
|
output_attentions=False,
|
||||||
|
):
|
||||||
|
if IS_TRANSFORMERS_V5:
|
||||||
|
if extra_args and encoder_hidden_states is None:
|
||||||
|
encoder_hidden_states = extra_args[0]
|
||||||
|
else:
|
||||||
|
if extra_args:
|
||||||
|
if head_mask is None:
|
||||||
|
head_mask = extra_args[0]
|
||||||
|
if len(extra_args) > 1 and encoder_hidden_states is None:
|
||||||
|
encoder_hidden_states = extra_args[1]
|
||||||
|
if len(extra_args) > 2 and encoder_attention_mask is None:
|
||||||
|
encoder_attention_mask = extra_args[2]
|
||||||
|
if len(extra_args) > 3:
|
||||||
|
use_cache = extra_args[3]
|
||||||
|
if len(extra_args) > 4:
|
||||||
|
output_attentions = extra_args[4]
|
||||||
|
|
||||||
|
return (
|
||||||
|
head_mask,
|
||||||
|
encoder_hidden_states,
|
||||||
|
encoder_attention_mask,
|
||||||
|
use_cache,
|
||||||
|
output_attentions,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class GPT3DevConfig(GPT2Config):
|
||||||
|
model_type = "gpt3dev"
|
||||||
|
|
||||||
|
def __init__(self, use_pre_layernorm=True, window_size=256, stride=128, **kwargs):
|
||||||
|
super().__init__(**kwargs)
|
||||||
|
self.use_pre_layernorm = use_pre_layernorm
|
||||||
|
self.window_size = window_size
|
||||||
|
self.stride = stride
|
||||||
|
|
||||||
|
|
||||||
|
class GPT3DevAttention(GPT2Attention): # dense
|
||||||
|
"""GPT-3 style dense attention: nn.Linear instead of Conv1D."""
|
||||||
|
def __init__(self, config, is_cross_attention=False, layer_idx=None):
|
||||||
|
super().__init__(config, is_cross_attention, layer_idx=layer_idx)
|
||||||
|
# GPT-3 uses nn.Linear instead of Conv1D
|
||||||
|
self.c_attn = nn.Linear(config.hidden_size, 3 * config.hidden_size, bias=True)
|
||||||
|
self.c_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
|
||||||
|
# forward() inherited from GPT2Attention — no override needed
|
||||||
|
|
||||||
|
|
||||||
|
class GPT3DevSparseAttention(GPT3DevAttention): # local sparse
|
||||||
|
"""GPT-3 style locally banded sparse attention."""
|
||||||
|
def __init__(self, config, is_cross_attention=False, layer_idx=None):
|
||||||
|
super().__init__(config, is_cross_attention, layer_idx=layer_idx)
|
||||||
|
self.window_size = getattr(config, "window_size", 256)
|
||||||
|
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
hidden_states,
|
||||||
|
past_key_value=None,
|
||||||
|
cache_position=None,
|
||||||
|
attention_mask=None,
|
||||||
|
*extra_args,
|
||||||
|
head_mask=None,
|
||||||
|
encoder_hidden_states=None,
|
||||||
|
encoder_attention_mask=None,
|
||||||
|
output_attentions=False,
|
||||||
|
past_key_values=None,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
if past_key_values is not None and past_key_value is None:
|
||||||
|
past_key_value = past_key_values
|
||||||
|
|
||||||
|
bsz, tgt_len, _ = hidden_states.size()
|
||||||
|
device = hidden_states.device
|
||||||
|
dtype = hidden_states.dtype
|
||||||
|
|
||||||
|
# Determine query/key positions using cache_position (new API)
|
||||||
|
if cache_position is not None:
|
||||||
|
q_pos = cache_position # shape: (tgt_len,)
|
||||||
|
seq_len = int(q_pos[-1].item()) + 1
|
||||||
|
else:
|
||||||
|
q_pos = torch.arange(tgt_len, device=device)
|
||||||
|
seq_len = tgt_len
|
||||||
|
k_pos = torch.arange(seq_len, device=device)
|
||||||
|
|
||||||
|
diff = q_pos[:, None] - k_pos[None, :] # (tgt_len, seq_len)
|
||||||
|
is_causal = diff >= 0
|
||||||
|
within_window = diff.abs() <= self.window_size
|
||||||
|
allow_attention = is_causal & within_window
|
||||||
|
del is_causal, within_window, diff
|
||||||
|
|
||||||
|
sparse_mask = torch.zeros((1, 1, tgt_len, seq_len), dtype=dtype, device=device)
|
||||||
|
sparse_mask.masked_fill_(~allow_attention, torch.finfo(dtype).min)
|
||||||
|
del allow_attention
|
||||||
|
|
||||||
|
# Combine with parent's causal mask
|
||||||
|
if attention_mask is not None:
|
||||||
|
# Parent may create mask with extra KV positions — trim to match
|
||||||
|
if attention_mask.size(-1) != sparse_mask.size(-1):
|
||||||
|
attention_mask = attention_mask[..., :sparse_mask.size(-1)]
|
||||||
|
if attention_mask.size(-2) != sparse_mask.size(-2):
|
||||||
|
attention_mask = attention_mask[..., :sparse_mask.size(-2), :]
|
||||||
|
attention_mask = torch.minimum(attention_mask, sparse_mask)
|
||||||
|
else:
|
||||||
|
attention_mask = sparse_mask
|
||||||
|
del sparse_mask
|
||||||
|
|
||||||
|
forward_kwargs = dict(
|
||||||
|
hidden_states=hidden_states,
|
||||||
|
cache_position=cache_position,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
encoder_hidden_states=encoder_hidden_states,
|
||||||
|
encoder_attention_mask=encoder_attention_mask,
|
||||||
|
output_attentions=output_attentions,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
if IS_TRANSFORMERS_V5:
|
||||||
|
forward_kwargs["past_key_values"] = past_key_value
|
||||||
|
else:
|
||||||
|
forward_kwargs["past_key_value"] = past_key_value
|
||||||
|
forward_kwargs["head_mask"] = head_mask
|
||||||
|
|
||||||
|
return super().forward(**forward_kwargs)
|
||||||
|
|
||||||
|
|
||||||
|
class GPT3DevMLP(GPT2MLP):
|
||||||
|
def __init__(self, intermediate_size, config):
|
||||||
|
super().__init__(intermediate_size, config)
|
||||||
|
self.c_fc = nn.Linear(config.hidden_size, intermediate_size, bias=True)
|
||||||
|
self.c_proj = nn.Linear(intermediate_size, config.hidden_size, bias=True)
|
||||||
|
self.act = nn.GELU() # standard GeLU
|
||||||
|
|
||||||
|
|
||||||
|
class GPT3DevBlock(GPT2Block):
|
||||||
|
"""GPT-3 block with pre-LayerNorm and alternating dense/sparse attention."""
|
||||||
|
def __init__(self, config, is_sparse: bool = False, layer_idx=None):
|
||||||
|
super().__init__(config, layer_idx=layer_idx)
|
||||||
|
self.use_pre_layernorm = config.use_pre_layernorm
|
||||||
|
self.ln_1 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_epsilon)
|
||||||
|
self.ln_2 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_epsilon)
|
||||||
|
|
||||||
|
if is_sparse:
|
||||||
|
self.attn = GPT3DevSparseAttention(config, layer_idx=layer_idx)
|
||||||
|
else:
|
||||||
|
self.attn = GPT3DevAttention(config, layer_idx=layer_idx)
|
||||||
|
|
||||||
|
self.mlp = GPT3DevMLP(4 * config.hidden_size, config)
|
||||||
|
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
hidden_states,
|
||||||
|
past_key_value=None,
|
||||||
|
cache_position=None,
|
||||||
|
attention_mask=None,
|
||||||
|
*extra_args,
|
||||||
|
head_mask=None,
|
||||||
|
encoder_hidden_states=None,
|
||||||
|
encoder_attention_mask=None,
|
||||||
|
use_cache=False,
|
||||||
|
output_attentions=False,
|
||||||
|
past_key_values=None,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
if past_key_values is not None and past_key_value is None:
|
||||||
|
past_key_value = past_key_values
|
||||||
|
(
|
||||||
|
head_mask,
|
||||||
|
encoder_hidden_states,
|
||||||
|
encoder_attention_mask,
|
||||||
|
use_cache,
|
||||||
|
output_attentions,
|
||||||
|
) = _normalize_block_args(
|
||||||
|
extra_args,
|
||||||
|
head_mask=head_mask,
|
||||||
|
encoder_hidden_states=encoder_hidden_states,
|
||||||
|
encoder_attention_mask=encoder_attention_mask,
|
||||||
|
use_cache=use_cache,
|
||||||
|
output_attentions=output_attentions,
|
||||||
|
)
|
||||||
|
|
||||||
|
if self.use_pre_layernorm:
|
||||||
|
# Pre-LayerNorm (GPT-3)
|
||||||
|
residual = hidden_states
|
||||||
|
hidden_states = self.ln_1(hidden_states)
|
||||||
|
attn_kwargs = dict(
|
||||||
|
hidden_states=hidden_states,
|
||||||
|
cache_position=cache_position,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
encoder_hidden_states=encoder_hidden_states,
|
||||||
|
encoder_attention_mask=encoder_attention_mask,
|
||||||
|
output_attentions=output_attentions,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
if IS_TRANSFORMERS_V5:
|
||||||
|
attn_kwargs["past_key_values"] = past_key_value
|
||||||
|
attn_output, attn_weights = self.attn(**attn_kwargs)
|
||||||
|
else:
|
||||||
|
attn_kwargs["past_key_value"] = past_key_value
|
||||||
|
attn_kwargs["head_mask"] = head_mask
|
||||||
|
attn_output, attn_weights = self.attn(**attn_kwargs)
|
||||||
|
hidden_states = residual + attn_output
|
||||||
|
|
||||||
|
residual = hidden_states
|
||||||
|
hidden_states = self.ln_2(hidden_states)
|
||||||
|
feed_forward_hidden_states = self.mlp(hidden_states)
|
||||||
|
hidden_states = residual + feed_forward_hidden_states
|
||||||
|
else:
|
||||||
|
# Post-LayerNorm (GPT-2)
|
||||||
|
residual = hidden_states
|
||||||
|
attn_kwargs = dict(
|
||||||
|
hidden_states=hidden_states,
|
||||||
|
cache_position=cache_position,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
encoder_hidden_states=encoder_hidden_states,
|
||||||
|
encoder_attention_mask=encoder_attention_mask,
|
||||||
|
output_attentions=output_attentions,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
if IS_TRANSFORMERS_V5:
|
||||||
|
attn_kwargs["past_key_values"] = past_key_value
|
||||||
|
attn_output, attn_weights = self.attn(**attn_kwargs)
|
||||||
|
else:
|
||||||
|
attn_kwargs["past_key_value"] = past_key_value
|
||||||
|
attn_kwargs["head_mask"] = head_mask
|
||||||
|
attn_output, attn_weights = self.attn(**attn_kwargs)
|
||||||
|
hidden_states = residual + attn_output
|
||||||
|
hidden_states = self.ln_1(hidden_states)
|
||||||
|
|
||||||
|
residual = hidden_states
|
||||||
|
feed_forward_hidden_states = self.mlp(hidden_states)
|
||||||
|
hidden_states = residual + feed_forward_hidden_states
|
||||||
|
hidden_states = self.ln_2(hidden_states)
|
||||||
|
|
||||||
|
if IS_TRANSFORMERS_V5:
|
||||||
|
return hidden_states
|
||||||
|
|
||||||
|
outputs = (hidden_states,)
|
||||||
|
if output_attentions:
|
||||||
|
outputs += (attn_weights,)
|
||||||
|
return outputs
|
||||||
|
|
||||||
|
|
||||||
|
class GPT3DevModel(GPT2Model):
|
||||||
|
config_class = GPT3DevConfig
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__(config)
|
||||||
|
|
||||||
|
self.wte = nn.Embedding(config.vocab_size, config.hidden_size)
|
||||||
|
self.wpe = nn.Embedding(config.n_positions, config.hidden_size)
|
||||||
|
self.drop = nn.Dropout(config.embd_pdrop)
|
||||||
|
self.h = nn.ModuleList()
|
||||||
|
for i in range(config.num_hidden_layers):
|
||||||
|
self.h.append(GPT3DevBlock(config, is_sparse=(i % 2 == 1), layer_idx=i))
|
||||||
|
|
||||||
|
self.ln_f = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_epsilon)
|
||||||
|
self.post_init()
|
||||||
|
# NOTE: _apply_residual_scaling is called from GPT3DevLMHeadModel.__init__
|
||||||
|
# AFTER the final post_init(), so it is NOT undone by re-initialization.
|
||||||
|
|
||||||
|
def _apply_residual_scaling(self):
|
||||||
|
# GPT-3/GPT-2 modified init: scale residuals by 1 / sqrt(2 * num_layers)
|
||||||
|
scale = 1 / math.sqrt(2 * self.config.num_hidden_layers)
|
||||||
|
for block in self.h:
|
||||||
|
block.attn.c_proj.weight.data.mul_(scale)
|
||||||
|
block.mlp.c_proj.weight.data.mul_(scale)
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
class GPT3DevLMHeadModel(GPT2LMHeadModel):
|
||||||
|
config_class = GPT3DevConfig
|
||||||
|
|
||||||
|
def __init__(self, config):
|
||||||
|
super().__init__(config)
|
||||||
|
self.transformer = GPT3DevModel(config)
|
||||||
|
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
|
||||||
|
|
||||||
|
self.post_init()
|
||||||
|
# GPT-3 modified init: scale residual projections by 1/sqrt(2*num_layers)
|
||||||
|
# MUST be AFTER the final post_init() which re-initializes all weights
|
||||||
|
self.transformer._apply_residual_scaling()
|
||||||
|
|
||||||
|
def forward(
|
||||||
|
self,
|
||||||
|
input_ids=None,
|
||||||
|
past_key_values=None,
|
||||||
|
cache_position=None,
|
||||||
|
attention_mask=None,
|
||||||
|
token_type_ids=None,
|
||||||
|
position_ids=None,
|
||||||
|
head_mask=None,
|
||||||
|
inputs_embeds=None,
|
||||||
|
encoder_hidden_states=None,
|
||||||
|
encoder_attention_mask=None,
|
||||||
|
labels=None,
|
||||||
|
use_cache=None,
|
||||||
|
output_attentions=None,
|
||||||
|
output_hidden_states=None,
|
||||||
|
return_dict=None,
|
||||||
|
logits_to_keep=0,
|
||||||
|
output_logits=None, # Force returning full logits even with labels (for debugging/distillation)
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
|
||||||
|
|
||||||
|
transformer_kwargs = dict(
|
||||||
|
input_ids=input_ids,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
cache_position=cache_position,
|
||||||
|
token_type_ids=token_type_ids,
|
||||||
|
position_ids=position_ids,
|
||||||
|
inputs_embeds=inputs_embeds,
|
||||||
|
encoder_hidden_states=encoder_hidden_states,
|
||||||
|
encoder_attention_mask=encoder_attention_mask,
|
||||||
|
use_cache=use_cache,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
if not IS_TRANSFORMERS_V5:
|
||||||
|
transformer_kwargs["head_mask"] = head_mask
|
||||||
|
transformer_kwargs["output_attentions"] = output_attentions
|
||||||
|
transformer_kwargs["output_hidden_states"] = output_hidden_states
|
||||||
|
transformer_kwargs["return_dict"] = return_dict
|
||||||
|
transformer_kwargs["past_key_values"] = past_key_values
|
||||||
|
|
||||||
|
transformer_outputs = self.transformer(**transformer_kwargs)
|
||||||
|
|
||||||
|
hidden_states = (
|
||||||
|
transformer_outputs.last_hidden_state
|
||||||
|
if hasattr(transformer_outputs, "last_hidden_state")
|
||||||
|
else transformer_outputs[0]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Set up for loss computation if labels are provided
|
||||||
|
compute_full_logits = labels is not None or output_logits or logits_to_keep == 0
|
||||||
|
if compute_full_logits:
|
||||||
|
logits_hidden_states = hidden_states
|
||||||
|
else:
|
||||||
|
slice_indices = (
|
||||||
|
slice(-logits_to_keep, None)
|
||||||
|
if isinstance(logits_to_keep, int)
|
||||||
|
else logits_to_keep
|
||||||
|
)
|
||||||
|
logits_hidden_states = hidden_states[:, slice_indices, :]
|
||||||
|
lm_logits = self.lm_head(logits_hidden_states.contiguous())
|
||||||
|
|
||||||
|
loss = None
|
||||||
|
if labels is not None:
|
||||||
|
# Shift so that tokens < n predict n
|
||||||
|
shift_logits = lm_logits[..., :-1, :].contiguous()
|
||||||
|
shift_labels = labels[..., 1:].contiguous()
|
||||||
|
# Flatten the tokens
|
||||||
|
loss_fct = nn.CrossEntropyLoss()
|
||||||
|
shift_logits = shift_logits.view(-1, self.config.vocab_size)
|
||||||
|
shift_labels = shift_labels.view(-1)
|
||||||
|
# Enable model parallelism
|
||||||
|
shift_labels = shift_labels.to(shift_logits.device)
|
||||||
|
loss = loss_fct(shift_logits, shift_labels)
|
||||||
|
|
||||||
|
if not return_dict:
|
||||||
|
return ((loss,) if loss is not None else ()) + (lm_logits,) + transformer_outputs[1:]
|
||||||
|
|
||||||
|
return CausalLMOutputWithCrossAttentions(
|
||||||
|
loss=loss,
|
||||||
|
logits=lm_logits,
|
||||||
|
past_key_values=getattr(transformer_outputs, "past_key_values", None),
|
||||||
|
hidden_states=getattr(transformer_outputs, "hidden_states", None),
|
||||||
|
attentions=getattr(transformer_outputs, "attentions", None),
|
||||||
|
cross_attentions=getattr(transformer_outputs, "cross_attentions", None),
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
AutoConfig.register("gpt3dev", GPT3DevConfig)
|
||||||
|
AutoModel.register(GPT3DevConfig, GPT3DevModel)
|
||||||
|
AutoModelForCausalLM.register(GPT3DevConfig, GPT3DevLMHeadModel)
|
||||||
|
|
||||||
|
# ---- Transformers 5.x compatibility patch ----
|
||||||
|
_ORIG_GPT3DEV_BLOCK_FORWARD = GPT3DevBlock.forward
|
||||||
|
_ORIG_GPT3DEV_SPARSE_FORWARD = GPT3DevSparseAttention.forward
|
||||||
|
|
||||||
|
def _patched_gpt3dev_block_forward(
|
||||||
|
self,
|
||||||
|
hidden_states,
|
||||||
|
past_key_values=None,
|
||||||
|
attention_mask=None,
|
||||||
|
encoder_hidden_states=None,
|
||||||
|
encoder_attention_mask=None,
|
||||||
|
use_cache=False,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
cache_position = kwargs.pop("cache_position", None)
|
||||||
|
output_attentions = kwargs.pop("output_attentions", False)
|
||||||
|
head_mask = kwargs.pop("head_mask", None)
|
||||||
|
past_key_value = kwargs.pop("past_key_value", None)
|
||||||
|
if past_key_values is None:
|
||||||
|
past_key_values = past_key_value
|
||||||
|
|
||||||
|
return _ORIG_GPT3DEV_BLOCK_FORWARD(
|
||||||
|
self,
|
||||||
|
hidden_states,
|
||||||
|
past_key_value=past_key_values,
|
||||||
|
cache_position=cache_position,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
head_mask=head_mask,
|
||||||
|
encoder_hidden_states=encoder_hidden_states,
|
||||||
|
encoder_attention_mask=encoder_attention_mask,
|
||||||
|
use_cache=use_cache,
|
||||||
|
output_attentions=output_attentions,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def _patched_gpt3dev_sparse_forward(
|
||||||
|
self,
|
||||||
|
hidden_states,
|
||||||
|
past_key_values=None,
|
||||||
|
attention_mask=None,
|
||||||
|
encoder_hidden_states=None,
|
||||||
|
encoder_attention_mask=None,
|
||||||
|
output_attentions=False,
|
||||||
|
**kwargs,
|
||||||
|
):
|
||||||
|
cache_position = kwargs.pop("cache_position", None)
|
||||||
|
head_mask = kwargs.pop("head_mask", None)
|
||||||
|
past_key_value = kwargs.pop("past_key_value", None)
|
||||||
|
if past_key_values is None:
|
||||||
|
past_key_values = past_key_value
|
||||||
|
|
||||||
|
return _ORIG_GPT3DEV_SPARSE_FORWARD(
|
||||||
|
self,
|
||||||
|
hidden_states,
|
||||||
|
past_key_value=past_key_values,
|
||||||
|
cache_position=cache_position,
|
||||||
|
attention_mask=attention_mask,
|
||||||
|
head_mask=head_mask,
|
||||||
|
encoder_hidden_states=encoder_hidden_states,
|
||||||
|
encoder_attention_mask=encoder_attention_mask,
|
||||||
|
output_attentions=output_attentions,
|
||||||
|
**kwargs,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
GPT3DevBlock.forward = _patched_gpt3dev_block_forward
|
||||||
|
GPT3DevSparseAttention.forward = _patched_gpt3dev_sparse_forward
|
||||||
|
# ---- End compatibility patch ----
|
||||||
6
special_tokens_map.json
Normal file
6
special_tokens_map.json
Normal file
@@ -0,0 +1,6 @@
|
|||||||
|
{
|
||||||
|
"bos_token": "<|endoftext|>",
|
||||||
|
"eos_token": "<|endoftext|>",
|
||||||
|
"pad_token": "<|endoftext|>",
|
||||||
|
"unk_token": "<|endoftext|>"
|
||||||
|
}
|
||||||
250311
tokenizer.json
Normal file
250311
tokenizer.json
Normal file
File diff suppressed because it is too large
Load Diff
21
tokenizer_config.json
Normal file
21
tokenizer_config.json
Normal file
@@ -0,0 +1,21 @@
|
|||||||
|
{
|
||||||
|
"add_prefix_space": false,
|
||||||
|
"added_tokens_decoder": {
|
||||||
|
"50256": {
|
||||||
|
"content": "<|endoftext|>",
|
||||||
|
"lstrip": false,
|
||||||
|
"normalized": true,
|
||||||
|
"rstrip": false,
|
||||||
|
"single_word": false,
|
||||||
|
"special": true
|
||||||
|
}
|
||||||
|
},
|
||||||
|
"bos_token": "<|endoftext|>",
|
||||||
|
"clean_up_tokenization_spaces": false,
|
||||||
|
"eos_token": "<|endoftext|>",
|
||||||
|
"extra_special_tokens": {},
|
||||||
|
"model_max_length": 1024,
|
||||||
|
"pad_token": "<|endoftext|>",
|
||||||
|
"tokenizer_class": "GPT2Tokenizer",
|
||||||
|
"unk_token": "<|endoftext|>"
|
||||||
|
}
|
||||||
1
vocab.json
Normal file
1
vocab.json
Normal file
File diff suppressed because one or more lines are too long
Reference in New Issue
Block a user