初始化项目,由ModelHub XC社区提供模型

Model: k050506koch/GPT3-dev-125m-0104
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-04-21 12:50:23 +08:00
commit ceadbf8fc8
17 changed files with 300975 additions and 0 deletions

41
.gitattributes vendored Normal file
View File

@@ -0,0 +1,41 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
model_f16.gguf filter=lfs diff=lfs merge=lfs -text
model_f32.gguf filter=lfs diff=lfs merge=lfs -text
model_q8_0.gguf filter=lfs diff=lfs merge=lfs -text
model_instruct_fp16.gguf filter=lfs diff=lfs merge=lfs -text
model_instruct_fp32.gguf filter=lfs diff=lfs merge=lfs -text
model_instruct_q8_0.gguf filter=lfs diff=lfs merge=lfs -text

51
README.md Normal file
View File

@@ -0,0 +1,51 @@
---
license: mit
datasets:
- HuggingFaceFW/fineweb
language:
- en
pipeline_tag: text-generation
library_name: transformers
---
# GPT3
Welcome to the GPT3 repository! This project is an attempt to recreate the architecture and approach from the original OpenAI GPT-3 paper. The repository includes scripts for training, fine-tuning, and inference of a GPT-3-like model using PyTorch and the Hugging Face Transformers library.
Here are located weights of dev checkpoints of my models. You can always download a folder, paste it's path inside inference.py and chat with them.
# **You can find all code on [GitHub](https://github.com/krll-corp/GPT3)**
# Note: This is a model with 125 million parameters (attempt to replicate GPT-3 Small). (it's very undertrained.)
# Note 2: This is a model checkpoint released on 12/02 2025 (12 batch size, 4 grad accumulation, 512 tokens and 65000 steps under Lion optimizer). It scores 28.65% on MMLU which is slightly higher than 25% (random guess)
# Note 3: This model already demonstrates basic abilities in generating text. It's not perfect and I will continue working on it. Expect Instruct models soon.
## inference:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('k050506koch/GPT3-dev-125m-0104', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('k050506koch/GPT3-dev-125m-1202')
tokenizer.pad_token_id = tokenizer.eos_token_id
print("\n", tokenizer.decode(model.generate(tokenizer.encode("He is a doctor. His main goal is", return_tensors='pt'),
max_length=128, temperature=0.7, top_p=0.9, repetition_penalty=1.2, no_repeat_ngram_size=3,
num_return_sequences=1, do_sample=True)[0], skip_special_tokens=True))
```
# Note for instruct models: all instruct models share same chat template
```plaintext
User: Here is a user prompt.<|endoftext|>\nAssistant: Here's an answer from model.<|endoftext|>
```
Please note that the model is not trained for continious chat and not tested for it so it's possible but unlikely that it will keep the topic and act coherently between messages.
## Contributing
Contributions are welcome! I'm just a student who is interested in AI so my code may be incorrect or have logical issues. Please open an issue or submit a pull request for any improvements or bug fixes, I will be happy.
## License
This project is licensed under the MIT License. See the LICENSE file for details. Everyone can use and modify this code at their discretion.
## Acknowledgements
Thanks OpenAI, HuggingFace and Pytorch for making this project possible!
- [OpenAI GPT-3 Paper](https://arxiv.org/abs/2005.14165)
- [Hugging Face Transformers](https://github.com/huggingface/transformers)
- [PyTorch](https://pytorch.org/)

36
config.json Normal file
View File

@@ -0,0 +1,36 @@
{
"activation_function": "gelu",
"architectures": ["GPT3DevLMHeadModel"],
"auto_map": {
"AutoConfig": "modeling_gpt3dev.GPT3DevConfig",
"AutoModel": "modeling_gpt3dev.GPT3DevModel",
"AutoModelForCausalLM": "modeling_gpt3dev.GPT3DevLMHeadModel"
},
"attn_pdrop": 0.0,
"bos_token_id": 50256,
"embd_pdrop": 0.0,
"eos_token_id": 50256,
"initializer_range": 0.02,
"layer_norm_epsilon": 1e-05,
"model_type": "gpt3dev",
"n_ctx": 2048,
"n_embd": 768,
"n_head": 12,
"n_inner": 3072,
"n_layer": 12,
"n_positions": 2048,
"reorder_and_upcast_attn": false,
"resid_pdrop": 0.0,
"scale_attn_by_inverse_layer_idx": false,
"scale_attn_weights": true,
"summary_activation": null,
"summary_first_dropout": 0.1,
"summary_proj_to_labels": true,
"summary_type": "cls_index",
"summary_use_proj": true,
"torch_dtype": "bfloat16",
"transformers_version": "4.50.1",
"use_cache": true,
"use_pre_layernorm": true,
"vocab_size": 50257
}

6
generation_config.json Normal file
View File

@@ -0,0 +1,6 @@
{
"_from_model_config": true,
"bos_token_id": 50256,
"eos_token_id": 50256,
"transformers_version": "4.50.1"
}

50001
merges.txt Normal file

File diff suppressed because it is too large Load Diff

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:31145eacbdc57d44235953a981c90911ea23d029de2ff40122c4050c383eed7f
size 250467520

3
model_f16.gguf Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:732b58ed13a496101e9b37c22b2854871e7a1a98dfc69ec204072967074adadd
size 332811008

3
model_f32.gguf Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:8a6fce0ee77d1aa24a882234fe38d93c47a19334377a2a0dca0456dcb8a60d5a
size 657069824

3
model_instruct_fp16.gguf Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:ffbf013872f422143126160d88c999ab870e3daed8a843abd3ea0f4ef5e7bf58
size 332811040

3
model_instruct_fp32.gguf Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:4d99ded5c86ab3b55769afc3f43617ff1d87d5c803aaba526682845c6c99b9fa
size 657069856

3
model_instruct_q8_0.gguf Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:d0c402aa4a1c1b9b8eafdc853b16df5322b1a0155fc142b3478abce77847b96a
size 180814752

3
model_q8_0.gguf Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:1094384992e54b6485fa553ab5e837a15bb5ec32b9b3e3dcb6d68b9dd2ea4c94
size 180814720

480
modeling_gpt3dev.py Normal file
View File

@@ -0,0 +1,480 @@
import math
import torch
import torch.nn as nn
from packaging.version import Version
from transformers.models.gpt2.configuration_gpt2 import GPT2Config
from transformers.models.gpt2.modeling_gpt2 import GPT2MLP
from transformers import (
__version__ as TRANSFORMERS_VERSION,
AutoConfig,
AutoModel,
AutoModelForCausalLM
)
from transformers.modeling_outputs import (
CausalLMOutputWithCrossAttentions,
)
from transformers.models.gpt2.configuration_gpt2 import GPT2Config
from transformers.models.gpt2.modeling_gpt2 import (
GPT2LMHeadModel,
GPT2Model,
GPT2Block,
GPT2Attention,
GPT2MLP,
CausalLMOutputWithCrossAttentions
)
IS_TRANSFORMERS_V5 = Version(TRANSFORMERS_VERSION) >= Version("5.0.0")
def _normalize_block_args(
extra_args,
*,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
use_cache=False,
output_attentions=False,
):
if IS_TRANSFORMERS_V5:
if extra_args and encoder_hidden_states is None:
encoder_hidden_states = extra_args[0]
else:
if extra_args:
if head_mask is None:
head_mask = extra_args[0]
if len(extra_args) > 1 and encoder_hidden_states is None:
encoder_hidden_states = extra_args[1]
if len(extra_args) > 2 and encoder_attention_mask is None:
encoder_attention_mask = extra_args[2]
if len(extra_args) > 3:
use_cache = extra_args[3]
if len(extra_args) > 4:
output_attentions = extra_args[4]
return (
head_mask,
encoder_hidden_states,
encoder_attention_mask,
use_cache,
output_attentions,
)
class GPT3DevConfig(GPT2Config):
model_type = "gpt3dev"
def __init__(self, use_pre_layernorm=True, window_size=256, stride=128, **kwargs):
super().__init__(**kwargs)
self.use_pre_layernorm = use_pre_layernorm
self.window_size = window_size
self.stride = stride
class GPT3DevAttention(GPT2Attention): # dense
"""GPT-3 style dense attention: nn.Linear instead of Conv1D."""
def __init__(self, config, is_cross_attention=False, layer_idx=None):
super().__init__(config, is_cross_attention, layer_idx=layer_idx)
# GPT-3 uses nn.Linear instead of Conv1D
self.c_attn = nn.Linear(config.hidden_size, 3 * config.hidden_size, bias=True)
self.c_proj = nn.Linear(config.hidden_size, config.hidden_size, bias=True)
# forward() inherited from GPT2Attention — no override needed
class GPT3DevSparseAttention(GPT3DevAttention): # local sparse
"""GPT-3 style locally banded sparse attention."""
def __init__(self, config, is_cross_attention=False, layer_idx=None):
super().__init__(config, is_cross_attention, layer_idx=layer_idx)
self.window_size = getattr(config, "window_size", 256)
def forward(
self,
hidden_states,
past_key_value=None,
cache_position=None,
attention_mask=None,
*extra_args,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=False,
past_key_values=None,
**kwargs,
):
if past_key_values is not None and past_key_value is None:
past_key_value = past_key_values
bsz, tgt_len, _ = hidden_states.size()
device = hidden_states.device
dtype = hidden_states.dtype
# Determine query/key positions using cache_position (new API)
if cache_position is not None:
q_pos = cache_position # shape: (tgt_len,)
seq_len = int(q_pos[-1].item()) + 1
else:
q_pos = torch.arange(tgt_len, device=device)
seq_len = tgt_len
k_pos = torch.arange(seq_len, device=device)
diff = q_pos[:, None] - k_pos[None, :] # (tgt_len, seq_len)
is_causal = diff >= 0
within_window = diff.abs() <= self.window_size
allow_attention = is_causal & within_window
del is_causal, within_window, diff
sparse_mask = torch.zeros((1, 1, tgt_len, seq_len), dtype=dtype, device=device)
sparse_mask.masked_fill_(~allow_attention, torch.finfo(dtype).min)
del allow_attention
# Combine with parent's causal mask
if attention_mask is not None:
# Parent may create mask with extra KV positions — trim to match
if attention_mask.size(-1) != sparse_mask.size(-1):
attention_mask = attention_mask[..., :sparse_mask.size(-1)]
if attention_mask.size(-2) != sparse_mask.size(-2):
attention_mask = attention_mask[..., :sparse_mask.size(-2), :]
attention_mask = torch.minimum(attention_mask, sparse_mask)
else:
attention_mask = sparse_mask
del sparse_mask
forward_kwargs = dict(
hidden_states=hidden_states,
cache_position=cache_position,
attention_mask=attention_mask,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
output_attentions=output_attentions,
**kwargs,
)
if IS_TRANSFORMERS_V5:
forward_kwargs["past_key_values"] = past_key_value
else:
forward_kwargs["past_key_value"] = past_key_value
forward_kwargs["head_mask"] = head_mask
return super().forward(**forward_kwargs)
class GPT3DevMLP(GPT2MLP):
def __init__(self, intermediate_size, config):
super().__init__(intermediate_size, config)
self.c_fc = nn.Linear(config.hidden_size, intermediate_size, bias=True)
self.c_proj = nn.Linear(intermediate_size, config.hidden_size, bias=True)
self.act = nn.GELU() # standard GeLU
class GPT3DevBlock(GPT2Block):
"""GPT-3 block with pre-LayerNorm and alternating dense/sparse attention."""
def __init__(self, config, is_sparse: bool = False, layer_idx=None):
super().__init__(config, layer_idx=layer_idx)
self.use_pre_layernorm = config.use_pre_layernorm
self.ln_1 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_epsilon)
self.ln_2 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_epsilon)
if is_sparse:
self.attn = GPT3DevSparseAttention(config, layer_idx=layer_idx)
else:
self.attn = GPT3DevAttention(config, layer_idx=layer_idx)
self.mlp = GPT3DevMLP(4 * config.hidden_size, config)
def forward(
self,
hidden_states,
past_key_value=None,
cache_position=None,
attention_mask=None,
*extra_args,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
use_cache=False,
output_attentions=False,
past_key_values=None,
**kwargs,
):
if past_key_values is not None and past_key_value is None:
past_key_value = past_key_values
(
head_mask,
encoder_hidden_states,
encoder_attention_mask,
use_cache,
output_attentions,
) = _normalize_block_args(
extra_args,
head_mask=head_mask,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
use_cache=use_cache,
output_attentions=output_attentions,
)
if self.use_pre_layernorm:
# Pre-LayerNorm (GPT-3)
residual = hidden_states
hidden_states = self.ln_1(hidden_states)
attn_kwargs = dict(
hidden_states=hidden_states,
cache_position=cache_position,
attention_mask=attention_mask,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
output_attentions=output_attentions,
**kwargs,
)
if IS_TRANSFORMERS_V5:
attn_kwargs["past_key_values"] = past_key_value
attn_output, attn_weights = self.attn(**attn_kwargs)
else:
attn_kwargs["past_key_value"] = past_key_value
attn_kwargs["head_mask"] = head_mask
attn_output, attn_weights = self.attn(**attn_kwargs)
hidden_states = residual + attn_output
residual = hidden_states
hidden_states = self.ln_2(hidden_states)
feed_forward_hidden_states = self.mlp(hidden_states)
hidden_states = residual + feed_forward_hidden_states
else:
# Post-LayerNorm (GPT-2)
residual = hidden_states
attn_kwargs = dict(
hidden_states=hidden_states,
cache_position=cache_position,
attention_mask=attention_mask,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
output_attentions=output_attentions,
**kwargs,
)
if IS_TRANSFORMERS_V5:
attn_kwargs["past_key_values"] = past_key_value
attn_output, attn_weights = self.attn(**attn_kwargs)
else:
attn_kwargs["past_key_value"] = past_key_value
attn_kwargs["head_mask"] = head_mask
attn_output, attn_weights = self.attn(**attn_kwargs)
hidden_states = residual + attn_output
hidden_states = self.ln_1(hidden_states)
residual = hidden_states
feed_forward_hidden_states = self.mlp(hidden_states)
hidden_states = residual + feed_forward_hidden_states
hidden_states = self.ln_2(hidden_states)
if IS_TRANSFORMERS_V5:
return hidden_states
outputs = (hidden_states,)
if output_attentions:
outputs += (attn_weights,)
return outputs
class GPT3DevModel(GPT2Model):
config_class = GPT3DevConfig
def __init__(self, config):
super().__init__(config)
self.wte = nn.Embedding(config.vocab_size, config.hidden_size)
self.wpe = nn.Embedding(config.n_positions, config.hidden_size)
self.drop = nn.Dropout(config.embd_pdrop)
self.h = nn.ModuleList()
for i in range(config.num_hidden_layers):
self.h.append(GPT3DevBlock(config, is_sparse=(i % 2 == 1), layer_idx=i))
self.ln_f = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_epsilon)
self.post_init()
# NOTE: _apply_residual_scaling is called from GPT3DevLMHeadModel.__init__
# AFTER the final post_init(), so it is NOT undone by re-initialization.
def _apply_residual_scaling(self):
# GPT-3/GPT-2 modified init: scale residuals by 1 / sqrt(2 * num_layers)
scale = 1 / math.sqrt(2 * self.config.num_hidden_layers)
for block in self.h:
block.attn.c_proj.weight.data.mul_(scale)
block.mlp.c_proj.weight.data.mul_(scale)
class GPT3DevLMHeadModel(GPT2LMHeadModel):
config_class = GPT3DevConfig
def __init__(self, config):
super().__init__(config)
self.transformer = GPT3DevModel(config)
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
self.post_init()
# GPT-3 modified init: scale residual projections by 1/sqrt(2*num_layers)
# MUST be AFTER the final post_init() which re-initializes all weights
self.transformer._apply_residual_scaling()
def forward(
self,
input_ids=None,
past_key_values=None,
cache_position=None,
attention_mask=None,
token_type_ids=None,
position_ids=None,
head_mask=None,
inputs_embeds=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
labels=None,
use_cache=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
logits_to_keep=0,
output_logits=None, # Force returning full logits even with labels (for debugging/distillation)
**kwargs,
):
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
transformer_kwargs = dict(
input_ids=input_ids,
attention_mask=attention_mask,
cache_position=cache_position,
token_type_ids=token_type_ids,
position_ids=position_ids,
inputs_embeds=inputs_embeds,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
use_cache=use_cache,
**kwargs,
)
if not IS_TRANSFORMERS_V5:
transformer_kwargs["head_mask"] = head_mask
transformer_kwargs["output_attentions"] = output_attentions
transformer_kwargs["output_hidden_states"] = output_hidden_states
transformer_kwargs["return_dict"] = return_dict
transformer_kwargs["past_key_values"] = past_key_values
transformer_outputs = self.transformer(**transformer_kwargs)
hidden_states = (
transformer_outputs.last_hidden_state
if hasattr(transformer_outputs, "last_hidden_state")
else transformer_outputs[0]
)
# Set up for loss computation if labels are provided
compute_full_logits = labels is not None or output_logits or logits_to_keep == 0
if compute_full_logits:
logits_hidden_states = hidden_states
else:
slice_indices = (
slice(-logits_to_keep, None)
if isinstance(logits_to_keep, int)
else logits_to_keep
)
logits_hidden_states = hidden_states[:, slice_indices, :]
lm_logits = self.lm_head(logits_hidden_states.contiguous())
loss = None
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = lm_logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = nn.CrossEntropyLoss()
shift_logits = shift_logits.view(-1, self.config.vocab_size)
shift_labels = shift_labels.view(-1)
# Enable model parallelism
shift_labels = shift_labels.to(shift_logits.device)
loss = loss_fct(shift_logits, shift_labels)
if not return_dict:
return ((loss,) if loss is not None else ()) + (lm_logits,) + transformer_outputs[1:]
return CausalLMOutputWithCrossAttentions(
loss=loss,
logits=lm_logits,
past_key_values=getattr(transformer_outputs, "past_key_values", None),
hidden_states=getattr(transformer_outputs, "hidden_states", None),
attentions=getattr(transformer_outputs, "attentions", None),
cross_attentions=getattr(transformer_outputs, "cross_attentions", None),
)
AutoConfig.register("gpt3dev", GPT3DevConfig)
AutoModel.register(GPT3DevConfig, GPT3DevModel)
AutoModelForCausalLM.register(GPT3DevConfig, GPT3DevLMHeadModel)
# ---- Transformers 5.x compatibility patch ----
_ORIG_GPT3DEV_BLOCK_FORWARD = GPT3DevBlock.forward
_ORIG_GPT3DEV_SPARSE_FORWARD = GPT3DevSparseAttention.forward
def _patched_gpt3dev_block_forward(
self,
hidden_states,
past_key_values=None,
attention_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
use_cache=False,
**kwargs,
):
cache_position = kwargs.pop("cache_position", None)
output_attentions = kwargs.pop("output_attentions", False)
head_mask = kwargs.pop("head_mask", None)
past_key_value = kwargs.pop("past_key_value", None)
if past_key_values is None:
past_key_values = past_key_value
return _ORIG_GPT3DEV_BLOCK_FORWARD(
self,
hidden_states,
past_key_value=past_key_values,
cache_position=cache_position,
attention_mask=attention_mask,
head_mask=head_mask,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
use_cache=use_cache,
output_attentions=output_attentions,
**kwargs,
)
def _patched_gpt3dev_sparse_forward(
self,
hidden_states,
past_key_values=None,
attention_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
output_attentions=False,
**kwargs,
):
cache_position = kwargs.pop("cache_position", None)
head_mask = kwargs.pop("head_mask", None)
past_key_value = kwargs.pop("past_key_value", None)
if past_key_values is None:
past_key_values = past_key_value
return _ORIG_GPT3DEV_SPARSE_FORWARD(
self,
hidden_states,
past_key_value=past_key_values,
cache_position=cache_position,
attention_mask=attention_mask,
head_mask=head_mask,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
output_attentions=output_attentions,
**kwargs,
)
GPT3DevBlock.forward = _patched_gpt3dev_block_forward
GPT3DevSparseAttention.forward = _patched_gpt3dev_sparse_forward
# ---- End compatibility patch ----

6
special_tokens_map.json Normal file
View File

@@ -0,0 +1,6 @@
{
"bos_token": "<|endoftext|>",
"eos_token": "<|endoftext|>",
"pad_token": "<|endoftext|>",
"unk_token": "<|endoftext|>"
}

250311
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

21
tokenizer_config.json Normal file
View File

@@ -0,0 +1,21 @@
{
"add_prefix_space": false,
"added_tokens_decoder": {
"50256": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": true,
"rstrip": false,
"single_word": false,
"special": true
}
},
"bos_token": "<|endoftext|>",
"clean_up_tokenization_spaces": false,
"eos_token": "<|endoftext|>",
"extra_special_tokens": {},
"model_max_length": 1024,
"pad_token": "<|endoftext|>",
"tokenizer_class": "GPT2Tokenizer",
"unk_token": "<|endoftext|>"
}

1
vocab.json Normal file

File diff suppressed because one or more lines are too long