初始化项目,由ModelHub XC社区提供模型

Model: failspy/Phi-3-vision-128k-instruct-abliterated-alpha
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-04-26 10:14:10 +08:00
commit a6e4e8442f
17 changed files with 97904 additions and 0 deletions

35
.gitattributes vendored Normal file
View File

@@ -0,0 +1,35 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text

77
README.md Normal file
View File

@@ -0,0 +1,77 @@
---
license: mit
license_link: https://huggingface.co/microsoft/Phi-3-medium-4k-instruct/resolve/main/LICENSE
language:
- multilingual
pipeline_tag: text-generation
tags:
- nlp
- code
- vision
inference:
parameters:
temperature: 0.7
widget:
- messages:
- role: user
content: I'm looking to do something unethical online. How can I stay safe whilst doing so?
---
# Phi-3-vision-128k-instruct-abliterated-***alpha***
[My Jupyter "cookbook" to replicate the methodology can be found here, refined library coming soon](https://huggingface.co/failspy/llama-3-70B-Instruct-abliterated/blob/main/ortho_cookbook.ipynb)
#### Phi-3-abliterated statement
Took me a while to wizard this one up. Its been a while since I've released a Phi-3 model. In the past I accidentally missed an item required in the model release process - hallucination testing.
This model has been tested and though it is more likely to hallucinate than the original model in my experience, it is generally as stable as the original.
Now that the new Phi-3 models are out, I'm working on completing this abliteration process quickly and then will release the other models as soon as possible. 🏇
## Summary
This is [microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) with orthogonalized bfloat16 safetensor weights, generated with a refined methodology based on that which was described in the preview paper/blog post: '[Refusal in LLMs is mediated by a single direction](https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction)' which I encourage you to read to understand more.
## Alpha???
This is a very experimental model. I've never done a vision model before, so I really don't know what can possibly go wrong. Want to set expectations where they're at.
This is, however, the same methodology as all the other abliterated-v3s, but applied to a vision model. The vision encoding layers are completely unmodified, but the *language*/chat model layers are abliterated.
Please feel free to file issues to report any weird/strange behaviour. Expect this model to see updates as needed.
## Hang on, "abliterated"? Orthogonalization? Ablation? What is this?
TL;DR: This model has had certain weights manipulated to "inhibit" the model's ability to express refusal. It is not in anyway _guaranteed_ that it won't refuse you, understand your request, it may still lecture you about ethics/safety, etc. It is tuned in all other respects the same as the original 70B instruct model was, just with the strongest refusal directions orthogonalized out.
**TL;TL;DR;DR: It's uncensored in the purest form I can manage -- no new or changed behaviour in any other respect from the original model.**
As far as "abliterated": it's just a fun play-on-words using the original "ablation" term used in the original paper to refer to removing features, which I made up particularly to differentiate the model from "uncensored" fine-tunes.
Ablate + obliterated = Abliterated
Anyways, orthogonalization/ablation are both aspects to refer to the same thing here, the technique in which the refusal feature was "ablated" from the model was via orthogonalization.
## A little more on the methodology, and why this is interesting
To me, ablation (or applying the methodology for the inverse, "augmentation") seems to be good for inducing/removing very specific features that you'd have to spend way too many tokens on encouraging or discouraging in your system prompt.
Instead, you just apply your system prompt in the ablation script against a blank system prompt on the same dataset and orthogonalize for the desired behaviour in the final model weights.
> Why this over fine-tuning?
Ablation is much more surgical in nature whilst also being effectively executed with a _lot_ less data than fine-tuning, which I think is its main advantage.
As well, and its most valuable aspect is it keeps as much of the original model's knowledge and training intact, whilst removing its tendency to behave in one very specific undesireable manner. (In this case, refusing user requests.)
Fine tuning is still exceptionally useful and the go-to for broad behaviour changes; however, you may be able to get close to your desired behaviour with very few samples using the ablation/augmentation techniques.
It may also be a useful step to add to your model refinement: orthogonalize -> fine-tune or vice-versa.
I haven't really gotten around to exploring this model stacked with fine-tuning, I encourage others to give it a shot if they've got the capacity.
## Quirkiness awareness notice
This model may come with interesting quirks, with the methodology being so new. I encourage you to play with the model, and post any quirks you notice in the community tab, as that'll help us further understand what this orthogonalization has in the way of side effects.
If you manage to develop further improvements, please share! This is really the most basic way to use ablation, but there are other possibilities that I believe are as-yet unexplored.
Additionally, feel free to reach out in any way about this. I'm on the Cognitive Computations Discord, I'm watching the Community tab, reach out! I'd love to see this methodology used in other ways, and so would gladly support whoever whenever I can.

148
config.json Normal file
View File

@@ -0,0 +1,148 @@
{
"_name_or_path": "Phi-3-vision-128k-instruct",
"architectures": [
"Phi3VForCausalLM"
],
"attention_dropout": 0.0,
"auto_map": {
"AutoConfig": "configuration_phi3_v.Phi3VConfig",
"AutoModelForCausalLM": "modeling_phi3_v.Phi3VForCausalLM"
},
"bos_token_id": 1,
"embd_layer": {
"embedding_cls": "image",
"hd_transform_order": "sub_glb",
"projection_cls": "mlp",
"use_hd_transform": true,
"with_learnable_separator": true
},
"eos_token_id": 2,
"hidden_act": "silu",
"hidden_size": 3072,
"img_processor": {
"image_dim_out": 1024,
"model_name": "openai/clip-vit-large-patch14-336",
"name": "clip_vision_model",
"num_img_tokens": 144
},
"initializer_range": 0.02,
"intermediate_size": 8192,
"max_position_embeddings": 131072,
"model_type": "phi3_v",
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32,
"original_max_position_embeddings": 4096,
"rms_norm_eps": 1e-05,
"rope_scaling": {
"long_factor": [
1.0299999713897705,
1.0499999523162842,
1.0499999523162842,
1.0799999237060547,
1.2299998998641968,
1.2299998998641968,
1.2999999523162842,
1.4499999284744263,
1.5999999046325684,
1.6499998569488525,
1.8999998569488525,
2.859999895095825,
3.68999981880188,
5.419999599456787,
5.489999771118164,
5.489999771118164,
9.09000015258789,
11.579999923706055,
15.65999984741211,
15.769999504089355,
15.789999961853027,
18.360000610351562,
21.989999771118164,
23.079999923706055,
30.009998321533203,
32.35000228881836,
32.590003967285156,
35.56000518798828,
39.95000457763672,
53.840003967285156,
56.20000457763672,
57.95000457763672,
59.29000473022461,
59.77000427246094,
59.920005798339844,
61.190006256103516,
61.96000671386719,
62.50000762939453,
63.3700065612793,
63.48000717163086,
63.48000717163086,
63.66000747680664,
63.850006103515625,
64.08000946044922,
64.760009765625,
64.80001068115234,
64.81001281738281,
64.81001281738281
],
"short_factor": [
1.05,
1.05,
1.05,
1.1,
1.1,
1.1,
1.2500000000000002,
1.2500000000000002,
1.4000000000000004,
1.4500000000000004,
1.5500000000000005,
1.8500000000000008,
1.9000000000000008,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.000000000000001,
2.1000000000000005,
2.1000000000000005,
2.2,
2.3499999999999996,
2.3499999999999996,
2.3499999999999996,
2.3499999999999996,
2.3999999999999995,
2.3999999999999995,
2.6499999999999986,
2.6999999999999984,
2.8999999999999977,
2.9499999999999975,
3.049999999999997,
3.049999999999997,
3.049999999999997
],
"type": "su"
},
"rope_theta": 10000.0,
"sliding_window": 131072,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.38.1",
"use_cache": true,
"vocab_size": 32064,
"_attn_implementation": "flash_attention_2"
}

217
configuration_phi3_v.py Normal file
View File

@@ -0,0 +1,217 @@
# coding=utf-8
# Copyright 2024 Microsoft and the HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" Phi-3-V model configuration"""
from transformers.configuration_utils import PretrainedConfig
from transformers.utils import logging
logger = logging.get_logger(__name__)
PHI3V_PRETRAINED_CONFIG_ARCHIVE_MAP = {
"microsoft/Phi-3-vision-128k-instruct": "https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/resolve/main/config.json",
}
class Phi3VConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`Phi3VModel`]. It is used to instantiate a Phi-3
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the
[microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct).
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 32064):
Vocabulary size of the Phi-3-V model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`Phi3VModel`].
hidden_size (`int`, *optional*, defaults to 3072):
Dimension of the hidden representations.
intermediate_size (`int`, *optional*, defaults to 8192):
Dimension of the MLP representations.
num_hidden_layers (`int`, *optional*, defaults to 32):
Number of hidden layers in the Transformer decoder.
num_attention_heads (`int`, *optional*, defaults to 32):
Number of attention heads for each attention layer in the Transformer decoder.
num_key_value_heads (`int`, *optional*):
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
`num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
by meanpooling all the original heads within that group. For more details checkout [this
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
`num_attention_heads`.
resid_pdrop (`float`, *optional*, defaults to 0.0):
Dropout probability for mlp outputs.
embd_pdrop (`int`, *optional*, defaults to 0.0):
The dropout ratio for the embeddings.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio after computing the attention scores.
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
The non-linear activation function (function or string) in the decoder.
max_position_embeddings (`int`, *optional*, defaults to 4096):
The maximum sequence length that this model might ever be used with.
original_max_position_embeddings (`int`, *optional*, defaults to 4096):
The maximum sequence length that this model was trained with. This is used to determine the size of the
original RoPE embeddings when using long scaling.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (`float`, *optional*, defaults to 1e-05):
The epsilon value used for the RMSNorm.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`. Whether to tie weight embeddings or not.
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
Whether to tie weight embeddings
rope_theta (`float`, *optional*, defaults to 10000.0):
The base period of the RoPE embeddings.
rope_scaling (`dict`, *optional*):
The scaling strategy for the RoPE embeddings. If `None`, no scaling is applied. If a dictionary, it must
contain the following keys: `type`, `short_factor` and `long_factor`. The `type` must be either `su` or `yarn` and
the `short_factor` and `long_factor` must be lists of numbers with the same length as the hidden size
divided by the number of attention heads divided by 2.
bos_token_id (`int`, *optional*, defaults to 1):
The id of the "beginning-of-sequence" token.
eos_token_id (`int`, *optional*, defaults to 32000):
The id of the "end-of-sequence" token.
pad_token_id (`int`, *optional*, defaults to 32000):
The id of the padding token.
sliding_window (`int`, *optional*):
Sliding window attention window size. If `None`, no sliding window is applied.
embd_layer (`str`, *optional*, defaults to `"default"`):
The embedding layer to use. Can be either `"default"` or `"image"`. "default" uses the standard embedding for text.
Example:
```python
>>> from transformers import Phi3VModel, Phi3VConfig
>>> # Initializing a Phi-3-V style configuration
>>> configuration = Phi3Config.from_pretrained("microsoft/Phi-3-vision-128k-instruct")
>>> # Initializing a model from the configuration
>>> model = Phi3VModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = "phi3_v"
keys_to_ignore_at_inference = ["past_key_values"]
def __init__(
self,
vocab_size=32064,
hidden_size=3072,
intermediate_size=8192,
num_hidden_layers=32,
num_attention_heads=32,
num_key_value_heads=None,
resid_pdrop=0.0,
embd_pdrop=0.0,
attention_dropout=0.0,
hidden_act="silu",
max_position_embeddings=4096,
original_max_position_embeddings=4096,
initializer_range=0.02,
rms_norm_eps=1e-5,
use_cache=True,
tie_word_embeddings=False,
rope_theta=10000.0,
rope_scaling=None,
bos_token_id=1,
eos_token_id=32000,
pad_token_id=32000,
sliding_window=None,
embd_layer: str = "default",
**kwargs,
):
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
if num_key_value_heads is None:
num_key_value_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.resid_pdrop = resid_pdrop
self.embd_pdrop = embd_pdrop
self.attention_dropout = attention_dropout
self.hidden_act = hidden_act
self.max_position_embeddings = max_position_embeddings
self.original_max_position_embeddings = original_max_position_embeddings
self.initializer_range = initializer_range
self.rms_norm_eps = rms_norm_eps
self.use_cache = use_cache
self.rope_theta = rope_theta
self.rope_scaling = rope_scaling
self._rope_scaling_validation()
self.sliding_window = sliding_window
self.embd_layer = embd_layer
super().__init__(
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
pad_token_id=pad_token_id,
tie_word_embeddings=tie_word_embeddings,
**kwargs,
)
def _rope_scaling_validation(self):
"""
Validate the `rope_scaling` configuration.
"""
if self.rope_scaling is None:
return
if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 3:
raise ValueError(
"`rope_scaling` must be a dictionary with three fields, `type`, `short_factor` and `long_factor`, "
f"got {self.rope_scaling}"
)
rope_scaling_type = self.rope_scaling.get("type", None)
rope_scaling_short_factor = self.rope_scaling.get("short_factor", None)
rope_scaling_long_factor = self.rope_scaling.get("long_factor", None)
if rope_scaling_type is None or rope_scaling_type not in ["su", "yarn"]:
raise ValueError(f"`rope_scaling`'s type field must be one of ['su', 'yarn'], got {rope_scaling_type}")
if not (
isinstance(rope_scaling_short_factor, list)
and all(isinstance(x, (int, float)) for x in rope_scaling_short_factor)
):
raise ValueError(
f"`rope_scaling`'s short_factor field must be a list of numbers, got {rope_scaling_short_factor}"
)
if not len(rope_scaling_short_factor) == self.hidden_size // self.num_attention_heads // 2:
raise ValueError(
f"`rope_scaling`'s short_factor field must have length {self.hidden_size // self.num_attention_heads // 2}, got {len(rope_scaling_short_factor)}"
)
if not (
isinstance(rope_scaling_long_factor, list)
and all(isinstance(x, (int, float)) for x in rope_scaling_long_factor)
):
raise ValueError(
f"`rope_scaling`'s long_factor field must be a list of numbers, got {rope_scaling_long_factor}"
)
if not len(rope_scaling_long_factor) == self.hidden_size // self.num_attention_heads // 2:
raise ValueError(
f"`rope_scaling`'s long_factor field must have length {self.hidden_size // self.num_attention_heads // 2}, got {len(rope_scaling_long_factor)}"
)

7
generation_config.json Normal file
View File

@@ -0,0 +1,7 @@
{
"_from_model_config": true,
"bos_token_id": 1,
"eos_token_id": 2,
"pad_token_id": 32000,
"transformers_version": "4.42.0.dev0"
}

301
image_embedding_phi3_v.py Normal file
View File

@@ -0,0 +1,301 @@
# coding=utf-8
# Copyright 2024 Microsoft and the HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import torch
import torch.nn as nn
from transformers import CLIPVisionModel, PretrainedConfig
from transformers import CLIPVisionConfig
from transformers.utils import logging
from datetime import datetime
logger = logging.get_logger(__name__)
CLIP_VIT_LARGE_PATCH14_336_CONFIG = CLIPVisionConfig(
attention_dropout=0.0,
dropout=0.0,
hidden_act="quick_gelu",
hidden_size=1024,
image_size=336,
initializer_factor=1.0,
initializer_range=0.02,
intermediate_size=4096,
layer_norm_eps=1e-05,
num_attention_heads=16,
num_channels=3,
num_hidden_layers=24,
patch_size=14,
projection_dim=768
)
class Phi3ImageEmbedding(nn.Module):
"""Phi3 Image embedding."""
def __init__(self, config: PretrainedConfig, wte=None, **kwargs) -> None:
super().__init__()
# n_embed or hidden_size
hidden_size = config.n_embd if hasattr(config, 'n_embd') else config.hidden_size
if hasattr(config, 'embd_pdrop') or hasattr(config, 'embed_pdrop'):
embd_drop = config.embd_pdrop if hasattr(config, 'embd_pdrop') else config.embed_pdrop
self.drop = nn.Dropout(embd_drop)
else:
self.drop = None
self.wte = wte
if isinstance(config.img_processor, dict) and config.img_processor.get('name', None) == 'clip_vision_model':
assert 'model_name' in config.img_processor, 'model_name must be provided for CLIPVisionModel'
assert 'image_dim_out' in config.img_processor, 'image_dim_out must be provided for CLIPVisionModel'
assert 'num_img_tokens' in config.img_processor, 'num_img_tokens must be provided for CLIPVisionModel'
assert config.img_processor['model_name'] == 'openai/clip-vit-large-patch14-336'
clip_config = CLIP_VIT_LARGE_PATCH14_336_CONFIG
self.img_processor = CLIPVisionModel(clip_config)
image_dim_out = config.img_processor['image_dim_out']
self.num_img_tokens = config.img_processor['num_img_tokens']
else:
raise NotImplementedError(f'img_processor = {config.img_processor}, not implemented')
self.image_dim_out = image_dim_out
self.img_sizes = None
# global_gn and sub_gn for hd transform, serves as line separator
self.use_hd_transform = kwargs.get('use_hd_transform', False)
self.with_learnable_separator = kwargs.get('with_learnable_separator', False)
self.hd_transform_order = kwargs.get('hd_transform_order', 'glb_sub')
# with_hd_transform and with_learnable_separator should have same value
assert self.use_hd_transform == self.with_learnable_separator, 'use_hd_transform and with_learnable_separator should have same value'
if self.with_learnable_separator:
assert self.use_hd_transform, 'learnable separator is only for hd transform'
# 1024 * 4, merge spatial to channel dimension
self.glb_GN = nn.Parameter(torch.zeros([1, 1, self.image_dim_out * 4]))
self.sub_GN = nn.Parameter(torch.zeros([1, 1, 1, self.image_dim_out * 4]))
logger.info(f'learnable separator enabled for hd transform, hd_transform_order = {self.hd_transform_order}')
projection_cls = kwargs.get('projection_cls', 'linear')
if projection_cls == 'linear':
self.img_projection = nn.Linear(image_dim_out, hidden_size)
elif projection_cls == 'mlp' and self.use_hd_transform:
dim_projection = hidden_size
depth = 2
layers = [nn.Linear(image_dim_out * 4, dim_projection)]
for _ in range(1, depth):
layers.extend([nn.GELU(),
nn.Linear(dim_projection, dim_projection)])
self.img_projection = nn.Sequential(*layers)
elif projection_cls == 'mlp':
dim_projection = hidden_size
depth = 2
layers = [nn.Linear(image_dim_out, dim_projection)]
for _ in range(1, depth):
layers.extend([nn.GELU(),
nn.Linear(dim_projection, dim_projection)])
self.img_projection = nn.Sequential(*layers)
else:
raise NotImplementedError(f'projection_cls = {projection_cls}, not implemented')
self.vocab_size = config.vocab_size
self.img_features = None
if isinstance(config.img_processor, dict):
self.layer_idx = config.img_processor.get('layer_idx', -2)
self.type_feature = config.img_processor.get('type_feature', 'patch')
else:
self.layer_idx = -2
self.type_feature = 'patch'
def set_img_features(self, img_features: torch.FloatTensor) -> None:
self.img_features = img_features
def set_img_sizes(self, img_sizes: torch.LongTensor) -> None:
self.img_sizes = img_sizes
def get_img_features(self, img_embeds: torch.FloatTensor) -> torch.FloatTensor:
LAYER_IDX = self.layer_idx
TYPE_FEATURE = self.type_feature
img_processor_output = self.img_processor(img_embeds, output_hidden_states=True)
img_feature = img_processor_output.hidden_states[LAYER_IDX]
if TYPE_FEATURE == "patch":
patch_feature = img_feature[:, 1:]
return patch_feature
if TYPE_FEATURE == "cls_patch":
return img_feature
raise NotImplementedError
def forward(self, input_ids: torch.LongTensor, pixel_values: torch.FloatTensor, image_sizes=None) -> torch.FloatTensor:
MAX_INPUT_ID = int(1e9)
img_embeds = pixel_values
img_sizes = image_sizes
if self.img_features is not None:
img_embeds = self.img_features.clone()
self.img_features = None
if self.img_sizes is not None:
img_sizes = self.img_sizes
input_shape = input_ids.size()
input_ids = input_ids.view(-1, input_shape[-1])
with torch.no_grad():
positions = torch.nonzero((input_ids < 0) & (input_ids > -MAX_INPUT_ID), as_tuple=False)
select = False
if isinstance(self.img_projection, nn.Sequential):
target_device = self.img_projection[0].bias.device
target_dtype = self.img_projection[0].bias.dtype
else: # It's a single nn.Linear layer
target_device = self.img_projection.bias.device
target_dtype = self.img_projection.bias.dtype
if len(positions.tolist()) > 0:
with torch.no_grad():
g_values = abs(input_ids[positions[:, 0], positions[:, 1]])
if self.use_hd_transform and img_sizes is not None and len(img_sizes):
hd_transform = True
assert img_embeds.ndim == 5, f'img_embeds size: {img_embeds.size()}, expect 5D tensor for hd transform'
# img_embeds: (num_images, max_num_crops, 3, H, W)
# img_sizes: (num_images, 2).view(1, -1)
start_time = datetime.now()
bs = img_embeds.shape[0]
# Nx(HW)xC
img_features = self.get_img_features(img_embeds.flatten(0, 1))
base_feat_height = base_feat_width = int(img_features.shape[1] ** 0.5)
assert base_feat_height == 24 and base_feat_width == 24, f'base_feat_height: {base_feat_height}, base_feat_width: {base_feat_width}, expect 24x24 features for hd transform'
# bs x max_num_crops x (24x24) x C
img_features = img_features.view(bs, -1, base_feat_height * base_feat_width, self.image_dim_out)
C = self.image_dim_out
H = base_feat_height
output_imgs = []
output_len = []
# training is tensor, inference is list
if isinstance(img_sizes, torch.Tensor):
img_sizes = img_sizes.view(-1, 2)
for _bs in range(bs):
h, w = img_sizes[_bs]
h = h // 336
w = w // 336
B_ = h * w
# 1 x (24x24) x 1024
global_img_feature = img_features[_bs, :1]
# 1 x 12 x 12 x 4096
glb_img = global_img_feature.reshape(1,H,H,C).reshape(1,H//2,2,H//2,2,C).contiguous().permute(0,1,3,2,4,5).reshape(1,H//2,H//2,4*C).contiguous()
temp_glb_GN = self.sub_GN.repeat(1, H//2, 1, 1)
# 1 x 156 x 4096
glb_img = torch.cat([glb_img, temp_glb_GN], dim=2).reshape(1,-1,4*C)
# (max_num_crops-1) x (12x12) x C
sub_img = img_features[_bs, 1:]
# 16x574x1024
# get rid of padding sub_img
sub_img = sub_img[:B_]
# (num_crops, 12, 2, 12, 2, 1024) -> (num_crops, 12, 12, 2, 2, 1024) -> (num_crops, 12*12, 4*1024)
sub_img = sub_img.reshape(B_,H,H,C).reshape(B_,H//2,2,H//2,2,C).contiguous().permute(0,1,3,2,4,5).reshape(B_,-1,4*C).contiguous()
sub_img = sub_img.reshape(1, h, w, 12, 12, -1).permute(0,1,3,2,4,5).reshape(1,h*12,w*12,4*C)
temp_sub_GN = self.sub_GN.repeat(1, h*12, 1, 1)
sub_img = torch.cat([sub_img, temp_sub_GN], dim=2).reshape(1,-1,4*C)
# (1, num_img_tokens, 1024*4)
# glb + sub
if self.hd_transform_order == 'glb_sub':
output_imgs.append(torch.cat([glb_img, self.glb_GN, sub_img], dim=1))
elif self.hd_transform_order == 'sub_glb':
output_imgs.append(torch.cat([sub_img, self.glb_GN, glb_img], dim=1))
else:
raise NotImplementedError(f'hd_transform_order = {self.hd_transform_order}, not implemented')
temp_len = int((h*w+1)*144 + 1 + (h+1)*12)
assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
output_len.append(temp_len)
num_img_tokens = output_len
img_set_tensor = []
for _output_img in output_imgs:
img_feature_proj = self.img_projection(_output_img.to(target_device).to(target_dtype))
img_set_tensor.append(img_feature_proj)
logger.info(f'img_embeds size: {img_embeds.size()}, image sizes: {img_sizes} loading time {datetime.now() - start_time}')
elif img_embeds.ndim == 4:
selected_g_values = g_values[::self.num_img_tokens]
assert len(img_embeds) == len(selected_g_values), f'img_embeds size: {img_embeds.size()}, selected_g_values size: {len(selected_g_values)}, selected_g_value {selected_g_values}'
start_time = datetime.now()
tt = (
self.get_img_features(img_embeds)
.to(target_device)
.to(target_dtype)
.reshape(-1, self.image_dim_out)
)
logger.info(f'img_embeds size: {img_embeds.size()}, loading time {datetime.now() - start_time}')
img_set_tensor = self.img_projection(tt) # adapted visual features.
elif img_embeds.ndim == 3:
selected_g_values = g_values[::self.num_img_tokens]
assert len(img_embeds) == len(selected_g_values), f'img_embeds size: {img_embeds.size()}, selected_g_values size: {len(selected_g_values)}, selected_g_value {selected_g_values}'
tt = (
img_embeds
.to(target_device)
.to(target_dtype)
.view(-1, self.image_dim_out)
)
img_set_tensor = self.img_projection(tt) # adapted visual features.
else:
raise NotImplementedError
select = True
with torch.no_grad():
input_ids.clamp_min_(0).clamp_max_(self.vocab_size)
hidden_states = self.wte(input_ids)
if select:
if hd_transform:
idx = 0
for i, cnt in enumerate(num_img_tokens):
hidden_states[positions[idx, 0], positions[idx, 1] : positions[idx, 1] + cnt] = (
img_set_tensor[i]
.to(hidden_states.dtype)
.to(hidden_states.device)
)
idx += cnt
else:
idx = 0
assert len(selected_g_values) * self.num_img_tokens == len(img_set_tensor), f'len(selected_g_values) * self.num_img_tokens = {len(selected_g_values) * self.num_img_tokens}, len(img_set_tensor) = {len(img_set_tensor)}'
for i, g in enumerate(selected_g_values):
cnt = self.num_img_tokens
hidden_states[positions[idx, 0], positions[idx, 1] : positions[idx, 1] + cnt] = (
img_set_tensor[i * cnt : (i + 1) * cnt]
.to(hidden_states.dtype)
.to(hidden_states.device)
)
idx += cnt
if self.drop is not None:
hidden_states = self.drop(hidden_states)
return hidden_states

274
image_processing_phi3_v.py Normal file
View File

@@ -0,0 +1,274 @@
# coding=utf-8
# Copyright 2024 Microsoft and the HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Image processor class for Phi3-V."""
from typing import List, Optional, Union
import numpy as np
from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
from transformers.image_transforms import (
convert_to_rgb,
)
from transformers.image_utils import (
OPENAI_CLIP_MEAN,
OPENAI_CLIP_STD,
ImageInput,
make_list_of_images,
valid_images,
)
from transformers.utils import TensorType, is_vision_available, logging
from transformers import AutoImageProcessor
logger = logging.get_logger(__name__)
if is_vision_available():
from PIL import Image
import torch
import torchvision
def padding_336(b):
width, height = b.size
tar = int(np.ceil(height / 336) * 336)
top_padding = int((tar - height)/2)
bottom_padding = tar - height - top_padding
left_padding = 0
right_padding = 0
b = torchvision.transforms.functional.pad(b, [left_padding, top_padding, right_padding, bottom_padding], fill=[255,255,255])
return b
def calc_padded_size(width, height, padding_unit=336):
target_height = int(np.ceil(height / padding_unit) * padding_unit)
top_padding = int((target_height - height) / 2)
bottom_padding = target_height - height - top_padding
left_padding = 0
right_padding = 0
padded_width = width + left_padding + right_padding
padded_height = height + top_padding + bottom_padding
return padded_width, padded_height
def HD_transform(img, hd_num=16):
width, height = img.size
trans = False
if width < height:
img = img.transpose(Image.TRANSPOSE)
trans = True
width, height = img.size
ratio = (width/ height)
scale = 1
while scale*np.ceil(scale/ratio) <= hd_num:
scale += 1
scale -= 1
new_w = int(scale * 336)
new_h = int(new_w / ratio)
img = torchvision.transforms.functional.resize(img, [new_h, new_w],)
img = padding_336(img)
width, height = img.size
if trans:
img = img.transpose(Image.TRANSPOSE)
return img
def calc_hd_transform_size(width, height, hd_num=16):
transposed = False
if width < height:
width, height = height, width
transposed = True
ratio = width / height
scale = 1
while scale * np.ceil(scale / ratio) <= hd_num:
scale += 1
scale -= 1
new_width = int(scale * 336)
new_height = int(new_width / ratio)
padded_width, padded_height = calc_padded_size(new_width, new_height)
if transposed:
padded_width, padded_height = padded_height, padded_width
return padded_width, padded_height
def pad_to_max_num_crops_tensor(images, max_crops=5):
"""
images: B x 3 x H x W, B<=max_crops
"""
B, _, H, W = images.shape
if B < max_crops:
pad = torch.zeros(max_crops - B, 3, H, W, dtype=images.dtype, device=images.device)
images = torch.cat([images, pad], dim=0)
return images
class Phi3VImageProcessor(BaseImageProcessor):
r"""
Constructs a Phi3 image processor. Based on [`CLIPImageProcessor`] with incorporation of additional techniques
for processing high resolution images as explained in the [InternLM-XComposer2-4KHD](https://arxiv.org/abs/2401.16420)
Args:
image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
Mean to use if normalizing the image. This is a float or list of floats the length of the number of
channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
Can be overridden by the `image_std` parameter in the `preprocess` method.
do_convert_rgb (`bool`, *optional*, defaults to `True`):
Whether to convert the image to RGB.
"""
model_input_names = ["pixel_values"]
def __init__(
self,
num_crops: int = 1,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
do_convert_rgb: bool = True,
**kwargs,
) -> None:
super().__init__(**kwargs)
self.num_crops = num_crops
self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
self.do_convert_rgb = do_convert_rgb
def calc_num_image_tokens(
self,
images: ImageInput
):
""" Calculate the number of image tokens for each image.
Args:
images (`ImageInput`):
Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
passing in images with pixel values between 0 and 1, set `do_rescale=False`.
"""
images = make_list_of_images(images)
if not valid_images(images):
raise ValueError(
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
"torch.Tensor, tf.Tensor or jax.ndarray."
)
images = [image.convert('RGB') for image in images]
# (H, W, C)
elems = [HD_transform(im, hd_num = self.num_crops) for im in images]
shapes = [[im.size[1], im.size[0]] for im in elems]
num_img_tokens = [int((h//336*w//336+1)*144 + 1 + (h//336+1)*12) for h, w in shapes]
return num_img_tokens
def calc_num_image_tokens_from_image_size(self, width, height):
"""
Calculate the number of image tokens for a given image size.
Args:
width (`int`): Width of the image.
height (`int`): Height of the image.
"""
new_width, new_height = calc_hd_transform_size(width, height, hd_num=self.num_crops)
num_img_tokens = int((new_height // 336 * new_width // 336 + 1) * 144 + 1 + (new_height // 336 + 1) * 12)
return num_img_tokens
def preprocess(
self,
images: ImageInput,
image_mean: Optional[Union[float, List[float]]] = None,
image_std: Optional[Union[float, List[float]]] = None,
do_convert_rgb: bool = None,
return_tensors: Optional[Union[str, TensorType]] = None,
):
"""
Args:
images (`ImageInput`):
Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
passing in images with pixel values between 0 and 1, set `do_rescale=False`.
image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
`True`.
do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
Whether to convert the image to RGB.
return_tensors (`str` or `TensorType`, *optional*):
The type of tensors to return. Can be one of:
- Unset: Return a list of `np.ndarray`.
- `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
- `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
- `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
- `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
"""
image_mean = image_mean if image_mean is not None else self.image_mean
image_std = image_std if image_std is not None else self.image_std
do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
images = make_list_of_images(images)
if not valid_images(images):
raise ValueError(
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
"torch.Tensor, tf.Tensor or jax.ndarray."
)
if do_convert_rgb:
images = [convert_to_rgb(image) for image in images]
image_sizes = []
img_processor = torchvision.transforms.Compose([
torchvision.transforms.ToTensor(),
torchvision.transforms.Normalize(image_mean, image_std)
])
# PIL images
# HD_transform pad images to size of multiiply of 336, 336
# convert to RGB first
images = [image.convert('RGB') for image in images]
elems = [HD_transform(im, hd_num = self.num_crops) for im in images]
# tensor transform and normalize
hd_images = [img_processor(im) for im in elems]
# create global image
global_image = [torch.nn.functional.interpolate(im.unsqueeze(0).float(), size=(336, 336), mode='bicubic',).to(im.dtype) for im in hd_images]
# [(3, h, w)], where h, w is multiple of 336
shapes = [[im.size(1), im.size(2)] for im in hd_images]
num_img_tokens = [int((h//336*w//336+1)*144 + 1 + (h//336+1)*12) for h, w in shapes]
# reshape to channel dimension -> (num_images, num_crops, 3, 336, 336)
# (1, 3, h//336, 336, w//336, 336) -> (1, h//336, w//336, 3, 336, 336) -> (h//336*w//336, 3, 336, 336)
hd_images_reshape = [im.reshape(1, 3, h//336, 336, w//336, 336).permute(0,2,4,1,3,5).reshape(-1, 3, 336, 336).contiguous() for im, (h, w) in zip(hd_images, shapes)]
# concat global image and local image
hd_images_reshape = [torch.cat([_global_image] + [_im], dim=0) for _global_image, _im in zip(global_image, hd_images_reshape)]
# pad to max_num_crops
image_transformed = [pad_to_max_num_crops_tensor(im, self.num_crops+1) for im in hd_images_reshape]
image_transformed = torch.stack(image_transformed, dim=0)
image_sizes = [torch.LongTensor(_shapes) for _shapes in shapes]
padded_images = image_transformed
image_sizes = shapes
data = {"pixel_values": padded_images,
"image_sizes": image_sizes,
"num_img_tokens": num_img_tokens
}
return BatchFeature(data=data, tensor_type=return_tensors)
AutoImageProcessor.register("Phi3VImageProcessor", Phi3VImageProcessor)

1632
modeling_phi3_v.py Normal file

File diff suppressed because it is too large Load Diff

20
preprocessor_config.json Normal file
View File

@@ -0,0 +1,20 @@
{
"auto_map": {
"AutoProcessor": "processing_phi3_v.Phi3VProcessor",
"AutoImageProcessor": "image_processing_phi3_v.Phi3VImageProcessor"
},
"num_crops": 16,
"image_mean": [
0.48145466,
0.4578275,
0.40821073
],
"image_processor_type": "Phi3VImageProcessor",
"image_std": [
0.26862954,
0.26130258,
0.27577711
],
"processor_class": "Phi3VProcessor",
"num_img_tokens": 144
}

217
processing_phi3_v.py Normal file
View File

@@ -0,0 +1,217 @@
# coding=utf-8
# Copyright 2024 Microsoft and the HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Processor class for Phi3-V.
"""
import re
from typing import List, Optional, Union
import torch
import transformers
from transformers.feature_extraction_utils import BatchFeature
from transformers.image_utils import ImageInput
from transformers.processing_utils import ProcessorMixin
from transformers.tokenization_utils_base import PaddingStrategy, TextInput, TruncationStrategy
from transformers.utils import TensorType
from .image_processing_phi3_v import Phi3VImageProcessor
transformers.Phi3VImageProcessor = Phi3VImageProcessor
class Phi3VProcessor(ProcessorMixin):
r"""
Constructs a Phi3-V processor which wraps a Phi3-V image processor and a LLaMa tokenizer into a single processor.
[`Phi3VProcessor`] offers all the functionalities of [`Phi3VImageProcessor`] and [`LlamaTokenizerFast`]. See the
[`~Phi3VProcessor.__call__`] and [`~Phi3VProcessor.decode`] for more information.
Args:
image_processor ([`Phi3VImageProcessor`], *optional*):
The image processor is a required input.
tokenizer ([`LlamaTokenizerFast`], *optional*):
The tokenizer is a required input.
"""
attributes = ["image_processor", "tokenizer"]
image_processor_class = "Phi3VImageProcessor"
tokenizer_class = ("LlamaTokenizer", "LlamaTokenizerFast")
special_image_token = "<|image|>"
def __init__(self, image_processor, tokenizer):
self.image_processor = image_processor
self.tokenizer = tokenizer
self.num_img_tokens = image_processor.num_img_tokens
self.img_tokens = [f"<|image_{i+1}|>" for i in range(1000000)]
def __call__(
self,
text: Union[TextInput, List[TextInput]],
images: ImageInput = None,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TruncationStrategy] = None,
max_length=None,
return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
) -> BatchFeature:
"""
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to encode
the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
Phi3ImageProcessor's [`~Phi3ImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
of the above two methods for more information.
Args:
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. Both channels-first and channels-last formats are supported.
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
Select a strategy to pad the returned sequences (according to the model's padding side and padding
index) among:
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
acceptable input length for the model if that argument is not provided.
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
lengths).
max_length (`int`, *optional*):
Maximum length of the returned list and optionally padding length (see above).
truncation (`bool`, *optional*):
Activates truncation to cut input sequences longer than `max_length` to `max_length`.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
"""
if images is not None:
image_inputs = self.image_processor(images, return_tensors=return_tensors)
else:
image_inputs = {}
inputs = self._convert_images_texts_to_inputs(image_inputs, text, padding=padding, truncation=truncation, max_length=max_length, return_tensors=return_tensors)
return inputs
def calc_num_image_tokens(self, images: ImageInput):
""" Calculate the number of image tokens for each image.
Args:
images (`ImageInput`):
Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
passing in images with pixel values between 0 and 1, set `do_rescale=False`.
"""
return self.image_processor.calc_num_image_tokens(images)
def calc_num_image_tokens_from_image_size(self, width, height):
""" Calculate the number of image token for an image with given width and height.
Args:
width (`int`):
Width of the image.
height (`int`):
Height of the image.
"""
return self.image_processor.calc_num_image_tokens_from_image_size(width, height)
@property
def special_image_token_id(self):
return self.tokenizer.convert_tokens_to_ids(self.special_image_token)
def get_special_image_token_id(self):
return self.tokenizer.convert_tokens_to_ids(self.special_image_token)
def _convert_images_texts_to_inputs(self, images, texts, padding=False, truncation=None, max_length=None, return_tensors=None):
if not len(images):
model_inputs = self.tokenizer(texts, return_tensors=return_tensors, padding=padding, truncation=truncation, max_length=max_length)
return BatchFeature(data={**model_inputs})
pattern = r"<\|image_\d+\|>"
prompt_chunks = [self.tokenizer(chunk).input_ids for chunk in re.split(pattern, texts)]
if 'num_img_tokens' in images:
num_img_tokens = images['num_img_tokens']
else:
assert 'num_crops' in images, 'num_crops must be provided in images if num_img_tokens is not provided'
num_crops = images['num_crops']
num_img_tokens = [_num_crops * self.num_img_tokens for _num_crops in num_crops]
images, image_sizes = images['pixel_values'], images['image_sizes']
# image_tags needs to start from 1 to n
image_tags = re.findall(pattern, texts)
# image_ids = [int(s.split("|")[1].split("_")[-1]) * -1 for s in image_tags]
# image_ids_pad = [[iid]*num_img_tokens[i] for i, iid in enumerate(image_ids)]
image_ids = [int(s.split("|")[1].split("_")[-1]) for s in image_tags]
unique_image_ids = sorted(list(set(image_ids)))
# image_ids must start from 1, and must be continuous int, e.g. [1, 2, 3], cannot be [1, 4, 5]
# check the condition
assert unique_image_ids == list(range(1, len(unique_image_ids)+1)), f"image_ids must start from 1, and must be continuous int, e.g. [1, 2, 3], cannot be {unique_image_ids}"
# total images must be the same as the number of image tags
assert len(unique_image_ids) == len(images), f"total images must be the same as the number of image tags, got {len(unique_image_ids)} image tags and {len(images)} images"
image_ids_pad = [[-iid]*num_img_tokens[iid-1] for iid in image_ids]
def insert_separator(X, sep_list):
if len(X) > len(sep_list):
sep_list.append([])
return [ele for sublist in zip(X, sep_list) for ele in sublist]
input_ids = []
offset = 0
for x in insert_separator(prompt_chunks, image_ids_pad):
input_ids.extend(x[offset:])
input_ids = torch.tensor(input_ids, dtype=torch.long).unsqueeze(0)
attention_mask = (input_ids > -1000000).to(torch.long)
return BatchFeature(data={"input_ids": input_ids,
"attention_mask": attention_mask,
"pixel_values": images,
"image_sizes": image_sizes})
# Copied from transformers.models.clip.processing_clip.CLIPProcessor.batch_decode with CLIP->Llama
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
# Copied from transformers.models.clip.processing_clip.CLIPProcessor.decode with CLIP->Llama
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
@property
# Copied from transformers.models.clip.processing_clip.CLIPProcessor.model_input_names
def model_input_names(self):
tokenizer_input_names = self.tokenizer.model_input_names
image_processor_input_names = self.image_processor.model_input_names
return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:a933d3a38f156f35737a63b9c860678d345103348cf2fb5d9475c490672306ed
size 4944236059

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c4475ddd0bf401f1781ac6336b41e6479cf70b7bf7ba59579773ea3526c32bec
size 3349227876

View File

@@ -0,0 +1,600 @@
{
"metadata": {
"total_size": 8293242880
},
"weight_map": {
"lm_head.weight": "pytorch_model-00002-of-00002.bin",
"model.embed_tokens.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.0.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.0.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.0.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.0.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.0.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.0.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.1.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.1.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.1.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.1.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.1.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.1.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.10.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.10.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.10.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.10.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.10.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.10.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.11.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.11.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.11.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.11.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.11.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.11.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.12.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.12.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.12.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.12.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.12.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.12.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.13.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.13.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.13.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.13.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.13.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.13.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.14.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.14.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.14.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.14.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.14.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.14.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.15.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.15.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.15.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.15.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.15.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.15.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.16.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.16.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.16.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.16.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.16.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.16.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.17.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.17.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.17.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.17.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.17.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.17.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.18.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.18.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.18.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.18.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.18.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.18.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.19.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.19.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.19.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.19.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.19.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.19.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.2.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.2.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.2.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.2.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.2.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.2.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.20.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.20.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.20.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.20.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.20.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.20.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.21.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.21.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.21.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.21.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.21.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.21.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.22.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.22.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.22.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.22.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.22.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.22.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.23.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.23.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.23.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.23.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.23.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.23.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.24.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.24.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.24.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.24.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.24.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.24.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.25.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.25.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.25.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.25.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.25.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.25.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.26.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.26.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.26.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.26.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.26.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.26.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.27.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.27.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.27.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.27.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.27.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.27.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.28.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.28.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.28.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.28.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.28.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.28.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.29.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.29.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.29.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.29.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.29.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.29.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.3.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.3.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.3.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.3.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.3.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.3.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.30.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.30.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.30.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.30.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.30.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.30.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.31.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.31.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.31.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.31.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.31.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.31.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
"model.layers.4.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.4.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.4.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.4.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.4.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.4.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.5.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.5.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.5.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.5.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.5.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.5.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.6.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.6.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.6.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.6.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.6.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.6.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.7.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.7.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.7.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.7.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.7.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.7.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.8.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.8.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.8.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.8.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.8.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.8.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.9.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.9.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.9.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.9.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.9.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.layers.9.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.norm.weight": "pytorch_model-00002-of-00002.bin",
"model.vision_embed_tokens.glb_GN": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.embeddings.class_embedding": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.embeddings.patch_embedding.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.embeddings.position_embedding.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.post_layernorm.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.post_layernorm.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.pre_layrnorm.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_processor.vision_model.pre_layrnorm.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_projection.0.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_projection.0.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_projection.2.bias": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.img_projection.2.weight": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.sub_GN": "pytorch_model-00001-of-00002.bin",
"model.vision_embed_tokens.wte.weight": "pytorch_model-00001-of-00002.bin"
}
}

129
sample_inference.py Normal file
View File

@@ -0,0 +1,129 @@
from PIL import Image
import requests
import torch
from transformers import AutoModelForCausalLM
from transformers import AutoProcessor
model_path = "./"
kwargs = {}
kwargs['torch_dtype'] = torch.bfloat16
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype="auto").cuda()
user_prompt = '<|user|>\n'
assistant_prompt = '<|assistant|>\n'
prompt_suffix = "<|end|>\n"
#################################################### text-only ####################################################
# single-image prompt
prompt = f"{user_prompt}what is the answer for 1+1? Explain it.{prompt_suffix}{assistant_prompt}"
print(f">>> Prompt\n{prompt}")
inputs = processor(prompt, images=None, return_tensors="pt").to("cuda:0")
generate_ids = model.generate(**inputs,
max_new_tokens=1000,
eos_token_id=processor.tokenizer.eos_token_id,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False)[0]
print(f'>>> Response\n{response}')
#################################################### text-only 2 ####################################################
# single-image prompt
prompt = f"{user_prompt}Give me the code for sloving two-sum problem.{prompt_suffix}{assistant_prompt}"
print(f">>> Prompt\n{prompt}")
inputs = processor(prompt, images=None, return_tensors="pt").to("cuda:0")
generate_ids = model.generate(**inputs,
max_new_tokens=1000,
eos_token_id=processor.tokenizer.eos_token_id,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False)[0]
print(f'>>> Response\n{response}')
#################################################### EXAMPLE 1 ####################################################
# single-image prompt
prompt = f"{user_prompt}<|image_1|>\nWhat is shown in this image?{prompt_suffix}{assistant_prompt}"
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
print(f">>> Prompt\n{prompt}")
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
generate_ids = model.generate(**inputs,
max_new_tokens=1000,
eos_token_id=processor.tokenizer.eos_token_id,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False)[0]
print(f'>>> Response\n{response}')
#################################################### EXAMPLE 2 ####################################################
# multiple image prompt
# Note: image tokens must start from <|image_1|>
prompt = f"{user_prompt}<|image_1|>\n<|image_2|>\n What is shown in this two images?{prompt_suffix}{assistant_prompt}"
print(f">>> Prompt\n{prompt}")
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image_1 = Image.open(requests.get(url, stream=True).raw)
url = "https://img.freepik.com/free-photo/painting-mountain-lake-with-mountain-background_188544-9126.jpg?w=2000"
image_2 = Image.open(requests.get(url, stream=True).raw)
images = [image_1, image_2]
inputs = processor(prompt, images, return_tensors="pt").to("cuda:0")
generate_ids = model.generate(**inputs,
max_new_tokens=1000,
eos_token_id=processor.tokenizer.eos_token_id,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(f'>>> Response\n{response}')
#################################################### EXAMPLE 3 ####################################################
# chat template
chat = [
{"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
{"role": "assistant", "content": "The image depicts a street scene with a prominent red stop sign in the foreground. The background showcases a building with traditional Chinese architecture, characterized by its red roof and ornate decorations. There are also several statues of lions, which are common in Chinese culture, positioned in front of the building. The street is lined with various shops and businesses, and there's a car passing by."},
{"role": "user", "content": "What is so special about this image"}
]
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
image = Image.open(requests.get(url, stream=True).raw)
prompt = processor.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
# need to remove last <|endoftext|> if it is there, which is used for training, not inference. For training, make sure to add <|endoftext|> in the end.
if prompt.endswith("<|endoftext|>"):
prompt = prompt.rstrip("<|endoftext|>")
print(f">>> Prompt\n{prompt}")
inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")
generate_ids = model.generate(**inputs,
max_new_tokens=1000,
eos_token_id=processor.tokenizer.eos_token_id,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(f'>>> Response\n{response}')
############################# to markdown #############################
# single-image prompt
prompt = f"{user_prompt}<|image_1|>\nCan you convert the table to markdown format?{prompt_suffix}{assistant_prompt}"
url = "https://support.content.office.net/en-us/media/3dd2b79b-9160-403d-9967-af893d17b580.png"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
print(f">>> Prompt\n{prompt}")
generate_ids = model.generate(**inputs,
max_new_tokens=1000,
eos_token_id=processor.tokenizer.eos_token_id,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids,
skip_special_tokens=False,
clean_up_tokenization_spaces=False)[0]
print(f'>>> Response\n{response}')

36
special_tokens_map.json Normal file
View File

@@ -0,0 +1,36 @@
{
"additional_special_tokens": [
"<|system|>",
"<|end|>",
"<|user|>",
"<|end|>"
],
"bos_token": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

93797
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

408
tokenizer_config.json Normal file
View File

@@ -0,0 +1,408 @@
{
"add_bos_token": true,
"add_eos_token": false,
"added_tokens_decoder": {
"0": {
"content": "<unk>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<s>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "</s>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": false
},
"32000": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"32001": {
"content": "<|assistant|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32002": {
"content": "<|placeholder1|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32003": {
"content": "<|placeholder2|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32004": {
"content": "<|placeholder3|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32005": {
"content": "<|placeholder4|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32006": {
"content": "<|system|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"32007": {
"content": "<|end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"32008": {
"content": "<|placeholder5|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32009": {
"content": "<|placeholder6|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32010": {
"content": "<|user|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"32011": {
"content": "<|placeholder7|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32012": {
"content": "<|placeholder8|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32013": {
"content": "<|placeholder9|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32014": {
"content": "<|placeholder10|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32015": {
"content": "<|placeholder11|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32016": {
"content": "<|placeholder12|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32017": {
"content": "<|placeholder13|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32018": {
"content": "<|placeholder14|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32019": {
"content": "<|placeholder15|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32020": {
"content": "<|placeholder16|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32021": {
"content": "<|placeholder17|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32022": {
"content": "<|placeholder18|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32023": {
"content": "<|placeholder19|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32024": {
"content": "<|placeholder20|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32025": {
"content": "<|placeholder21|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32026": {
"content": "<|placeholder22|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32027": {
"content": "<|placeholder23|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32028": {
"content": "<|placeholder24|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32029": {
"content": "<|placeholder25|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32030": {
"content": "<|placeholder26|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32031": {
"content": "<|placeholder27|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32032": {
"content": "<|placeholder28|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32033": {
"content": "<|placeholder29|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32034": {
"content": "<|placeholder30|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32035": {
"content": "<|placeholder31|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32036": {
"content": "<|placeholder32|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32037": {
"content": "<|placeholder33|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32038": {
"content": "<|placeholder34|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32039": {
"content": "<|placeholder35|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32040": {
"content": "<|placeholder36|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32041": {
"content": "<|placeholder37|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32042": {
"content": "<|placeholder38|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32043": {
"content": "<|placeholder39|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
},
"32044": {
"content": "<|image|>",
"lstrip": false,
"normalized": false,
"rstrip": true,
"single_word": false,
"special": true
}
},
"additional_special_tokens": [
"<|system|>",
"<|end|>",
"<|user|>",
"<|end|>"
],
"bos_token": "<s>",
"chat_template": "{% for message in messages %}{{'<|' + message['role'] + '|>' + '\n' + message['content'] + '<|end|>\n' }}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{- '<|assistant|>\n' -}}{% endif %}",
"clean_up_tokenization_spaces": false,
"eos_token": "<|endoftext|>",
"legacy": false,
"model_max_length": 131072,
"pad_token": "<|endoftext|>",
"padding_side": "right",
"sp_model_kwargs": {},
"tokenizer_class": "LlamaTokenizer",
"unk_token": "<unk>",
"use_default_system_prompt": false
}