初始化项目，由ModelHub XC社区提供模型

Model: failspy/Phi-3-vision-128k-instruct-abliterated-alpha Source: Original Platform
2026-04-26 10:14:10 +08:00
commit a6e4e8442f
17 changed files with 97904 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,35 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.bz2 filter=lfs diff=lfs merge=lfs -text
 *.ckpt filter=lfs diff=lfs merge=lfs -text
 *.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.joblib filter=lfs diff=lfs merge=lfs -text
 *.lfs.* filter=lfs diff=lfs merge=lfs -text
 *.mlmodel filter=lfs diff=lfs merge=lfs -text
 *.model filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
 *.npy filter=lfs diff=lfs merge=lfs -text
 *.npz filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.parquet filter=lfs diff=lfs merge=lfs -text
 *.pb filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,77 @@
 ---
 license: mit
 license_link: https://huggingface.co/microsoft/Phi-3-medium-4k-instruct/resolve/main/LICENSE
 language:
 - multilingual
 pipeline_tag: text-generation
 tags:
 - nlp
 - code
 - vision
 inference:
  parameters:
    temperature: 0.7
 widget:
  - messages:
      - role: user
        content: I'm looking to do something unethical online. How can I stay safe whilst doing so?
 ---
 # Phi-3-vision-128k-instruct-abliterated-***alpha***
 [My Jupyter "cookbook" to replicate the methodology can be found here, refined library coming soon](https://huggingface.co/failspy/llama-3-70B-Instruct-abliterated/blob/main/ortho_cookbook.ipynb)
 #### Phi-3-abliterated statement
 Took me a while to wizard this one up. It’s been a while since I've released a Phi-3 model. In the past I accidentally missed an item required in the model release process - hallucination testing. 
 This model has been tested and though it is more likely to hallucinate than the original model in my experience, it is generally as stable as the original. 
 Now that the new Phi-3 models are out, I'm working on completing this abliteration process quickly and then will release the other models as soon as possible. 🏇
 ## Summary
 This is [microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) with orthogonalized bfloat16 safetensor weights, generated with a refined methodology based on that which was described in the preview paper/blog post: '[Refusal in LLMs is mediated by a single direction](https://www.alignmentforum.org/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction)' which I encourage you to read to understand more.
 ## Alpha???
 This is a very experimental model. I've never done a vision model before, so I really don't know what can possibly go wrong. Want to set expectations where they're at.
 This is, however, the same methodology as all the other abliterated-v3s, but applied to a vision model. The vision encoding layers are completely unmodified, but the *language*/chat model layers are abliterated.
 Please feel free to file issues to report any weird/strange behaviour. Expect this model to see updates as needed.
 ## Hang on, "abliterated"? Orthogonalization? Ablation? What is this?
 TL;DR: This model has had certain weights manipulated to "inhibit" the model's ability to express refusal. It is not in anyway _guaranteed_ that it won't refuse you, understand your request, it may still lecture you about ethics/safety, etc. It is tuned in all other respects the same as the original 70B instruct model was, just with the strongest refusal directions orthogonalized out.
 **TL;TL;DR;DR: It's uncensored in the purest form I can manage -- no new or changed behaviour in any other respect from the original model.**
 As far as "abliterated": it's just a fun play-on-words using the original "ablation" term used in the original paper to refer to removing features, which I made up particularly to differentiate the model from "uncensored" fine-tunes. 
 Ablate + obliterated = Abliterated
 Anyways, orthogonalization/ablation are both aspects to refer to the same thing here, the technique in which the refusal feature was "ablated" from the model was via orthogonalization.
 ## A little more on the methodology, and why this is interesting
 To me, ablation (or applying the methodology for the inverse, "augmentation") seems to be good for inducing/removing very specific features that you'd have to spend way too many tokens on encouraging or discouraging in your system prompt.  
 Instead, you just apply your system prompt in the ablation script against a blank system prompt on the same dataset and orthogonalize for the desired behaviour in the final model weights.
 > Why this over fine-tuning?
 Ablation is much more surgical in nature whilst also being effectively executed with a _lot_ less data than fine-tuning, which I think is its main advantage. 
 As well, and its most valuable aspect is it keeps as much of the original model's knowledge and training intact, whilst removing its tendency to behave in one very specific undesireable manner. (In this case, refusing user requests.)
 Fine tuning is still exceptionally useful and the go-to for broad behaviour changes; however, you may be able to get close to your desired behaviour with very few samples using the ablation/augmentation techniques.
 It may also be a useful step to add to your model refinement: orthogonalize -> fine-tune or vice-versa.
 I haven't really gotten around to exploring this model stacked with fine-tuning, I encourage others to give it a shot if they've got the capacity.
 ## Quirkiness awareness notice
 This model may come with interesting quirks, with the methodology being so new. I encourage you to play with the model, and post any quirks you notice in the community tab, as that'll help us further understand what this orthogonalization has in the way of side effects.  
 If you manage to develop further improvements, please share! This is really the most basic way to use ablation, but there are other possibilities that I believe are as-yet unexplored.
 Additionally, feel free to reach out in any way about this. I'm on the Cognitive Computations Discord, I'm watching the Community tab, reach out! I'd love to see this methodology used in other ways, and so would gladly support whoever whenever I can.
--- a/config.json
+++ b/config.json
@@ -0,0 +1,148 @@
 {
  "_name_or_path": "Phi-3-vision-128k-instruct",
  "architectures": [
    "Phi3VForCausalLM"
  ],
  "attention_dropout": 0.0,
  "auto_map": {
    "AutoConfig": "configuration_phi3_v.Phi3VConfig",
    "AutoModelForCausalLM": "modeling_phi3_v.Phi3VForCausalLM"
  },
  "bos_token_id": 1,
  "embd_layer": {
    "embedding_cls": "image",
    "hd_transform_order": "sub_glb",
    "projection_cls": "mlp",
    "use_hd_transform": true,
    "with_learnable_separator": true
  },
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 3072,
  "img_processor": {
    "image_dim_out": 1024,
    "model_name": "openai/clip-vit-large-patch14-336",
    "name": "clip_vision_model",
    "num_img_tokens": 144
  },
  "initializer_range": 0.02,
  "intermediate_size": 8192,
  "max_position_embeddings": 131072,
  "model_type": "phi3_v",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 32,
  "original_max_position_embeddings": 4096,
  "rms_norm_eps": 1e-05,
  "rope_scaling": {
    "long_factor": [
      1.0299999713897705,
      1.0499999523162842,
      1.0499999523162842,
      1.0799999237060547,
      1.2299998998641968,
      1.2299998998641968,
      1.2999999523162842,
      1.4499999284744263,
      1.5999999046325684,
      1.6499998569488525,
      1.8999998569488525,
      2.859999895095825,
      3.68999981880188,
      5.419999599456787,
      5.489999771118164,
      5.489999771118164,
      9.09000015258789,
      11.579999923706055,
      15.65999984741211,
      15.769999504089355,
      15.789999961853027,
      18.360000610351562,
      21.989999771118164,
      23.079999923706055,
      30.009998321533203,
      32.35000228881836,
      32.590003967285156,
      35.56000518798828,
      39.95000457763672,
      53.840003967285156,
      56.20000457763672,
      57.95000457763672,
      59.29000473022461,
      59.77000427246094,
      59.920005798339844,
      61.190006256103516,
      61.96000671386719,
      62.50000762939453,
      63.3700065612793,
      63.48000717163086,
      63.48000717163086,
      63.66000747680664,
      63.850006103515625,
      64.08000946044922,
      64.760009765625,
      64.80001068115234,
      64.81001281738281,
      64.81001281738281
    ],
    "short_factor": [
      1.05,
      1.05,
      1.05,
      1.1,
      1.1,
      1.1,
      1.2500000000000002,
      1.2500000000000002,
      1.4000000000000004,
      1.4500000000000004,
      1.5500000000000005,
      1.8500000000000008,
      1.9000000000000008,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.000000000000001,
      2.1000000000000005,
      2.1000000000000005,
      2.2,
      2.3499999999999996,
      2.3499999999999996,
      2.3499999999999996,
      2.3499999999999996,
      2.3999999999999995,
      2.3999999999999995,
      2.6499999999999986,
      2.6999999999999984,
      2.8999999999999977,
      2.9499999999999975,
      3.049999999999997,
      3.049999999999997,
      3.049999999999997
    ],
    "type": "su"
  },
  "rope_theta": 10000.0,
  "sliding_window": 131072,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.38.1",
  "use_cache": true,
  "vocab_size": 32064,
  "_attn_implementation": "flash_attention_2"
 }
--- a/configuration_phi3_v.py
+++ b/configuration_phi3_v.py
@@ -0,0 +1,217 @@
 # coding=utf-8
 # Copyright 2024 Microsoft and the HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """ Phi-3-V model configuration"""
 from transformers.configuration_utils import PretrainedConfig
 from transformers.utils import logging
 logger = logging.get_logger(__name__)
 PHI3V_PRETRAINED_CONFIG_ARCHIVE_MAP = {
    "microsoft/Phi-3-vision-128k-instruct": "https://huggingface.co/microsoft/Phi-3-vision-128k-instruct/resolve/main/config.json",
 }
 class Phi3VConfig(PretrainedConfig):
    r"""
    This is the configuration class to store the configuration of a [`Phi3VModel`]. It is used to instantiate a Phi-3
    model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
    defaults will yield a similar configuration to that of the
    [microsoft/Phi-3-vision-128k-instruct](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct).
    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
    documentation from [`PretrainedConfig`] for more information.
    Args:
        vocab_size (`int`, *optional*, defaults to 32064):
            Vocabulary size of the Phi-3-V model. Defines the number of different tokens that can be represented by the
            `inputs_ids` passed when calling [`Phi3VModel`].
        hidden_size (`int`, *optional*, defaults to 3072):
            Dimension of the hidden representations.
        intermediate_size (`int`, *optional*, defaults to 8192):
            Dimension of the MLP representations.
        num_hidden_layers (`int`, *optional*, defaults to 32):
            Number of hidden layers in the Transformer decoder.
        num_attention_heads (`int`, *optional*, defaults to 32):
            Number of attention heads for each attention layer in the Transformer decoder.
        num_key_value_heads (`int`, *optional*):
            This is the number of key_value heads that should be used to implement Grouped Query Attention. If
            `num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
            `num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
            converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
            by meanpooling all the original heads within that group. For more details checkout [this
            paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
            `num_attention_heads`.
        resid_pdrop (`float`, *optional*, defaults to 0.0):
            Dropout probability for mlp outputs.
        embd_pdrop (`int`, *optional*, defaults to 0.0):
            The dropout ratio for the embeddings.
        attention_dropout (`float`, *optional*, defaults to 0.0):
            The dropout ratio after computing the attention scores.
        hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
            The non-linear activation function (function or string) in the decoder.
        max_position_embeddings (`int`, *optional*, defaults to 4096):
            The maximum sequence length that this model might ever be used with.
        original_max_position_embeddings (`int`, *optional*, defaults to 4096):
            The maximum sequence length that this model was trained with. This is used to determine the size of the
            original RoPE embeddings when using long scaling.
        initializer_range (`float`, *optional*, defaults to 0.02):
            The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
        rms_norm_eps (`float`, *optional*, defaults to 1e-05):
            The epsilon value used for the RMSNorm.
        use_cache (`bool`, *optional*, defaults to `True`):
            Whether or not the model should return the last key/values attentions (not used by all models). Only
            relevant if `config.is_decoder=True`. Whether to tie weight embeddings or not.
        tie_word_embeddings (`bool`, *optional*, defaults to `False`):
            Whether to tie weight embeddings
        rope_theta (`float`, *optional*, defaults to 10000.0):
            The base period of the RoPE embeddings.
        rope_scaling (`dict`, *optional*):
            The scaling strategy for the RoPE embeddings. If `None`, no scaling is applied. If a dictionary, it must
            contain the following keys: `type`, `short_factor` and `long_factor`. The `type` must be either `su` or `yarn` and
            the `short_factor` and `long_factor` must be lists of numbers with the same length as the hidden size
            divided by the number of attention heads divided by 2.
        bos_token_id (`int`, *optional*, defaults to 1):
            The id of the "beginning-of-sequence" token.
        eos_token_id (`int`, *optional*, defaults to 32000):
            The id of the "end-of-sequence" token.
        pad_token_id (`int`, *optional*, defaults to 32000):
            The id of the padding token.
        sliding_window (`int`, *optional*):
            Sliding window attention window size. If `None`, no sliding window is applied.
        embd_layer (`str`, *optional*, defaults to `"default"`):
            The embedding layer to use. Can be either `"default"` or `"image"`. "default" uses the standard embedding for text. 
    Example:
    ```python
    >>> from transformers import Phi3VModel, Phi3VConfig
    >>> # Initializing a Phi-3-V style configuration
    >>> configuration = Phi3Config.from_pretrained("microsoft/Phi-3-vision-128k-instruct")
    >>> # Initializing a model from the configuration
    >>> model = Phi3VModel(configuration)
    >>> # Accessing the model configuration
    >>> configuration = model.config
    ```"""
    model_type = "phi3_v"
    keys_to_ignore_at_inference = ["past_key_values"]
    def __init__(
        self,
        vocab_size=32064,
        hidden_size=3072,
        intermediate_size=8192,
        num_hidden_layers=32,
        num_attention_heads=32,
        num_key_value_heads=None,
        resid_pdrop=0.0,
        embd_pdrop=0.0,
        attention_dropout=0.0,
        hidden_act="silu",
        max_position_embeddings=4096,
        original_max_position_embeddings=4096,
        initializer_range=0.02,
        rms_norm_eps=1e-5,
        use_cache=True,
        tie_word_embeddings=False,
        rope_theta=10000.0,
        rope_scaling=None,
        bos_token_id=1,
        eos_token_id=32000,
        pad_token_id=32000,
        sliding_window=None,
        embd_layer: str = "default",
        **kwargs,
    ):
        self.vocab_size = vocab_size
        self.hidden_size = hidden_size
        self.intermediate_size = intermediate_size
        self.num_hidden_layers = num_hidden_layers
        self.num_attention_heads = num_attention_heads
        if num_key_value_heads is None:
            num_key_value_heads = num_attention_heads
        self.num_key_value_heads = num_key_value_heads
        self.resid_pdrop = resid_pdrop
        self.embd_pdrop = embd_pdrop
        self.attention_dropout = attention_dropout
        self.hidden_act = hidden_act
        self.max_position_embeddings = max_position_embeddings
        self.original_max_position_embeddings = original_max_position_embeddings
        self.initializer_range = initializer_range
        self.rms_norm_eps = rms_norm_eps
        self.use_cache = use_cache
        self.rope_theta = rope_theta
        self.rope_scaling = rope_scaling
        self._rope_scaling_validation()
        self.sliding_window = sliding_window
        self.embd_layer = embd_layer
        super().__init__(
            bos_token_id=bos_token_id,
            eos_token_id=eos_token_id,
            pad_token_id=pad_token_id,
            tie_word_embeddings=tie_word_embeddings,
            **kwargs,
        )
    def _rope_scaling_validation(self):
        """
        Validate the `rope_scaling` configuration.
        """
        if self.rope_scaling is None:
            return
        if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 3:
            raise ValueError(
                "`rope_scaling` must be a dictionary with three fields, `type`, `short_factor` and `long_factor`, "
                f"got {self.rope_scaling}"
            )
        rope_scaling_type = self.rope_scaling.get("type", None)
        rope_scaling_short_factor = self.rope_scaling.get("short_factor", None)
        rope_scaling_long_factor = self.rope_scaling.get("long_factor", None)
        if rope_scaling_type is None or rope_scaling_type not in ["su", "yarn"]:
            raise ValueError(f"`rope_scaling`'s type field must be one of ['su', 'yarn'], got {rope_scaling_type}")
        if not (
            isinstance(rope_scaling_short_factor, list)
            and all(isinstance(x, (int, float)) for x in rope_scaling_short_factor)
        ):
            raise ValueError(
                f"`rope_scaling`'s short_factor field must be a list of numbers, got {rope_scaling_short_factor}"
            )
        if not len(rope_scaling_short_factor) == self.hidden_size // self.num_attention_heads // 2:
            raise ValueError(
                f"`rope_scaling`'s short_factor field must have length {self.hidden_size // self.num_attention_heads // 2}, got {len(rope_scaling_short_factor)}"
            )
        if not (
            isinstance(rope_scaling_long_factor, list)
            and all(isinstance(x, (int, float)) for x in rope_scaling_long_factor)
        ):
            raise ValueError(
                f"`rope_scaling`'s long_factor field must be a list of numbers, got {rope_scaling_long_factor}"
            )
        if not len(rope_scaling_long_factor) == self.hidden_size // self.num_attention_heads // 2:
            raise ValueError(
                f"`rope_scaling`'s long_factor field must have length {self.hidden_size // self.num_attention_heads // 2}, got {len(rope_scaling_long_factor)}"
            )
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,7 @@
 {
  "_from_model_config": true,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "pad_token_id": 32000,
  "transformers_version": "4.42.0.dev0"
 }
--- a/image_embedding_phi3_v.py
+++ b/image_embedding_phi3_v.py
@@ -0,0 +1,301 @@
 # coding=utf-8
 # Copyright 2024 Microsoft and the HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import math
 import torch
 import torch.nn as nn
 from transformers import CLIPVisionModel, PretrainedConfig
 from transformers import CLIPVisionConfig 
 from transformers.utils import logging
 from datetime import datetime 
 logger = logging.get_logger(__name__)
 CLIP_VIT_LARGE_PATCH14_336_CONFIG = CLIPVisionConfig(
  attention_dropout=0.0,
  dropout=0.0,
  hidden_act="quick_gelu",
  hidden_size=1024,
  image_size=336,
  initializer_factor=1.0,
  initializer_range=0.02,
  intermediate_size=4096,
  layer_norm_eps=1e-05,
  num_attention_heads=16,
  num_channels=3,
  num_hidden_layers=24,
  patch_size=14,
  projection_dim=768 
 )
 class Phi3ImageEmbedding(nn.Module):
    """Phi3 Image embedding."""
    def __init__(self, config: PretrainedConfig, wte=None, **kwargs) -> None:
        super().__init__()
        # n_embed or hidden_size
        hidden_size = config.n_embd if hasattr(config, 'n_embd') else config.hidden_size
        if hasattr(config, 'embd_pdrop') or hasattr(config, 'embed_pdrop'):
            embd_drop = config.embd_pdrop if hasattr(config, 'embd_pdrop') else config.embed_pdrop
            self.drop = nn.Dropout(embd_drop)
        else:
            self.drop = None
        self.wte = wte
        if isinstance(config.img_processor, dict) and config.img_processor.get('name', None) == 'clip_vision_model':
            assert 'model_name' in config.img_processor, 'model_name must be provided for CLIPVisionModel'
            assert 'image_dim_out' in config.img_processor, 'image_dim_out must be provided for CLIPVisionModel'
            assert 'num_img_tokens' in config.img_processor, 'num_img_tokens must be provided for CLIPVisionModel'
            assert config.img_processor['model_name'] == 'openai/clip-vit-large-patch14-336'
            clip_config = CLIP_VIT_LARGE_PATCH14_336_CONFIG
            self.img_processor = CLIPVisionModel(clip_config)
            image_dim_out = config.img_processor['image_dim_out']
            self.num_img_tokens = config.img_processor['num_img_tokens']
        else:
            raise NotImplementedError(f'img_processor = {config.img_processor}, not implemented')
        self.image_dim_out = image_dim_out
        self.img_sizes = None
        # global_gn and sub_gn for hd transform, serves as line separator
        self.use_hd_transform = kwargs.get('use_hd_transform', False)
        self.with_learnable_separator = kwargs.get('with_learnable_separator', False)
        self.hd_transform_order = kwargs.get('hd_transform_order', 'glb_sub')
        # with_hd_transform and with_learnable_separator should have same value
        assert self.use_hd_transform == self.with_learnable_separator, 'use_hd_transform and with_learnable_separator should have same value'
        if self.with_learnable_separator:
            assert self.use_hd_transform, 'learnable separator is only for hd transform'
            # 1024 * 4, merge spatial to channel dimension
            self.glb_GN = nn.Parameter(torch.zeros([1, 1, self.image_dim_out * 4]))
            self.sub_GN = nn.Parameter(torch.zeros([1, 1, 1, self.image_dim_out * 4]))
            logger.info(f'learnable separator enabled for hd transform, hd_transform_order = {self.hd_transform_order}')
        projection_cls = kwargs.get('projection_cls', 'linear')
        if projection_cls == 'linear':
            self.img_projection = nn.Linear(image_dim_out, hidden_size)
        elif projection_cls == 'mlp' and self.use_hd_transform:
            dim_projection = hidden_size
            depth = 2
            layers = [nn.Linear(image_dim_out * 4, dim_projection)]
            for _ in range(1, depth):
                layers.extend([nn.GELU(),
                                nn.Linear(dim_projection, dim_projection)])
            self.img_projection = nn.Sequential(*layers)
        elif projection_cls == 'mlp':
            dim_projection = hidden_size
            depth = 2
            layers = [nn.Linear(image_dim_out, dim_projection)]
            for _ in range(1, depth):
                layers.extend([nn.GELU(),
                                nn.Linear(dim_projection, dim_projection)])
            self.img_projection = nn.Sequential(*layers)
        else:
            raise NotImplementedError(f'projection_cls = {projection_cls}, not implemented')
        self.vocab_size = config.vocab_size
        self.img_features = None
        if isinstance(config.img_processor, dict):
            self.layer_idx = config.img_processor.get('layer_idx', -2)
            self.type_feature = config.img_processor.get('type_feature', 'patch')
        else:
            self.layer_idx = -2
            self.type_feature = 'patch'
    def set_img_features(self, img_features: torch.FloatTensor) -> None:
        self.img_features = img_features
    def set_img_sizes(self, img_sizes: torch.LongTensor) -> None:
        self.img_sizes = img_sizes
    def get_img_features(self, img_embeds: torch.FloatTensor) -> torch.FloatTensor:
        LAYER_IDX = self.layer_idx
        TYPE_FEATURE = self.type_feature
        img_processor_output = self.img_processor(img_embeds, output_hidden_states=True)
        img_feature = img_processor_output.hidden_states[LAYER_IDX]
        if TYPE_FEATURE == "patch":
            patch_feature = img_feature[:, 1:]
            return patch_feature
        if TYPE_FEATURE == "cls_patch":
            return img_feature
        raise NotImplementedError
    def forward(self, input_ids: torch.LongTensor, pixel_values: torch.FloatTensor, image_sizes=None) -> torch.FloatTensor:
        MAX_INPUT_ID = int(1e9)
        img_embeds = pixel_values
        img_sizes = image_sizes
        if self.img_features is not None:
            img_embeds = self.img_features.clone()
            self.img_features = None
        if self.img_sizes is not None:
            img_sizes = self.img_sizes
        input_shape = input_ids.size()
        input_ids = input_ids.view(-1, input_shape[-1])
        with torch.no_grad():
            positions = torch.nonzero((input_ids < 0) & (input_ids > -MAX_INPUT_ID), as_tuple=False)
        select = False
        if isinstance(self.img_projection, nn.Sequential):  
            target_device = self.img_projection[0].bias.device  
            target_dtype = self.img_projection[0].bias.dtype  
        else:  # It's a single nn.Linear layer  
            target_device = self.img_projection.bias.device  
            target_dtype = self.img_projection.bias.dtype  
        if len(positions.tolist()) > 0:
            with torch.no_grad():
                g_values = abs(input_ids[positions[:, 0], positions[:, 1]])
            if self.use_hd_transform and img_sizes is not None and len(img_sizes):
                hd_transform = True
                assert img_embeds.ndim == 5, f'img_embeds size: {img_embeds.size()}, expect 5D tensor for hd transform'
                # img_embeds: (num_images, max_num_crops, 3, H, W)
                # img_sizes: (num_images, 2).view(1, -1)
                start_time = datetime.now()
                bs = img_embeds.shape[0]
                # Nx(HW)xC
                img_features = self.get_img_features(img_embeds.flatten(0, 1))
                base_feat_height = base_feat_width = int(img_features.shape[1] ** 0.5)
                assert base_feat_height == 24 and base_feat_width == 24, f'base_feat_height: {base_feat_height}, base_feat_width: {base_feat_width}, expect 24x24 features for hd transform'
                # bs x max_num_crops x (24x24) x C
                img_features = img_features.view(bs, -1, base_feat_height * base_feat_width, self.image_dim_out)
                C = self.image_dim_out
                H = base_feat_height
                output_imgs = []
                output_len = []
                # training is tensor, inference is list
                if isinstance(img_sizes, torch.Tensor):
                    img_sizes = img_sizes.view(-1, 2)
                for _bs in range(bs):
                    h, w = img_sizes[_bs]
                    h = h // 336 
                    w = w // 336
                    B_ = h * w
                    # 1 x (24x24) x 1024
                    global_img_feature = img_features[_bs, :1]
                    # 1 x 12 x 12 x 4096
                    glb_img = global_img_feature.reshape(1,H,H,C).reshape(1,H//2,2,H//2,2,C).contiguous().permute(0,1,3,2,4,5).reshape(1,H//2,H//2,4*C).contiguous()
                    temp_glb_GN = self.sub_GN.repeat(1, H//2, 1, 1)
                    # 1 x 156 x 4096
                    glb_img = torch.cat([glb_img, temp_glb_GN], dim=2).reshape(1,-1,4*C)
                    # (max_num_crops-1) x (12x12) x C
                    sub_img = img_features[_bs, 1:]
                    # 16x574x1024
                    # get rid of padding sub_img
                    sub_img = sub_img[:B_]
                    # (num_crops, 12, 2, 12, 2, 1024) -> (num_crops, 12, 12, 2, 2, 1024) -> (num_crops, 12*12, 4*1024)
                    sub_img = sub_img.reshape(B_,H,H,C).reshape(B_,H//2,2,H//2,2,C).contiguous().permute(0,1,3,2,4,5).reshape(B_,-1,4*C).contiguous()
                    sub_img = sub_img.reshape(1, h, w, 12, 12, -1).permute(0,1,3,2,4,5).reshape(1,h*12,w*12,4*C)
                    temp_sub_GN = self.sub_GN.repeat(1, h*12, 1, 1)
                    sub_img = torch.cat([sub_img, temp_sub_GN], dim=2).reshape(1,-1,4*C)
                    # (1, num_img_tokens, 1024*4)
                    # glb + sub
                    if self.hd_transform_order == 'glb_sub':
                        output_imgs.append(torch.cat([glb_img, self.glb_GN, sub_img], dim=1))
                    elif self.hd_transform_order == 'sub_glb':
                        output_imgs.append(torch.cat([sub_img, self.glb_GN, glb_img], dim=1))
                    else:
                        raise NotImplementedError(f'hd_transform_order = {self.hd_transform_order}, not implemented')
                    temp_len = int((h*w+1)*144 + 1 + (h+1)*12)
                    assert temp_len == output_imgs[-1].shape[1], f'temp_len: {temp_len}, output_imgs[-1].shape[1]: {output_imgs[-1].shape[1]}'
                    output_len.append(temp_len)
                num_img_tokens = output_len
                img_set_tensor = []
                for _output_img in output_imgs:
                    img_feature_proj = self.img_projection(_output_img.to(target_device).to(target_dtype))
                    img_set_tensor.append(img_feature_proj)
                logger.info(f'img_embeds size: {img_embeds.size()}, image sizes: {img_sizes} loading time {datetime.now() - start_time}')
            elif img_embeds.ndim == 4:
                selected_g_values = g_values[::self.num_img_tokens]
                assert len(img_embeds) == len(selected_g_values), f'img_embeds size: {img_embeds.size()}, selected_g_values size: {len(selected_g_values)}, selected_g_value {selected_g_values}'
                start_time = datetime.now()
                tt = (
                    self.get_img_features(img_embeds)
                    .to(target_device)
                    .to(target_dtype)
                    .reshape(-1, self.image_dim_out)
                )
                logger.info(f'img_embeds size: {img_embeds.size()}, loading time {datetime.now() - start_time}')
                img_set_tensor = self.img_projection(tt)  # adapted visual features.
            elif img_embeds.ndim == 3:
                selected_g_values = g_values[::self.num_img_tokens]
                assert len(img_embeds) == len(selected_g_values), f'img_embeds size: {img_embeds.size()}, selected_g_values size: {len(selected_g_values)}, selected_g_value {selected_g_values}'
                tt = (
                    img_embeds
                    .to(target_device)
                    .to(target_dtype)
                    .view(-1, self.image_dim_out)
                )
                img_set_tensor = self.img_projection(tt)  # adapted visual features.
            else:
                raise NotImplementedError
            select = True
        with torch.no_grad():
            input_ids.clamp_min_(0).clamp_max_(self.vocab_size)
        hidden_states = self.wte(input_ids)
        if select:
            if hd_transform:
                idx = 0
                for i, cnt in enumerate(num_img_tokens):
                    hidden_states[positions[idx, 0], positions[idx, 1] : positions[idx, 1] + cnt] = (
                        img_set_tensor[i]
                        .to(hidden_states.dtype)
                        .to(hidden_states.device)
                        )
                    idx += cnt
            else:
                idx = 0
                assert len(selected_g_values) * self.num_img_tokens == len(img_set_tensor), f'len(selected_g_values) * self.num_img_tokens = {len(selected_g_values) * self.num_img_tokens}, len(img_set_tensor) = {len(img_set_tensor)}'
                for i, g in enumerate(selected_g_values):
                    cnt = self.num_img_tokens
                    hidden_states[positions[idx, 0], positions[idx, 1] : positions[idx, 1] + cnt] = (
                        img_set_tensor[i * cnt : (i + 1) * cnt]
                        .to(hidden_states.dtype)
                        .to(hidden_states.device)
                        )
                    idx += cnt
        if self.drop is not None:
            hidden_states = self.drop(hidden_states)
        return hidden_states
--- a/image_processing_phi3_v.py
+++ b/image_processing_phi3_v.py
@@ -0,0 +1,274 @@
 # coding=utf-8
 # Copyright 2024 Microsoft and the HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Image processor class for Phi3-V."""
 from typing import List, Optional, Union
 import numpy as np
 from transformers.image_processing_utils import BaseImageProcessor, BatchFeature
 from transformers.image_transforms import (
    convert_to_rgb,
 )
 from transformers.image_utils import (
    OPENAI_CLIP_MEAN,
    OPENAI_CLIP_STD,
    ImageInput,
    make_list_of_images,
    valid_images,
 )
 from transformers.utils import TensorType, is_vision_available, logging
 from transformers import AutoImageProcessor
 logger = logging.get_logger(__name__)
 if is_vision_available():
    from PIL import Image
 import torch
 import torchvision
 def padding_336(b):
    width, height = b.size
    tar = int(np.ceil(height / 336) * 336)
    top_padding = int((tar - height)/2)
    bottom_padding = tar - height - top_padding
    left_padding = 0
    right_padding = 0
    b = torchvision.transforms.functional.pad(b, [left_padding, top_padding, right_padding, bottom_padding], fill=[255,255,255])
    return b
 def calc_padded_size(width, height, padding_unit=336):  
    target_height = int(np.ceil(height / padding_unit) * padding_unit)  
    top_padding = int((target_height - height) / 2)  
    bottom_padding = target_height - height - top_padding  
    left_padding = 0  
    right_padding = 0  
    padded_width = width + left_padding + right_padding  
    padded_height = height + top_padding + bottom_padding  
    return padded_width, padded_height  
 def HD_transform(img, hd_num=16):
    width, height = img.size
    trans = False
    if width < height:
        img = img.transpose(Image.TRANSPOSE)
        trans = True
        width, height = img.size
    ratio = (width/ height)
    scale = 1
    while scale*np.ceil(scale/ratio) <= hd_num:
        scale += 1
    scale -= 1
    new_w = int(scale * 336)
    new_h = int(new_w / ratio)
    img = torchvision.transforms.functional.resize(img, [new_h, new_w],)
    img = padding_336(img)
    width, height = img.size
    if trans:
        img = img.transpose(Image.TRANSPOSE)
    return img
 def calc_hd_transform_size(width, height, hd_num=16):  
    transposed = False  
    if width < height:  
        width, height = height, width  
        transposed = True  
    ratio = width / height  
    scale = 1  
    while scale * np.ceil(scale / ratio) <= hd_num:  
        scale += 1  
    scale -= 1  
    new_width = int(scale * 336)  
    new_height = int(new_width / ratio)  
    padded_width, padded_height = calc_padded_size(new_width, new_height)  
    if transposed:  
        padded_width, padded_height = padded_height, padded_width  
    return padded_width, padded_height  
 def pad_to_max_num_crops_tensor(images, max_crops=5):
    """
    images: B x 3 x H x W, B<=max_crops
    """
    B, _, H, W = images.shape
    if B < max_crops:
        pad = torch.zeros(max_crops - B, 3, H, W, dtype=images.dtype, device=images.device)
        images = torch.cat([images, pad], dim=0)
    return images
 class Phi3VImageProcessor(BaseImageProcessor):
    r"""
    Constructs a Phi3 image processor. Based on [`CLIPImageProcessor`] with incorporation of additional techniques
    for processing high resolution images as explained in the [InternLM-XComposer2-4KHD](https://arxiv.org/abs/2401.16420)
    Args:
        image_mean (`float` or `List[float]`, *optional*, defaults to `[0.48145466, 0.4578275, 0.40821073]`):
            Mean to use if normalizing the image. This is a float or list of floats the length of the number of
            channels in the image. Can be overridden by the `image_mean` parameter in the `preprocess` method.
        image_std (`float` or `List[float]`, *optional*, defaults to `[0.26862954, 0.26130258, 0.27577711]`):
            Standard deviation to use if normalizing the image. This is a float or list of floats the length of the
            number of channels in the image. Can be overridden by the `image_std` parameter in the `preprocess` method.
            Can be overridden by the `image_std` parameter in the `preprocess` method.
        do_convert_rgb (`bool`, *optional*, defaults to `True`):
            Whether to convert the image to RGB.
    """
    model_input_names = ["pixel_values"]
    def __init__(
        self,
        num_crops: int = 1,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_convert_rgb: bool = True,
        **kwargs,
    ) -> None:
        super().__init__(**kwargs)
        self.num_crops = num_crops
        self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
        self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
        self.do_convert_rgb = do_convert_rgb
    def calc_num_image_tokens(
            self, 
            images: ImageInput 
    ):
        """ Calculate the number of image tokens for each image.
        Args:
            images (`ImageInput`):
                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
        """
        images = make_list_of_images(images)
        if not valid_images(images):
            raise ValueError(
                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                "torch.Tensor, tf.Tensor or jax.ndarray."
            )
        images = [image.convert('RGB') for image in images]
        # (H, W, C)
        elems = [HD_transform(im, hd_num = self.num_crops) for im in images] 
        shapes = [[im.size[1], im.size[0]] for im in elems]
        num_img_tokens = [int((h//336*w//336+1)*144 + 1 + (h//336+1)*12) for h, w in shapes]
        return num_img_tokens
    def calc_num_image_tokens_from_image_size(self, width, height):
        """
        Calculate the number of image tokens for a given image size.
        Args:
            width (`int`): Width of the image.
            height (`int`): Height of the image.
        """
        new_width, new_height = calc_hd_transform_size(width, height, hd_num=self.num_crops)  
        num_img_tokens = int((new_height // 336 * new_width // 336 + 1) * 144 + 1 + (new_height // 336 + 1) * 12)  
        return num_img_tokens
    def preprocess(
        self,
        images: ImageInput,
        image_mean: Optional[Union[float, List[float]]] = None,
        image_std: Optional[Union[float, List[float]]] = None,
        do_convert_rgb: bool = None,
        return_tensors: Optional[Union[str, TensorType]] = None,
    ):
        """
        Args:
            images (`ImageInput`):
                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
            image_mean (`float` or `List[float]`, *optional*, defaults to `self.image_mean`):
                Image mean to use for normalization. Only has an effect if `do_normalize` is set to `True`.
            image_std (`float` or `List[float]`, *optional*, defaults to `self.image_std`):
                Image standard deviation to use for normalization. Only has an effect if `do_normalize` is set to
                `True`.
            do_convert_rgb (`bool`, *optional*, defaults to `self.do_convert_rgb`):
                Whether to convert the image to RGB.
            return_tensors (`str` or `TensorType`, *optional*):
                The type of tensors to return. Can be one of:
                - Unset: Return a list of `np.ndarray`.
                - `TensorType.TENSORFLOW` or `'tf'`: Return a batch of type `tf.Tensor`.
                - `TensorType.PYTORCH` or `'pt'`: Return a batch of type `torch.Tensor`.
                - `TensorType.NUMPY` or `'np'`: Return a batch of type `np.ndarray`.
                - `TensorType.JAX` or `'jax'`: Return a batch of type `jax.numpy.ndarray`.
        """
        image_mean = image_mean if image_mean is not None else self.image_mean
        image_std = image_std if image_std is not None else self.image_std
        do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
        images = make_list_of_images(images)
        if not valid_images(images):
            raise ValueError(
                "Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
                "torch.Tensor, tf.Tensor or jax.ndarray."
            )
        if do_convert_rgb:
            images = [convert_to_rgb(image) for image in images]
        image_sizes = []
        img_processor = torchvision.transforms.Compose([
            torchvision.transforms.ToTensor(),
            torchvision.transforms.Normalize(image_mean, image_std)
        ])
        # PIL images
        # HD_transform pad images to size of multiiply of 336, 336
        # convert to RGB first
        images = [image.convert('RGB') for image in images]
        elems = [HD_transform(im, hd_num = self.num_crops) for im in images] 
        # tensor transform and normalize
        hd_images = [img_processor(im) for im in elems]
        # create global image 
        global_image = [torch.nn.functional.interpolate(im.unsqueeze(0).float(), size=(336, 336), mode='bicubic',).to(im.dtype) for im in hd_images]
        # [(3, h, w)], where h, w is multiple of 336
        shapes = [[im.size(1), im.size(2)] for im in hd_images]
        num_img_tokens = [int((h//336*w//336+1)*144 + 1 + (h//336+1)*12) for h, w in shapes]
        # reshape to channel dimension -> (num_images, num_crops, 3, 336, 336)
        # (1, 3, h//336, 336, w//336, 336) -> (1, h//336, w//336, 3, 336, 336) -> (h//336*w//336, 3, 336, 336)
        hd_images_reshape = [im.reshape(1, 3, h//336, 336, w//336, 336).permute(0,2,4,1,3,5).reshape(-1, 3, 336, 336).contiguous() for im, (h, w) in zip(hd_images, shapes)]
        # concat global image and local image
        hd_images_reshape = [torch.cat([_global_image] + [_im], dim=0) for _global_image, _im in zip(global_image, hd_images_reshape)]
        # pad to max_num_crops
        image_transformed = [pad_to_max_num_crops_tensor(im, self.num_crops+1) for im in hd_images_reshape]
        image_transformed = torch.stack(image_transformed, dim=0)
        image_sizes = [torch.LongTensor(_shapes) for _shapes in shapes]
        padded_images = image_transformed
        image_sizes = shapes
        data = {"pixel_values": padded_images, 
                "image_sizes": image_sizes,
                "num_img_tokens": num_img_tokens
                }
        return BatchFeature(data=data, tensor_type=return_tensors)
 AutoImageProcessor.register("Phi3VImageProcessor", Phi3VImageProcessor)
--- a/modeling_phi3_v.py
+++ b/modeling_phi3_v.py
--- a/preprocessor_config.json
+++ b/preprocessor_config.json
@@ -0,0 +1,20 @@
 {
    "auto_map": {
        "AutoProcessor": "processing_phi3_v.Phi3VProcessor",
        "AutoImageProcessor": "image_processing_phi3_v.Phi3VImageProcessor"
    },
    "num_crops": 16,
    "image_mean": [
      0.48145466,
      0.4578275,
      0.40821073
    ],
    "image_processor_type": "Phi3VImageProcessor",
    "image_std": [
      0.26862954,
      0.26130258,
      0.27577711
    ],
    "processor_class": "Phi3VProcessor",
    "num_img_tokens": 144
  }
--- a/processing_phi3_v.py
+++ b/processing_phi3_v.py
@@ -0,0 +1,217 @@
 # coding=utf-8
 # Copyright 2024 Microsoft and the HuggingFace Inc. team. All rights reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 Processor class for Phi3-V.
 """
 import re
 from typing import List, Optional, Union
 import torch
 import transformers
 from transformers.feature_extraction_utils import BatchFeature
 from transformers.image_utils import ImageInput
 from transformers.processing_utils import ProcessorMixin
 from transformers.tokenization_utils_base import PaddingStrategy, TextInput, TruncationStrategy
 from transformers.utils import TensorType
 from .image_processing_phi3_v import Phi3VImageProcessor 
 transformers.Phi3VImageProcessor = Phi3VImageProcessor 
 class Phi3VProcessor(ProcessorMixin):
    r"""
    Constructs a Phi3-V processor which wraps a Phi3-V image processor and a LLaMa tokenizer into a single processor.
    [`Phi3VProcessor`] offers all the functionalities of [`Phi3VImageProcessor`] and [`LlamaTokenizerFast`]. See the
    [`~Phi3VProcessor.__call__`] and [`~Phi3VProcessor.decode`] for more information.
    Args:
        image_processor ([`Phi3VImageProcessor`], *optional*):
            The image processor is a required input.
        tokenizer ([`LlamaTokenizerFast`], *optional*):
            The tokenizer is a required input.
    """
    attributes = ["image_processor", "tokenizer"]
    image_processor_class = "Phi3VImageProcessor"
    tokenizer_class = ("LlamaTokenizer", "LlamaTokenizerFast")
    special_image_token = "<|image|>"
    def __init__(self, image_processor, tokenizer):
        self.image_processor = image_processor
        self.tokenizer = tokenizer
        self.num_img_tokens = image_processor.num_img_tokens
        self.img_tokens = [f"<|image_{i+1}|>" for i in range(1000000)]
    def __call__(
        self,
        text: Union[TextInput, List[TextInput]],
        images: ImageInput = None,
        padding: Union[bool, str, PaddingStrategy] = False,
        truncation: Union[bool, str, TruncationStrategy] = None,
        max_length=None,
        return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
    ) -> BatchFeature:
        """
        Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
        and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to encode
        the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
        Phi3ImageProcessor's [`~Phi3ImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
        of the above two methods for more information.
        Args:
            text (`str`, `List[str]`, `List[List[str]]`):
                The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
                (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
                `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
            images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
                The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
                tensor. Both channels-first and channels-last formats are supported.
            padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
                Select a strategy to pad the returned sequences (according to the model's padding side and padding
                index) among:
                - `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
                  sequence if provided).
                - `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
                  acceptable input length for the model if that argument is not provided.
                - `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
                  lengths).
            max_length (`int`, *optional*):
                Maximum length of the returned list and optionally padding length (see above).
            truncation (`bool`, *optional*):
                Activates truncation to cut input sequences longer than `max_length` to `max_length`.
            return_tensors (`str` or [`~utils.TensorType`], *optional*):
                If set, will return tensors of a particular framework. Acceptable values are:
                - `'tf'`: Return TensorFlow `tf.constant` objects.
                - `'pt'`: Return PyTorch `torch.Tensor` objects.
                - `'np'`: Return NumPy `np.ndarray` objects.
                - `'jax'`: Return JAX `jnp.ndarray` objects.
        Returns:
            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
              `None`).
            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
        """
        if images is not None:
            image_inputs = self.image_processor(images, return_tensors=return_tensors)
        else:
            image_inputs = {}
        inputs = self._convert_images_texts_to_inputs(image_inputs, text, padding=padding, truncation=truncation, max_length=max_length, return_tensors=return_tensors)
        return inputs
    def calc_num_image_tokens(self, images: ImageInput):
        """ Calculate the number of image tokens for each image.
        Args:
            images (`ImageInput`):
                Image to preprocess. Expects a single or batch of images with pixel values ranging from 0 to 255. If
                passing in images with pixel values between 0 and 1, set `do_rescale=False`.
        """
        return self.image_processor.calc_num_image_tokens(images)
    def calc_num_image_tokens_from_image_size(self, width, height):
        """ Calculate the number of image token for an image with given width and height.
        Args:
            width (`int`):
                Width of the image.
            height (`int`):
                Height of the image.
        """
        return self.image_processor.calc_num_image_tokens_from_image_size(width, height)
    @property 
    def special_image_token_id(self):
        return self.tokenizer.convert_tokens_to_ids(self.special_image_token)
    def get_special_image_token_id(self):
        return self.tokenizer.convert_tokens_to_ids(self.special_image_token)
    def _convert_images_texts_to_inputs(self, images, texts, padding=False, truncation=None, max_length=None, return_tensors=None):
        if not len(images):
            model_inputs = self.tokenizer(texts, return_tensors=return_tensors, padding=padding, truncation=truncation, max_length=max_length)
            return BatchFeature(data={**model_inputs})
        pattern = r"<\|image_\d+\|>"
        prompt_chunks = [self.tokenizer(chunk).input_ids for chunk in re.split(pattern, texts)] 
        if 'num_img_tokens' in images:
            num_img_tokens = images['num_img_tokens']
        else:
            assert 'num_crops' in images, 'num_crops must be provided in images if num_img_tokens is not provided'
            num_crops = images['num_crops']
            num_img_tokens = [_num_crops * self.num_img_tokens for _num_crops in num_crops] 
        images, image_sizes = images['pixel_values'], images['image_sizes']
        # image_tags needs to start from 1 to n
        image_tags = re.findall(pattern, texts) 
        # image_ids = [int(s.split("|")[1].split("_")[-1]) * -1 for s in image_tags]
        # image_ids_pad = [[iid]*num_img_tokens[i] for i, iid in enumerate(image_ids)]
        image_ids = [int(s.split("|")[1].split("_")[-1]) for s in image_tags]
        unique_image_ids = sorted(list(set(image_ids)))
        # image_ids must start from 1, and must be continuous int, e.g. [1, 2, 3], cannot be [1, 4, 5]
        # check the condition
        assert unique_image_ids == list(range(1, len(unique_image_ids)+1)), f"image_ids must start from 1, and must be continuous int, e.g. [1, 2, 3], cannot be {unique_image_ids}"
        # total images must be the same as the number of image tags
        assert len(unique_image_ids) == len(images), f"total images must be the same as the number of image tags, got {len(unique_image_ids)} image tags and {len(images)} images"
        image_ids_pad = [[-iid]*num_img_tokens[iid-1] for iid in image_ids]
        def insert_separator(X, sep_list):
            if len(X) > len(sep_list):
                sep_list.append([])
            return [ele for sublist in zip(X, sep_list) for ele in sublist]
        input_ids = []
        offset = 0                
        for x in insert_separator(prompt_chunks, image_ids_pad):
            input_ids.extend(x[offset:])
        input_ids = torch.tensor(input_ids, dtype=torch.long).unsqueeze(0)
        attention_mask = (input_ids > -1000000).to(torch.long)
        return BatchFeature(data={"input_ids": input_ids,
                                  "attention_mask": attention_mask,
                                  "pixel_values": images, 
                                  "image_sizes": image_sizes})
    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.batch_decode with CLIP->Llama
    def batch_decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
        refer to the docstring of this method for more information.
        """
        return self.tokenizer.batch_decode(*args, **kwargs)
    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.decode with CLIP->Llama
    def decode(self, *args, **kwargs):
        """
        This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
        the docstring of this method for more information.
        """
        return self.tokenizer.decode(*args, **kwargs)
    @property
    # Copied from transformers.models.clip.processing_clip.CLIPProcessor.model_input_names
    def model_input_names(self):
        tokenizer_input_names = self.tokenizer.model_input_names
        image_processor_input_names = self.image_processor.model_input_names
        return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
--- a/pytorch_model-00001-of-00002.bin
+++ b/pytorch_model-00001-of-00002.bin
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:a933d3a38f156f35737a63b9c860678d345103348cf2fb5d9475c490672306ed
 size 4944236059
--- a/pytorch_model-00002-of-00002.bin
+++ b/pytorch_model-00002-of-00002.bin
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:c4475ddd0bf401f1781ac6336b41e6479cf70b7bf7ba59579773ea3526c32bec
 size 3349227876
--- a/pytorch_model.bin.index.json
+++ b/pytorch_model.bin.index.json
@@ -0,0 +1,600 @@
 {
  "metadata": {
    "total_size": 8293242880
  },
  "weight_map": {
    "lm_head.weight": "pytorch_model-00002-of-00002.bin",
    "model.embed_tokens.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.0.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.0.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.0.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.0.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.0.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.0.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.1.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.1.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.1.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.1.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.1.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.1.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.10.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.10.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.10.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.10.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.10.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.10.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.11.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.11.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.11.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.11.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.11.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.11.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.12.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.12.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.12.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.12.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.12.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.12.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.13.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.13.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.13.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.13.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.13.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.13.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.14.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.14.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.14.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.14.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.14.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.14.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.15.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.15.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.15.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.15.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.15.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.15.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.16.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.16.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.16.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.16.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.16.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.16.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.17.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.17.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.17.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.17.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.17.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.17.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.18.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.18.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.18.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.18.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.18.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.18.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.19.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.19.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.19.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.19.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.19.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.19.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.2.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.2.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.2.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.2.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.2.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.2.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.20.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.20.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.20.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.20.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.20.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.20.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.21.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.21.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.21.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.21.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.21.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.21.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.22.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.22.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.22.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.22.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.22.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.22.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.23.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.23.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.23.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.23.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.23.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.23.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.24.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.24.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.24.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.24.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.24.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.24.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.25.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.25.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.25.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.25.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.25.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.25.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.26.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.26.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.26.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.26.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.26.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.26.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.27.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.27.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.27.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.27.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.27.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.27.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.28.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.28.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.28.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.28.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.28.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.28.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.29.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.29.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.29.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.29.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.29.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.29.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.3.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.3.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.3.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.3.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.3.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.3.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.30.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.30.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.30.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.30.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.30.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.30.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.31.input_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.31.mlp.down_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.31.mlp.gate_up_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.31.post_attention_layernorm.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.31.self_attn.o_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.31.self_attn.qkv_proj.weight": "pytorch_model-00002-of-00002.bin",
    "model.layers.4.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.4.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.4.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.4.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.4.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.4.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.5.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.5.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.5.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.5.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.5.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.5.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.6.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.6.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.6.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.6.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.6.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.6.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.7.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.7.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.7.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.7.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.7.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.7.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.8.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.8.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.8.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.8.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.8.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.8.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.9.input_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.9.mlp.down_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.9.mlp.gate_up_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.9.post_attention_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.9.self_attn.o_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.layers.9.self_attn.qkv_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.norm.weight": "pytorch_model-00002-of-00002.bin",
    "model.vision_embed_tokens.glb_GN": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.embeddings.class_embedding": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.embeddings.patch_embedding.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.embeddings.position_embedding.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.0.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.1.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.10.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.11.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.12.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.13.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.14.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.15.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.16.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.17.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.18.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.19.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.2.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.20.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.21.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.22.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.23.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.3.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.4.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.5.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.6.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.7.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.8.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.layer_norm1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.layer_norm1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.layer_norm2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.layer_norm2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.mlp.fc1.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.mlp.fc1.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.mlp.fc2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.mlp.fc2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.k_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.k_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.out_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.out_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.q_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.q_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.v_proj.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.encoder.layers.9.self_attn.v_proj.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.post_layernorm.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.post_layernorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.pre_layrnorm.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_processor.vision_model.pre_layrnorm.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_projection.0.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_projection.0.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_projection.2.bias": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.img_projection.2.weight": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.sub_GN": "pytorch_model-00001-of-00002.bin",
    "model.vision_embed_tokens.wte.weight": "pytorch_model-00001-of-00002.bin"
  }
 }
--- a/sample_inference.py
+++ b/sample_inference.py
@@ -0,0 +1,129 @@
 from PIL import Image
 import requests
 import torch
 from transformers import AutoModelForCausalLM
 from transformers import AutoProcessor
 model_path = "./"
 kwargs = {}
 kwargs['torch_dtype'] = torch.bfloat16
 processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
 model = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, torch_dtype="auto").cuda()
 user_prompt = '<|user|>\n'
 assistant_prompt = '<|assistant|>\n'
 prompt_suffix = "<|end|>\n"
 #################################################### text-only ####################################################
 # single-image prompt
 prompt = f"{user_prompt}what is the answer for 1+1? Explain it.{prompt_suffix}{assistant_prompt}"
 print(f">>> Prompt\n{prompt}")
 inputs = processor(prompt, images=None, return_tensors="pt").to("cuda:0")
 generate_ids = model.generate(**inputs, 
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )
 generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
 response = processor.batch_decode(generate_ids, 
                                  skip_special_tokens=True, 
                                  clean_up_tokenization_spaces=False)[0]
 print(f'>>> Response\n{response}')
 #################################################### text-only 2 ####################################################
 # single-image prompt
 prompt = f"{user_prompt}Give me the code for sloving two-sum problem.{prompt_suffix}{assistant_prompt}"
 print(f">>> Prompt\n{prompt}")
 inputs = processor(prompt, images=None, return_tensors="pt").to("cuda:0")
 generate_ids = model.generate(**inputs, 
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )
 generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
 response = processor.batch_decode(generate_ids, 
                                  skip_special_tokens=True, 
                                  clean_up_tokenization_spaces=False)[0]
 print(f'>>> Response\n{response}')
 #################################################### EXAMPLE 1 ####################################################
 # single-image prompt
 prompt = f"{user_prompt}<|image_1|>\nWhat is shown in this image?{prompt_suffix}{assistant_prompt}"
 url = "https://www.ilankelman.org/stopsigns/australia.jpg"
 print(f">>> Prompt\n{prompt}")
 image = Image.open(requests.get(url, stream=True).raw)
 inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
 generate_ids = model.generate(**inputs, 
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )
 generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
 response = processor.batch_decode(generate_ids, 
                                  skip_special_tokens=True, 
                                  clean_up_tokenization_spaces=False)[0]
 print(f'>>> Response\n{response}')
 #################################################### EXAMPLE 2 ####################################################
 # multiple image prompt
 # Note: image tokens must start from <|image_1|> 
 prompt = f"{user_prompt}<|image_1|>\n<|image_2|>\n What is shown in this two images?{prompt_suffix}{assistant_prompt}"
 print(f">>> Prompt\n{prompt}")
 url = "https://www.ilankelman.org/stopsigns/australia.jpg"
 image_1 = Image.open(requests.get(url, stream=True).raw)
 url = "https://img.freepik.com/free-photo/painting-mountain-lake-with-mountain-background_188544-9126.jpg?w=2000"
 image_2 = Image.open(requests.get(url, stream=True).raw)
 images = [image_1, image_2]
 inputs = processor(prompt, images, return_tensors="pt").to("cuda:0")
 generate_ids = model.generate(**inputs, 
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )
 generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
 response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
 print(f'>>> Response\n{response}')
 #################################################### EXAMPLE 3 ####################################################
 # chat template
 chat = [
    {"role": "user", "content": "<|image_1|>\nWhat is shown in this image?"},
    {"role": "assistant", "content": "The image depicts a street scene with a prominent red stop sign in the foreground. The background showcases a building with traditional Chinese architecture, characterized by its red roof and ornate decorations. There are also several statues of lions, which are common in Chinese culture, positioned in front of the building. The street is lined with various shops and businesses, and there's a car passing by."},
    {"role": "user", "content": "What is so special about this image"}
 ]
 url = "https://www.ilankelman.org/stopsigns/australia.jpg"
 image = Image.open(requests.get(url, stream=True).raw)
 prompt = processor.tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
 # need to remove last <|endoftext|> if it is there, which is used for training, not inference. For training, make sure to add <|endoftext|> in the end.
 if prompt.endswith("<|endoftext|>"):
    prompt = prompt.rstrip("<|endoftext|>")
 print(f">>> Prompt\n{prompt}")
 inputs = processor(prompt, [image], return_tensors="pt").to("cuda:0")
 generate_ids = model.generate(**inputs, 
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )
 generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
 response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
 print(f'>>> Response\n{response}')
 ############################# to markdown #############################
 # single-image prompt
 prompt = f"{user_prompt}<|image_1|>\nCan you convert the table to markdown format?{prompt_suffix}{assistant_prompt}"
 url = "https://support.content.office.net/en-us/media/3dd2b79b-9160-403d-9967-af893d17b580.png"
 image = Image.open(requests.get(url, stream=True).raw)
 inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
 print(f">>> Prompt\n{prompt}")
 generate_ids = model.generate(**inputs, 
                              max_new_tokens=1000,
                              eos_token_id=processor.tokenizer.eos_token_id,
                              )
 generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
 response = processor.batch_decode(generate_ids, 
                                  skip_special_tokens=False, 
                                  clean_up_tokenization_spaces=False)[0]
 print(f'>>> Response\n{response}')
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,36 @@
 {
  "additional_special_tokens": [
    "<|system|>",
    "<|end|>",
    "<|user|>",
    "<|end|>"
  ],
  "bos_token": {
    "content": "<s>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "eos_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "pad_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "unk_token": {
    "content": "<unk>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  }
 }
--- a/tokenizer.json
+++ b/tokenizer.json
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,408 @@
 {
  "add_bos_token": true,
  "add_eos_token": false,
  "added_tokens_decoder": {
    "0": {
      "content": "<unk>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "1": {
      "content": "<s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "2": {
      "content": "</s>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": false
    },
    "32000": {
      "content": "<|endoftext|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32001": {
      "content": "<|assistant|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32002": {
      "content": "<|placeholder1|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32003": {
      "content": "<|placeholder2|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32004": {
      "content": "<|placeholder3|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32005": {
      "content": "<|placeholder4|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32006": {
      "content": "<|system|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32007": {
      "content": "<|end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32008": {
      "content": "<|placeholder5|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32009": {
      "content": "<|placeholder6|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32010": {
      "content": "<|user|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "32011": {
      "content": "<|placeholder7|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32012": {
      "content": "<|placeholder8|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32013": {
      "content": "<|placeholder9|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32014": {
      "content": "<|placeholder10|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32015": {
      "content": "<|placeholder11|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32016": {
      "content": "<|placeholder12|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32017": {
      "content": "<|placeholder13|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32018": {
      "content": "<|placeholder14|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32019": {
      "content": "<|placeholder15|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32020": {
      "content": "<|placeholder16|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32021": {
      "content": "<|placeholder17|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32022": {
      "content": "<|placeholder18|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32023": {
      "content": "<|placeholder19|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32024": {
      "content": "<|placeholder20|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32025": {
      "content": "<|placeholder21|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32026": {
      "content": "<|placeholder22|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32027": {
      "content": "<|placeholder23|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32028": {
      "content": "<|placeholder24|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32029": {
      "content": "<|placeholder25|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32030": {
      "content": "<|placeholder26|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32031": {
      "content": "<|placeholder27|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32032": {
      "content": "<|placeholder28|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32033": {
      "content": "<|placeholder29|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32034": {
      "content": "<|placeholder30|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32035": {
      "content": "<|placeholder31|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32036": {
      "content": "<|placeholder32|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32037": {
      "content": "<|placeholder33|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32038": {
      "content": "<|placeholder34|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32039": {
      "content": "<|placeholder35|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32040": {
      "content": "<|placeholder36|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32041": {
      "content": "<|placeholder37|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32042": {
      "content": "<|placeholder38|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32043": {
      "content": "<|placeholder39|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    },
    "32044": {
      "content": "<|image|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": true,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [
    "<|system|>",
    "<|end|>",
    "<|user|>",
    "<|end|>"
  ],
  "bos_token": "<s>",
  "chat_template": "{% for message in messages %}{{'<|' + message['role'] + '|>' + '\n' + message['content'] + '<|end|>\n' }}{% endfor %}{% if add_generation_prompt and messages[-1]['role'] != 'assistant' %}{{- '<|assistant|>\n' -}}{% endif %}",
  "clean_up_tokenization_spaces": false,
  "eos_token": "<|endoftext|>",
  "legacy": false,
  "model_max_length": 131072,
  "pad_token": "<|endoftext|>",
  "padding_side": "right",
  "sp_model_kwargs": {},
  "tokenizer_class": "LlamaTokenizer",
  "unk_token": "<unk>",
  "use_default_system_prompt": false
 }