初始化项目,由ModelHub XC社区提供模型

Model: AI-ModelScope/R-4B
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-05-21 17:44:12 +08:00
commit 3ddb0fa7ff
27 changed files with 3526 additions and 0 deletions

52
.gitattributes vendored Normal file
View File

@@ -0,0 +1,52 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bin.* filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zstandard filter=lfs diff=lfs merge=lfs -text
*.tfevents* filter=lfs diff=lfs merge=lfs -text
*.db* filter=lfs diff=lfs merge=lfs -text
*.ark* filter=lfs diff=lfs merge=lfs -text
**/*ckpt*data* filter=lfs diff=lfs merge=lfs -text
**/*ckpt*.meta filter=lfs diff=lfs merge=lfs -text
**/*ckpt*.index filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.gguf* filter=lfs diff=lfs merge=lfs -text
*.ggml filter=lfs diff=lfs merge=lfs -text
*.llamafile* filter=lfs diff=lfs merge=lfs -text
*.pt2 filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
asset/R-4B.png filter=lfs diff=lfs merge=lfs -text
merges.txt filter=lfs diff=lfs merge=lfs -text
tokenizer.json filter=lfs diff=lfs merge=lfs -text
vocab.json filter=lfs diff=lfs merge=lfs -text

230
README.md Normal file
View File

@@ -0,0 +1,230 @@
---
base_model:
- Qwen/Qwen3-4B
language:
- en
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
---
# R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning
[[📚 Arxiv Paper](https://arxiv.org/pdf/2508.21113)] [[🤗 Hugging Face](https://huggingface.co/YannQi/R-4B)] [[🤖️ ModelScope](https://huggingface.co/YannQi/R-4B)] [[💻 Code](https://github.com/yannqi/R-4B)]
<div align="center">
<img src="asset/logo_R_4B.png" alt="logo" width="38" />
</div>
<div align="center">
<img src="asset/R-4B.png" width="100%" alt="R-4B Performance">
</div>
## ⭐️ Introduction
In this repo, we present **R-4B**, a multimodal large language model designed for general-purpose auto-thinking, autonomously switching between step-by-step thinking and direct response generation based on task complexity. This capability enables R-4B to deliver high-quality responses while significantly improving inference efficiency and reducing computational costs.
The development of R-4B follows a two-stage training paradigm:
(1) Bi-mode Annealing, which establishes both thinking and non-thinking capabilities for VQA; and
(2) Bi-mode Policy Optimization (BPO), which enables the model to adaptively switch between thinking and non-thinking modes based on input demands.
## 🚀 Key Features
- 🧠 **Think Smart, Act Fast: Adaptive & Controllable Thinking!**
Our model provides three-mode control over the response process.
- **Auto-thinking Mode:** Unleash **auto-thinking** that works across general topics, from simple Q&A to complex scientific analysis. It saves time and computation by thinking only when it matters.
- **Support Manual Control:** Explicitly command the model to use its `thinking` or `non-thinking` capabilities, enabling you to make your choices for every job.
- 🏆 **Strong Performance, Open for Everyone!**
Our model is now **fully open-source**. It achieves **state-of-the-art performance** among models of comparable size.
## 📢 News
- **[2025.08.20]** 🚀 **vLLM Support is Here!** Our R-4B model is now fully compatible with [vLLM](https://github.com/vllm-project/vllm) for high-performance inference.
- **[2025.08.18]** 🏆 **Top Rank Achieved!** We are thrilled to announce that R-4B is now ranked #1 among all open-source models on the [OpenCompass Multi-modal Reasoning Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal-reasoning/?m=REALTIME)!
- **[2025.08.11]** 🥇 **Rank #1!** R-4B ranks first under 20B parameters on the [OpenCompass Multi-modal Academic Leaderboard](https://rank.opencompass.org.cn/leaderboard-multimodal/?m=REALTIME)!
- **[2025.08.05]** 🎉 **R-4B is Released!** Our model is now publicly available. You can download it from [Hugging Face](https://huggingface.co/YannQi/R-4B).
## 🔥 Quickstart
Below, we provide simple examples to show how to use R-4B with 🤗 Transformers.
### Using 🤗 Transformers to Chat
> [!NOTE]
> Users can dynamically control the model's response by selecting one of three modes (`auto-thinking`, `thinking`, or `non-thinking`) with `thinking_mode`. `thinking_mode=auto` for `auto-thinking` mode; `thinking_mode=long` for `thinking` mode; `thinking_mode=short` for `non-thinking` mode.
> Default is `auto-thinking`.
```python
import requests
from PIL import Image
import torch
from transformers import AutoModel, AutoProcessor
model_path = "YannQi/R-4B"
# Load model
model = AutoModel.from_pretrained(
model_path,
torch_dtype=torch.float32,
trust_remote_code=True,
).to("cuda")
# Load processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
# Define conversation messages
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "http://images.cocodataset.org/val2017/000000039769.jpg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Apply chat template
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
thinking_mode="auto"
)
# Load image
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(image_url, stream=True).raw)
# Process inputs
inputs = processor(
images=image,
text=text,
return_tensors="pt"
).to("cuda")
# Generate output
generated_ids = model.generate(**inputs, max_new_tokens=16384)
output_ids = generated_ids[0][len(inputs.input_ids[0]):]
# Decode output
output_text = processor.decode(
output_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
# Print result
print("Auto-Thinking Output:", output_text)
```
</details>
### Using vLLM for fast R-4B deployment and inference.
- We recommend using vLLM for fast R-4B deployment and inference.
#### Install
The code of R-4B requires the newest vllm now. Please install from local source:
```bash
git clone https://github.com/vllm-project/vllm.git
cd vllm
VLLM_USE_PRECOMPILED=1 uv pip install --editable .
```
##### Online Serving
> [!TIP]
> The `thinking_mode` switch is also available in APIs created by [vLLM](https://github.com/vllm-project/vllm).
> Default is `auto-thinking`.
- Serve
```bash
vllm serve \
yannqi/R-4B \
--served-model-name r4b \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.8 \
--host 0.0.0.0 \
--port 8000 \
--trust-remote-code
```
- Openai Chat Completion Client
```python
import base64
from PIL import Image
from openai import OpenAI
# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
# image url
image_messages = [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "http://images.cocodataset.org/val2017/000000039769.jpg"
},
},
{"type": "text", "text": "Describe this image."},
],
},
]
chat_response = client.chat.completions.create(
model="r4b",
messages=image_messages,
max_tokens=16384,
extra_body={
"chat_template_kwargs": {"thinking_mode": "auto"},
},
)
print("Chat response:", chat_response)
```
## 📈 Experimental Results
<div align="center">
<img src="asset/performance.png" width="100%" alt="R-4B Performance">
</div>
1. R-4B establishes itself with powerful, state-of-the-art perceptual abilities that are competitive with larger models.
2. In evaluation sets that require complex logical reasoning and mathematical problem-solving, such as WeMath, MathVerse, and LogicVista, R-4B displays a strong performance curve. This highlights its advanced adaptive thinking capacity for logical deduction and solving complex quantitative problems.
## ✒️ Citation
```
@misc{yang2025r4bincentivizinggeneralpurposeautothinking,
title={R-4B: Incentivizing General-Purpose Auto-Thinking Capability in MLLMs via Bi-Mode Annealing and Reinforce Learning},
author={Qi Yang and Bolin Ni and Shiming Xiang and Han Hu and Houwen Peng and Jie Jiang},
year={2025},
eprint={2508.21113},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2508.21113},
}
```
## Acknowledgements
R-4B is developed based on the codebases of the following projects: [LLaVA-Next](https://github.com/LLaVA-VL/LLaVA-NeXT), [SigLIP2](https://huggingface.co/google/siglip2-so400m-patch14-384), [Qwen3](https://github.com/QwenLM/Qwen3), [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL), [VLMEvalKit](https://github.com/open-compass/VLMEvalKit). We sincerely thank these projects for their outstanding work.

30
added_tokens.json Normal file
View File

@@ -0,0 +1,30 @@
{
"</think>": 151668,
"</tool_call>": 151658,
"</tool_response>": 151666,
"<image>": 151669,
"<think>": 151667,
"<tool_call>": 151657,
"<tool_response>": 151665,
"<video>": 151670,
"<|box_end|>": 151649,
"<|box_start|>": 151648,
"<|endoftext|>": 151643,
"<|file_sep|>": 151664,
"<|fim_middle|>": 151660,
"<|fim_pad|>": 151662,
"<|fim_prefix|>": 151659,
"<|fim_suffix|>": 151661,
"<|im_end|>": 151645,
"<|im_start|>": 151644,
"<|image_pad|>": 151655,
"<|object_ref_end|>": 151647,
"<|object_ref_start|>": 151646,
"<|quad_end|>": 151651,
"<|quad_start|>": 151650,
"<|repo_name|>": 151663,
"<|video_pad|>": 151656,
"<|vision_end|>": 151653,
"<|vision_pad|>": 151654,
"<|vision_start|>": 151652
}

3
asset/R-4B.png Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:3cd7455ff84e7a7075e77b8eb5d6c937d62969e09030923ae05d0258074aa1a0
size 1221298

BIN
asset/logo_R_4B.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 475 KiB

BIN
asset/performance.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 338 KiB

11
chat_template.jinja Normal file
View File

@@ -0,0 +1,11 @@
{% for message in messages %}{{'<|im_start|>' + message['role'] + '
'}}{# Render all images first #}{% for content in message['content'] | selectattr('type', 'equalto', 'image') %}{{ '<image>
' }}{% endfor %}{# Render all video then #}{% for content in message['content'] | selectattr('type', 'equalto', 'video') %}{{ '<video>
' }}{% endfor %}{# Render all text next #}{% if message['role'] != 'assistant' %}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{{ content['text'] }}{% endfor %}{% else %}{% for content in message['content'] | selectattr('type', 'equalto', 'text') %}{% generation %}{{ content['text'] }}{% endgeneration %}{% endfor %}{% endif %}{{'<|im_end|>' + '
'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant
<think>' }}{% endif %}{%- if add_generation_prompt %}{%- if thinking_mode is defined and thinking_mode == 'short' %}{{- '
</think>
' }}{%- endif %}{%- if thinking_mode is defined and thinking_mode == 'long' %}{{- '
' }}{%- endif %}{%- endif %}

94
config.json Normal file
View File

@@ -0,0 +1,94 @@
{
"auto_map": {
"AutoConfig": "configuration_r.RConfig",
"AutoModel": "modeling_r.RForConditionalGeneration",
"AutoModelForCausalLM": "modeling_r.RForConditionalGeneration"
},
"architectures": [
"RForConditionalGeneration"
],
"eos_token_id": 151645,
"image_grid_pinpoints": [
[
384,
768
],
[
768,
384
],
[
768,
768
],
[
1152,
384
],
[
384,
1152
]
],
"image_token_index": 151669,
"model_type": "R",
"multimodal_projector_bias": true,
"pad_token_id": 151643,
"projector_hidden_act": "gelu",
"text_config": {
"_name_or_path": "Qwen/Qwen3-4B",
"architectures": [
"Qwen3ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"head_dim": 128,
"hidden_act": "silu",
"hidden_size": 2560,
"initializer_range": 0.02,
"intermediate_size": 9728,
"max_position_embeddings": 40960,
"max_window_layers": 36,
"model_type": "qwen3",
"num_attention_heads": 32,
"num_hidden_layers": 36,
"num_key_value_heads": 8,
"rms_norm_eps": 1e-06,
"rope_scaling": null,
"rope_theta": 1000000,
"sliding_window": null,
"tie_word_embeddings": true,
"torch_dtype": "float32",
"use_cache": true,
"use_sliding_window": false,
"vocab_size": 152000
},
"tie_word_embeddings": true,
"torch_dtype": "float32",
"transformers_version": "4.52.0",
"use_image_newline_parameter": true,
"video_token_index": 151670,
"vision_aspect_ratio": "anyres",
"vision_config": {
"auto_map": {
"AutoConfig": "configuration_r.RConfig"
},
"attention_dropout": 0.0,
"hidden_act": "gelu_pytorch_tanh",
"hidden_size": 1152,
"image_size": 384,
"intermediate_size": 4304,
"layer_norm_eps": 1e-06,
"model_type": "siglip_vision_model",
"num_attention_heads": 16,
"num_channels": 3,
"num_hidden_layers": 26,
"patch_size": 14,
"torch_dtype": "float32",
"vision_use_head": false
},
"vision_feature_layer": -1,
"vision_feature_select_strategy": "full"
}

1
configuration.json Normal file
View File

@@ -0,0 +1 @@
{"framework": "pytorch", "task": "visual-question-answering", "allow_remote": true}

101
configuration_r.py Normal file
View File

@@ -0,0 +1,101 @@
# coding=utf-8
# Copyright 2024 HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from transformers.configuration_utils import PretrainedConfig
from transformers.utils import (
logging,
)
logger = logging.get_logger(__name__)
class RConfig(PretrainedConfig):
model_type = "R"
attribute_map = {
"image_token_id": "image_token_index",
}
# sub_configs = {"text_config": AutoConfig, "vision_config": AutoConfig}
def __init__(
self,
vision_config=None,
text_config=None,
image_token_index=151646,
projector_hidden_act="gelu",
vision_feature_select_strategy="full",
vision_feature_layer=-1,
vision_aspect_ratio= "anyres",
image_grid_pinpoints=None,
tie_word_embeddings=False,
multimodal_projector_bias=True,
max_position_embeddings=32768,
**kwargs,
):
from transformers.models.auto import CONFIG_MAPPING, AutoConfig # for vllm
self.image_token_index = image_token_index
self.projector_hidden_act = projector_hidden_act
self.multimodal_projector_bias = multimodal_projector_bias
if vision_feature_select_strategy not in ["default", "full"]:
raise ValueError(
"vision_feature_select_strategy should be one of 'default', 'full'."
f"Got: {vision_feature_select_strategy}"
)
self.vision_feature_select_strategy = vision_feature_select_strategy
self.vision_feature_layer = vision_feature_layer
self.vision_aspect_ratio = vision_aspect_ratio
image_grid_pinpoints = (
image_grid_pinpoints
if image_grid_pinpoints is not None
else [[384, 768], [768, 384], [768, 768], [1152, 384], [384, 1152]]
)
self.image_grid_pinpoints = image_grid_pinpoints
if isinstance(vision_config, dict):
vision_config["model_type"] = (
vision_config["model_type"] if "model_type" in vision_config else "siglip_vision_model"
)
vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
elif vision_config is None:
vision_config = CONFIG_MAPPING["siglip_vision_model"](
hidden_size=1152,
intermediate_size=4304,
patch_size=14,
image_size=384,
num_hidden_layers=26,
num_attention_heads=14,
vision_use_head=False,
)
self.vision_config = vision_config
if isinstance(text_config, dict):
text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "qwen2"
text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
elif text_config is None:
text_config = CONFIG_MAPPING["qwen2"]()
self.text_config = text_config
super().__init__(tie_word_embeddings=tie_word_embeddings, **kwargs)
__all__ = ["RConfig"]

6
generation_config.json Normal file
View File

@@ -0,0 +1,6 @@
{
"_from_model_config": true,
"bos_token_id": 151643,
"eos_token_id": 151645,
"transformers_version": "4.54.1"
}

499
image_processing_r.py Normal file
View File

@@ -0,0 +1,499 @@
# coding=utf-8
# Copyright 2024 The HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from collections.abc import Iterable
from typing import Optional, Union
import numpy as np
from transformers.image_processing_utils import (
BaseImageProcessor,
BatchFeature,
get_patch_output_size,
get_size_dict,
select_best_resolution,
)
from transformers.image_transforms import (
PaddingMode,
convert_to_rgb,
pad,
resize,
to_channel_dimension_format,
)
from transformers.image_utils import (
OPENAI_CLIP_MEAN,
OPENAI_CLIP_STD,
ChannelDimension,
ImageInput,
PILImageResampling,
get_image_size,
infer_channel_dimension_format,
is_scaled_image,
make_flat_list_of_images,
to_numpy_array,
valid_images,
validate_preprocess_arguments,
)
from transformers.utils import TensorType, is_vision_available, logging
logger = logging.get_logger(__name__)
if is_vision_available():
from PIL import Image
# Copied from transformers.models.llava_next.image_processing_llava_next.divide_to_patches
def divide_to_patches(image: np.array, patch_size: int, input_data_format) -> list[np.array]:
"""
Divides an image into patches of a specified size.
Args:
image (`np.array`):
The input image.
patch_size (`int`):
The size of each patch.
input_data_format (`ChannelDimension` or `str`):
The channel dimension format of the input image.
Returns:
list: A list of np.array representing the patches.
"""
patches = []
height, width = get_image_size(image, channel_dim=input_data_format)
for i in range(0, height, patch_size):
for j in range(0, width, patch_size):
if input_data_format == ChannelDimension.LAST:
patch = image[i : i + patch_size, j : j + patch_size]
else:
patch = image[:, i : i + patch_size, j : j + patch_size]
patches.append(patch)
return patches
# Copied from transformers.models.llava_next.image_processing_llava_next.expand_to_square
def expand_to_square(image: np.array, background_color, input_data_format) -> np.array:
"""
Expands an image to a square by adding a background color.
"""
height, width = get_image_size(image, channel_dim=input_data_format)
if width == height:
return image
elif width > height:
result = np.ones((width, width, image.shape[2]), dtype=image.dtype) * background_color
result[(width - height) // 2 : (width - height) // 2 + height, :] = image
return result
else:
result = np.ones((height, height, image.shape[2]), dtype=image.dtype) * background_color
result[:, (height - width) // 2 : (height - width) // 2 + width] = image
return result
class RImageProcessor(BaseImageProcessor):
model_input_names = ["pixel_values_videos"]
def __init__(
self,
do_resize: bool = True,
size: Optional[dict[str, int]] = None,
image_grid_pinpoints: Optional[list] = None,
resample: PILImageResampling = PILImageResampling.BICUBIC,
do_rescale: bool = True,
rescale_factor: Union[int, float] = 1 / 255,
do_normalize: bool = True,
image_mean: Optional[Union[float, list[float]]] = None,
image_std: Optional[Union[float, list[float]]] = None,
do_pad: Optional[bool] = True,
do_convert_rgb: bool = True,
**kwargs,
) -> None:
super().__init__(**kwargs)
size = size if size is not None else {"height": 384, "width": 384}
size = get_size_dict(size, default_to_square=False)
image_grid_pinpoints = (
image_grid_pinpoints
if image_grid_pinpoints is not None
else [[384, 768], [768, 384], [768, 768], [1152, 384], [384, 1152]]
)
self.do_resize = do_resize
self.size = size
self.image_grid_pinpoints = image_grid_pinpoints
self.resample = resample
self.do_rescale = do_rescale
self.rescale_factor = rescale_factor
self.do_normalize = do_normalize
self.image_mean = image_mean if image_mean is not None else OPENAI_CLIP_MEAN
self.image_std = image_std if image_std is not None else OPENAI_CLIP_STD
self.do_pad = do_pad
self.do_convert_rgb = do_convert_rgb
# Copied from transformers.models.llava_next.image_processing_llava_next.LlavaNextImageProcessor.pad
def pad(
self,
image: np.ndarray,
padding: Union[int, tuple[int, int], Iterable[tuple[int, int]]],
mode: PaddingMode = PaddingMode.CONSTANT,
constant_values: Union[float, Iterable[float]] = 0.0,
data_format: Optional[Union[str, ChannelDimension]] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
) -> np.ndarray:
# call the general `pad` if padding on `height/width`, otherwise it's the `num_patched` dim
if isinstance(padding, int) or len(padding) != 4:
return pad(image, padding, mode, constant_values, data_format, input_data_format)
if input_data_format is None:
input_data_format = infer_channel_dimension_format(image)
if mode == PaddingMode.CONSTANT:
image = np.pad(image, padding, mode="constant", constant_values=constant_values)
elif mode == PaddingMode.REFLECT:
image = np.pad(image, padding, mode="reflect")
elif mode == PaddingMode.REPLICATE:
image = np.pad(image, padding, mode="edge")
elif mode == PaddingMode.SYMMETRIC:
image = np.pad(image, padding, mode="symmetric")
else:
raise ValueError(f"Invalid padding mode: {mode}")
image = (
to_channel_dimension_format(image, data_format, input_data_format) if data_format is not None else image
)
return image
# Copied from transformers.models.llava_next.image_processing_llava_next.LlavaNextImageProcessor._resize_for_patching
def _resize_for_patching(
self, image: np.array, target_resolution: tuple, resample, input_data_format: ChannelDimension
) -> np.array:
new_height, new_width = get_patch_output_size(image, target_resolution, input_data_format)
# Resize the image
resized_image = resize(image, (new_height, new_width), resample=resample, input_data_format=input_data_format)
return resized_image
# Copied from transformers.models.llava_next.image_processing_llava_next.LlavaNextImageProcessor._get_padding_size
def _get_padding_size(self, original_resolution: tuple, target_resolution: tuple):
original_height, original_width = original_resolution
target_height, target_width = target_resolution
paste_x, r_x = divmod(target_width - original_width, 2)
paste_y, r_y = divmod(target_height - original_height, 2)
return (paste_y, paste_y + r_y), (paste_x, paste_x + r_x)
# Copied from transformers.models.llava_next.image_processing_llava_next.LlavaNextImageProcessor._pad_for_patching
def _pad_for_patching(
self, image: np.array, target_resolution: tuple, input_data_format: ChannelDimension
) -> np.array:
"""
Pad an image to a target resolution while maintaining aspect ratio.
"""
new_resolution = get_patch_output_size(image, target_resolution, input_data_format)
padding = self._get_padding_size(new_resolution, target_resolution)
padded_image = self.pad(image, padding=padding)
return padded_image
# Copied from transformers.models.llava_next.image_processing_llava_next.LlavaNextImageProcessor.get_image_patches
def get_image_patches(
self,
image: np.array,
grid_pinpoints,
size: tuple,
patch_size: int,
resample: PILImageResampling,
data_format: ChannelDimension,
input_data_format: ChannelDimension,
) -> list[np.array]:
if not isinstance(grid_pinpoints, list):
raise TypeError("grid_pinpoints must be a list of possible resolutions.")
possible_resolutions = grid_pinpoints
image_size = get_image_size(image, channel_dim=input_data_format)
best_resolution = select_best_resolution(image_size, possible_resolutions)
resized_image = self._resize_for_patching(
image, best_resolution, resample=resample, input_data_format=input_data_format
)
padded_image = self._pad_for_patching(resized_image, best_resolution, input_data_format=input_data_format)
patches = divide_to_patches(padded_image, patch_size=patch_size, input_data_format=input_data_format)
# make sure that all patches are in the input data format
patches = [
to_channel_dimension_format(patch, channel_dim=data_format, input_channel_dim=input_data_format)
for patch in patches
]
resized_original_image = resize(
image,
size=size,
resample=resample,
data_format=data_format,
input_data_format=input_data_format,
)
image_patches = [resized_original_image] + patches
return image_patches
# Copied from transformers.models.llava_next.image_processing_llava_next.LlavaNextImageProcessor._pad_for_batching
def _pad_for_batching(
self,
pixel_values: list[np.ndarray],
data_format: Optional[Union[str, ChannelDimension]] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
max_patch = max(len(x) for x in pixel_values)
pixel_values = [
self.pad(
image,
padding=((0, max_patch - image.shape[0]), (0, 0), (0, 0), (0, 0)),
data_format=data_format,
input_data_format=input_data_format,
)
for image in pixel_values
]
return pixel_values
# Copied from transformers.models.llava.image_processing_llava.LlavaImageProcessor.pad_to_square
def pad_to_square(
self,
image: np.ndarray,
background_color: Union[int, tuple[int, int, int]] = 0,
data_format: Optional[Union[str, ChannelDimension]] = None,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
) -> np.array:
height, width = get_image_size(image, input_data_format)
num_channels = image.shape[0] if input_data_format == ChannelDimension.FIRST else image.shape[-1]
if height == width:
image = (
to_channel_dimension_format(image, data_format, input_data_format)
if data_format is not None
else image
)
return image
max_dim = max(height, width)
# Ensure background_color is the correct shape
if isinstance(background_color, int):
background_color = [background_color]
elif len(background_color) != num_channels:
raise ValueError(
f"background_color must have no more than {num_channels} elements to match the number of channels"
)
if input_data_format == ChannelDimension.FIRST:
result = np.zeros((num_channels, max_dim, max_dim), dtype=image.dtype)
for i, color in enumerate(background_color):
result[i, :, :] = color
if width > height:
start = (max_dim - height) // 2
result[:, start : start + height, :] = image
else:
start = (max_dim - width) // 2
result[:, :, start : start + width] = image
else:
result = np.zeros((max_dim, max_dim, num_channels), dtype=image.dtype)
for i, color in enumerate(background_color):
result[:, :, i] = color
if width > height:
start = (max_dim - height) // 2
result[start : start + height, :, :] = image
else:
start = (max_dim - width) // 2
result[:, start : start + width, :] = image
image = (
to_channel_dimension_format(result, data_format, input_data_format) if data_format is not None else result
)
return image
def _preprocess(
self,
images: ImageInput,
do_resize: Optional[bool] = None,
size: Optional[dict[str, int]] = None,
resample: PILImageResampling = None,
do_rescale: Optional[bool] = None,
rescale_factor: Optional[float] = None,
do_normalize: Optional[bool] = None,
image_mean: Optional[Union[float, list[float]]] = None,
image_std: Optional[Union[float, list[float]]] = None,
do_convert_rgb: Optional[bool] = None,
data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
) -> Image.Image:
if do_resize:
images = [
resize(image=image, size=size, resample=resample, input_data_format=input_data_format)
for image in images
]
if do_rescale:
images = [
self.rescale(image=image, scale=rescale_factor, input_data_format=input_data_format)
for image in images
]
if do_normalize:
images = [
self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
for image in images
]
images = [
to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
]
return images
def preprocess(
self,
images: ImageInput,
do_resize: Optional[bool] = None,
size: Optional[dict[str, int]] = None,
image_grid_pinpoints: Optional[list] = None,
resample: PILImageResampling = None,
do_rescale: Optional[bool] = None,
rescale_factor: Optional[float] = None,
do_normalize: Optional[bool] = None,
image_mean: Optional[Union[float, list[float]]] = None,
image_std: Optional[Union[float, list[float]]] = None,
do_pad: Optional[bool] = None,
do_convert_rgb: Optional[bool] = None,
return_tensors: Optional[Union[str, TensorType]] = None,
data_format: Optional[ChannelDimension] = ChannelDimension.FIRST,
input_data_format: Optional[Union[str, ChannelDimension]] = None,
):
do_resize = do_resize if do_resize is not None else self.do_resize
size = size if size is not None else self.size
size = get_size_dict(size, default_to_square=False)
image_grid_pinpoints = image_grid_pinpoints if image_grid_pinpoints is not None else self.image_grid_pinpoints
resample = resample if resample is not None else self.resample
do_rescale = do_rescale if do_rescale is not None else self.do_rescale
rescale_factor = rescale_factor if rescale_factor is not None else self.rescale_factor
do_normalize = do_normalize if do_normalize is not None else self.do_normalize
image_mean = image_mean if image_mean is not None else self.image_mean
image_std = image_std if image_std is not None else self.image_std
do_pad = do_pad if do_pad is not None else self.do_pad
do_convert_rgb = do_convert_rgb if do_convert_rgb is not None else self.do_convert_rgb
if isinstance(images, (tuple, list)) and isinstance(images[0], (tuple, list)):
# if the first element is a list, we assume that all elements are lists
batch_num_images = [len(x) for x in images]
elif isinstance(images, (tuple, list)):
# treat this as a single-image case for backward compatibility
batch_num_images = [1] * len(images)
else:
batch_num_images = [1]
# only single image patching is supported
need_patching = [n == 1 for n in batch_num_images for _ in range(n)]
images = make_flat_list_of_images(images)
if not valid_images(images):
raise ValueError(
"Invalid image type. Must be of type PIL.Image.Image, numpy.ndarray, "
"torch.Tensor, tf.Tensor or jax.ndarray."
)
validate_preprocess_arguments(
do_rescale=do_rescale,
rescale_factor=rescale_factor,
do_normalize=do_normalize,
image_mean=image_mean,
image_std=image_std,
do_resize=do_resize,
size=size,
resample=resample,
)
if do_convert_rgb:
images = [convert_to_rgb(image) for image in images]
# All transformations expect numpy arrays.
images = [to_numpy_array(image) for image in images]
if do_rescale and is_scaled_image(images[0]):
logger.warning_once(
"It looks like you are trying to rescale already rescaled images. If the input"
" images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again."
)
if input_data_format is None:
# We assume that all images have the same channel dimension format.
input_data_format = infer_channel_dimension_format(images[0])
size_tuple = (
(size["height"], size["width"])
if "height" in size and "width" in size
else (size["shortest_edge"], size["shortest_edge"])
)
new_images = []
image_sizes = [get_image_size(image, channel_dim=input_data_format) for image in images]
for i, image in enumerate(images):
if need_patching[i]:
# convert image into a list of patches
# we intentionally use the same data format as the input data format
image_patches = self.get_image_patches(
image,
image_grid_pinpoints,
size=size_tuple,
patch_size=size_tuple[0],
resample=resample,
data_format=input_data_format,
input_data_format=input_data_format,
)
else:
padded_image = self.pad_to_square(
image=image,
background_color=tuple(int(x * 255) for x in self.image_mean),
input_data_format=input_data_format,
)
image_patches = [padded_image]
# preprocess patches
pixel_values = self._preprocess(
image_patches,
do_resize=do_resize,
size=size_tuple,
resample=resample,
do_rescale=do_rescale,
rescale_factor=rescale_factor,
do_normalize=do_normalize,
image_mean=image_mean,
image_std=image_std,
data_format=data_format,
input_data_format=input_data_format,
)
pixel_values = np.array(pixel_values)
new_images.append(pixel_values)
if do_pad:
processed_images = self._pad_for_batching(new_images)
return BatchFeature(
data={"pixel_values": processed_images, "image_sizes": image_sizes, "batch_num_images": batch_num_images},
tensor_type=return_tensors,
)
__all__ = ["RImageProcessor"]

324
image_processing_r_fast.py Normal file
View File

@@ -0,0 +1,324 @@
# coding=utf-8
# Copyright 2024 HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Optional, Union
import torch
from transformers.image_processing_utils import BatchFeature, get_patch_output_size, select_best_resolution
from transformers.image_processing_utils_fast import (
BaseImageProcessorFast,
DefaultFastImageProcessorKwargs,
divide_to_patches,
group_images_by_shape,
reorder_images,
)
from transformers.image_utils import (
OPENAI_CLIP_MEAN,
OPENAI_CLIP_STD,
ChannelDimension,
ImageInput,
PILImageResampling,
SizeDict,
get_image_size,
make_flat_list_of_images,
)
from transformers.processing_utils import Unpack
from transformers.utils import TensorType, auto_docstring, is_torchvision_v2_available
if is_torchvision_v2_available():
from torchvision.transforms.v2 import functional as F
else:
from torchvision.transforms import functional as F
class RFastImageProcessorKwargs(DefaultFastImageProcessorKwargs):
image_grid_pinpoints: Optional[list[list[int]]]
do_pad: Optional[bool]
@auto_docstring
class RImageProcessorFast(BaseImageProcessorFast):
resample = PILImageResampling.BICUBIC
image_mean = OPENAI_CLIP_MEAN
image_std = OPENAI_CLIP_STD
size = {"height": 384, "width": 384}
default_to_square = False
crop_size = None
do_resize = True
do_center_crop = None
do_rescale = True
do_normalize = True
do_convert_rgb = True
do_pad = True
image_grid_pinpoints = [[384,768],[768,384],[768,768],[1152,384],[384,1152]],
valid_kwargs = RFastImageProcessorKwargs
model_input_names = ["pixel_values_videos"]
def __init__(self, **kwargs: Unpack[RFastImageProcessorKwargs]):
super().__init__(**kwargs)
@auto_docstring
def preprocess(
self, images: ImageInput, **kwargs: Unpack[RFastImageProcessorKwargs]
) -> BatchFeature:
if isinstance(images, (tuple, list)) and isinstance(images[0], (tuple, list)):
# if the first element is a list, we assume that all elements are lists
batch_num_images = [len(x) for x in images]
elif isinstance(images, (tuple, list)):
# treat this as a single-image case for backward compatibility
batch_num_images = [1] * len(images)
else:
batch_num_images = [1]
kwargs["batch_num_images"] = batch_num_images
return super().preprocess(images, **kwargs)
def _prepare_images_structure(
self,
images: ImageInput,
) -> ImageInput:
return make_flat_list_of_images(images)
def _resize_for_patching(
self,
image: "torch.Tensor",
target_resolution: tuple,
interpolation: "F.InterpolationMode",
input_data_format: ChannelDimension,
) -> "torch.Tensor":
new_height, new_width = get_patch_output_size(image, target_resolution, input_data_format)
# Resize the image
resized_image = F.resize(image, (new_height, new_width), interpolation=interpolation)
return resized_image
def _get_padding_size(self, original_resolution: tuple, target_resolution: tuple):
original_height, original_width = original_resolution
target_height, target_width = target_resolution
paste_x, r_x = divmod(target_width - original_width, 2)
paste_y, r_y = divmod(target_height - original_height, 2)
return [paste_x, paste_y, paste_x + r_x, paste_y + r_y]
def _pad_for_patching(
self, image: "torch.Tensor", target_resolution: tuple, input_data_format: ChannelDimension
) -> "torch.Tensor":
"""
Pad an image to a target resolution while maintaining aspect ratio.
"""
new_resolution = get_patch_output_size(image, target_resolution, input_data_format)
padding = self._get_padding_size(new_resolution, target_resolution)
padded_image = F.pad(image, padding=padding)
return padded_image
def _get_image_patches(
self,
image: "torch.Tensor",
grid_pinpoints,
size: tuple,
patch_size: int,
interpolation: "F.InterpolationMode",
) -> list["torch.Tensor"]:
"""
Process an image with variable resolutions by dividing it into patches.
Args:
image ("torch.Tensor"):
The input image to be processed.
grid_pinpoints (List):
A string representation of a list of possible resolutions.
size (`tuple`):
Size to resize the original image to.
patch_size (`int`):
Size of the patches to divide the image into.
interpolation (`"InterpolationMode"`):
Resampling filter to use if resizing the image.
Returns:
list["torch.Tensor"]: A list of NumPy arrays containing the processed image patches.
"""
if not isinstance(grid_pinpoints, list):
raise TypeError("grid_pinpoints must be a list of possible resolutions.")
possible_resolutions = grid_pinpoints
image_size = get_image_size(image, channel_dim=ChannelDimension.FIRST)
best_resolution = select_best_resolution(image_size, possible_resolutions)
resized_image = self._resize_for_patching(
image, best_resolution, interpolation=interpolation, input_data_format=ChannelDimension.FIRST
)
padded_image = self._pad_for_patching(resized_image, best_resolution, input_data_format=ChannelDimension.FIRST)
patches = divide_to_patches(padded_image, patch_size=patch_size)
resized_original_image = F.resize(image, size=size, interpolation=interpolation)
image_patches = [resized_original_image] + patches
return image_patches
def _pad_for_batching(
self,
pixel_values: list["torch.Tensor"],
) -> list["torch.Tensor"]:
"""
Pads images on the `num_of_patches` dimension with zeros to form a batch of same number of patches.
Args:
pixel_values (`list[torch.Tensor]`):
An array of pixel values of each images of shape (`batch_size`, `num_patches`, `image_in_3D`)
Returns:
list[`torch.Tensor`]: The padded images.
"""
max_patch = max(len(x) for x in pixel_values)
pixel_values = [
torch.nn.functional.pad(image, pad=[0, 0, 0, 0, 0, 0, 0, max_patch - image.shape[0]])
for image in pixel_values
]
return pixel_values
def _preprocess(
self,
images: list["torch.Tensor"],
do_resize: bool,
size: SizeDict,
image_grid_pinpoints: list[list[int]],
interpolation: Optional["F.InterpolationMode"],
do_center_crop: bool,
crop_size: SizeDict,
do_rescale: bool,
rescale_factor: float,
do_normalize: bool,
image_mean: Optional[Union[float, list[float]]],
image_std: Optional[Union[float, list[float]]],
do_pad: bool,
batch_num_images: list[int],
return_tensors: Optional[Union[str, TensorType]],
) -> BatchFeature:
processed_images = []
image_sizes = []
# only single image patching is supported
need_patching = [n == 1 for n in batch_num_images for _ in range(n)]
# Determine the size tuple
if size and size.height and size.width:
size_tuple = (size.height, size.width)
else:
size_tuple = (size.shortest_edge, size.shortest_edge)
# Determine the patch size
if crop_size and crop_size.height:
patch_size = crop_size.height
elif size and size.height:
patch_size = size.height
else:
patch_size = size.shortest_edge
for i, image in enumerate(images):
if need_patching[i]:
image_patches = self._get_image_patches(
image,
image_grid_pinpoints,
size=size_tuple,
patch_size=patch_size,
interpolation=interpolation,
)
else:
padded_image = self.pad_to_square(
images=image, background_color=tuple(int(x * 255) for x in self.image_mean)
)
image_patches = [padded_image]
# Group images by size for batched processing
processed_image_patches_grouped = {}
grouped_image_patches, grouped_image_patches_index = group_images_by_shape(image_patches)
for shape, stacked_image_patches in grouped_image_patches.items():
if do_resize:
stacked_image_patches = self.resize(
image=stacked_image_patches,
size=size,
interpolation=interpolation,
)
if do_center_crop:
stacked_image_patches = self.center_crop(stacked_image_patches, crop_size)
# Fused rescale and normalize
stacked_image_patches = self.rescale_and_normalize(
stacked_image_patches, do_rescale, rescale_factor, do_normalize, image_mean, image_std
)
processed_image_patches_grouped[shape] = stacked_image_patches
processed_image_patches = reorder_images(processed_image_patches_grouped, grouped_image_patches_index)
processed_image_patches = (
torch.stack(processed_image_patches, dim=0) if return_tensors else processed_image_patches
)
processed_images.append(processed_image_patches)
image_sizes.append(get_image_size(image, ChannelDimension.FIRST))
if do_pad:
processed_images = self._pad_for_batching(processed_images)
processed_images = torch.stack(processed_images, dim=0) if return_tensors else processed_images
return BatchFeature(
data={"pixel_values": processed_images, "image_sizes": image_sizes, "batch_num_images": batch_num_images},
tensor_type=return_tensors,
)
# Copied from transformers.models.llava.image_processing_llava_fast.LlavaImageProcessorFast.pad_to_square
def pad_to_square(
self,
images: "torch.Tensor",
background_color: Union[int, tuple[int, int, int]] = 0,
) -> "torch.Tensor":
"""
Pads an image to a square based on the longest edge.
Args:
images (`np.ndarray`):
The images to pad.
background_color (`int` or `tuple[int, int, int]`, *optional*, defaults to 0):
The color to use for the padding. Can be an integer for single channel or a
tuple of integers representing for multi-channel images. If passed as integer
in mutli-channel mode, it will default to `0` in subsequent channels.
Returns:
`torch.Tensor`: The padded images.
"""
height, width = get_image_size(images, ChannelDimension.FIRST)
if height == width:
return images
num_channels = images.shape[1] if len(images.shape) == 4 else images.shape[0]
if isinstance(background_color, int):
background_color = [background_color] + [0] * (num_channels - 1)
elif len(background_color) != num_channels:
raise ValueError(
f"background_color must have no more than {num_channels} elements to match the number of channels"
)
max_dim = max(height, width)
paste_x_left = (max_dim - width) // 2
paste_y_left = (max_dim - height) // 2
paste_x_right = max_dim - width - paste_x_left
paste_y_right = max_dim - height - paste_y_left
padded_images = F.pad(
images, padding=[paste_x_left, paste_y_left, paste_x_right, paste_y_right], fill=background_color
)
return padded_images
__all__ = ["RImageProcessorFast"]

BIN
merges.txt (Stored with Git LFS) Normal file

Binary file not shown.

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:0cbe4838c6bf13407ca264c3f05a7b5773a54e190d819612c9c9082040dcec89
size 4588680176

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:e6e36a8076b61866ef7fa29d5bf7418fc81705eaa59ef18982bdba846a96bae1
size 4984489744

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:bdb79aaec58bb87ba821cc6fedfe6affb9158be17303ad6d1254d2535f6fe82f
size 64968216

View File

@@ -0,0 +1,834 @@
{
"metadata": {
"total_size": 9638024768
},
"weight_map": {
"lm_head.weight": "model-00002-of-00003.safetensors",
"model.image_newline": "model-00002-of-00003.safetensors",
"model.language_model.embed_tokens.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.0.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.0.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.0.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.0.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.0.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.0.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.0.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.0.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.0.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.0.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.0.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.1.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.1.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.1.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.1.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.1.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.1.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.1.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.1.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.1.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.1.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.1.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.10.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.10.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.10.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.10.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.10.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.10.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.10.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.10.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.10.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.10.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.10.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.11.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.11.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.11.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.11.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.11.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.11.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.11.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.11.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.11.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.11.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.11.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.12.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.12.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.12.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.12.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.12.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.12.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.12.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.12.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.12.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.12.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.12.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.13.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.13.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.13.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.13.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.13.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.13.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.13.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.13.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.13.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.13.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.13.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.14.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.14.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.14.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.14.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.14.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.14.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.14.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.14.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.14.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.14.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.14.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.15.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.15.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.15.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.15.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.15.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.15.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.15.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.15.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.15.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.15.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.15.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.16.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.16.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.16.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.16.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.16.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.16.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.16.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.16.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.16.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.16.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.16.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.17.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.17.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.17.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.17.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.17.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.17.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.17.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.17.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.17.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.17.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.17.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.18.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.18.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.18.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.18.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.18.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.18.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.18.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.18.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.18.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.18.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.18.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.19.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.19.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.19.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.19.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.19.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.19.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.19.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.19.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.19.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.19.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.19.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.2.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.2.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.2.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.2.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.2.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.2.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.2.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.2.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.2.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.2.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.2.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.20.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.20.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.20.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.20.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.20.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.20.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.20.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.20.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.20.self_attn.q_norm.weight": "model-00003-of-00003.safetensors",
"model.language_model.layers.20.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.20.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.21.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.21.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.21.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.21.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.21.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.21.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.21.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.21.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.21.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.21.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.21.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.22.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.22.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.22.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.22.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.22.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.22.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.22.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.22.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.22.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.22.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.22.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.23.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.23.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.23.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.23.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.23.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.23.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.23.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.23.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.23.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.23.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.23.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.24.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.24.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.24.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.24.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.24.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.24.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.24.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.24.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.24.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.24.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.24.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.25.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.25.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.25.mlp.gate_proj.weight": "model-00003-of-00003.safetensors",
"model.language_model.layers.25.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.25.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.25.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.25.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.25.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.25.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.25.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.25.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.26.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.26.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.26.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.26.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.26.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.26.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.26.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.26.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.26.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.26.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.26.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.27.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.27.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.27.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.27.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.27.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.27.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.27.self_attn.k_proj.weight": "model-00003-of-00003.safetensors",
"model.language_model.layers.27.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.27.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.27.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.27.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.28.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.28.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.28.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.28.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.28.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.28.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.28.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.28.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.28.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.28.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.28.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.29.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.29.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.29.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.29.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.29.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.29.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.29.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.29.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.29.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.29.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.29.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.3.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.3.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.3.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.3.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.3.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.3.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.3.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.3.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.3.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.3.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.3.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.30.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.30.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.30.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.30.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.30.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.30.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.30.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.30.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.30.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.30.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.30.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.31.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.31.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.31.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.31.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.31.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.31.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.31.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.31.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.31.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.31.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.31.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.32.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.32.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.32.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.32.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.32.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.32.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.32.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.32.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.32.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.32.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.32.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.33.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.33.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.33.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.33.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.33.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.33.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.33.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.33.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.33.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.33.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.33.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.34.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.34.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.34.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.34.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.34.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.34.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.34.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.34.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.34.self_attn.q_norm.weight": "model-00003-of-00003.safetensors",
"model.language_model.layers.34.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.34.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.35.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.35.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.35.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.35.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.35.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.35.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.35.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.35.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.35.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.35.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.35.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.4.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.4.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.4.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.4.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.4.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.4.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.4.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.4.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.4.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.4.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.4.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.5.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.5.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.5.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.5.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.5.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.5.self_attn.k_norm.weight": "model-00003-of-00003.safetensors",
"model.language_model.layers.5.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.5.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.5.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.5.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.5.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.6.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.6.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.6.mlp.gate_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.6.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.6.post_attention_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.6.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.6.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.6.self_attn.o_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.6.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.6.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.6.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.7.input_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.7.mlp.down_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.7.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.7.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.7.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.7.self_attn.k_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.7.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.7.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.7.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.7.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.7.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.8.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.8.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.8.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.8.mlp.up_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.8.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.8.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.8.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.8.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.8.self_attn.q_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.8.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.8.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.9.input_layernorm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.9.mlp.down_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.9.mlp.gate_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.9.mlp.up_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.9.post_attention_layernorm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.9.self_attn.k_norm.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.9.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.9.self_attn.o_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.9.self_attn.q_norm.weight": "model-00001-of-00003.safetensors",
"model.language_model.layers.9.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.language_model.layers.9.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.language_model.norm.weight": "model-00002-of-00003.safetensors",
"model.multi_modal_projector.linear_1.bias": "model-00001-of-00003.safetensors",
"model.multi_modal_projector.linear_1.weight": "model-00002-of-00003.safetensors",
"model.multi_modal_projector.linear_2.bias": "model-00002-of-00003.safetensors",
"model.multi_modal_projector.linear_2.weight": "model-00002-of-00003.safetensors",
"model.multi_modal_projector.pre_norm.bias": "model-00001-of-00003.safetensors",
"model.multi_modal_projector.pre_norm.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.embeddings.patch_embedding.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.embeddings.patch_embedding.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.embeddings.position_embedding.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.layer_norm2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.layer_norm2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.mlp.fc1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.mlp.fc2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.0.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.layer_norm2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.mlp.fc1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.mlp.fc2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.mlp.fc2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.1.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.layer_norm1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.layer_norm1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.layer_norm2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.mlp.fc1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.mlp.fc1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.mlp.fc2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.10.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.layer_norm2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.layer_norm2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.mlp.fc1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.mlp.fc2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.11.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.layer_norm1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.layer_norm2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.layer_norm2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.mlp.fc1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.mlp.fc1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.mlp.fc2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.12.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.layer_norm2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.mlp.fc1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.mlp.fc2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.mlp.fc2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.13.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.layer_norm2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.mlp.fc1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.mlp.fc2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.mlp.fc2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.14.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.layer_norm1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.layer_norm2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.mlp.fc1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.mlp.fc2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.15.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.layer_norm1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.layer_norm2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.mlp.fc1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.mlp.fc1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.mlp.fc2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.mlp.fc2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.16.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.layer_norm2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.layer_norm2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.mlp.fc1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.mlp.fc1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.mlp.fc2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.mlp.fc2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.17.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.layer_norm1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.layer_norm2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.mlp.fc1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.mlp.fc2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.18.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.layer_norm1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.layer_norm1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.layer_norm2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.mlp.fc1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.mlp.fc2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.mlp.fc2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.19.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.layer_norm1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.layer_norm1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.layer_norm2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.layer_norm2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.mlp.fc1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.mlp.fc2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.mlp.fc2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.2.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.layer_norm2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.mlp.fc1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.mlp.fc2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.20.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.layer_norm1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.layer_norm1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.layer_norm2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.mlp.fc1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.mlp.fc2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.21.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.layer_norm1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.layer_norm2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.mlp.fc1.weight": "model-00003-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.mlp.fc2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.22.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.layer_norm2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.layer_norm2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.mlp.fc1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.mlp.fc1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.mlp.fc2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.23.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.layer_norm1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.layer_norm1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.layer_norm2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.layer_norm2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.mlp.fc1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.mlp.fc2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.mlp.fc2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.24.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.layer_norm2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.mlp.fc1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.mlp.fc2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.25.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.layer_norm1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.layer_norm2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.mlp.fc1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.mlp.fc1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.mlp.fc2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.3.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.layer_norm1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.layer_norm2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.mlp.fc1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.mlp.fc2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.4.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.layer_norm2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.mlp.fc1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.mlp.fc2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.self_attn.out_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.5.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.layer_norm1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.layer_norm2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.mlp.fc1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.mlp.fc2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.6.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.layer_norm1.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.layer_norm2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.layer_norm2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.mlp.fc1.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.mlp.fc1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.mlp.fc2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.mlp.fc2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.7.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.layer_norm2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.mlp.fc1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.mlp.fc2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.mlp.fc2.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.self_attn.k_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.self_attn.q_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.8.self_attn.v_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.layer_norm1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.layer_norm1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.layer_norm2.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.layer_norm2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.mlp.fc1.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.mlp.fc1.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.mlp.fc2.bias": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.mlp.fc2.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.self_attn.k_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.self_attn.out_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.self_attn.q_proj.weight": "model-00002-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.encoder.layers.9.self_attn.v_proj.weight": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.post_layernorm.bias": "model-00001-of-00003.safetensors",
"model.vision_tower.vision_model.post_layernorm.weight": "model-00001-of-00003.safetensors"
}
}

688
modeling_r.py Normal file
View File

@@ -0,0 +1,688 @@
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
from dataclasses import dataclass
from typing import Optional, Union
import numpy as np
import torch
from torch import nn
from transformers.activations import GELUActivation
from transformers.generation import GenerationMixin
from transformers.image_processing_utils import select_best_resolution
from transformers.modeling_flash_attention_utils import FlashAttentionKwargs
from transformers.modeling_outputs import BaseModelOutputWithPast, ModelOutput
from transformers.modeling_utils import PreTrainedModel
from transformers.models.auto import AutoModel
from transformers.processing_utils import Unpack
from transformers.utils import (
can_return_tuple,
is_torchdynamo_compiling,
logging,
)
from .configuration_r import RConfig
logger = logging.get_logger(__name__)
@dataclass
class RModelOutputWithPast(BaseModelOutputWithPast):
image_hidden_states: Optional[torch.FloatTensor] = None
@dataclass
class RCausalLMOutputWithPast(ModelOutput):
loss: Optional[torch.FloatTensor] = None
logits: Optional[torch.FloatTensor] = None
past_key_values: Optional[list[torch.FloatTensor]] = None
hidden_states: Optional[tuple[torch.FloatTensor]] = None
attentions: Optional[tuple[torch.FloatTensor]] = None
image_hidden_states: Optional[torch.FloatTensor] = None
class RPooler(nn.Module):
def __init__(self, config):
super().__init__()
mode = config.spatial_pool_mode
stride = config.spatial_pool_stride
out_channels = getattr(config, "spatial_pool_out_channels", config.vision_config.hidden_size)
self.image_size = (config.vision_config.image_size // config.vision_config.patch_size) ** 2
if mode == "average":
self.pool = nn.AvgPool2d(kernel_size=stride, stride=stride)
elif mode == "max":
self.pool = nn.MaxPool2d(kernel_size=stride, stride=stride)
elif mode == "conv":
self.pool = nn.Conv2d(
in_channels=config.vision_config.hidden_size,
out_channels=out_channels,
kernel_size=stride,
stride=stride,
)
else:
raise ValueError(f"Unknown pooling mode: {mode}. Has to be one of [`average`, `max`, `conv`]")
def forward(self, image_features):
ori_width = int(math.sqrt(image_features.shape[1] * self.image_size // self.image_size))
ori_height = int(ori_width * self.image_size // self.image_size)
batch_size, _, dim = image_features.shape
image_features_spatial = image_features.view(batch_size, ori_height, ori_height, dim).permute(0, 3, 1, 2)
image_features_spatial_pool = self.pool(image_features_spatial)
return image_features_spatial_pool.flatten(2).transpose(1, 2).contiguous()
def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
if not isinstance(grid_pinpoints, list):
raise TypeError("grid_pinpoints should be a list of tuples or lists")
# ! VERY IMPORTANT if image_size is tensor, must convert to into tuple, otherwise it will cause wrong calculate
if not isinstance(image_size, (list, tuple)):
if not isinstance(image_size, (torch.Tensor, np.ndarray)):
raise TypeError(
f"image_size invalid type: {type(image_size)} not valid, should be either list, tuple, np.ndarray or tensor"
)
image_size = image_size.tolist()
height, width = select_best_resolution(image_size, grid_pinpoints)
return height // patch_size, width // patch_size
def image_size_to_num_patches(image_size, grid_pinpoints, patch_size: int):
if not isinstance(grid_pinpoints, list):
raise TypeError("grid_pinpoints should be a list of tuples or lists")
# ! VERY IMPORTANT if image_size is tensor, must convert to into tuple, otherwise it will cause wrong calculate
if not isinstance(image_size, (list, tuple)):
if not isinstance(image_size, (torch.Tensor, np.ndarray)):
raise TypeError(f"image_size invalid type {type(image_size)} with value {image_size}")
image_size = image_size.tolist()
best_resolution = select_best_resolution(image_size, grid_pinpoints)
height, width = best_resolution
num_patches = 0
# consider change to ceil(height/patch_size)*ceil(width/patch_size) + 1
for i in range(0, height, patch_size):
for j in range(0, width, patch_size):
num_patches += 1
# add the base patch
num_patches += 1
return num_patches
def unpad_image(tensor, original_size):
if not isinstance(original_size, (list, tuple)):
if not isinstance(original_size, (torch.Tensor, np.ndarray)):
raise TypeError(
f"image_size invalid type: {type(original_size)} not valid, should be either list, tuple, np.ndarray or tensor"
)
original_size = original_size.tolist()
original_height, original_width = original_size
current_height, current_width = tensor.shape[1:]
original_aspect_ratio = original_width / original_height
current_aspect_ratio = current_width / current_height
if original_aspect_ratio > current_aspect_ratio:
scale_factor = current_width / original_width
new_height = int(round(original_height * scale_factor, 7))
padding = (current_height - new_height) // 2
unpadded_tensor = tensor[:, padding : current_height - padding, :]
else:
scale_factor = current_height / original_height
new_width = int(round(original_width * scale_factor, 7))
padding = (current_width - new_width) // 2
unpadded_tensor = tensor[:, :, padding : current_width - padding]
return unpadded_tensor
class RPreTrainedModel(PreTrainedModel):
config_class = RConfig
base_model_prefix = ""
supports_gradient_checkpointing = True
# _no_split_modules = ["LlamaDecoderLayer"]
_no_split_modules = ["SiglipEncoderLayer", "Qwen3DecoderLayer", ]
_skip_keys_device_placement = "past_key_values"
_supports_cache_class = True
_supports_flash_attn_2 = True
_supports_sdpa = True
_supports_quantized_cache = True
_supports_static_cache = True
_supports_flex_attn = True
_supports_attention_backend = True
def _init_weights(self, module):
std = getattr(self.config, "initializer_range", self.config.get_text_config().initializer_range)
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, RModel):
embed_std = 1 / math.sqrt(self.config.text_config.hidden_size)
module.image_newline.data.normal_(mean=0.0, std=embed_std)
class RMultiModalProjector(nn.Module):
def __init__(self, config):
super().__init__()
print("Using MultiModalProjector_withLayerNorm")
self.pre_norm = torch.nn.LayerNorm(config.vision_config.hidden_size, eps=1e-06)
self.linear_1 = nn.Linear(config.vision_config.hidden_size, config.text_config.hidden_size, bias=True)
self.act = GELUActivation()
self.linear_2 = nn.Linear(config.text_config.hidden_size, config.text_config.hidden_size, bias=True)
def forward(self, image_feature: torch.Tensor) -> torch.Tensor:
image_feature = self.pre_norm(image_feature)
hidden_states = self.linear_1(image_feature)
hidden_states = self.act(hidden_states)
hidden_states = self.linear_2(hidden_states)
return hidden_states
class RModel(RPreTrainedModel):
_checkpoint_conversion_mapping = {"language_model.model": "language_model"}
def __init__(self, config):
super().__init__(config)
self.vision_tower = AutoModel.from_config(config.vision_config)
self.multi_modal_projector = RMultiModalProjector(config)
embed_std = 1 / math.sqrt(config.text_config.hidden_size)
self.image_newline = nn.Parameter(torch.randn(config.text_config.hidden_size, dtype=self.dtype) * embed_std)
self.vocab_size = config.text_config.vocab_size
self.language_model = AutoModel.from_config(config.text_config)
self.pad_token_id = self.config.pad_token_id if self.config.pad_token_id is not None else -1
self.post_init()
def get_input_embeddings(self):
return self.language_model.get_input_embeddings()
def set_input_embeddings(self, value):
self.language_model.set_input_embeddings(value)
def pack_image_features(self, image_features, image_sizes, image_newline=None, vision_aspect_ratio="anyres"):
new_image_features = []
feature_lens = []
for image_idx, image_feature in enumerate(image_features):
if image_feature.shape[0] > 1:
base_image_feature = image_feature[0]
image_feature = image_feature[1:]
height = width = self.config.vision_config.image_size // self.config.vision_config.patch_size
if height * width != base_image_feature.shape[0]:
raise ValueError("The number of patches is not consistent with the image size.")
num_patch_height, num_patch_width = get_anyres_image_grid_shape(
image_sizes[image_idx],
self.config.image_grid_pinpoints,
self.config.vision_config.image_size,
)
image_feature = image_feature.view(num_patch_height, num_patch_width, height, width, -1)
image_feature = image_feature.permute(4, 0, 2, 1, 3).contiguous()
image_feature = image_feature.flatten(1, 2).flatten(2, 3)
image_feature = unpad_image(image_feature, image_sizes[image_idx])
try:
max_num_patches = int(vision_aspect_ratio.strip("anyres_max_"))
channels, curr_height, curr_width = image_feature.shape
ratio = math.sqrt(curr_height * curr_width / (max_num_patches * height**2))
if ratio > 1.1:
image_feature = image_feature[None]
image_feature = nn.functional.interpolate(
image_feature, [int(curr_height // ratio), int(curr_width // ratio)], mode="bilinear"
)[0]
except:
pass
if image_newline is not None:
image_feature = torch.cat(
(
image_feature,
image_newline[:, None, None]
.expand(*image_feature.shape[:-1], 1)
.to(image_feature.device, image_feature.dtype),
),
dim=-1,
)
image_feature = image_feature.flatten(1, 2).transpose(0, 1)
image_feature = torch.cat((base_image_feature, image_feature), dim=0)
else:
image_feature = image_feature[0]
if image_newline is not None:
image_feature = torch.cat((image_feature, image_newline[None].to(image_feature)), dim=0)
image_feature = image_feature.flatten(0, 1)
new_image_features.append(image_feature)
feature_lens.append(image_feature.size(0))
feature_lens = torch.tensor(feature_lens, dtype=torch.long, device=image_features[0].device)
return new_image_features, feature_lens
def get_image_features(
self,
pixel_values: torch.FloatTensor,
image_sizes: torch.Tensor,
vision_feature_layer: Optional[Union[int, list[int]]] = None,
vision_feature_select_strategy: Optional[str] = None,
vision_aspect_ratio: Optional[str] = None,
batch_num_images: Optional[torch.LongTensor] = None,
):
vision_feature_layer = (
vision_feature_layer if vision_feature_layer is not None else self.config.vision_feature_layer
)
vision_feature_select_strategy = (
vision_feature_select_strategy
if vision_feature_select_strategy is not None
else self.config.vision_feature_select_strategy
)
vision_aspect_ratio = (
vision_aspect_ratio if vision_aspect_ratio is not None else self.config.vision_aspect_ratio
)
if batch_num_images is None:
# treat this as a single-image case for backward compatibility
need_patching = [True] * len(image_sizes)
else:
need_patching = [n == 1 for n in batch_num_images for _ in range(n)]
image_num_patches = [
image_size_to_num_patches(
image_size=imsize,
grid_pinpoints=self.config.image_grid_pinpoints,
patch_size=self.config.vision_config.image_size,
)
if should_patch
else 1
for imsize, should_patch in zip(image_sizes, need_patching)
]
if isinstance(pixel_values, torch.Tensor):
if pixel_values.dim() == 5:
# stacked if input is (batch_size, num_patches, num_channels, height, width)
_pixel_values_list = [pix_val[:num_patch] for pix_val, num_patch in zip(pixel_values, image_num_patches)]
pixel_values = torch.cat(_pixel_values_list, dim=0)
elif pixel_values.dim() != 4:
# otherwise has to be stacked from list of (num_patches, num_channels, height, width)
raise ValueError(f"pixel_values of shape {pixel_values.shape}, expect to be of 4 or 5 dimensions")
elif isinstance(pixel_values, list):
# list of [(batch_size, num_patches, num_channels, height, width)]
assert len(pixel_values) == len(image_num_patches), (
f"pixel_values is a list of {len(pixel_values)} tensors, but image_num_patches is of length {len(image_num_patches)}"
)
_pixel_values_list = [pix_val.squeeze(0)[:num_patch] for pix_val, num_patch in zip(pixel_values, image_num_patches)]
pixel_values = torch.cat(_pixel_values_list, dim=0)
image_features = self.vision_tower(pixel_values, output_hidden_states=True)
# If we have one vision feature layer, return the corresponding hidden states,
# otherwise, select the hidden states of each feature layer and concatenate them
if isinstance(vision_feature_layer, int):
selected_image_feature = image_features.hidden_states[vision_feature_layer]
else:
hs_pool = [image_features.hidden_states[layer_idx] for layer_idx in vision_feature_layer]
selected_image_feature = torch.cat(hs_pool, dim=-1)
if vision_feature_select_strategy == "default":
selected_image_feature = selected_image_feature[:, 1:]
elif vision_feature_select_strategy == "full":
selected_image_feature = selected_image_feature
image_features = self.multi_modal_projector(selected_image_feature)
image_features = torch.split(image_features, image_num_patches, dim=0)
image_features, feature_lens = self.pack_image_features(
image_features,
image_sizes,
image_newline=self.image_newline,
vision_aspect_ratio=vision_aspect_ratio,
)
return image_features
@can_return_tuple
def forward(
self,
input_ids: torch.LongTensor = None,
pixel_values: torch.FloatTensor = None,
image_sizes: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[list[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
vision_feature_layer: Optional[Union[int, list[int]]] = None,
vision_feature_select_strategy: Optional[str] = None,
vision_aspect_ratio: Optional[str] = None,
batch_num_images: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
**kwargs: Unpack[FlashAttentionKwargs],
) -> Union[tuple, RModelOutputWithPast]:
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
vision_feature_layer = (
vision_feature_layer if vision_feature_layer is not None else self.config.vision_feature_layer
)
vision_feature_select_strategy = (
vision_feature_select_strategy
if vision_feature_select_strategy is not None
else self.config.vision_feature_select_strategy
)
vision_aspect_ratio = (
vision_aspect_ratio if vision_aspect_ratio is not None else self.config.vision_aspect_ratio
)
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
if pixel_values is not None and inputs_embeds is not None:
raise ValueError(
"You cannot specify both `pixel_values` and `inputs_embeds` at the same time, "
"and must specify either one"
)
if inputs_embeds is None:
inputs_embeds = self.get_input_embeddings()(input_ids)
# Images are processed with Anyres
if pixel_values is not None:
image_features = self.get_image_features(
pixel_values,
image_sizes,
vision_feature_layer=vision_feature_layer,
vision_feature_select_strategy=vision_feature_select_strategy,
batch_num_images=batch_num_images,
)
image_features = torch.cat(image_features, dim=0)
special_image_mask = (input_ids == self.config.image_token_id).unsqueeze(-1)
special_image_mask = special_image_mask.expand_as(inputs_embeds).to(inputs_embeds.device)
if not is_torchdynamo_compiling() and inputs_embeds[special_image_mask].numel() != image_features.numel():
n_image_tokens = (input_ids == self.config.image_token_id).sum()
n_image_features = image_features.shape[0]
raise ValueError(
f"Image features and image tokens do not match: tokens: {n_image_tokens}, features {n_image_features}"
)
image_features = image_features.to(inputs_embeds.device, inputs_embeds.dtype)
inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
outputs = self.language_model(
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=True,
cache_position=cache_position,
**kwargs,
)
return RModelOutputWithPast(
last_hidden_state=outputs.last_hidden_state,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
image_hidden_states=image_features if pixel_values is not None else None,
)
def apply_pooling(self, image_features):
height = width = self.config.vision_config.image_size // self.config.vision_config.patch_size
batch_frames, seq_len, dim = image_features.shape
image_features = image_features.view(batch_frames, height, width, -1)
image_features = image_features.permute(0, 3, 1, 2).contiguous()
height, width = image_features.shape[2:]
scaled_shape = [math.ceil(height / 2), math.ceil(width / 2)]
image_features = nn.functional.interpolate(image_features, size=scaled_shape, mode="bilinear")
image_features = image_features.permute(0, 2, 3, 1)
image_features = image_features.view(batch_frames, -1, dim)
return image_features
class RForConditionalGeneration(RPreTrainedModel, GenerationMixin):
_checkpoint_conversion_mapping = {
"^language_model.model": "model.language_model",
"^vision_tower": "model.vision_tower",
"^multi_modal_projector": "model.multi_modal_projector",
"^image_newline": "model.image_newline",
"^language_model.lm_head": "lm_head",
}
_tied_weights_keys = ["lm_head.weight"]
def __init__(self, config: RConfig):
super().__init__(config)
self.model = RModel(config)
self.lm_head = nn.Linear(config.text_config.hidden_size, config.text_config.vocab_size, bias=False)
self.post_init()
def get_input_embeddings(self):
return self.model.get_input_embeddings()
def set_input_embeddings(self, value):
self.model.set_input_embeddings(value)
def get_output_embeddings(self) -> nn.Module:
return self.lm_head
def set_output_embeddings(self, new_embeddings):
self.lm_head = new_embeddings
def set_decoder(self, decoder):
self.model = decoder
def get_decoder(self):
return self.model
def pack_image_features(self, image_features, image_sizes, vision_feature_select_strategy, image_newline=None):
return self.model.pack_image_features(
image_features=image_features,
image_sizes=image_sizes,
vision_feature_select_strategy=vision_feature_select_strategy,
image_newline=image_newline,
)
def get_image_features(
self,
pixel_values: torch.FloatTensor,
image_sizes: torch.Tensor,
vision_feature_layer: Optional[Union[int, list[int]]] = None,
vision_feature_select_strategy: Optional[str] = None,
):
return self.model.get_image_features(
pixel_values=pixel_values,
image_sizes=image_sizes,
vision_feature_layer=vision_feature_layer,
vision_feature_select_strategy=vision_feature_select_strategy,
)
# Make modules available throught conditional class for BC
@property
def language_model(self):
return self.model.language_model
@property
def vision_tower(self):
return self.model.vision_tower
@property
def multi_modal_projector(self):
return self.model.multi_modal_projector
@can_return_tuple
def forward(
self,
input_ids: torch.LongTensor = None,
pixel_values: torch.FloatTensor = None,
image_sizes: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[list[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
vision_feature_layer: Optional[Union[int, list[int]]] = None,
vision_feature_select_strategy: Optional[str] = None,
vision_aspect_ratio: Optional[str] = None,
batch_num_images: Optional[torch.LongTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
logits_to_keep: Union[int, torch.Tensor] = 0,
**kwargs,
) -> Union[tuple, RCausalLMOutputWithPast]:
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
vision_feature_layer = (
vision_feature_layer if vision_feature_layer is not None else self.config.vision_feature_layer
)
vision_feature_select_strategy = (
vision_feature_select_strategy
if vision_feature_select_strategy is not None
else self.config.vision_feature_select_strategy
)
vision_aspect_ratio = (
vision_aspect_ratio if vision_aspect_ratio is not None else self.config.vision_aspect_ratio
)
outputs = self.model(
input_ids=input_ids,
pixel_values=pixel_values,
image_sizes=image_sizes,
vision_aspect_ratio=vision_aspect_ratio,
vision_feature_layer=vision_feature_layer,
vision_feature_select_strategy=vision_feature_select_strategy,
batch_num_images=batch_num_images,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=True,
cache_position=cache_position,
logits_to_keep=logits_to_keep,
**kwargs,
)
hidden_states = outputs[0]
# Only compute necessary logits, and do not upcast them to float if we are not computing the loss
slice_indices = slice(-logits_to_keep, None) if isinstance(logits_to_keep, int) else logits_to_keep
logits = self.lm_head(hidden_states[:, slice_indices, :])
loss = None
if labels is not None:
loss = self.loss_function(
logits=logits, labels=labels, vocab_size=self.config.text_config.vocab_size, **kwargs
)
return RCausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
image_hidden_states=outputs.image_hidden_states,
)
def prepare_inputs_for_generation(
self,
input_ids,
past_key_values=None,
inputs_embeds=None,
pixel_values=None,
image_sizes=None,
attention_mask=None,
cache_position=None,
logits_to_keep=None,
**kwargs,
):
# Overwritten -- in specific circumstances we don't want to forward image inputs to the model
model_inputs = super().prepare_inputs_for_generation(
input_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
cache_position=cache_position,
logits_to_keep=logits_to_keep,
**kwargs,
)
if cache_position[0] == 0:
# If we're in cached decoding stage, pixel values should be None because input ids do not contain special image token anymore
# Otherwise we need pixel values to be passed to model
model_inputs["pixel_values"] = pixel_values
model_inputs["image_sizes"] = image_sizes
return model_inputs
@staticmethod
def _prepare_4d_causal_attention_mask_with_cache_position(
attention_mask: torch.Tensor,
sequence_length: int,
target_length: int,
dtype: torch.dtype,
cache_position: torch.Tensor,
batch_size: int,
**kwargs,
):
if attention_mask is not None and attention_mask.dim() == 4:
# In this case we assume that the mask comes already in inverted form and requires no inversion or slicing.
causal_mask = attention_mask
else:
min_dtype = torch.finfo(dtype).min
causal_mask = torch.full(
(sequence_length, target_length), fill_value=min_dtype, dtype=dtype, device=cache_position.device
)
if sequence_length != 1:
causal_mask = torch.triu(causal_mask, diagonal=1)
causal_mask *= torch.arange(target_length, device=cache_position.device) > cache_position.reshape(-1, 1)
causal_mask = causal_mask[None, None, :, :].expand(batch_size, 1, -1, -1)
if attention_mask is not None:
causal_mask = causal_mask.clone() # copy to contiguous memory for in-place edit
mask_length = attention_mask.shape[-1]
padding_mask = causal_mask[:, :, :, :mask_length] + attention_mask[:, None, None, :].to(
causal_mask.device
)
padding_mask = padding_mask == 0
causal_mask[:, :, :, :mask_length] = causal_mask[:, :, :, :mask_length].masked_fill(
padding_mask, min_dtype
)
return causal_mask
__all__ = ["RModel", "RForConditionalGeneration", "RPreTrainedModel"]

51
preprocessor_config.json Normal file
View File

@@ -0,0 +1,51 @@
{
"do_convert_rgb": null,
"do_normalize": true,
"do_pad": true,
"do_rescale": true,
"do_resize": true,
"image_grid_pinpoints": [
[
384,
768
],
[
768,
384
],
[
768,
768
],
[
1152,
384
],
[
384,
1152
]
],
"image_mean": [
0.5,
0.5,
0.5
],
"image_processor_type": "RImageProcessor",
"image_std": [
0.5,
0.5,
0.5
],
"processor_class": "RProcessor",
"auto_map": {
"AutoProcessor": "processing_r.RProcessor",
"AutoImageProcessor": "image_processing_r.RImageProcessor"
},
"resample": 2,
"rescale_factor": 0.00392156862745098,
"size": {
"height": 384,
"width": 384
}
}

259
processing_r.py Normal file
View File

@@ -0,0 +1,259 @@
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
from collections.abc import Iterable
from typing import Union
import numpy as np
from transformers.feature_extraction_utils import BatchFeature
from transformers.image_processing_utils import select_best_resolution
from transformers.image_utils import ImageInput, get_image_size, to_numpy_array
from transformers.processing_utils import ProcessingKwargs, ProcessorMixin, Unpack, MultiModalData
from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
from transformers.utils import logging
logger = logging.get_logger(__name__)
class RProcessorKwargs(ProcessingKwargs, total=False):
# see processing_utils.ProcessingKwargs documentation for usage.
_defaults = {
"text_kwargs": {
"padding": False,
},
"image_kwargs": {},
}
class RProcessor(ProcessorMixin):
attributes = ["image_processor", "tokenizer"]
valid_kwargs = [
"chat_template",
"num_image_tokens",
"image_processor_type",
"vision_feature_select_strategy",
"image_token",
"vision_aspect_ratio",
]
image_processor_class = "AutoImageProcessor"
tokenizer_class = "AutoTokenizer"
def __init__(
self,
image_processor=None,
tokenizer=None,
num_image_tokens=None,
vision_feature_select_strategy=None,
chat_template=None,
image_token="<image>",
vision_aspect_ratio= "anyres",
**kwargs,
):
self.num_image_tokens = num_image_tokens
self.vision_feature_select_strategy = vision_feature_select_strategy
self.image_token = tokenizer.image_token if hasattr(tokenizer, "image_token") else image_token
self.image_token_id = (
tokenizer.image_token_id
if getattr(tokenizer, "image_token_id", None)
else tokenizer.convert_tokens_to_ids(self.image_token)
)
self.vision_aspect_ratio = vision_aspect_ratio
super().__init__(image_processor, tokenizer, chat_template=chat_template)
def __call__(
self,
images: ImageInput = None,
text: Union[TextInput, PreTokenizedInput, list[TextInput], list[PreTokenizedInput]] = None,
audio=None,
**kwargs: Unpack[RProcessorKwargs],
) -> BatchFeature:
output_kwargs = self._merge_kwargs(
RProcessorKwargs,
tokenizer_init_kwargs=self.tokenizer.init_kwargs,
**kwargs,
)
if isinstance(text, str):
text = [text]
elif not isinstance(text, list) and not isinstance(text[0], str):
raise ValueError("Invalid input text. Please provide a string, or a list of strings")
image_inputs = {}
if images is not None:
image_inputs = self.image_processor(images, **output_kwargs["images_kwargs"])
batch_num_images = iter(image_inputs["batch_num_images"])
image_sizes = iter(image_inputs["image_sizes"])
height, width = get_image_size(
to_numpy_array(image_inputs["pixel_values"][0][0]),
channel_dim=output_kwargs["images_kwargs"].get("data_format"),
)
text, num_image_tokens = self._expand_image_tokens(
text, image_sizes, height, width, self.image_token, batch_num_images
)
return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)
text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
self._check_special_mm_tokens(text, text_inputs, modalities=["image"])
return BatchFeature(data={**text_inputs, **image_inputs}, tensor_type=return_tensors)
def _expand_image_tokens(
self,
text: list[TextInput],
image_sizes: Iterable[Union[list[int], int]],
height: int,
width: int,
special_token: str,
batch_num_images: Iterable[int],
):
prompt_strings = []
max_num_vision_tokens = 0
for sample in text:
if special_token in sample:
is_multi_image = next(batch_num_images) != 1
else:
is_multi_image = False
while special_token in sample:
if is_multi_image:
num_image_tokens = self.num_image_tokens + 1 # one for image_newline
else:
original_size = next(image_sizes)
if not isinstance(original_size, (list, tuple)):
# cast to list to avoid numerical precision errors when calculating unpadding
original_size = original_size.tolist()
orig_height, orig_width = original_size
num_image_tokens = self._get_number_of_features(orig_height, orig_width, height, width)
max_num_vision_tokens = max(max_num_vision_tokens, num_image_tokens)
if self.vision_feature_select_strategy == "default":
num_image_tokens -= 1
sample = sample.replace(special_token, "<placeholder>" * num_image_tokens, 1)
prompt_strings.append(sample)
text = [sample.replace("<placeholder>", special_token) for sample in prompt_strings]
return text, max_num_vision_tokens
def _get_number_of_features(self, orig_height: int, orig_width: int, height: int, width: int) -> int:
image_grid_pinpoints = self.image_processor.image_grid_pinpoints
height_best_resolution, width_best_resolution = select_best_resolution(
[orig_height, orig_width], image_grid_pinpoints
)
scale_height, scale_width = height_best_resolution // height, width_best_resolution // width
patches_height = patches_width = int(math.sqrt(self.num_image_tokens))
unpadded_features, newline_features = self._get_unpadded_features(
orig_height, orig_width, patches_height, patches_width, scale_height, scale_width
)
# The base patch covers the entire image (no CLS for SigLIP)
base_features = self.num_image_tokens
num_image_tokens = unpadded_features + newline_features + base_features
return num_image_tokens
# Adapted from transformers.models.llava_next.processing_llava_next.LlavaNextProcessor._get_unpadded_features
def _get_unpadded_features(self, height, width, patches_height, patches_width, scale_height, scale_width):
current_height = patches_height * scale_height
current_width = patches_width * scale_width
original_aspect_ratio = width / height
current_aspect_ratio = current_width / current_height
if original_aspect_ratio > current_aspect_ratio:
new_height = int(round(height * (current_width / width), 7))
padding = (current_height - new_height) // 2
current_height -= padding * 2
else:
new_width = int(round(width * (current_height / height), 7))
padding = (current_width - new_width) // 2
current_width -= padding * 2
unpadded_features = current_height * current_width
newline_features = current_height
return (unpadded_features, newline_features)
def _get_num_multimodal_tokens(self, image_sizes=None, video_sizes=None, **kwargs):
"""
Computes the number of placeholder tokens needed for multimodal inputs with the given sizes.
Args:
image_sizes (list[list[str]], *optional*):
The input sizes formatted as (height, width) per each image.
video_sizes (list[list[str]], *optional*):
The input sizes formatted as (num_frames, height, width) per each video.
audio_lengths (list[int], *optional*):
The input length formatted as per each audio.
Returns:
dict[str, list[int]]: A dictionary mapping each modality ("image", "video", "audio")
to a list containing the number of placeholder tokens required. If the model doesn't accept
a certain modality or no input sizes are provided, the dict value is set to an empty list.
"""
vision_data = {}
if image_sizes is not None:
images_kwargs = RProcessorKwargs._defaults.get("images_kwargs", {})
images_kwargs.update(kwargs)
size = images_kwargs.get("size", None) or self.image_processor.size
size = (
(size["shortest_edge"], size["shortest_edge"])
if "shortest_edge" in size
else (min(size["height"], size["width"]), min(size["height"], size["width"]))
)
processed_height, processed_width = size
batch_num_image_tokens = []
num_image_patches = [1] * len(image_sizes) # llava-ov doesn't batch pixels as Idefics, thus `1` patch`
for image_size in image_sizes:
orig_height, orig_width = image_size
num_image_tokens = self._get_number_of_features(
orig_height, orig_width, processed_height, processed_width
)
if self.vision_feature_select_strategy == "default":
num_image_tokens -= 1
batch_num_image_tokens.append(num_image_tokens)
vision_data.update({"num_image_tokens": batch_num_image_tokens, "num_image_patches": num_image_patches})
return MultiModalData(**vision_data)
# Copied from transformers.models.clip.processing_clip.CLIPProcessor.batch_decode with CLIP->Llama
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
# Copied from transformers.models.clip.processing_clip.CLIPProcessor.decode with CLIP->Llama
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
@property
# Copied from transformers.models.clip.processing_clip.CLIPProcessor.model_input_names
def model_input_names(self):
tokenizer_input_names = self.tokenizer.model_input_names
image_processor_input_names = self.image_processor.model_input_names
return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
__all__ = ["RProcessor"]

12
processor_config.json Normal file
View File

@@ -0,0 +1,12 @@
{
"image_token": "<image>",
"num_image_tokens": 729,
"processor_class": "RProcessor",
"auto_map": {
"AutoProcessor": "processing_r.RProcessor",
"AutoImageProcessor": "image_processing_r.RImageProcessor"
},
"video_token": "<video>",
"vision_aspect_ratio": "anyres",
"vision_feature_select_strategy": "full"
}

31
special_tokens_map.json Normal file
View File

@@ -0,0 +1,31 @@
{
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>"
],
"eos_token": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

3
tokenizer.json Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:c6a4e9901c580f8acc48cdbd2618c3b0ec673dcb91d44b555171844c707f28d2
size 11423022

256
tokenizer_config.json Normal file
View File

@@ -0,0 +1,256 @@
{
"add_bos_token": false,
"add_prefix_space": false,
"added_tokens_decoder": {
"151643": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151644": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151645": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151646": {
"content": "<|object_ref_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151647": {
"content": "<|object_ref_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151648": {
"content": "<|box_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151649": {
"content": "<|box_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151650": {
"content": "<|quad_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151651": {
"content": "<|quad_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151652": {
"content": "<|vision_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151653": {
"content": "<|vision_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151654": {
"content": "<|vision_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151655": {
"content": "<|image_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151656": {
"content": "<|video_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151657": {
"content": "<tool_call>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151658": {
"content": "</tool_call>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151659": {
"content": "<|fim_prefix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151660": {
"content": "<|fim_middle|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151661": {
"content": "<|fim_suffix|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151662": {
"content": "<|fim_pad|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151663": {
"content": "<|repo_name|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151664": {
"content": "<|file_sep|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151665": {
"content": "<tool_response>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151666": {
"content": "</tool_response>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151667": {
"content": "<think>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151668": {
"content": "</think>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": false
},
"151669": {
"content": "<image>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"151670": {
"content": "<video>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"additional_special_tokens": [
"<|im_start|>",
"<|im_end|>",
"<|object_ref_start|>",
"<|object_ref_end|>",
"<|box_start|>",
"<|box_end|>",
"<|quad_start|>",
"<|quad_end|>",
"<|vision_start|>",
"<|vision_end|>",
"<|vision_pad|>",
"<|image_pad|>",
"<|video_pad|>"
],
"bos_token": null,
"clean_up_tokenization_spaces": false,
"eos_token": "<|im_end|>",
"errors": "replace",
"extra_special_tokens": {},
"model_max_length": 131072,
"pad_token": "<|endoftext|>",
"processor_class": "processing_r.RProcessor",
"split_special_tokens": false,
"tokenizer_class": "Qwen2Tokenizer",
"unk_token": null
}

View File

@@ -0,0 +1,26 @@
{
"do_convert_rgb": true,
"do_normalize": true,
"do_pad": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.5,
0.5,
0.5
],
"video_processor_type": "LlavaOnevisionVideoProcessor",
"image_std": [
0.5,
0.5,
0.5
],
"processor_class": "LlavaOnevisionProcessor",
"resample": 3,
"rescale_factor": 0.00392156862745098,
"size": {
"height": 384,
"width": 384
}
}

BIN
vocab.json (Stored with Git LFS) Normal file

Binary file not shown.