初始化项目，由ModelHub XC社区提供模型

Model: OpenGVLab/InternVL3-2B-hf Source: Original Platform
2026-06-06 08:50:13 +08:00
commit e48a8c4f13
15 changed files with 152315 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,49 @@
 *.7z filter=lfs diff=lfs merge=lfs -text
 *.arrow filter=lfs diff=lfs merge=lfs -text
 *.bin filter=lfs diff=lfs merge=lfs -text
 *.bin.* filter=lfs diff=lfs merge=lfs -text
 *.bz2 filter=lfs diff=lfs merge=lfs -text
 *.ftz filter=lfs diff=lfs merge=lfs -text
 *.gz filter=lfs diff=lfs merge=lfs -text
 *.h5 filter=lfs diff=lfs merge=lfs -text
 *.joblib filter=lfs diff=lfs merge=lfs -text
 *.lfs.* filter=lfs diff=lfs merge=lfs -text
 *.model filter=lfs diff=lfs merge=lfs -text
 *.msgpack filter=lfs diff=lfs merge=lfs -text
 *.onnx filter=lfs diff=lfs merge=lfs -text
 *.ot filter=lfs diff=lfs merge=lfs -text
 *.parquet filter=lfs diff=lfs merge=lfs -text
 *.pb filter=lfs diff=lfs merge=lfs -text
 *.pt filter=lfs diff=lfs merge=lfs -text
 *.pth filter=lfs diff=lfs merge=lfs -text
 *.rar filter=lfs diff=lfs merge=lfs -text
 saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.tar.* filter=lfs diff=lfs merge=lfs -text
 *.tflite filter=lfs diff=lfs merge=lfs -text
 *.tgz filter=lfs diff=lfs merge=lfs -text
 *.xz filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zstandard filter=lfs diff=lfs merge=lfs -text
 *.tfevents* filter=lfs diff=lfs merge=lfs -text
 *.db* filter=lfs diff=lfs merge=lfs -text
 *.ark* filter=lfs diff=lfs merge=lfs -text
 **/*ckpt*data* filter=lfs diff=lfs merge=lfs -text
 **/*ckpt*.meta filter=lfs diff=lfs merge=lfs -text
 **/*ckpt*.index filter=lfs diff=lfs merge=lfs -text
 *.safetensors filter=lfs diff=lfs merge=lfs -text
 *.ckpt filter=lfs diff=lfs merge=lfs -text
 *.gguf* filter=lfs diff=lfs merge=lfs -text
 *.ggml filter=lfs diff=lfs merge=lfs -text
 *.llamafile* filter=lfs diff=lfs merge=lfs -text
 *.pt2 filter=lfs diff=lfs merge=lfs -text
 *.mlmodel filter=lfs diff=lfs merge=lfs -text
 *.npy filter=lfs diff=lfs merge=lfs -text
 *.npz filter=lfs diff=lfs merge=lfs -text
 *.pickle filter=lfs diff=lfs merge=lfs -text
 *.pkl filter=lfs diff=lfs merge=lfs -text
 *.tar filter=lfs diff=lfs merge=lfs -text
 *.wasm filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 tokenizer.json filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,357 @@
 ---
 license: other
 license_name: qwen
 license_link: https://huggingface.co/Qwen/Qwen2.5-72B-Instruct/blob/main/LICENSE
 pipeline_tag: image-text-to-text
 library_name: transformers
 base_model:
 - OpenGVLab/InternVL3-2B-Instruct
 base_model_relation: finetune
 datasets:
 - OpenGVLab/MMPR-v1.2
 language:
 - multilingual
 tags:
 - internvl
 ---
 # InternVL3-2B Transformers 🤗 Implementation
 [\[📜 InternVL 1.0\]](https://huggingface.co/papers/2312.14238)  [\[📜 InternVL 1.5\]](https://huggingface.co/papers/2404.16821)  [\[📜 InternVL 2.5\]](https://huggingface.co/papers/2412.05271)  [\[📜 InternVL2.5-MPO\]](https://huggingface.co/papers/2411.10442)  [\[📜 InternVL3\]](https://huggingface.co/papers/2504.10479)
 [\[🆕 Blog\]](https://internvl.github.io/blog/)  [\[🗨️ Chat Demo\]](https://internvl.opengvlab.com/)  [\[🤗 HF Demo\]](https://huggingface.co/spaces/OpenGVLab/InternVL)  [\[🚀 Quick Start\]](#quick-start)  [\[📖 Documents\]](https://internvl.readthedocs.io/en/latest/)
 <div align="center">
  <img width="500" alt="image" src="https://cdn-uploads.huggingface.co/production/uploads/64006c09330a45b03605bba3/zJsd2hqd3EevgXo6fNgC-.png">
 </div>
 > [!IMPORTANT]
 > This repository contains the Hugging Face 🤗 Transformers implementation for the [OpenGVLab/InternVL3-2B](https://huggingface.co/OpenGVLab/InternVL3-2B) model.
 > It is intended to be functionally equivalent to the original OpenGVLab release.
 > As a native Transformers model, it supports core library features such as various attention implementations (eager, including SDPA, and FA2) and enables efficient batched inference with interleaved image, video, and text inputs.
 ## Introduction
 We introduce InternVL3, an advanced multimodal large language model (MLLM) series that demonstrates superior overall performance.
 Compared to InternVL 2.5, InternVL3 exhibits superior multimodal perception and reasoning capabilities, while further extending its multimodal capabilities to encompass tool usage, GUI agents, industrial image analysis, 3D vision perception, and more.
 Additionally, we compare InternVL3 with  Qwen2.5 Chat models, whose corresponding pre-trained base models are employed as the initialization of the langauge component in InternVL3. Benefitting from Native Multimodal Pre-Training, the InternVL3 series achieves even better overall text performance than the Qwen2.5 series.
 ![image/png](https://huggingface.co/datasets/Weiyun1025/InternVL-Performance/resolve/main/internvl3/overall.png)
 You can find more info on the InternVL3 family in the original checkpoint [OpenGVLab/InternVL3-2B](https://huggingface.co/OpenGVLab/InternVL3-2B)
 ## Usage example
 ### Inference with Pipeline
 Here is how you can use the `image-text-to-text` pipeline to perform inference with the `InternVL3` models in just a few lines of code:
 ```python
 >>> from transformers import pipeline
 >>> messages = [
 ...     {
 ...         "role": "user",
 ...         "content": [
 ...             {
 ...                 "type": "image",
 ...                 "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
 ...             },
 ...             {"type": "text", "text": "Describe this image."},
 ...         ],
 ...     },
 ... ]
 >>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-2B-hf")
 >>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
 >>> outputs[0]["generated_text"]
 'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n   - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'
 ```
 ### Inference on a single image
 This example demonstrates how to perform inference on a single image with the InternVL models using chat templates.
 > [!NOTE]
 > Note that the model has been trained with a specific prompt format for chatting. Use `processor.apply_chat_template(my_conversation_dict)` to correctly format your prompts.
 ```python
 >>> from transformers import AutoProcessor, AutoModelForImageTextToText
 >>> import torch
 >>> torch_device = "cuda"
 >>> model_checkpoint = "OpenGVLab/InternVL3-2B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
 >>> messages = [
 ...     {
 ...         "role": "user",
 ...         "content": [
 ...             {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
 ...             {"type": "text", "text": "Please describe the image explicitly."},
 ...         ],
 ...     }
 ... ]
 >>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
 >>> generate_ids = model.generate(**inputs, max_new_tokens=50)
 >>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
 >>> decoded_output
 'The image shows two cats lying on a pink blanket. The cat on the left is a tabby with a mix of brown, black, and white fur, and it appears to be sleeping with its head resting on the blanket. The cat on the'
 ```
 ### Text-only generation
 This example shows how to generate text using the InternVL model without providing any image input.
 ```python
 >>> from transformers import AutoProcessor, AutoModelForImageTextToText
 >>> import torch
 >>> torch_device = "cuda"
 >>> model_checkpoint = "OpenGVLab/InternVL3-2B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
 >>> messages = [
 ...     {
 ...         "role": "user",
 ...         "content": [
 ...             {"type": "text", "text": "Write a haiku"},
 ...         ],
 ...     }
 ... ]
 >>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(torch_device, dtype=torch.bfloat16)
 >>> generate_ids = model.generate(**inputs, max_new_tokens=50)
 >>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
 >>> print(decoded_output)
 "Whispers of dawn,\nSilent whispers of the night,\nNew day's light begins."
 ```
 ### Batched image and text inputs
 InternVL models also support batched image and text inputs.
 ```python
 >>> from transformers import AutoProcessor, AutoModelForImageTextToText
 >>> import torch
 >>> torch_device = "cuda"
 >>> model_checkpoint = "OpenGVLab/InternVL3-2B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
 >>> messages = [
 ...     [
 ...         {
 ...             "role": "user",
 ...             "content": [
 ...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
 ...                 {"type": "text", "text": "Write a haiku for this image"},
 ...             ],
 ...         },
 ...     ],
 ...     [
 ...         {
 ...             "role": "user",
 ...             "content": [
 ...                 {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
 ...                 {"type": "text", "text": "Describe this image"},
 ...             ],
 ...         },
 ...     ],
 ... ]
 >>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
 >>> output = model.generate(**inputs, max_new_tokens=25)
 >>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
 >>> decoded_outputs
 ["user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace.",
 'user\n\nDescribe this image\nassistant\nThe image shows a street scene with a traditional Chinese archway, known as a "Chinese Gate" or "Chinese Gate of']
 ```
 ### Batched multi-image input
 This implementation of the InternVL models supports batched text-images inputs with different number of images for each text.
 ```python
 >>> from transformers import AutoProcessor, AutoModelForImageTextToText
 >>> import torch
 >>> torch_device = "cuda"
 >>> model_checkpoint = "OpenGVLab/InternVL3-2B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
 >>> messages = [
 ...     [
 ...         {
 ...             "role": "user",
 ...             "content": [
 ...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
 ...                 {"type": "text", "text": "Write a haiku for this image"},
 ...             ],
 ...         },
 ...     ],
 ...     [
 ...         {
 ...             "role": "user",
 ...             "content": [
 ...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
 ...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
 ...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
 ...             ],
 ...         },
 ...     ],
 >>> ]
 >>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
 >>> output = model.generate(**inputs, max_new_tokens=25)
 >>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
 >>> decoded_outputs
 ["user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace.",
 'user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nYes, these images depict the Statue of Liberty and the Golden Gate Bridge.']
 ```
 ### Video input
 InternVL models can also handle video inputs. Here is an example of how to perform inference on a video input using chat templates.
 ```python
 >>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
 >>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
 >>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, quantization_config=quantization_config)
 >>> messages = [
 ...     {
 ...         "role": "user",
 ...         "content": [
 ...             {
 ...                 "type": "video",
 ...                 "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4",
 ...             },
 ...             {"type": "text", "text": "What type of shot is the man performing?"},
 ...         ],
 ...     }
 >>> ]
 >>> inputs = processor.apply_chat_template(
 ...     messages,
 ...     return_tensors="pt",
 ...     add_generation_prompt=True,
 ...     tokenize=True,
 ...     return_dict=True,
 >>> ).to(model.device, dtype=torch.float16)
 >>> output = model.generate(**inputs, max_new_tokens=25)
 >>> decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
 >>> decoded_output
 'The man is performing a forehand shot.'
 ```
 ### Interleaved image and video inputs
 This example showcases how to handle a batch of chat conversations with interleaved image and video inputs using chat template.
 ```python
 >>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
 >>> import torch
 >>> torch_device = "cuda"
 >>> model_checkpoint = "OpenGVLab/InternVL3-2B-hf"
 >>> processor = AutoProcessor.from_pretrained(model_checkpoint)
 >>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
 >>> messages = [
 ...     [
 ...         {
 ...             "role": "user",
 ...             "content": [
 ...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
 ...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
 ...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
 ...             ],
 ...         },
 ...     ],
 ...     [
 ...         {
 ...             "role": "user",
 ...             "content": [
 ...                 {"type": "video", "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4"},
 ...                 {"type": "text", "text": "What type of shot is the man performing?"},
 ...             ],
 ...         },
 ...     ],
 ...     [
 ...         {
 ...             "role": "user",
 ...             "content": [
 ...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
 ...                 {"type": "text", "text": "Write a haiku for this image"},
 ...             ],
 ...         },
 ...     ],
 >>> ]
 >>> inputs = processor.apply_chat_template(
 ...     messages,
 ...     padding=True,
 ...     add_generation_prompt=True,
 ...     tokenize=True,
 ...     return_dict=True,
 ...     return_tensors="pt",
 >>> ).to(model.device, dtype=torch.bfloat16)
 >>> outputs = model.generate(**inputs, max_new_tokens=25)
 >>> decoded_outputs = processor.batch_decode(outputs, skip_special_tokens=True)
 >>> decoded_outputs
 ['user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nThe images depict the Statue of Liberty and the Golden Gate Bridge.',
 'user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nA forehand shot',
 "user\n\nWrite a haiku for this image\nassistant\nSilky lake,  \nWooden pier,  \nNature's peace."]
 ```
 ## License
 This project is released under the MIT License. This project uses the pre-trained Qwen2.5 as a component, which is licensed under the Qwen License.
 ## Citation
 If you find this project useful in your research, please consider citing:
 ```BibTeX
@article{chen2024expanding,
  title={Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling},
  author={Chen, Zhe and Wang, Weiyun and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Cui, Erfei and Zhu, Jinguo and Ye, Shenglong and Tian, Hao and Liu, Zhaoyang and others},
  journal={arXiv preprint arXiv:2412.05271},
  year={2024}
 }
@article{wang2024mpo,
  title={Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization},
  author={Wang, Weiyun and Chen, Zhe and Wang, Wenhai and Cao, Yue and Liu, Yangzhou and Gao, Zhangwei and Zhu, Jinguo and Zhu, Xizhou and Lu, Lewei and Qiao, Yu and Dai, Jifeng},
  journal={arXiv preprint arXiv:2411.10442},
  year={2024}
 }
@article{chen2024far,
  title={How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites},
  author={Chen, Zhe and Wang, Weiyun and Tian, Hao and Ye, Shenglong and Gao, Zhangwei and Cui, Erfei and Tong, Wenwen and Hu, Kongzhi and Luo, Jiapeng and Ma, Zheng and others},
  journal={arXiv preprint arXiv:2404.16821},
  year={2024}
 }
@inproceedings{chen2024internvl,
  title={Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks},
  author={Chen, Zhe and Wu, Jiannan and Wang, Wenhai and Su, Weijie and Chen, Guo and Xing, Sen and Zhong, Muyan and Zhang, Qinglong and Zhu, Xizhou and Lu, Lewei and others},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={24185--24198},
  year={2024}
 }
 ```
--- a/added_tokens.json
+++ b/added_tokens.json
@@ -0,0 +1,34 @@
 {
  "</box>": 151673,
  "</img>": 151666,
  "</quad>": 151669,
  "</ref>": 151671,
  "</tool_call>": 151658,
  "<IMG_CONTEXT>": 151667,
  "<box>": 151672,
  "<img>": 151665,
  "<quad>": 151668,
  "<ref>": 151670,
  "<tool_call>": 151657,
  "<video>": 151674,
  "<|box_end|>": 151649,
  "<|box_start|>": 151648,
  "<|endoftext|>": 151643,
  "<|file_sep|>": 151664,
  "<|fim_middle|>": 151660,
  "<|fim_pad|>": 151662,
  "<|fim_prefix|>": 151659,
  "<|fim_suffix|>": 151661,
  "<|im_end|>": 151645,
  "<|im_start|>": 151644,
  "<|image_pad|>": 151655,
  "<|object_ref_end|>": 151647,
  "<|object_ref_start|>": 151646,
  "<|quad_end|>": 151651,
  "<|quad_start|>": 151650,
  "<|repo_name|>": 151663,
  "<|video_pad|>": 151656,
  "<|vision_end|>": 151653,
  "<|vision_pad|>": 151654,
  "<|vision_start|>": 151652
 }
--- a/chat_template.jinja
+++ b/chat_template.jinja
@@ -0,0 +1,6 @@
 {% for message in messages %}{{'<|im_start|>' + message['role'] + '
 '}}{% if message['content'] is string %}{{ message['content'] }}{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' %}{{ '<IMG_CONTEXT>
 ' }}{% elif content['type'] == 'video' %}{{ '<video>
 ' }}{% elif content['type'] == 'text' %}{{ content['text'] }}{% endif %}{% endfor %}{% endif %}{{'<|im_end|>
 '}}{% endfor %}{% if add_generation_prompt %}{{'<|im_start|>assistant
 ' }}{% endif %}
--- a/config.json
+++ b/config.json
@@ -0,0 +1,79 @@
 {
  "architectures": [
    "InternVLForConditionalGeneration"
  ],
  "downsample_ratio": 0.5,
  "image_seq_length": 256,
  "image_token_id": 151667,
  "model_type": "internvl",
  "projector_hidden_act": "gelu",
  "text_config": {
    "architectures": [
      "Qwen2ForCausalLM"
    ],
    "attention_dropout": 0.0,
    "bos_token_id": 151643,
    "eos_token_id": 151645,
    "hidden_act": "silu",
    "hidden_size": 1536,
    "initializer_range": 0.02,
    "intermediate_size": 8960,
    "max_position_embeddings": 32768,
    "max_window_layers": 70,
    "model_type": "qwen2",
    "num_attention_heads": 12,
    "num_hidden_layers": 28,
    "num_key_value_heads": 2,
    "rms_norm_eps": 1e-06,
    "rope_scaling": {
      "factor": 2.0,
      "rope_type": "dynamic",
      "type": "dynamic"
    },
    "rope_theta": 1000000.0,
    "sliding_window": null,
    "torch_dtype": "bfloat16",
    "use_cache": true,
    "use_sliding_window": false,
    "vocab_size": 151674
  },
  "torch_dtype": "bfloat16",
  "transformers_version": "4.52.0.dev0",
  "vision_config": {
    "architectures": [
      "InternVisionModel"
    ],
    "attention_bias": true,
    "attention_dropout": 0.0,
    "dropout": 0.0,
    "hidden_act": "gelu",
    "hidden_dropout_prob": 0.0,
    "hidden_size": 1024,
    "image_size": [
      448,
      448
    ],
    "initializer_factor": 0.1,
    "initializer_range": 1e-10,
    "intermediate_size": 4096,
    "layer_norm_eps": 1e-06,
    "layer_scale_init_value": 0.1,
    "model_type": "internvl_vision",
    "norm_type": "layer_norm",
    "num_attention_heads": 16,
    "num_channels": 3,
    "num_hidden_layers": 24,
    "patch_size": [
      14,
      14
    ],
    "projection_dropout": 0.0,
    "torch_dtype": "bfloat16",
    "use_absolute_position_embeddings": true,
    "use_mask_token": false,
    "use_mean_pooling": true,
    "use_qk_norm": false
  },
  "vision_feature_layer": -1,
  "vision_feature_select_strategy": "default"
 }
--- a/configuration.json
+++ b/configuration.json
@@ -0,0 +1 @@
 {"framework": "pytorch", "task": "image-text-to-text", "allow_remote": true}
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,6 @@
 {
  "_from_model_config": true,
  "bos_token_id": 151643,
  "eos_token_id": 151645,
  "transformers_version": "4.52.0.dev0"
 }
--- a/merges.txt
+++ b/merges.txt
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:8e2c302719a13916a4d276e60fad4fb1c92a57b39dd2ad51af5a63dcf7a16f4a
 size 4178013768
--- a/preprocessor_config.json
+++ b/preprocessor_config.json
@@ -0,0 +1,34 @@
 {
  "crop_size": null,
  "crop_to_patches": false,
  "data_format": "channels_first",
  "default_to_square": true,
  "device": null,
  "do_center_crop": null,
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.485,
    0.456,
    0.406
  ],
  "image_processor_type": "GotOcr2ImageProcessorFast",
  "image_std": [
    0.229,
    0.224,
    0.225
  ],
  "input_data_format": null,
  "max_patches": 12,
  "min_patches": 1,
  "processor_class": "InternVLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "return_tensors": null,
  "size": {
    "height": 448,
    "width": 448
  }
 }
--- a/processor_config.json
+++ b/processor_config.json
@@ -0,0 +1,4 @@
 {
  "image_seq_length": 256,
  "processor_class": "InternVLProcessor"
 }
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,44 @@
 {
  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>",
    "<|object_ref_start|>",
    "<|object_ref_end|>",
    "<|box_start|>",
    "<|box_end|>",
    "<|quad_start|>",
    "<|quad_end|>",
    "<|vision_start|>",
    "<|vision_end|>",
    "<|vision_pad|>",
    "<|image_pad|>",
    "<|video_pad|>",
    "<img>",
    "</img>",
    "<IMG_CONTEXT>",
    "<quad>",
    "</quad>",
    "<ref>",
    "</ref>",
    "<box>",
    "</box>"
  ],
  "context_image_token": "<IMG_CONTEXT>",
  "end_image_token": "</img>",
  "eos_token": {
    "content": "<|im_end|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "pad_token": {
    "content": "<|endoftext|>",
    "lstrip": false,
    "normalized": false,
    "rstrip": false,
    "single_word": false
  },
  "start_image_token": "<img>",
  "video_token": "<video>"
 }
--- a/tokenizer.json
+++ b/tokenizer.json
@@ -0,0 +1,3 @@
 version https://git-lfs.github.com/spec/v1
 oid sha256:7cc80b7e20adf8bf6f6ca442bf1abfac8056bb3b7d3e0b11c9d497d3e79398c9
 size 11423732
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
@@ -0,0 +1,306 @@
 {
  "add_bos_token": false,
  "add_prefix_space": false,
  "added_tokens_decoder": {
    "151643": {
      "content": "<|endoftext|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151644": {
      "content": "<|im_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151645": {
      "content": "<|im_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151646": {
      "content": "<|object_ref_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151647": {
      "content": "<|object_ref_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151648": {
      "content": "<|box_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151649": {
      "content": "<|box_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151650": {
      "content": "<|quad_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151651": {
      "content": "<|quad_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151652": {
      "content": "<|vision_start|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151653": {
      "content": "<|vision_end|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151654": {
      "content": "<|vision_pad|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151655": {
      "content": "<|image_pad|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151656": {
      "content": "<|video_pad|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151657": {
      "content": "<tool_call>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "151658": {
      "content": "</tool_call>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "151659": {
      "content": "<|fim_prefix|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "151660": {
      "content": "<|fim_middle|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "151661": {
      "content": "<|fim_suffix|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "151662": {
      "content": "<|fim_pad|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "151663": {
      "content": "<|repo_name|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "151664": {
      "content": "<|file_sep|>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": false
    },
    "151665": {
      "content": "<img>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151666": {
      "content": "</img>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151667": {
      "content": "<IMG_CONTEXT>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151668": {
      "content": "<quad>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151669": {
      "content": "</quad>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151670": {
      "content": "<ref>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151671": {
      "content": "</ref>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151672": {
      "content": "<box>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151673": {
      "content": "</box>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    },
    "151674": {
      "content": "<video>",
      "lstrip": false,
      "normalized": false,
      "rstrip": false,
      "single_word": false,
      "special": true
    }
  },
  "additional_special_tokens": [
    "<|im_start|>",
    "<|im_end|>",
    "<|object_ref_start|>",
    "<|object_ref_end|>",
    "<|box_start|>",
    "<|box_end|>",
    "<|quad_start|>",
    "<|quad_end|>",
    "<|vision_start|>",
    "<|vision_end|>",
    "<|vision_pad|>",
    "<|image_pad|>",
    "<|video_pad|>",
    "<img>",
    "</img>",
    "<IMG_CONTEXT>",
    "<quad>",
    "</quad>",
    "<ref>",
    "</ref>",
    "<box>",
    "</box>"
  ],
  "bos_token": null,
  "clean_up_tokenization_spaces": false,
  "context_image_token": "<IMG_CONTEXT>",
  "end_image_token": "</img>",
  "eos_token": "<|im_end|>",
  "errors": "replace",
  "extra_special_tokens": {
    "context_image_token": "<IMG_CONTEXT>",
    "end_image_token": "</img>",
    "start_image_token": "<img>",
    "video_token": "<video>"
  },
  "model_max_length": 8192,
  "pad_token": "<|endoftext|>",
  "return_token_type_ids": false,
  "split_special_tokens": false,
  "start_image_token": "<img>",
  "tokenizer_class": "Qwen2Tokenizer",
  "unk_token": null,
  "video_token": "<video>"
 }
--- a/vocab.json
+++ b/vocab.json
		`@@ -0,0 +1 @@`
							`{"framework": "pytorch", "task": "image-text-to-text", "allow_remote": true}`