初始化项目,由ModelHub XC社区提供模型

Model: HuggingFaceTB/SmolVLM-Instruct
Source: Original Platform
This commit is contained in:
ModelHub XC
2026-06-05 00:54:13 +08:00
commit 74f83eb12c
44 changed files with 294746 additions and 0 deletions

61
.gitattributes vendored Normal file
View File

@@ -0,0 +1,61 @@
*.7z filter=lfs diff=lfs merge=lfs -text
*.arrow filter=lfs diff=lfs merge=lfs -text
*.bin filter=lfs diff=lfs merge=lfs -text
*.bz2 filter=lfs diff=lfs merge=lfs -text
*.ckpt filter=lfs diff=lfs merge=lfs -text
*.ftz filter=lfs diff=lfs merge=lfs -text
*.gz filter=lfs diff=lfs merge=lfs -text
*.h5 filter=lfs diff=lfs merge=lfs -text
*.joblib filter=lfs diff=lfs merge=lfs -text
*.lfs.* filter=lfs diff=lfs merge=lfs -text
*.mlmodel filter=lfs diff=lfs merge=lfs -text
*.model filter=lfs diff=lfs merge=lfs -text
*.msgpack filter=lfs diff=lfs merge=lfs -text
*.npy filter=lfs diff=lfs merge=lfs -text
*.npz filter=lfs diff=lfs merge=lfs -text
*.onnx filter=lfs diff=lfs merge=lfs -text
*.ot filter=lfs diff=lfs merge=lfs -text
*.parquet filter=lfs diff=lfs merge=lfs -text
*.pb filter=lfs diff=lfs merge=lfs -text
*.pickle filter=lfs diff=lfs merge=lfs -text
*.pkl filter=lfs diff=lfs merge=lfs -text
*.pt filter=lfs diff=lfs merge=lfs -text
*.pth filter=lfs diff=lfs merge=lfs -text
*.rar filter=lfs diff=lfs merge=lfs -text
*.safetensors filter=lfs diff=lfs merge=lfs -text
saved_model/**/* filter=lfs diff=lfs merge=lfs -text
*.tar.* filter=lfs diff=lfs merge=lfs -text
*.tar filter=lfs diff=lfs merge=lfs -text
*.tflite filter=lfs diff=lfs merge=lfs -text
*.tgz filter=lfs diff=lfs merge=lfs -text
*.wasm filter=lfs diff=lfs merge=lfs -text
*.xz filter=lfs diff=lfs merge=lfs -text
*.zip filter=lfs diff=lfs merge=lfs -text
*.zst filter=lfs diff=lfs merge=lfs -text
*tfevents* filter=lfs diff=lfs merge=lfs -text
onnx/decoder_model_merged.onnx_data filter=lfs diff=lfs merge=lfs -text
onnx/decoder_model_merged_fp16.onnx_data filter=lfs diff=lfs merge=lfs -text
model.safetensors filter=lfs diff=lfs merge=lfs -text
onnx/decoder_model_merged_bnb4.onnx filter=lfs diff=lfs merge=lfs -text
onnx/decoder_model_merged_fp16.onnx filter=lfs diff=lfs merge=lfs -text
onnx/decoder_model_merged_int8.onnx filter=lfs diff=lfs merge=lfs -text
onnx/decoder_model_merged_q4.onnx filter=lfs diff=lfs merge=lfs -text
onnx/decoder_model_merged_q4f16.onnx filter=lfs diff=lfs merge=lfs -text
onnx/decoder_model_merged_quantized.onnx filter=lfs diff=lfs merge=lfs -text
onnx/decoder_model_merged_uint8.onnx filter=lfs diff=lfs merge=lfs -text
onnx/embed_tokens.onnx filter=lfs diff=lfs merge=lfs -text
onnx/embed_tokens_bnb4.onnx filter=lfs diff=lfs merge=lfs -text
onnx/embed_tokens_fp16.onnx filter=lfs diff=lfs merge=lfs -text
onnx/embed_tokens_int8.onnx filter=lfs diff=lfs merge=lfs -text
onnx/embed_tokens_q4.onnx filter=lfs diff=lfs merge=lfs -text
onnx/embed_tokens_q4f16.onnx filter=lfs diff=lfs merge=lfs -text
onnx/embed_tokens_quantized.onnx filter=lfs diff=lfs merge=lfs -text
onnx/embed_tokens_uint8.onnx filter=lfs diff=lfs merge=lfs -text
onnx/vision_encoder.onnx filter=lfs diff=lfs merge=lfs -text
onnx/vision_encoder_bnb4.onnx filter=lfs diff=lfs merge=lfs -text
onnx/vision_encoder_fp16.onnx filter=lfs diff=lfs merge=lfs -text
onnx/vision_encoder_int8.onnx filter=lfs diff=lfs merge=lfs -text
onnx/vision_encoder_q4.onnx filter=lfs diff=lfs merge=lfs -text
onnx/vision_encoder_q4f16.onnx filter=lfs diff=lfs merge=lfs -text
onnx/vision_encoder_quantized.onnx filter=lfs diff=lfs merge=lfs -text
onnx/vision_encoder_uint8.onnx filter=lfs diff=lfs merge=lfs -text

180
README.md Normal file
View File

@@ -0,0 +1,180 @@
---
library_name: transformers
license: apache-2.0
datasets:
- HuggingFaceM4/the_cauldron
- HuggingFaceM4/Docmatix
pipeline_tag: image-text-to-text
language:
- en
base_model:
- HuggingFaceTB/SmolLM2-1.7B-Instruct
- google/siglip-so400m-patch14-384
---
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/SmolVLM.png" width="800" height="auto" alt="Image description">
# SmolVLM
SmolVLM is a compact open multimodal model that accepts arbitrary sequences of image and text inputs to produce text outputs. Designed for efficiency, SmolVLM can answer questions about images, describe visual content, create stories grounded on multiple images, or function as a pure language model without visual inputs. Its lightweight architecture makes it suitable for on-device applications while maintaining strong performance on multimodal tasks.
## Model Summary
- **Developed by:** Hugging Face 🤗
- **Model type:** Multi-modal model (image+text)
- **Language(s) (NLP):** English
- **License:** Apache 2.0
- **Architecture:** Based on [Idefics3](https://huggingface.co/HuggingFaceM4/Idefics3-8B-Llama3) (see technical summary)
## Resources
- **Demo:** [SmolVLM Demo](https://huggingface.co/spaces/HuggingFaceTB/SmolVLM)
- **Blog:** [Blog post](https://huggingface.co/blog/smolvlm)
## Uses
SmolVLM can be used for inference on multimodal (image + text) tasks where the input comprises text queries along with one or more images. Text and images can be interleaved arbitrarily, enabling tasks like image captioning, visual question answering, and storytelling based on visual content. The model does not support image generation.
To fine-tune SmolVLM on a specific task, you can follow the fine-tuning tutorial.
<!-- todo: add link to fine-tuning tutorial -->
### Technical Summary
SmolVLM leverages the lightweight SmolLM2 language model to provide a compact yet powerful multimodal experience. It introduces several changes compared to previous Idefics models:
- **Image compression:** We introduce a more radical image compression compared to Idefics3 to enable the model to infer faster and use less RAM.
- **Visual Token Encoding:** SmolVLM uses 81 visual tokens to encode image patches of size 384×384. Larger images are divided into patches, each encoded separately, enhancing efficiency without compromising performance.
More details about the training and architecture are available in our technical report.
### How to get started
You can use transformers to load, infer and fine-tune SmolVLM.
```python
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load images
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg")
# Initialize processor and model
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM-Instruct")
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceTB/SmolVLM-Instruct",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "Can you describe the two images?"}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = inputs.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(
generated_ids,
skip_special_tokens=True,
)
print(generated_texts[0])
"""
Assistant: The first image shows a green statue of the Statue of Liberty standing on a stone pedestal in front of a body of water.
The statue is holding a torch in its right hand and a tablet in its left hand. The water is calm and there are no boats or other objects visible.
The sky is clear and there are no clouds. The second image shows a bee on a pink flower.
The bee is black and yellow and is collecting pollen from the flower. The flower is surrounded by green leaves.
"""
```
### Model optimizations
**Precision**: For better performance, load and run the model in half-precision (`torch.float16` or `torch.bfloat16`) if your hardware supports it.
```python
from transformers import AutoModelForVision2Seq
import torch
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceTB/SmolVLM-Instruct",
torch_dtype=torch.bfloat16
).to("cuda")
```
You can also load SmolVLM with 4/8-bit quantization using bitsandbytes, torchao or Quanto. Refer to [this page](https://huggingface.co/docs/transformers/en/main_classes/quantization) for other options.
```python
from transformers import AutoModelForVision2Seq, BitsAndBytesConfig
import torch
quantization_config = BitsAndBytesConfig(load_in_8bit=True)
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceTB/SmolVLM-Instruct",
quantization_config=quantization_config,
)
```
**Vision Encoder Efficiency**: Adjust the image resolution by setting `size={"longest_edge": N*384}` when initializing the processor, where N is your desired value. The default `N=4` works well, which results in input images of
size 1536×1536. For documents, `N=5` might be beneficial. Decreasing N can save GPU memory and is appropriate for lower-resolution images. This is also useful if you want to fine-tune on videos.
## Misuse and Out-of-scope Use
SmolVLM is not intended for high-stakes scenarios or critical decision-making processes that affect an individual's well-being or livelihood. The model may produce content that appears factual but may not be accurate. Misuse includes, but is not limited to:
- Prohibited Uses:
- Evaluating or scoring individuals (e.g., in employment, education, credit)
- Critical automated decision-making
- Generating unreliable factual content
- Malicious Activities:
- Spam generation
- Disinformation campaigns
- Harassment or abuse
- Unauthorized surveillance
### License
SmolVLM is built upon [the shape-optimized SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) as image encoder and [SmolLM2](https://huggingface.co/HuggingFaceTB/SmolLM2-1.7B-Instruct) for text decoder part.
We release the SmolVLM checkpoints under the Apache 2.0 license.
## Training Details
### Training Data
The training data comes from [The Cauldron](https://huggingface.co/datasets/HuggingFaceM4/the_cauldron) and [Docmatix](https://huggingface.co/datasets/HuggingFaceM4/Docmatix) datasets, with emphasis on document understanding (25%) and image captioning (18%), while maintaining balanced coverage across other crucial capabilities like visual reasoning, chart comprehension, and general instruction following.
<img src="https://huggingface.co/HuggingFaceTB/SmolVLM-Instruct/resolve/main/mixture_the_cauldron.png" alt="Example Image" style="width:90%;" />
## Evaluation
| Model | MMMU (val) | MathVista (testmini) | MMStar (val) | DocVQA (test) | TextVQA (val) | Min GPU RAM required (GB) |
|-------------------|------------|----------------------|--------------|---------------|---------------|---------------------------|
| SmolVLM | 38.8 | 44.6 | 42.1 | 81.6 | 72.7 | 5.02 |
| Qwen-VL 2B | 41.1 | 47.8 | 47.5 | 90.1 | 79.7 | 13.70 |
| InternVL2 2B | 34.3 | 46.3 | 49.8 | 86.9 | 73.4 | 10.52 |
| PaliGemma 3B 448px| 34.9 | 28.7 | 48.3 | 32.2 | 56.0 | 6.72 |
| moondream2 | 32.4 | 24.3 | 40.3 | 70.5 | 65.2 | 3.87 |
| MiniCPM-V-2 | 38.2 | 39.8 | 39.1 | 71.9 | 74.1 | 7.88 |
| MM1.5 1B | 35.8 | 37.2 | 0.0 | 81.0 | 72.5 | NaN |

BIN
SmolVLM.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 142 KiB

5
added_tokens.json Normal file
View File

@@ -0,0 +1,5 @@
{
"<end_of_utterance>": 49154,
"<fake_token_around_image>": 49152,
"<image>": 49153
}

3
chat_template.json Normal file
View File

@@ -0,0 +1,3 @@
{
"chat_template": "<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}"
}

263
config.json Normal file
View File

@@ -0,0 +1,263 @@
{
"architectures": [
"Idefics3ForConditionalGeneration"
],
"image_seq_len": 81,
"image_token_id": 49153,
"model_type": "idefics3",
"scale_factor": 3,
"text_config": {
"_attn_implementation_autoset": false,
"_flash_attn_2_enabled": true,
"_name_or_path": "/fsx/m4/experiments/local_experiment_dir/s3_async_temporary_checkpoint_folder/tr_324_opt_400/unwrapped_model",
"add_cross_attention": false,
"architectures": [
"VLlama3ForCausalLM"
],
"attention_bias": false,
"attention_dropout": 0.0,
"bad_words_ids": null,
"begin_suppress_tokens": null,
"bos_token_id": 0,
"chunk_size_feed_forward": 0,
"cross_attention_hidden_size": null,
"decoder_start_token_id": null,
"diversity_penalty": 0.0,
"do_sample": false,
"early_stopping": false,
"encoder_no_repeat_ngram_size": 0,
"eos_token_id": 0,
"exponential_decay_length_penalty": null,
"finetuning_task": null,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"head_dim": 64,
"hidden_act": "silu",
"hidden_size": 2048,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 8192,
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"length_penalty": 1.0,
"max_length": 20,
"max_position_embeddings": 16384,
"min_length": 0,
"mlp_bias": false,
"model_type": "llama",
"neftune_noise_alpha": 0.0,
"no_repeat_ngram_size": 0,
"num_attention_heads": 32,
"num_beam_groups": 1,
"num_beams": 1,
"num_hidden_layers": 24,
"num_key_value_heads": 32,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"pad_token_id": 2,
"perceiver_config": {
"_attn_implementation_autoset": false,
"_name_or_path": "",
"add_cross_attention": false,
"architectures": null,
"attention_dropout": 0.0,
"bad_words_ids": null,
"begin_suppress_tokens": null,
"bos_token_id": null,
"chunk_size_feed_forward": 0,
"cross_attention_hidden_size": null,
"decoder_start_token_id": null,
"diversity_penalty": 0.0,
"do_sample": false,
"early_stopping": false,
"encoder_no_repeat_ngram_size": 0,
"eos_token_id": null,
"exponential_decay_length_penalty": null,
"finetuning_task": null,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"hidden_act": "silu",
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"length_penalty": 1.0,
"max_length": 20,
"min_length": 0,
"model_type": "vllama3",
"no_repeat_ngram_size": 0,
"num_beam_groups": 1,
"num_beams": 1,
"num_key_value_heads": 1,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"pad_token_id": null,
"prefix": null,
"problem_type": null,
"pruned_heads": {},
"qk_layer_norms_perceiver": false,
"remove_invalid_values": false,
"repetition_penalty": 1.0,
"resampler_depth": 6,
"resampler_head_dim": 96,
"resampler_n_heads": 16,
"resampler_n_latents": 64,
"return_dict": true,
"return_dict_in_generate": false,
"sep_token_id": null,
"suppress_tokens": null,
"task_specific_params": null,
"temperature": 1.0,
"tf_legacy_loss": false,
"tie_encoder_decoder": false,
"tie_word_embeddings": true,
"tokenizer_class": null,
"top_k": 50,
"top_p": 1.0,
"torch_dtype": null,
"torchscript": false,
"transformers_version": "4.46.0",
"typical_p": 1.0,
"use_bfloat16": false
},
"prefix": null,
"pretraining_tp": 1,
"problem_type": null,
"pruned_heads": {},
"qk_layer_norms": false,
"remove_invalid_values": false,
"repetition_penalty": 1.0,
"return_dict": true,
"return_dict_in_generate": false,
"rms_norm_eps": 1e-05,
"rope_scaling": null,
"rope_theta": 273768.0,
"sep_token_id": null,
"suppress_tokens": null,
"task_specific_params": null,
"temperature": 1.0,
"tf_legacy_loss": false,
"tie_encoder_decoder": false,
"tie_word_embeddings": false,
"tokenizer_class": null,
"top_k": 50,
"top_p": 1.0,
"torch_dtype": "bfloat16",
"torchscript": false,
"typical_p": 1.0,
"use_bfloat16": false,
"use_cache": true,
"use_resampler": false,
"vocab_size": 49155
},
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.46.0",
"transformers.js_config": {
"kv_cache_dtype": {
"q4f16": "float16",
"fp16": "float16"
},
"dtype": {
"embed_tokens": "auto",
"vision_encoder": "auto",
"decoder_model_merged": "q4"
}
},
"use_cache": true,
"vision_config": {
"size": {"longest_edge": 1920},
"max_image_size": {"longest_edge": 384},
"_attn_implementation_autoset": false,
"_name_or_path": "",
"add_cross_attention": false,
"architectures": null,
"attention_dropout": 0.0,
"bad_words_ids": null,
"begin_suppress_tokens": null,
"bos_token_id": null,
"chunk_size_feed_forward": 0,
"cross_attention_hidden_size": null,
"decoder_start_token_id": null,
"diversity_penalty": 0.0,
"do_sample": false,
"early_stopping": false,
"encoder_no_repeat_ngram_size": 0,
"eos_token_id": null,
"exponential_decay_length_penalty": null,
"finetuning_task": null,
"forced_bos_token_id": null,
"forced_eos_token_id": null,
"hidden_act": "gelu_pytorch_tanh",
"hidden_size": 1152,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"image_size": 384,
"initializer_range": 0.02,
"intermediate_size": 4304,
"is_decoder": false,
"is_encoder_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-06,
"length_penalty": 1.0,
"max_length": 20,
"min_length": 0,
"model_type": "idefics3",
"no_repeat_ngram_size": 0,
"num_attention_heads": 16,
"num_beam_groups": 1,
"num_beams": 1,
"num_channels": 3,
"num_hidden_layers": 27,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_scores": false,
"pad_token_id": null,
"patch_size": 14,
"prefix": null,
"problem_type": null,
"pruned_heads": {},
"remove_invalid_values": false,
"repetition_penalty": 1.0,
"return_dict": true,
"return_dict_in_generate": false,
"sep_token_id": null,
"suppress_tokens": null,
"task_specific_params": null,
"temperature": 1.0,
"tf_legacy_loss": false,
"tie_encoder_decoder": false,
"tie_word_embeddings": false,
"tokenizer_class": null,
"top_k": 50,
"top_p": 1.0,
"torch_dtype": null,
"torchscript": false,
"typical_p": 1.0,
"use_bfloat16": false
},
"vocab_size": 49155
}

1
configuration.json Normal file
View File

@@ -0,0 +1 @@
{"framework": "pytorch", "task": "image-text-to-text", "allow_remote": true}

7
generation_config.json Normal file
View File

@@ -0,0 +1,7 @@
{
"_from_model_config": true,
"bos_token_id": 0,
"eos_token_id": 49154,
"pad_token_id": 2,
"transformers_version": "4.46.0"
}

48901
merges.txt Normal file

File diff suppressed because it is too large Load Diff

BIN
mixture_the_cauldron.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 915 KiB

3
model.safetensors Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:8a4f76cb64f6f2e4e74716d8fc1cfc6a70bbb3eeea69d424c3ec9902655065eb
size 4492630912

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:a27ef6fe177d3109e0913c63da7a4b0f2791fab95da3e5f91b31ba6e03115385
size 126930

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:d530c318000311b2697d0b891ef46c69f9e9c89688761e043654d08a3cca376c
size 6849724416

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:39974ffc8a05f4de601005dc555d326f3dd2744ffd544e2892ad065fe25b2b8a
size 967330291

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:e6ec36a518b896dfc6448c4343898d0ea5109702677d34cea3919fba074044d1
size 1342510427

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:4249b577dcd1cd146c6db45bea29108dd1e3831f1c6a5d6a226d01ac92ab411d
size 2082471936

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:2a2369559862dd3e40361a2b63ac0e7be18c07c72845393ee89df0e79713f6c7
size 1716139218

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:612e5c30793bc2045f9262b597013a25bcca44b4f76a7db196938a57a77e1f79
size 1074284508

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:2d74ec46083829ddb18f58fcceb358d2ba58d2a1320bdab431c32e4d2896981d
size 965031477

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:bddb1dcd933e681eb2a542186c081dd1e6cf4b67161d905ef9da31cabbd3474d
size 1716139269

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:bddb1dcd933e681eb2a542186c081dd1e6cf4b67161d905ef9da31cabbd3474d
size 1716139269

3
onnx/embed_tokens.onnx Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:8ec8537866d20b78e618e15aea8f91a558266cd77fe783e513f095fc1de1c8c4
size 402678062

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:eca0a3199567ba01a76dc6b923fd14bce39d6eb51d26686654bb7a98acfad280
size 402678081

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:377adedd6ac1975e3afc3fb4c24dd6032a973626da71a5e0648dec3735a56527
size 201339266

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:6666a926ca2a65f89016ea19ec7c5b8afd01c58e5aca1f33733f2d936f31c71d
size 100669984

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:eca0a3199567ba01a76dc6b923fd14bce39d6eb51d26686654bb7a98acfad280
size 402678081

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:d835e5524c9a8b349fe55a7f589ab21780417c2f1e67f52062cf7787dcbefc3b
size 201339285

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:6666a926ca2a65f89016ea19ec7c5b8afd01c58e5aca1f33733f2d936f31c71d
size 100669984

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:6666a926ca2a65f89016ea19ec7c5b8afd01c58e5aca1f33733f2d936f31c71d
size 100669984

3
onnx/vision_encoder.onnx Normal file
View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:65bb9b57b64763897cc6dc397450449fce5607138843566a885e2f0a250343c8
size 1737427560

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:6a8256b74fd9465f859fab31c1840ed073aa0edd7b75d61127eefe1ce1fcf560
size 251407732

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:ab171611906fa056c91a28fde0ef1fda897b44bbf1ca0d9ae692cfaff90947b1
size 868985807

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:352fe86ad7d8358f39fb896de9b2efd0d8a6cf2b6239565841bab5146a735d2f
size 436180765

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:e7d20e0f8a6201e4944759f2fbcab4fa035bbb1fb34e14700f25f1f00e678992
size 278736452

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:0b47406ea04d0754ccdd5bd0d68e827a72979f962886cf9bdeae926342234298
size 247852840

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:38e7292275057cec773aad0218310041e325289d18b708f89deae541925f4274
size 436180848

View File

@@ -0,0 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:38e7292275057cec773aad0218310041e325289d18b708f89deae541925f4274
size 436180848

28
preprocessor_config.json Normal file
View File

@@ -0,0 +1,28 @@
{
"do_convert_rgb": true,
"do_image_splitting": true,
"do_normalize": true,
"do_pad": true,
"do_rescale": true,
"do_resize": true,
"image_mean": [
0.5,
0.5,
0.5
],
"image_processor_type": "Idefics3ImageProcessor",
"image_std": [
0.5,
0.5,
0.5
],
"max_image_size": {
"longest_edge": 384
},
"processor_class": "Idefics3Processor",
"resample": 1,
"rescale_factor": 0.00392156862745098,
"size": {
"longest_edge": 1536
}
}

4
processor_config.json Normal file
View File

@@ -0,0 +1,4 @@
{
"processor_class": "Idefics3Processor",
"image_seq_len": 81
}

BIN
smolvlm-data.pdf Normal file

Binary file not shown.

53
special_tokens_map.json Normal file
View File

@@ -0,0 +1,53 @@
{
"additional_special_tokens": [
{
"content": "<fake_token_around_image>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
{
"content": "<image>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
{
"content": "<end_of_utterance>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
],
"bos_token": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"eos_token": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"pad_token": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
},
"unk_token": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false
}
}

244976
tokenizer.json Normal file

File diff suppressed because it is too large Load Diff

182
tokenizer_config.json Normal file
View File

@@ -0,0 +1,182 @@
{
"add_prefix_space": false,
"added_tokens_decoder": {
"0": {
"content": "<|endoftext|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"1": {
"content": "<|im_start|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"2": {
"content": "<|im_end|>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"3": {
"content": "<repo_name>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"4": {
"content": "<reponame>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"5": {
"content": "<file_sep>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"6": {
"content": "<filename>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"7": {
"content": "<gh_stars>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"8": {
"content": "<issue_start>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"9": {
"content": "<issue_comment>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"10": {
"content": "<issue_closed>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"11": {
"content": "<jupyter_start>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"12": {
"content": "<jupyter_text>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"13": {
"content": "<jupyter_code>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"14": {
"content": "<jupyter_output>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"15": {
"content": "<jupyter_script>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"16": {
"content": "<empty_output>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"49152": {
"content": "<fake_token_around_image>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"49153": {
"content": "<image>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
},
"49154": {
"content": "<end_of_utterance>",
"lstrip": false,
"normalized": false,
"rstrip": false,
"single_word": false,
"special": true
}
},
"additional_special_tokens": [
"<fake_token_around_image>",
"<image>",
"<end_of_utterance>"
],
"bos_token": "<|im_start|>",
"clean_up_tokenization_spaces": false,
"eos_token": "<end_of_utterance>",
"legacy": false,
"model_max_length": 16384,
"pad_token": "<|im_end|>",
"processor_class": "Idefics3Processor",
"tokenizer_class": "GPT2Tokenizer",
"truncation_side": "left",
"chat_template": "<|im_start|>{% for message in messages %}{{message['role'] | capitalize}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}",
"unk_token": "<|endoftext|>",
"vocab_size": 49152
}

1
vocab.json Normal file

File diff suppressed because one or more lines are too long