JSONify-Flux-Large/README.md

---
license: apache-2.0
language:
- en
- zh
base_model:
- Qwen/Qwen2-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- caption
- text-generation-inference
- flux
---
![9.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/esgwb8sdL5LbyDuQFLWnT.png)
# **JSONify-Flux-Large**

The **JSONify-Flux-Large** model is a fine-tuned version of **Qwen2VL**, specifically trained on **Flux-generated images** and their **corresponding captions**. This model has been trained using a **30M trainable parameter** dataset and is designed to output responses in structured **JSON format** while maintaining state-of-the-art performance in **Optical Character Recognition (OCR)**, **image-to-text conversion**, and **math problem-solving with LaTeX formatting**.

### Key Enhancements:

* **Optimized for Flux-Generated Image Captioning**: JSONify-Flux-Large has been trained to understand and describe images created using Flux-based generation techniques.

* **State-of-the-Art Image Understanding**: Built on Qwen2VL's architecture, JSONify-Flux-Large excels in visual reasoning tasks like DocVQA, RealWorldQA, MTVQA, and more.

* **Formatted JSON Output**: Responses are structured in a JSON format, making it ideal for automation, database storage, and further processing.

* **Multilingual Support**: Recognizes and extracts text from images in multiple languages, including English, Chinese, Japanese, Arabic, and various European languages.

* **Supports Multi-Turn Interactions**: Maintains context in conversations and can provide extended reasoning over multiple inputs.

### How to Use

```python
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/JSONify-Flux-Large", torch_dtype="auto", device_map="auto"
)

# Enable flash_attention_2 for better acceleration and memory efficiency
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "prithivMLmods/JSONify-Flux-Large",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# Default processor
processor = AutoProcessor.from_pretrained("prithivMLmods/JSONify-Flux-Large")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image in JSON format."},
        ],
    }
]

# Prepare inputs for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generate JSON-formatted output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text)  # JSON-formatted response
```

### JSON Buffer Handling
```python
buffer = ""
for new_text in streamer:
    buffer += new_text
    buffer = buffer.replace("<|im_end|>", "")
    yield buffer
```

### **Key Features**

1. **Flux-Based Vision-Language Model**:
   - Specifically trained on **Flux-generated images and captions** for precise image-to-text conversion.

2. **Optical Character Recognition (OCR)**:
   - Extracts and processes text from images with high accuracy.

3. **Math and LaTeX Support**:
   - Solves math problems and outputs equations in **LaTeX format**.

4. **Structured JSON Output**:
   - Ensures outputs are formatted in JSON, making it suitable for API responses and automation tasks.

5. **Multi-Image and Video Understanding**:
   - Supports analyzing multiple images and video content up to **20 minutes long**.

6. **Secure Weight Format**:
   - Uses **Safetensors** for enhanced security and faster model loading.
System update meta information 2025-02-18 07:42:04 +00:00			`---`
Update README.md 2025-02-18 15:50:09 +08:00			`license: apache-2.0`
			`language:`
			`- en`
			`- zh`
			`base_model:`
			`- Qwen/Qwen2-VL-7B-Instruct`
			`pipeline_tag: image-text-to-text`
			`library_name: transformers`
			`tags:`
			`- caption`
			`- text-generation-inference`
			`- flux`
			`---`
			`![9.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/esgwb8sdL5LbyDuQFLWnT.png)`
			`# JSONify-Flux-Large`
System update meta information 2025-02-18 07:42:04 +00:00
Update README.md 2025-02-18 15:50:09 +08:00			`The JSONify-Flux-Large model is a fine-tuned version of Qwen2VL, specifically trained on Flux-generated images and their corresponding captions. This model has been trained using a 30M trainable parameter dataset and is designed to output responses in structured JSON format while maintaining state-of-the-art performance in Optical Character Recognition (OCR), image-to-text conversion, and math problem-solving with LaTeX formatting.`
System update meta information 2025-02-18 07:42:04 +00:00
Update README.md 2025-02-18 15:50:09 +08:00			`### Key Enhancements:`
System update meta information 2025-02-18 07:42:04 +00:00
Update README.md 2025-02-18 15:50:09 +08:00			`* Optimized for Flux-Generated Image Captioning: JSONify-Flux-Large has been trained to understand and describe images created using Flux-based generation techniques.`
System update meta information 2025-02-18 07:42:04 +00:00
Update README.md 2025-02-18 15:50:09 +08:00			`* State-of-the-Art Image Understanding: Built on Qwen2VL's architecture, JSONify-Flux-Large excels in visual reasoning tasks like DocVQA, RealWorldQA, MTVQA, and more.`
System update meta information 2025-02-18 07:42:04 +00:00
Update README.md 2025-02-18 15:50:09 +08:00			`* Formatted JSON Output: Responses are structured in a JSON format, making it ideal for automation, database storage, and further processing.`
System update meta information 2025-02-18 07:42:04 +00:00
Update README.md 2025-02-18 15:50:09 +08:00			`* Multilingual Support: Recognizes and extracts text from images in multiple languages, including English, Chinese, Japanese, Arabic, and various European languages.`

			`* Supports Multi-Turn Interactions: Maintains context in conversations and can provide extended reasoning over multiple inputs.`

			`### How to Use`
System update meta information 2025-02-18 07:42:04 +00:00
			```python
Update README.md 2025-02-18 15:50:09 +08:00			`from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor`
			`from qwen_vl_utils import process_vision_info`

			`# Load the model on the available device(s)`
			`model = Qwen2VLForConditionalGeneration.from_pretrained(`
			`"prithivMLmods/JSONify-Flux-Large", torch_dtype="auto", device_map="auto"`
			`)`

			`# Enable flash_attention_2 for better acceleration and memory efficiency`
			`# model = Qwen2VLForConditionalGeneration.from_pretrained(`
			`# "prithivMLmods/JSONify-Flux-Large",`
			`# torch_dtype=torch.bfloat16,`
			`# attn_implementation="flash_attention_2",`
			`# device_map="auto",`
			`# )`

			`# Default processor`
			`processor = AutoProcessor.from_pretrained("prithivMLmods/JSONify-Flux-Large")`

			`messages = [`
			`{`
			`"role": "user",`
			`"content": [`
			`{`
			`"type": "image",`
			`"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",`
			`},`
			`{"type": "text", "text": "Describe this image in JSON format."},`
			`],`
			`}`
			`]`

			`# Prepare inputs for inference`
			`text = processor.apply_chat_template(`
			`messages, tokenize=False, add_generation_prompt=True`
			`)`
			`image_inputs, video_inputs = process_vision_info(messages)`
			`inputs = processor(`
			`text=[text],`
			`images=image_inputs,`
			`videos=video_inputs,`
			`padding=True,`
			`return_tensors="pt",`
			`)`
			`inputs = inputs.to("cuda")`

			`# Inference: Generate JSON-formatted output`
			`generated_ids = model.generate(**inputs, max_new_tokens=128)`
			`generated_ids_trimmed = [`
			`out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)`
			`]`
			`output_text = processor.batch_decode(`
			`generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False`
			`)`

			`print(output_text) # JSON-formatted response`
System update meta information 2025-02-18 07:42:04 +00:00			```
Update README.md 2025-02-18 15:50:09 +08:00
			`### JSON Buffer Handling`
			```python
			`buffer = ""`
			`for new_text in streamer:`
			`buffer += new_text`
			`buffer = buffer.replace("<\|im_end\|>", "")`
			`yield buffer`
System update meta information 2025-02-18 07:42:04 +00:00			```

Update README.md 2025-02-18 15:50:09 +08:00			`### Key Features`

			`1. Flux-Based Vision-Language Model:`
			`- Specifically trained on Flux-generated images and captions for precise image-to-text conversion.`

			`2. Optical Character Recognition (OCR):`
			`- Extracts and processes text from images with high accuracy.`

			`3. Math and LaTeX Support:`
			`- Solves math problems and outputs equations in LaTeX format.`

			`4. Structured JSON Output:`
			`- Ensures outputs are formatted in JSON, making it suitable for API responses and automation tasks.`

			`5. Multi-Image and Video Understanding:`
			`- Supports analyzing multiple images and video content up to 20 minutes long.`

			`6. Secure Weight Format:`
			`- Uses Safetensors for enhanced security and faster model loading.`