LatexMind-2B-Codec/README.md

---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2-VL-2B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- latex
- vLM
- Vision
- Codec
---

![qwenVL.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/g8zYbOSBt4NSqhSIypaX3.png)

--------------

# **LatexMind-2B-Codec**

The **LatexMind-2B-Codec** model is a fine-tuned version of Qwen2-VL-2B-Instruct, optimized for Optical Character Recognition (OCR), **image-to-text conversion**, and **mathematical expression extraction with LaTeX formatting**. This model integrates a conversational approach with visual and textual understanding to handle multi-modal tasks effectively.  

# Key Enhancements:  

* **SoTA understanding of images with various resolutions & aspect ratios**: LatexMind-2B-Codec achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.  

* **Advanced LaTeX extraction**: The model specializes in extracting structured mathematical expressions from images and documents, converting them into LaTeX format for precise rendering and further computation.  

* **Understanding long-duration videos (20min+)**: LatexMind-2B-Codec can process videos over 20 minutes long, enabling high-quality video-based question answering, mathematical solution explanation, and educational content creation.  

* **Agent capabilities for automated operations**: With complex reasoning and decision-making abilities, the model can be integrated with mobile devices, robots, and assistive technologies to automate tasks based on visual and textual inputs.  

* **Multilingual Support**: To serve global users, in addition to English and Chinese, the model supports text recognition inside images across multiple languages, including European languages, Japanese, Korean, Arabic, Vietnamese, etc.  

This model is particularly effective in **retrieving mathematical notations and equations** from scanned documents, whiteboard images, and handwritten notes, ensuring accurate conversion to LaTeX code for further academic and computational applications.  

# Sample Inference with Doc 

![latexqwen.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/-h5z3giEudPrdM9qRMMTe.png)

Demo: https://huggingface.co/prithivMLmods/LatexMind-2B-Codec/blob/main/latexmind/latexmind-codec.ipynb

# Use it with Transformers


```python
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/LatexMind-2B-Codec", torch_dtype="auto", device_map="auto"
)

# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
#     "prithivMLmods/LatexMind-2B-Codec",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# default processer
processor = AutoProcessor.from_pretrained("prithivMLmods/Qwen2-VL-OCR-2B-Instruct")

# The default range for the number of visual tokens per image in the model is 4-16384. You can set min_pixels and max_pixels according to your needs, such as a token count range of 256-1280, to balance speed and memory usage.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```
# Buf
```python
    buffer = ""
    for new_text in streamer:
        buffer += new_text
        # Remove <|im_end|> or similar tokens from the output
        buffer = buffer.replace("<|im_end|>", "")
        yield buffer
```

# Intended Use 

**LatexMind-2B-Codec** is designed for tasks that require **image-based text recognition**, **math equation extraction**, and **multi-modal understanding**. It is particularly useful in the following scenarios:  

 **Optical Character Recognition (OCR)** – Extracting printed and handwritten text from images, documents, and scanned pages.  
 **Math Expression Recognition** – Converting mathematical notations into structured **LaTeX format** for further computation and documentation.  
 **Image-to-Text Conversion** – Generating accurate descriptions for text-rich and math-heavy images.  
 **Document and Academic Processing** – Assisting researchers, students, and professionals in digitizing handwritten notes and extracting structured content from books, PDFs, and whiteboards.  
 **Automated Educational Support** – Enabling AI-powered tutors, content summarization, and interactive learning for subjects involving complex equations.  
 **Multi-Language OCR** – Recognizing text inside images across multiple languages, including English, Chinese, Japanese, Korean, Arabic, and various European languages.  
 **Video-Based Question Answering** – Understanding long-duration videos for content summarization, question answering, and structured data extraction.  

# Limitations 

Despite its capabilities, **LatexMind-2B-Codec** has some inherent limitations:  

 **Handwritten Text Accuracy** – While it can recognize handwritten equations, performance may degrade with highly unstructured or messy handwriting.  
 **Complex LaTeX Formatting** – The model may struggle with deeply nested or ambiguous LaTeX expressions, requiring manual corrections for precise formatting.  
 **Low-Resolution Images** – Extracting accurate text from blurry or low-resolution images can lead to misinterpretations or OCR errors.  
 **Contextual Understanding in Multi-Step Equations** – While it recognizes math expressions, solving multi-step problems autonomously may be limited.  
 **Limited Support for Rare Mathematical Notations** – Some specialized or domain-specific symbols may not be recognized with high accuracy.  
 **Processing Speed for Large Documents** – Performance may slow down when handling extremely large documents or dense mathematical content in real-time applications.  
 **Language-Specific OCR Variability** – While it supports multiple languages, OCR accuracy may vary depending on the script complexity and font style.