diff --git a/README.md b/README.md index ee03baf..f8eba2a 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ --- license: apache-2.0 datasets: -- allenai/olmo-mix-1124 +- allenai/olmOCR-mix-0225 language: - yo - sw @@ -19,4 +19,186 @@ tags: - Text-Generation - Optical-Character-Recognition - Low-Resource-Languages ---- \ No newline at end of file +--- + +# KarantaOCR: Efficient Document Processing for African Languages + +## Model Description + +**KarantaOCR** is an open-source document OCR and processing model designed for **high-accuracy text extraction in African languages**. +The model focuses on preserving language-specific characters and diacritics that are often lost, normalized, or mis-transcribed by existing OCR systems. + +KarantaOCR is fine-tuned from [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), a vision-language model that combines a strong vision encoder with a large language model. +Through targeted curriculum fine-tuning, KarantaOCR extends these capabilities to robust document understanding across diverse PDF formats and multilingual settings. + +## Training Data + +KarantaOCR was trained using a **two-stage curriculum fine-tuning strategy**. + +### Stage 1: General OCR Training + +* **100,000 documents** sampled from [Allenai OCRMix](allenai/olmOCR-mix-0225) +* Purpose: learn general OCR skills across layouts, fonts, tables, and document structures + +### Stage 2: African Language Fine-Tuning + +* **50,000 PDFs** containing text in **10 African languages**, crawled from the web +* Domains include: + + * Religious texts + * Legal documents + * Dictionaries + * Novels + * Other long-form and structured documents + +This stage emphasizes accurate transcription of **diacritics, special characters, and region-specific typography**. + +--- + +## Capabilities + +KarantaOCR supports: + +* High-accuracy **text extraction** from PDFs +* **Table extraction** and structured document understanding +* Robust handling of: + + * Multi-column layouts + * Headers and footers + * Mixed scanned and digital PDFs + +While improved performance on African languages was our priority, KarantaOCR **maintains strong performance on English and other high-resource languages**, making it suitable for mixed-language document collections. + +## Evaluation + +KarantaOCR is evaluated on the OLMOocr benchmark using pass-rate accuracy. Scores are reported as averages across JSONL files with 95% confidence intervals. + +| Model | Avg Score ↑ | 95% CI | +| --------------- | ----------- | ------ | +| **KarantaOCR** | **74.1%** | ± 1.1 | +| RoLMOCR | 74.4% | ± 1.0 | +| NanoNetsOCR-2 | 68.8% | ± 1.1 | +| OLMOCR | 65.8% | ± 0.9 | + +### Results by Documet Type (%) + +| JSONL File | KarantaOCR | RoLMOCR | NanoNetsOCR-2 | OLMOCR | +| --------------- | ---------- | -------- | ------------- | -------- | +| arxiv_math | 74.2 | **76.8** | 73.7 | 68.9 | +| baseline | **99.4** | 97.9 | **99.5** | 85.0 | +| headers_footers | **95.3** | 94.1 | 32.8 | **96.4** | +| long_tiny_text | 72.2 | 61.3 | **92.1** | 81.9 | +| multi_column | 75.6 | 70.0 | **82.5** | **84.0** | +| old_scans | 41.3 | 42.4 | 41.4 | **42.0** | +| old_scans_math | 70.3 | **80.1** | 44.1 | 0.0 | +| table_tests | 64.3 | 72.2 | **84.2** | 68.3 | + +## How to Use + +KarantaOCR processes PDF documents by rendering pages into images and combining them with structured prompts for inference. + +### Load the Model and Processor + +```python +import torch +from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor + +def load_model(model_path: str, device_map: str = "auto", dtype: str = "auto"): + model = Qwen2_5_VLForConditionalGeneration.from_pretrained( + model_path, + torch_dtype=getattr(torch, dtype) if dtype != "auto" else "auto", + device_map=device_map, + ) + return model + +def load_processor(processor_name: str, min_pixels=None, max_pixels=None): + if min_pixels and max_pixels: + return AutoProcessor.from_pretrained( + processor_name, min_pixels=min_pixels, max_pixels=max_pixels + ) + return AutoProcessor.from_pretrained(processor_name) +``` + +### Prepare a PDF Page for Inference + +```python +from jinja2 import Template + +def render_pdf_to_base64png( + local_pdf_path: str, page_num: int, target_longest_image_dim: int = 2048 +) -> str: + longest_dim = max(get_pdf_media_box_width_height(local_pdf_path, page_num)) + + # Convert PDF page to PNG using pdftoppm + pdftoppm_result = subprocess.run( + [ + "pdftoppm", + "-png", + "-f", + str(page_num), + "-l", + str(page_num), + "-r", + str( + target_longest_image_dim * 72 / longest_dim + ), # 72 pixels per point is the conversion factor + local_pdf_path, + ], + timeout=120, + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + ) + assert pdftoppm_result.returncode == 0, pdftoppm_result.stderr + return base64.b64encode(pdftoppm_result.stdout).decode("utf-8") + +def build_message(image_url: str, system_prompt: str, page: int = 0): + image_base64 = render_pdf_to_base64png(image_url, page, TARGET_IMAGE_DIM) + + prompt = [ + { + "role": "user", + "content": [ + { + "type": "text", + "text": system_prompt + }, + { + "type": "image", + "image": f"data:image/png;base64,{image_base64}", + }, + ], + } + ] + return prompt +``` + +### Run OCR Inference + +```python +from qwen_vl_utils import process_vision_info + +def run_inference(model, processor, messages, max_new_tokens=128, device="cuda"): + text = processor.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True + ) + + image_inputs, _ = process_vision_info(messages) + inputs = processor( + text=[text], + images=image_inputs, + padding=False, + return_tensors="pt", + ).to(device) + + generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens) + trimmed_ids = [ + out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) + ] + + outputs = processor.batch_decode( + trimmed_ids, + skip_special_tokens=True, + clean_up_tokenization_spaces=False, + ) + return outputs[0] +```