Update README.md

2025-12-25 17:24:05 +00:00
parent e5aab0f185
commit 429e47146a
1 changed files with 184 additions and 2 deletions
--- a/README.md
+++ b/README.md
@@ -1,7 +1,7 @@
 ---
 license: apache-2.0
 datasets:
- allenai/olmo-mix-1124
+- allenai/olmOCR-mix-0225
 language:
 - yo
 - sw
@@ -20,3 +20,185 @@ tags:
 - Optical-Character-Recognition
 - Low-Resource-Languages
 ---
+
+# KarantaOCR: Efficient Document Processing for African Languages
+
+## Model Description
+
+**KarantaOCR** is an open-source document OCR and processing model designed for **high-accuracy text extraction in African languages**.
+The model focuses on preserving language-specific characters and diacritics that are often lost, normalized, or mis-transcribed by existing OCR systems.
+
+KarantaOCR is fine-tuned from [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), a vision-language model that combines a strong vision encoder with a large language model.
+Through targeted curriculum fine-tuning, KarantaOCR extends these capabilities to robust document understanding across diverse PDF formats and multilingual settings.
+
+## Training Data
+
+KarantaOCR was trained using a **two-stage curriculum fine-tuning strategy**.
+
+### Stage 1: General OCR Training
+
+* **100,000 documents** sampled from [Allenai OCRMix](allenai/olmOCR-mix-0225)
+* Purpose: learn general OCR skills across layouts, fonts, tables, and document structures
+
+### Stage 2: African Language Fine-Tuning
+
+* **50,000 PDFs** containing text in **10 African languages**, crawled from the web
+* Domains include:
+
+  * Religious texts
+  * Legal documents
+  * Dictionaries
+  * Novels
+  * Other long-form and structured documents
+
+This stage emphasizes accurate transcription of **diacritics, special characters, and region-specific typography**.
+
+---
+
+## Capabilities
+
+KarantaOCR supports:
+
+* High-accuracy **text extraction** from PDFs
+* **Table extraction** and structured document understanding
+* Robust handling of:
+
+  * Multi-column layouts
+  * Headers and footers
+  * Mixed scanned and digital PDFs
+
+While improved performance on African languages was our priority, KarantaOCR **maintains strong performance on English and other high-resource languages**, making it suitable for mixed-language document collections.
+
+## Evaluation
+
+KarantaOCR is evaluated on the OLMOocr benchmark using pass-rate accuracy. Scores are reported as averages across JSONL files with 95% confidence intervals.
+
+| Model           | Avg Score ↑ | 95% CI |
+| --------------- | ----------- | ------ |
+| **KarantaOCR**  | **74.1%**   | ± 1.1  |
+| RoLMOCR         | 74.4%       | ± 1.0  |
+| NanoNetsOCR-2   | 68.8%       | ± 1.1  |
+| OLMOCR  | 65.8%       | ± 0.9  |
+
+### Results by Documet Type (%)
+
+| JSONL File      | KarantaOCR | RoLMOCR  | NanoNetsOCR-2 | OLMOCR   |
+| --------------- | ---------- | -------- | ------------- | -------- |
+| arxiv_math      | 74.2       | **76.8** | 73.7          | 68.9     |
+| baseline        | **99.4**   | 97.9     | **99.5**      | 85.0     |
+| headers_footers | **95.3**   | 94.1     | 32.8          | **96.4** |
+| long_tiny_text  | 72.2       | 61.3     | **92.1**      | 81.9     |
+| multi_column    | 75.6       | 70.0     | **82.5**      | **84.0** |
+| old_scans       | 41.3       | 42.4     | 41.4          | **42.0** |
+| old_scans_math  | 70.3       | **80.1** | 44.1          | 0.0      |
+| table_tests     | 64.3       | 72.2     | **84.2**      | 68.3     |
+
+## How to Use
+
+KarantaOCR processes PDF documents by rendering pages into images and combining them with structured prompts for inference.
+
+### Load the Model and Processor
+
+```python
+import torch
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
+
+def load_model(model_path: str, device_map: str = "auto", dtype: str = "auto"):
+    model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+        model_path,
+        torch_dtype=getattr(torch, dtype) if dtype != "auto" else "auto",
+        device_map=device_map,
+    )
+    return model
+
+def load_processor(processor_name: str, min_pixels=None, max_pixels=None):
+    if min_pixels and max_pixels:
+        return AutoProcessor.from_pretrained(
+            processor_name, min_pixels=min_pixels, max_pixels=max_pixels
+        )
+    return AutoProcessor.from_pretrained(processor_name)
+```
+
+### Prepare a PDF Page for Inference
+
+```python
+from jinja2 import Template
+
+def render_pdf_to_base64png(
+    local_pdf_path: str, page_num: int, target_longest_image_dim: int = 2048
+) -> str:
+    longest_dim = max(get_pdf_media_box_width_height(local_pdf_path, page_num))
+
+    # Convert PDF page to PNG using pdftoppm
+    pdftoppm_result = subprocess.run(
+        [
+            "pdftoppm",
+            "-png",
+            "-f",
+            str(page_num),
+            "-l",
+            str(page_num),
+            "-r",
+            str(
+                target_longest_image_dim * 72 / longest_dim
+            ),  # 72 pixels per point is the conversion factor
+            local_pdf_path,
+        ],
+        timeout=120,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.PIPE,
+    )
+    assert pdftoppm_result.returncode == 0, pdftoppm_result.stderr
+    return base64.b64encode(pdftoppm_result.stdout).decode("utf-8")
+
+def build_message(image_url: str, system_prompt: str, page: int = 0):
+    image_base64 = render_pdf_to_base64png(image_url, page, TARGET_IMAGE_DIM)
+
+    prompt = [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": system_prompt
+                },
+                {
+                    "type": "image",
+                    "image": f"data:image/png;base64,{image_base64}",
+                },
+            ],
+        }
+    ]
+    return prompt
+```
+
+### Run OCR Inference
+
+```python
+from qwen_vl_utils import process_vision_info
+
+def run_inference(model, processor, messages, max_new_tokens=128, device="cuda"):
+    text = processor.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+
+    image_inputs, _ = process_vision_info(messages)
+    inputs = processor(
+        text=[text],
+        images=image_inputs,
+        padding=False,
+        return_tensors="pt",
+    ).to(device)
+
+    generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
+    trimmed_ids = [
+        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+    ]
+
+    outputs = processor.batch_decode(
+        trimmed_ids,
+        skip_special_tokens=True,
+        clean_up_tokenization_spaces=False,
+    )
+    return outputs[0]
+```