Update README.md
This commit is contained in:
186
README.md
186
README.md
@@ -1,7 +1,7 @@
|
|||||||
---
|
---
|
||||||
license: apache-2.0
|
license: apache-2.0
|
||||||
datasets:
|
datasets:
|
||||||
- allenai/olmo-mix-1124
|
- allenai/olmOCR-mix-0225
|
||||||
language:
|
language:
|
||||||
- yo
|
- yo
|
||||||
- sw
|
- sw
|
||||||
@@ -19,4 +19,186 @@ tags:
|
|||||||
- Text-Generation
|
- Text-Generation
|
||||||
- Optical-Character-Recognition
|
- Optical-Character-Recognition
|
||||||
- Low-Resource-Languages
|
- Low-Resource-Languages
|
||||||
---
|
---
|
||||||
|
|
||||||
|
# KarantaOCR: Efficient Document Processing for African Languages
|
||||||
|
|
||||||
|
## Model Description
|
||||||
|
|
||||||
|
**KarantaOCR** is an open-source document OCR and processing model designed for **high-accuracy text extraction in African languages**.
|
||||||
|
The model focuses on preserving language-specific characters and diacritics that are often lost, normalized, or mis-transcribed by existing OCR systems.
|
||||||
|
|
||||||
|
KarantaOCR is fine-tuned from [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), a vision-language model that combines a strong vision encoder with a large language model.
|
||||||
|
Through targeted curriculum fine-tuning, KarantaOCR extends these capabilities to robust document understanding across diverse PDF formats and multilingual settings.
|
||||||
|
|
||||||
|
## Training Data
|
||||||
|
|
||||||
|
KarantaOCR was trained using a **two-stage curriculum fine-tuning strategy**.
|
||||||
|
|
||||||
|
### Stage 1: General OCR Training
|
||||||
|
|
||||||
|
* **100,000 documents** sampled from [Allenai OCRMix](allenai/olmOCR-mix-0225)
|
||||||
|
* Purpose: learn general OCR skills across layouts, fonts, tables, and document structures
|
||||||
|
|
||||||
|
### Stage 2: African Language Fine-Tuning
|
||||||
|
|
||||||
|
* **50,000 PDFs** containing text in **10 African languages**, crawled from the web
|
||||||
|
* Domains include:
|
||||||
|
|
||||||
|
* Religious texts
|
||||||
|
* Legal documents
|
||||||
|
* Dictionaries
|
||||||
|
* Novels
|
||||||
|
* Other long-form and structured documents
|
||||||
|
|
||||||
|
This stage emphasizes accurate transcription of **diacritics, special characters, and region-specific typography**.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Capabilities
|
||||||
|
|
||||||
|
KarantaOCR supports:
|
||||||
|
|
||||||
|
* High-accuracy **text extraction** from PDFs
|
||||||
|
* **Table extraction** and structured document understanding
|
||||||
|
* Robust handling of:
|
||||||
|
|
||||||
|
* Multi-column layouts
|
||||||
|
* Headers and footers
|
||||||
|
* Mixed scanned and digital PDFs
|
||||||
|
|
||||||
|
While improved performance on African languages was our priority, KarantaOCR **maintains strong performance on English and other high-resource languages**, making it suitable for mixed-language document collections.
|
||||||
|
|
||||||
|
## Evaluation
|
||||||
|
|
||||||
|
KarantaOCR is evaluated on the OLMOocr benchmark using pass-rate accuracy. Scores are reported as averages across JSONL files with 95% confidence intervals.
|
||||||
|
|
||||||
|
| Model | Avg Score ↑ | 95% CI |
|
||||||
|
| --------------- | ----------- | ------ |
|
||||||
|
| **KarantaOCR** | **74.1%** | ± 1.1 |
|
||||||
|
| RoLMOCR | 74.4% | ± 1.0 |
|
||||||
|
| NanoNetsOCR-2 | 68.8% | ± 1.1 |
|
||||||
|
| OLMOCR | 65.8% | ± 0.9 |
|
||||||
|
|
||||||
|
### Results by Documet Type (%)
|
||||||
|
|
||||||
|
| JSONL File | KarantaOCR | RoLMOCR | NanoNetsOCR-2 | OLMOCR |
|
||||||
|
| --------------- | ---------- | -------- | ------------- | -------- |
|
||||||
|
| arxiv_math | 74.2 | **76.8** | 73.7 | 68.9 |
|
||||||
|
| baseline | **99.4** | 97.9 | **99.5** | 85.0 |
|
||||||
|
| headers_footers | **95.3** | 94.1 | 32.8 | **96.4** |
|
||||||
|
| long_tiny_text | 72.2 | 61.3 | **92.1** | 81.9 |
|
||||||
|
| multi_column | 75.6 | 70.0 | **82.5** | **84.0** |
|
||||||
|
| old_scans | 41.3 | 42.4 | 41.4 | **42.0** |
|
||||||
|
| old_scans_math | 70.3 | **80.1** | 44.1 | 0.0 |
|
||||||
|
| table_tests | 64.3 | 72.2 | **84.2** | 68.3 |
|
||||||
|
|
||||||
|
## How to Use
|
||||||
|
|
||||||
|
KarantaOCR processes PDF documents by rendering pages into images and combining them with structured prompts for inference.
|
||||||
|
|
||||||
|
### Load the Model and Processor
|
||||||
|
|
||||||
|
```python
|
||||||
|
import torch
|
||||||
|
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
|
||||||
|
|
||||||
|
def load_model(model_path: str, device_map: str = "auto", dtype: str = "auto"):
|
||||||
|
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
||||||
|
model_path,
|
||||||
|
torch_dtype=getattr(torch, dtype) if dtype != "auto" else "auto",
|
||||||
|
device_map=device_map,
|
||||||
|
)
|
||||||
|
return model
|
||||||
|
|
||||||
|
def load_processor(processor_name: str, min_pixels=None, max_pixels=None):
|
||||||
|
if min_pixels and max_pixels:
|
||||||
|
return AutoProcessor.from_pretrained(
|
||||||
|
processor_name, min_pixels=min_pixels, max_pixels=max_pixels
|
||||||
|
)
|
||||||
|
return AutoProcessor.from_pretrained(processor_name)
|
||||||
|
```
|
||||||
|
|
||||||
|
### Prepare a PDF Page for Inference
|
||||||
|
|
||||||
|
```python
|
||||||
|
from jinja2 import Template
|
||||||
|
|
||||||
|
def render_pdf_to_base64png(
|
||||||
|
local_pdf_path: str, page_num: int, target_longest_image_dim: int = 2048
|
||||||
|
) -> str:
|
||||||
|
longest_dim = max(get_pdf_media_box_width_height(local_pdf_path, page_num))
|
||||||
|
|
||||||
|
# Convert PDF page to PNG using pdftoppm
|
||||||
|
pdftoppm_result = subprocess.run(
|
||||||
|
[
|
||||||
|
"pdftoppm",
|
||||||
|
"-png",
|
||||||
|
"-f",
|
||||||
|
str(page_num),
|
||||||
|
"-l",
|
||||||
|
str(page_num),
|
||||||
|
"-r",
|
||||||
|
str(
|
||||||
|
target_longest_image_dim * 72 / longest_dim
|
||||||
|
), # 72 pixels per point is the conversion factor
|
||||||
|
local_pdf_path,
|
||||||
|
],
|
||||||
|
timeout=120,
|
||||||
|
stdout=subprocess.PIPE,
|
||||||
|
stderr=subprocess.PIPE,
|
||||||
|
)
|
||||||
|
assert pdftoppm_result.returncode == 0, pdftoppm_result.stderr
|
||||||
|
return base64.b64encode(pdftoppm_result.stdout).decode("utf-8")
|
||||||
|
|
||||||
|
def build_message(image_url: str, system_prompt: str, page: int = 0):
|
||||||
|
image_base64 = render_pdf_to_base64png(image_url, page, TARGET_IMAGE_DIM)
|
||||||
|
|
||||||
|
prompt = [
|
||||||
|
{
|
||||||
|
"role": "user",
|
||||||
|
"content": [
|
||||||
|
{
|
||||||
|
"type": "text",
|
||||||
|
"text": system_prompt
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"type": "image",
|
||||||
|
"image": f"data:image/png;base64,{image_base64}",
|
||||||
|
},
|
||||||
|
],
|
||||||
|
}
|
||||||
|
]
|
||||||
|
return prompt
|
||||||
|
```
|
||||||
|
|
||||||
|
### Run OCR Inference
|
||||||
|
|
||||||
|
```python
|
||||||
|
from qwen_vl_utils import process_vision_info
|
||||||
|
|
||||||
|
def run_inference(model, processor, messages, max_new_tokens=128, device="cuda"):
|
||||||
|
text = processor.apply_chat_template(
|
||||||
|
messages, tokenize=False, add_generation_prompt=True
|
||||||
|
)
|
||||||
|
|
||||||
|
image_inputs, _ = process_vision_info(messages)
|
||||||
|
inputs = processor(
|
||||||
|
text=[text],
|
||||||
|
images=image_inputs,
|
||||||
|
padding=False,
|
||||||
|
return_tensors="pt",
|
||||||
|
).to(device)
|
||||||
|
|
||||||
|
generated_ids = model.generate(**inputs, max_new_tokens=max_new_tokens)
|
||||||
|
trimmed_ids = [
|
||||||
|
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
|
||||||
|
]
|
||||||
|
|
||||||
|
outputs = processor.batch_decode(
|
||||||
|
trimmed_ids,
|
||||||
|
skip_special_tokens=True,
|
||||||
|
clean_up_tokenization_spaces=False,
|
||||||
|
)
|
||||||
|
return outputs[0]
|
||||||
|
```
|
||||||
|
|||||||
Reference in New Issue
Block a user