初始化项目，由ModelHub XC社区提供模型

Model: numind/NuExtract-2.0-2B Source: Original Platform
2026-06-03 20:22:52 +08:00
commit 295d39b734
15 changed files with 152483 additions and 0 deletions
--- a/.gitattributes
+++ b/.gitattributes
@@ -0,0 +1,37 @@
+*.7z filter=lfs diff=lfs merge=lfs -text
+*.arrow filter=lfs diff=lfs merge=lfs -text
+*.bin filter=lfs diff=lfs merge=lfs -text
+*.bz2 filter=lfs diff=lfs merge=lfs -text
+*.ckpt filter=lfs diff=lfs merge=lfs -text
+*.ftz filter=lfs diff=lfs merge=lfs -text
+*.gz filter=lfs diff=lfs merge=lfs -text
+*.h5 filter=lfs diff=lfs merge=lfs -text
+*.joblib filter=lfs diff=lfs merge=lfs -text
+*.lfs.* filter=lfs diff=lfs merge=lfs -text
+*.mlmodel filter=lfs diff=lfs merge=lfs -text
+*.model filter=lfs diff=lfs merge=lfs -text
+*.msgpack filter=lfs diff=lfs merge=lfs -text
+*.npy filter=lfs diff=lfs merge=lfs -text
+*.npz filter=lfs diff=lfs merge=lfs -text
+*.onnx filter=lfs diff=lfs merge=lfs -text
+*.ot filter=lfs diff=lfs merge=lfs -text
+*.parquet filter=lfs diff=lfs merge=lfs -text
+*.pb filter=lfs diff=lfs merge=lfs -text
+*.pickle filter=lfs diff=lfs merge=lfs -text
+*.pkl filter=lfs diff=lfs merge=lfs -text
+*.pt filter=lfs diff=lfs merge=lfs -text
+*.pth filter=lfs diff=lfs merge=lfs -text
+*.rar filter=lfs diff=lfs merge=lfs -text
+*.safetensors filter=lfs diff=lfs merge=lfs -text
+saved_model/**/* filter=lfs diff=lfs merge=lfs -text
+*.tar.* filter=lfs diff=lfs merge=lfs -text
+*.tar filter=lfs diff=lfs merge=lfs -text
+*.tflite filter=lfs diff=lfs merge=lfs -text
+*.tgz filter=lfs diff=lfs merge=lfs -text
+*.wasm filter=lfs diff=lfs merge=lfs -text
+*.xz filter=lfs diff=lfs merge=lfs -text
+*.zip filter=lfs diff=lfs merge=lfs -text
+*.zst filter=lfs diff=lfs merge=lfs -text
+*tfevents* filter=lfs diff=lfs merge=lfs -text
+tokenizer.json filter=lfs diff=lfs merge=lfs -text
+nuextract2_bench.png filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
@@ -0,0 +1,592 @@
+---
+library_name: transformers
+license: mit
+base_model:
+- Qwen/Qwen2-VL-2B-Instruct
+new_version: numind/NuExtract3
+pipeline_tag: image-text-to-text
+---
+
+<p align="center">
+    <a href="https://nuextract.ai/">
+        <img src="logo_nuextract.svg" width="200"/>
+    </a>
+</p>
+<p align="center">
+        🖥️ <a href="https://nuextract.ai/">API / Platform</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://numind.ai/blog">Blog</a>&nbsp&nbsp | &nbsp&nbsp🗣️ <a href="https://discord.gg/3tsEtJNCDe">Discord</a>&nbsp&nbsp | &nbsp&nbsp🔗 <a href="https://github.com/numindai/nuextract">GitHub</a>
+</p>
+
+# NuExtract 2.0 2B by NuMind 📈📈📈
+
+NuExtract 2.0 is a family of models trained specifically for structured information extraction tasks. It supports both multimodal inputs and is multilingual.
+
+We provide several versions of different sizes, all based on pre-trained models from the QwenVL family.
+| Model Size | Model Name | Base Model | License | Huggingface Link |
+|------------|------------|------------|---------|------------------|
+| 2B | NuExtract-2.0-2B | [Qwen2-VL-2B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct) | MIT | 🤗 [NuExtract-2.0-2B](https://huggingface.co/numind/NuExtract-2.0-2B) |
+| 4B | NuExtract-2.0-4B | [Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct) | Qwen Research License | 🤗 [NuExtract-2.0-4B](https://huggingface.co/numind/NuExtract-2.0-4B) |
+| 8B | NuExtract-2.0-8B | [Qwen2.5-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct) | MIT | 🤗 [NuExtract-2.0-8B](https://huggingface.co/numind/NuExtract-2.0-8B) |
+
+❗️Note: `NuExtract-2.0-2B` is based on Qwen2-VL rather than Qwen2.5-VL because the smallest Qwen2.5-VL model (3B) has a more restrictive, non-commercial license. We therefore include `NuExtract-2.0-2B` as a small model option that can be used commercially.
+
+## Benchmark
+Performance on collection of ~1,000 diverse extraction examples containing both text and image inputs.
+<a href="https://nuextract.ai/">
+    <img src="nuextract2_bench.png" width="500"/>
+</a>
+
+## Overview
+
+To use the model, provide an input text/image and a JSON template describing the information you need to extract. The template should be a JSON object, specifying field names and their expected type.
+
+Support types include:
+* `verbatim-string` - instructs the model to extract text that is present verbatim in the input.
+* `string` - a generic string field that can incorporate paraphrasing/abstraction.
+* `integer` - a whole number.
+* `number` - a whole or decimal number.
+* `date-time` - ISO formatted date.
+* Array of any of the above types (e.g. `["string"]`)
+* `enum` - a choice from set of possible answers (represented in template as an array of options, e.g. `["yes", "no", "maybe"]`).
+* `multi-label` - an enum that can have multiple possible answers (represented in template as a double-wrapped array, e.g. `[["A", "B", "C"]]`).
+
+If the model does not identify relevant information for a field, it will return `null` or `[]` (for arrays and multi-labels).
+
+The following is an example template:
+```json
+{
+  "first_name": "verbatim-string",
+  "last_name": "verbatim-string",
+  "description": "string",
+  "age": "integer",
+  "gpa": "number",
+  "birth_date": "date-time",
+  "nationality": ["France", "England", "Japan", "USA", "China"],
+  "languages_spoken": [["English", "French", "Japanese", "Mandarin", "Spanish"]]
+}
+```
+An example output:
+```json
+{
+  "first_name": "Susan",
+  "last_name": "Smith",
+  "description": "A student studying computer science.",
+  "age": 20,
+  "gpa": 3.7,
+  "birth_date": "2005-03-01",
+  "nationality": "England",
+  "languages_spoken": ["English", "French"]
+}
+```
+
+⚠️ We recommend using NuExtract with a temperature at or very close to 0. Some inference frameworks, such as Ollama, use a default of 0.7 which is not well suited to many extraction tasks.
+
+## Using NuExtract with 🤗 Transformers
+
+```python
+import torch
+from transformers import AutoProcessor, AutoModelForVision2Seq
+
+model_name = "numind/NuExtract-2.0-2B"
+# model_name = "numind/NuExtract-2.0-8B"
+
+model = AutoModelForVision2Seq.from_pretrained(model_name, 
+                                               trust_remote_code=True, 
+                                               torch_dtype=torch.bfloat16,
+                                               attn_implementation="flash_attention_2",
+                                               device_map="auto")
+processor = AutoProcessor.from_pretrained(model_name, 
+                                          trust_remote_code=True, 
+                                          padding_side='left',
+                                          use_fast=True)
+
+# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
+# min_pixels = 256*28*28
+# max_pixels = 1280*28*28
+# processor = AutoProcessor.from_pretrained(model_name, min_pixels=min_pixels, max_pixels=max_pixels)
+```
+
+You will need the following function to handle loading of image input data:
+```python
+def process_all_vision_info(messages, examples=None):
+    """
+    Process vision information from both messages and in-context examples, supporting batch processing.
+    
+    Args:
+        messages: List of message dictionaries (single input) OR list of message lists (batch input)
+        examples: Optional list of example dictionaries (single input) OR list of example lists (batch)
+    
+    Returns:
+        A flat list of all images in the correct order:
+        - For single input: example images followed by message images
+        - For batch input: interleaved as (item1 examples, item1 input, item2 examples, item2 input, etc.)
+        - Returns None if no images were found
+    """
+    from qwen_vl_utils import process_vision_info, fetch_image
+    
+    # Helper function to extract images from examples
+    def extract_example_images(example_item):
+        if not example_item:
+            return []
+            
+        # Handle both list of examples and single example
+        examples_to_process = example_item if isinstance(example_item, list) else [example_item]
+        images = []
+        
+        for example in examples_to_process:
+            if isinstance(example.get('input'), dict) and example['input'].get('type') == 'image':
+                images.append(fetch_image(example['input']))
+                
+        return images
+    
+    # Normalize inputs to always be batched format
+    is_batch = messages and isinstance(messages[0], list)
+    messages_batch = messages if is_batch else [messages]
+    is_batch_examples = examples and isinstance(examples, list) and (isinstance(examples[0], list) or examples[0] is None)
+    examples_batch = examples if is_batch_examples else ([examples] if examples is not None else None)
+    
+    # Ensure examples batch matches messages batch if provided
+    if examples and len(examples_batch) != len(messages_batch):
+        if not is_batch and len(examples_batch) == 1:
+            # Single example set for a single input is fine
+            pass
+        else:
+            raise ValueError("Examples batch length must match messages batch length")
+    
+    # Process all inputs, maintaining correct order
+    all_images = []
+    for i, message_group in enumerate(messages_batch):
+        # Get example images for this input
+        if examples and i < len(examples_batch):
+            input_example_images = extract_example_images(examples_batch[i])
+            all_images.extend(input_example_images)
+        
+        # Get message images for this input
+        input_message_images = process_vision_info(message_group)[0] or []
+        all_images.extend(input_message_images)
+    
+    return all_images if all_images else None
+```
+
+E.g. To perform a basic extraction of names from a text document:
+```python
+template = """{"names": ["string"]}"""
+document = "John went to the restaurant with Mary. James went to the cinema."
+
+# prepare the user message content
+messages = [{"role": "user", "content": document}]
+text = processor.tokenizer.apply_chat_template(
+    messages,
+    template=template, # template is specified here
+    tokenize=False,
+    add_generation_prompt=True,
+)
+
+print(text)
+""""<|im_start|>user
+# Template:
+{"names": ["string"]}
+# Context:
+John went to the restaurant with Mary. James went to the cinema.<|im_end|> 
+<|im_start|>assistant"""
+
+image_inputs = process_all_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    padding=True,
+    return_tensors="pt",
+).to("cuda")
+
+# we choose greedy sampling here, which works well for most information extraction tasks
+generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
+
+# Inference: Generation of the output
+generated_ids = model.generate(
+    **inputs,
+    **generation_config
+)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+
+print(output_text)
+# ['{"names": ["John", "Mary", "James"]}']
+```
+
+<details>
+<summary>In-Context Examples</summary>
+Sometimes the model might not perform as well as we want because our task is challenging or involves some degree of ambiguity. Alternatively, we may want the model to follow some specific formatting, or just give it a bit more help. In cases like this it can be valuable to provide "in-context examples" to help NuExtract better understand the task.
+
+To do so, we can provide a list examples (dictionaries of input/output pairs). In the example below, we show to the model that we want the extracted names to be in captial letters with `-` on either side (for the sake of illustration). Usually providing multiple examples will lead to better results.
+```python
+template = """{"names": ["string"]}"""
+document = "John went to the restaurant with Mary. James went to the cinema."
+examples = [
+    {
+        "input": "Stephen is the manager at Susan's store.",
+        "output": """{"names": ["-STEPHEN-", "-SUSAN-"]}"""
+    }
+]
+
+messages = [{"role": "user", "content": document}]
+text = processor.tokenizer.apply_chat_template(
+    messages,
+    template=template,
+    examples=examples, # examples provided here
+    tokenize=False,
+    add_generation_prompt=True,
+)
+
+image_inputs = process_all_vision_info(messages, examples)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    padding=True,
+    return_tensors="pt",
+).to("cuda")
+
+# we choose greedy sampling here, which works well for most information extraction tasks
+generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
+
+# Inference: Generation of the output
+generated_ids = model.generate(
+    **inputs,
+    **generation_config
+)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+# ['{"names": ["-JOHN-", "-MARY-", "-JAMES-"]}']
+```
+</details>
+
+<details>
+<summary>Image Inputs</summary>
+If we want to give image inputs to NuExtract, instead of text, we simply provide a dictionary specifying the desired image file as the message content, instead of a string. (e.g. `{"type": "image", "image": "file://image.jpg"}`).
+
+You can also specify an image URL (e.g. `{"type": "image", "image": "http://path/to/your/image.jpg"}`) or base64 encoding (e.g. `{"type": "image", "image": "data:image;base64,/9j/..."}`).
+```python
+template = """{"store": "verbatim-string"}"""
+document = {"type": "image", "image": "file://1.jpg"}
+
+messages = [{"role": "user", "content": [document]}]
+text = processor.tokenizer.apply_chat_template(
+    messages,
+    template=template,
+    tokenize=False,
+    add_generation_prompt=True,
+)
+
+image_inputs = process_all_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    padding=True,
+    return_tensors="pt",
+).to("cuda")
+
+generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
+
+# Inference: Generation of the output
+generated_ids = model.generate(
+    **inputs,
+    **generation_config
+)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+# ['{"store": "Trader Joe\'s"}']
+```
+</details>
+
+<details>
+<summary>Batch Inference</summary>
+
+```python
+inputs = [
+    # image input with no ICL examples
+    {
+        "document": {"type": "image", "image": "file://0.jpg"},
+        "template": """{"store_name": "verbatim-string"}""",
+    },
+    # image input with 1 ICL example
+    {
+        "document": {"type": "image", "image": "file://0.jpg"},
+        "template": """{"store_name": "verbatim-string"}""",
+        "examples": [
+            {
+                "input": {"type": "image", "image": "file://1.jpg"},
+                "output": """{"store_name": "Trader Joe's"}""",
+            }
+        ],
+    },
+    # text input with no ICL examples
+    {
+        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
+        "template": """{"names": ["string"]}""",
+    },
+    # text input with ICL example
+    {
+        "document": {"type": "text", "text": "John went to the restaurant with Mary. James went to the cinema."},
+        "template": """{"names": ["string"]}""",
+        "examples": [
+            {
+                "input": "Stephen is the manager at Susan's store.",
+                "output": """{"names": ["STEPHEN", "SUSAN"]}"""
+            }
+        ],
+    },
+]
+
+# messages should be a list of lists for batch processing
+messages = [
+    [
+        {
+            "role": "user",
+            "content": [x['document']],
+        }
+    ]
+    for x in inputs
+]
+
+# apply chat template to each example individually
+texts = [
+    processor.tokenizer.apply_chat_template(
+        messages[i],  # Now this is a list containing one message
+        template=x['template'],
+        examples=x.get('examples', None),
+        tokenize=False, 
+        add_generation_prompt=True)
+    for i, x in enumerate(inputs)
+]
+
+image_inputs = process_all_vision_info(messages, [x.get('examples') for x in inputs])
+inputs = processor(
+    text=texts,
+    images=image_inputs,
+    padding=True,
+    return_tensors="pt",
+).to("cuda")
+
+generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
+
+# Batch Inference
+generated_ids = model.generate(**inputs, **generation_config)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_texts = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+for y in output_texts:
+    print(y)
+# {"store_name": "WAL-MART"}
+# {"store_name": "Walmart"}
+# {"names": ["John", "Mary", "James"]}
+# {"names": ["JOHN", "MARY", "JAMES"]}
+```
+</details>
+
+<details>
+<summary>Template Generation</summary>
+If you want to convert existing schema files you have in other formats (e.g. XML, YAML, etc.) or start from an example, NuExtract 2.0 models can automatically generate this for you.
+
+E.g. convert XML into a NuExtract template:
+```python
+xml_template = """<SportResult>
+    <Date></Date>
+    <Sport></Sport>
+    <Venue></Venue>
+    <HomeTeam></HomeTeam>
+    <AwayTeam></AwayTeam>
+    <HomeScore></HomeScore>
+    <AwayScore></AwayScore>
+    <TopScorer></TopScorer>
+</SportResult>"""
+
+messages = [
+        {
+            "role": "user",
+            "content": [{"type": "text", "text": xml_template}],
+        }
+    ]
+
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True,
+)
+
+image_inputs = process_all_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    padding=True,
+    return_tensors="pt",
+).to("cuda")
+
+generated_ids = model.generate(
+    **inputs,
+    **generation_config
+)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+
+print(output_text[0])
+# {
+#     "Date": "date-time",
+#     "Sport": "verbatim-string",
+#     "Venue": "verbatim-string",
+#     "HomeTeam": "verbatim-string",
+#     "AwayTeam": "verbatim-string",
+#     "HomeScore": "integer",
+#     "AwayScore": "integer",
+#     "TopScorer": "verbatim-string"
+# }
+```
+
+E.g. generate a template from natural language description:
+```python
+description = "I would like to extract important details from the contract."
+
+messages = [
+        {
+            "role": "user",
+            "content": [{"type": "text", "text": description}],
+        }
+    ]
+
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True,
+)
+
+image_inputs = process_all_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    padding=True,
+    return_tensors="pt",
+).to("cuda")
+
+generated_ids = model.generate(
+    **inputs,
+    **generation_config
+)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+
+print(output_text[0])
+# {
+#     "Contract": {
+#         "Title": "verbatim-string",
+#         "Description": "verbatim-string",
+#         "Terms": [
+#             {
+#                 "Term": "verbatim-string",
+#                 "Description": "verbatim-string"
+#             }
+#         ],
+#         "Date": "date-time",
+#         "Signatory": "verbatim-string"
+#     }
+# }
+```
+</details>
+
+## Fine-Tuning
+You can find a fine-tuning tutorial notebook in the [cookbooks](https://github.com/numindai/nuextract/tree/main/cookbooks) folder of the [GitHub repo](https://github.com/numindai/nuextract/tree/main).
+
+## vLLM Deployment
+Run the command below to serve an OpenAI-compatible API:
+```bash
+vllm serve numind/NuExtract-2.0-8B --trust_remote_code --limit-mm-per-prompt image=6 --chat-template-content-format openai
+```
+If you encounter memory issues, set `--max-model-len` accordingly.
+
+Send requests to the model as follows:
+```python
+import json
+from openai import OpenAI
+
+openai_api_key = "EMPTY"
+openai_api_base = "http://localhost:8000/v1"
+
+client = OpenAI(
+    api_key=openai_api_key,
+    base_url=openai_api_base,
+)
+
+chat_response = client.chat.completions.create(
+    model="numind/NuExtract-2.0-8B",
+    temperature=0,
+    messages=[
+        {
+            "role": "user", 
+            "content": [{"type": "text", "text": "Yesterday I went shopping at Bunnings"}],
+        },
+    ],
+    extra_body={
+        "chat_template_kwargs": {
+            "template": json.dumps(json.loads("""{\"store\": \"verbatim-string\"}"""), indent=4)
+        },
+    }
+)
+print("Chat response:", chat_response)
+```
+For image inputs, structure requests as shown below. Make sure to order the images in `"content"` as they appear in the prompt (i.e. any in-context examples before the main input).
+```python
+import base64
+
+def encode_image(image_path):
+    """
+    Encode the image file to base64 string
+    """
+    with open(image_path, "rb") as image_file:
+        return base64.b64encode(image_file.read()).decode('utf-8')
+
+base64_image = encode_image("0.jpg")
+base64_image2 = encode_image("1.jpg")
+
+chat_response = client.chat.completions.create(
+    model="numind/NuExtract-2.0-8B",
+    temperature=0,
+    messages=[
+        {
+            "role": "user", 
+            "content": [
+                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}}, # first ICL example image
+                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image2}"}}, # real input image
+            ],
+        },
+    ],
+    extra_body={
+        "chat_template_kwargs": {
+            "template": json.dumps(json.loads("""{\"store\": \"verbatim-string\"}"""), indent=4),
+            "examples": [
+                {
+                    "input": "<image>",
+                    "output": """{\"store\": \"Walmart\"}"""
+                }
+            ]
+        },
+    }
+)
+print("Chat response:", chat_response)
+```
--- a/added_tokens.json
+++ b/added_tokens.json
@@ -0,0 +1,24 @@
+{
+  "</tool_call>": 151658,
+  "<tool_call>": 151657,
+  "<|box_end|>": 151649,
+  "<|box_start|>": 151648,
+  "<|endoftext|>": 151643,
+  "<|file_sep|>": 151664,
+  "<|fim_middle|>": 151660,
+  "<|fim_pad|>": 151662,
+  "<|fim_prefix|>": 151659,
+  "<|fim_suffix|>": 151661,
+  "<|im_end|>": 151645,
+  "<|im_start|>": 151644,
+  "<|image_pad|>": 151655,
+  "<|object_ref_end|>": 151647,
+  "<|object_ref_start|>": 151646,
+  "<|quad_end|>": 151651,
+  "<|quad_start|>": 151650,
+  "<|repo_name|>": 151663,
+  "<|video_pad|>": 151656,
+  "<|vision_end|>": 151653,
+  "<|vision_pad|>": 151654,
+  "<|vision_start|>": 151652
+}
--- a/chat_template.json
+++ b/chat_template.json
--- a/config.json
+++ b/config.json
@@ -0,0 +1,49 @@
+{
+  "_name_or_path": "experiments/Qwen2_2B-final-stp/checkpoint-70584",
+  "architectures": [
+    "Qwen2VLForConditionalGeneration"
+  ],
+  "attention_dropout": 0.0,
+  "bos_token_id": 151643,
+  "eos_token_id": 151645,
+  "hidden_act": "silu",
+  "hidden_size": 1536,
+  "image_token_id": 151655,
+  "initializer_range": 0.02,
+  "intermediate_size": 8960,
+  "max_position_embeddings": 32768,
+  "max_window_layers": 28,
+  "model_type": "qwen2_vl",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 28,
+  "num_key_value_heads": 2,
+  "rms_norm_eps": 1e-06,
+  "rope_scaling": {
+    "mrope_section": [
+      16,
+      24,
+      24
+    ],
+    "rope_type": "default",
+    "type": "default"
+  },
+  "rope_theta": 1000000.0,
+  "sliding_window": 32768,
+  "tie_word_embeddings": true,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.49.0",
+  "use_cache": true,
+  "use_sliding_window": false,
+  "video_token_id": 151656,
+  "vision_config": {
+    "hidden_size": 1536,
+    "in_chans": 3,
+    "model_type": "qwen2_vl",
+    "spatial_patch_size": 14,
+    "torch_dtype": "bfloat16"
+  },
+  "vision_end_token_id": 151653,
+  "vision_start_token_id": 151652,
+  "vision_token_id": 151654,
+  "vocab_size": 151936
+}
--- a/generation_config.json
+++ b/generation_config.json
@@ -0,0 +1,14 @@
+{
+  "attn_implementation": "flash_attention_2",
+  "bos_token_id": 151643,
+  "do_sample": true,
+  "eos_token_id": [
+    151645,
+    151643
+  ],
+  "pad_token_id": 151643,
+  "temperature": 0.01,
+  "top_k": 1,
+  "top_p": 0.001,
+  "transformers_version": "4.49.0"
+}
--- a/logo_nuextract.svg
+++ b/logo_nuextract.svg
@@ -0,0 +1,90 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!-- Created with Inkscape (http://www.inkscape.org/) -->
+
+<svg
+   version="1.1"
+   id="svg1"
+   width="1520.4797"
+   height="292.58524"
+   viewBox="0 0 1520.4798 292.58524"
+   xmlns="http://www.w3.org/2000/svg"
+   xmlns:svg="http://www.w3.org/2000/svg">
+  <defs
+     id="defs1">
+    <clipPath
+       clipPathUnits="userSpaceOnUse"
+       id="clipPath13">
+      <path
+         d="M 0,341.62 H 1100.11 V 0 H 0 Z"
+         transform="translate(-296.42451,-172.17601)"
+         id="path13" />
+    </clipPath>
+    <clipPath
+       clipPathUnits="userSpaceOnUse"
+       id="clipPath15">
+      <path
+         d="M 0,341.62 H 1100.11 V 0 H 0 Z"
+         transform="translate(-232.38961,-267.80771)"
+         id="path15" />
+    </clipPath>
+    <clipPath
+       clipPathUnits="userSpaceOnUse"
+       id="clipPath17">
+      <path
+         d="M 0,341.62 H 1100.11 V 0 H 0 Z"
+         transform="translate(-122.20671,-236.63431)"
+         id="path17" />
+    </clipPath>
+    <clipPath
+       clipPathUnits="userSpaceOnUse"
+       id="clipPath19">
+      <path
+         d="M 0,341.62 H 1100.11 V 0 H 0 Z"
+         transform="translate(-117.495,-121.51351)"
+         id="path19" />
+    </clipPath>
+    <clipPath
+       clipPathUnits="userSpaceOnUse"
+       id="clipPath21">
+      <path
+         d="M 0,341.62 H 1100.11 V 0 H 0 Z"
+         transform="translate(-225.1954,-81.860705)"
+         id="path21" />
+    </clipPath>
+  </defs>
+  <path
+     id="path12"
+     d="m 0,0 v 0 c 1.104,3.005 2.142,6.323 3.014,9.839 v 0 C 15.425,-2.193 20.455,-12.66 13.315,-25.357 5.134,-41.848 -32.768,-69.221 -50.814,-72.11 c 2.78,6.99 39.967,81.895 -12.419,126.995 28.42,-10.17 49.405,-30.511 52.717,-33.265 0.018,-25.369 -7.695,-45.257 -11.521,-56.247 C 0.177,-26.332 9.806,-11.907 0,0"
+     style="fill:#e4843a;fill-opacity:1;fill-rule:nonzero;stroke:none"
+     transform="matrix(1.3333333,0,0,-1.3333333,273.36248,144.06706)"
+     clip-path="url(#clipPath13)" />
+  <path
+     id="path14"
+     d="m 0,0 v 0 c -2.517,1.979 -5.352,3.991 -8.426,5.907 v 0 C 6.852,13.993 18.361,15.542 28.231,4.828 41.386,-8.049 55.707,-52.555 52.878,-70.61 47.09,-65.806 -12.658,-7.292 -71.739,-43.178 -53.284,-19.292 -27.454,-5.619 -23.811,-3.32 0.322,-11.143 16.853,-24.624 26.122,-31.659 25.098,-7.969 14.354,5.646 0,0"
+     style="fill:#a37fb8;fill-opacity:1;fill-rule:nonzero;stroke:none"
+     transform="matrix(1.3333333,0,0,-1.3333333,187.98261,16.558133)"
+     clip-path="url(#clipPath15)" />
+  <path
+     id="path16"
+     d="m 0,0 v 0 c -2.66,-1.782 -5.45,-3.856 -8.221,-6.189 v 0 c -2.97,17.03 -0.886,28.454 12.354,34.53 16.311,8.532 63.064,8.399 79.361,0.129 C 77.137,24.45 3.024,-14.292 18.896,-81.57 1.882,-56.638 -3.14,-27.847 -4.2,-23.672 10.697,-3.138 28.627,8.418 38.181,15.06 15.334,21.407 -0.934,15.396 0,0"
+     style="fill:#87b7e0;fill-opacity:1;fill-rule:nonzero;stroke:none"
+     transform="matrix(1.3333333,0,0,-1.3333333,41.072076,58.122663)"
+     clip-path="url(#clipPath17)" />
+  <path
+     id="path18"
+     d="m 0,0 v 0 c 0.873,-3.08 1.984,-6.375 3.345,-9.731 v 0 c -17.113,2.438 -27.335,7.95 -29.022,22.419 -3.074,18.15 11.5,62.573 24.401,75.518 C 0.583,80.917 14.527,-1.541 83.417,-7.236 54.447,-15.712 25.513,-11.591 21.215,-11.31 6.29,9.204 0.84,29.827 -2.525,40.967 -15.621,21.199 -14.931,3.869 0,0"
+     style="fill:#99b535;fill-opacity:1;fill-rule:nonzero;stroke:none"
+     transform="matrix(1.3333333,0,0,-1.3333333,34.789806,211.61706)"
+     clip-path="url(#clipPath19)" />
+  <path
+     id="path20"
+     d="m 0,0 v 0 c 3.199,-0.121 6.676,-0.083 10.289,0.174 v 0 C 2.681,-15.348 -5.719,-23.366 -20.002,-20.499 -38.213,-17.814 -75.959,9.774 -84.283,26.044 -76.776,25.56 5.954,13.34 32.659,77.099 31.768,46.927 18.908,20.683 17.312,16.682 -6.81,8.826 -28.108,10.016 -39.742,10.258 -24.988,-8.305 -8.294,-13.005 0,0"
+     style="fill:#f0cf35;fill-opacity:1;fill-rule:nonzero;stroke:none"
+     transform="matrix(1.3333333,0,0,-1.3333333,178.39034,264.48746)"
+     clip-path="url(#clipPath21)" />
+  <path
+     style="font-weight:500;font-size:266.667px;font-family:Avenir;-inkscape-font-specification:'Avenir Medium';white-space:pre;fill:#535353;stroke-width:0;stroke-linejoin:round;stroke-miterlimit:5.4;stroke-opacity:0"
+     d="m 331.94501,49.293372 h 33.60005 L 469.54518,204.49356 h 0.53334 V 49.293372 h 25.60003 V 238.09361 H 463.14517 L 358.07838,82.893417 h -0.53333 V 238.09361 H 331.94501 Z M 650.07859,238.09361 h -24.00001 v -19.46669 h -0.53333 q -4.53334,10.13334 -15.73336,16.53335 -11.20001,6.13334 -25.86669,6.13334 -9.33335,0 -17.60003,-2.93334 -8.26667,-2.66667 -14.66668,-8.53334 -6.13334,-5.86667 -9.86668,-14.93335 -3.73334,-9.33335 -3.73334,-21.8667 v -81.33343 h 24.00003 v 74.66676 q 0,8.80001 2.40001,15.20002 2.4,6.13334 6.4,10.13334 4.00001,3.73334 9.06668,5.60001 5.33334,1.6 10.93335,1.6 7.46667,0 13.86668,-2.4 6.40001,-2.4 11.20002,-7.46668 4.8,-5.33334 7.46667,-13.33335 2.66667,-8.00001 2.66667,-18.93335 v -65.06675 h 24.00001 z m 42.4,-188.800238 h 121.8668 v 24.00003 h -96.2668 v 56.266738 h 89.6001 v 24.00003 h -89.6001 v 60.53341 h 101.0668 v 24.00003 h -126.6668 z m 191.467,121.333488 -44.8001,-58.93341 h 30.9334 l 30.6667,44.80006 30.1333,-44.80006 h 29.0667 l -43.2,58.93341 51.2001,67.46675 h -30.9334 l -37.3334,-53.06674 -37.3334,53.06674 h -29.6 z m 171.46661,-38.13338 h -34.4001 v 57.3334 q 0,5.33334 0.2667,10.66668 0.2667,5.06667 1.8667,9.33334 1.8666,4.00001 5.3333,6.66668 3.7333,2.4 10.6667,2.4 4.2667,0 8.8,-0.8 4.5333,-0.8 8.2667,-2.93334 v 21.8667 q -4.2667,2.4 -11.2,3.2 -6.6667,1.06667 -10.4001,1.06667 -13.8666,0 -21.6,-3.73334 -7.4667,-4 -11.2,-10.13334 -3.46671,-6.13334 -4.26671,-13.60002 -0.5333,-7.73334 -0.5333,-15.46669 v -65.86674 h -27.7334 v -20.80003 h 27.7334 V 76.226737 h 24.00001 v 35.466713 h 34.4001 z m 30.9335,-20.80003 h 24.0001 v 19.46669 h 0.5333 q 2.4,-5.06667 6.4,-9.06668 4,-4.26667 8.8,-7.2 5.0667,-2.93334 10.9334,-4.53334 5.8666,-1.86667 11.7333,-1.86667 5.8667,0 10.6667,1.6 l -1.0667,25.8667 q -2.9333,-0.8 -5.8666,-1.33334 -2.9334,-0.53333 -5.8667,-0.53333 -17.6,0 -26.9334,9.86668 -9.3333,9.86668 -9.3333,30.6667 v 63.46675 h -24.0001 z m 99.2004,15.46669 q 10.1333,-9.33335 23.4667,-13.86669 13.3333,-4.8 26.6667,-4.8 13.8666,0 23.7333,3.46667 10.1334,3.46667 16.5334,9.33334 6.4,5.86668 9.3333,13.60002 3.2,7.46668 3.2,15.73335 v 64.53341 q 0,6.66668 0.2667,12.26669 0.2667,5.6 0.8,10.66668 h -21.3334 q -0.8,-9.60002 -0.8,-19.20003 h -0.5333 q -8,12.26668 -18.9334,17.33336 -10.9333,5.06667 -25.3333,5.06667 -8.8,0 -16.8,-2.4 -8.0001,-2.40001 -14.1334,-7.20001 -5.8667,-4.80001 -9.3333,-11.73335 -3.4667,-7.20001 -3.4667,-16.53335 0,-12.26668 5.3333,-20.53336 5.6,-8.26668 14.9334,-13.33335 9.6,-5.33334 22.1333,-7.46668 12.8001,-2.4 27.2001,-2.4 h 17.6 v -5.33334 q 0,-4.80001 -1.8667,-9.60001 -1.8666,-4.80001 -5.6,-8.53335 -3.7333,-4 -9.3333,-6.13334 -5.6,-2.4 -13.3334,-2.4 -6.9333,0 -12.2667,1.33334 -5.0666,1.33333 -9.3333,3.46667 -4.2667,1.86667 -7.7333,4.53334 -3.4667,2.66667 -6.6667,5.06667 z m 67.7334,50.13339 q -8.5334,0 -17.6,1.06667 -8.8,0.8 -16.2667,3.46667 -7.2,2.66667 -12,7.46668 -4.5334,4.8 -4.5334,12.26668 0,10.93334 7.2,15.73335 7.4667,4.80001 20.0001,4.80001 9.8666,0 16.8,-3.20001 6.9333,-3.46667 11.2,-8.80001 4.2667,-5.33334 6.1333,-11.73335 1.8667,-6.66667 1.8667,-13.06668 v -8.00001 z m 160.5333,-32.00004 q -6.6667,-6.93334 -14.1333,-10.40001 -7.2001,-3.73334 -17.3334,-3.73334 -9.8667,0 -17.3334,3.73334 -7.2,3.46667 -12.2666,9.86668 -4.8,6.13334 -7.4667,14.40002 -2.4,8.00001 -2.4,16.80002 0,8.80001 2.9333,16.80002 2.9334,7.73334 8.2667,13.60001 5.3333,5.86668 12.8,9.33335 7.4667,3.2 16.8,3.2 10.1334,0 17.3334,-3.46667 7.2,-3.73334 13.3333,-10.66668 l 17.0667,17.06669 q -9.3333,10.40001 -21.8667,14.93335 -12.2666,4.53334 -26.1333,4.53334 -14.6667,0 -26.9334,-4.80001 -12,-4.8 -20.8,-13.33335 -8.8,-8.80001 -13.6,-20.80002 -4.8,-12.26668 -4.8,-26.93337 0,-14.66668 4.8,-26.93336 4.8,-12.26669 13.3333,-21.0667 8.8,-8.80001 20.8,-13.60001 12.2667,-5.06668 27.2001,-5.06668 13.8667,0 26.4,5.06668 12.8,4.8 22.1334,14.93335 z m 105.8669,-12.80001 h -34.4 v 57.3334 q 0,5.33334 0.2667,10.66668 0.2666,5.06667 1.8666,9.33334 1.8667,4.00001 5.3334,6.66668 3.7333,2.4 10.6667,2.4 4.2666,0 8.8,-0.8 4.5333,-0.8 8.2666,-2.93334 v 21.8667 q -4.2666,2.4 -11.2,3.2 -6.6666,1.06667 -10.4,1.06667 -13.8667,0 -21.6,-3.73334 -7.4667,-4 -11.2,-10.13334 -3.4667,-6.13334 -4.2667,-13.60002 -0.5333,-7.73334 -0.5333,-15.46669 v -65.86674 h -27.7334 v -20.80003 h 27.7334 V 76.226737 h 24 v 35.466713 h 34.4 z"
+     id="text21"
+     aria-label="NuExtract" />
+</svg>
--- a/merges.txt
+++ b/merges.txt
--- a/model.safetensors
+++ b/model.safetensors
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:5da927d09ed271c51e0c497556aa1ab37f10cb3464863173bb474e41a2148d5d
+size 4418050848
--- a/nuextract2_bench.png
+++ b/nuextract2_bench.png
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:b2cdf1eec686510aaa05e91d098ddda56f4674e7448a3e4b66e50a915240b545
+size 106243
--- a/preprocessor_config.json
+++ b/preprocessor_config.json
@@ -0,0 +1,29 @@
+{
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_rescale": true,
+  "do_resize": true,
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_processor_type": "Qwen2VLImageProcessor",
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "max_pixels": 23000000,
+  "merge_size": 2,
+  "min_pixels": 200704,
+  "patch_size": 14,
+  "processor_class": "Qwen2_5_VLProcessor",
+  "resample": 3,
+  "rescale_factor": 0.00392156862745098,
+  "size": {
+    "longest_edge": 23000000,
+    "shortest_edge": 200704
+  },
+  "temporal_patch_size": 2
+}
--- a/special_tokens_map.json
+++ b/special_tokens_map.json
@@ -0,0 +1,31 @@
+{
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
+  "eos_token": {
+    "content": "<|im_end|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<|endoftext|>",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}
--- a/tokenizer.json
+++ b/tokenizer.json
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:ba0c439f7be467bf47d12a7e6f9adc6116201056fc60c67f431c679b7c16afc8
+size 11422064
--- a/tokenizer_config.json
+++ b/tokenizer_config.json
--- a/vocab.json
+++ b/vocab.json