初始化项目，由ModelHub XC社区提供模型

Model: numind/NuExtract-2-2B-experimental Source: Original Platform
2026-05-04 22:56:44 +08:00
commit e31fbf3385
18 changed files with 4306 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,617 @@
+---
+license: mit
+language:
+- multilingual
+tags:
+- nlp
+base_model: OpenGVLab/InternVL2_5-2B
+pipeline_tag: text-generation
+inference: true
+---
+
+# NuExtract-2-2B [experimental version] by NuMind 🔥
+
+NuExtract 2.0 experimental is a family of models trained specifically for structured information extraction tasks. It supports both multimodal inputs and is multilingual.
+
+NB: This is an experimental version that will be superseeded by NuExtract 2.0
+
+We provide several versions of different sizes, all based on the InternVL2.5 family.
+| Model Size | Model Name | Base Model | Huggingface Link |
+|------------|------------|------------|------------------|
+| 2B | NuExtract-2.0-2B | [InternVL2_5-2B](https://huggingface.co/OpenGVLab/InternVL2_5-2B) | [NuExtract-2-2B](https://huggingface.co/numind/NuExtract-2-2B) |
+| 4B | NuExtract-2.0-4B | [InternVL2_5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B) | [NuExtract-2-4B](https://huggingface.co/numind/NuExtract-2-4B) |
+| 8B | NuExtract-2.0-8B | [InternVL2_5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) | [NuExtract-2-8B](https://huggingface.co/numind/NuExtract-2-8B) |
+
+## Overview
+
+To use the model, provide an input text/image and a JSON template describing the information you need to extract. The template should be a JSON object, specifying field names and their expected type.
+
+Support types include:
+* `verbatim-string` - instructs the model to extract text that is present verbatim in the input.
+* `string` - a generic string field that can incorporate paraphrasing/abstraction.
+* `integer` - a whole number.
+* `number` - a whole or decimal number.
+* `date-time` - ISO formatted date.
+* Array of any of the above types (e.g. `["string"]`)
+* `enum` - a choice from set of possible answers (represented in template as an array of options, e.g. `["yes", "no", "maybe"]`).
+* `multi-label` - an enum that can have multiple possible answers (represented in template as a double-wrapped array, e.g. `[["A", "B", "C"]]`).
+
+If the model does not identify relevant information for a field, it will return `null` or `[]` (for arrays and multi-labels).
+
+The following is an example template:
+```json
+{
+  "first_name": "verbatim-string",
+  "last_name": "verbatim-string",
+  "description": "string",
+  "age": "integer",
+  "gpa": "number",
+  "birth_date": "date-time",
+  "nationality": ["France", "England", "Japan", "USA", "China"],
+  "languages_spoken": [["English", "French", "Japanese", "Mandarin", "Spanish"]]
+}
+```
+An example output:
+```json
+{
+  "first_name": "Susan",
+  "last_name": "Smith",
+  "description": "A student studying computer science.",
+  "age": 20,
+  "gpa": 3.7,
+  "birth_date": "2005-03-01",
+  "nationality": "England",
+  "languages_spoken": ["English", "French"]
+}
+```
+
+⚠️ We recommend using NuExtract with a temperature at or very close to 0. Some inference frameworks, such as Ollama, use a default of 0.7 which is not well suited to many extraction tasks.
+
+## Inference
+
+Use the following code to handle loading and preprocessing of input data:
+
+```python
+import torch
+import torchvision.transforms as T
+from PIL import Image
+from torchvision.transforms.functional import InterpolationMode
+
+IMAGENET_MEAN = (0.485, 0.456, 0.406)
+IMAGENET_STD = (0.229, 0.224, 0.225)
+
+def build_transform(input_size):
+    MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
+    transform = T.Compose([
+        T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
+        T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
+        T.ToTensor(),
+        T.Normalize(mean=MEAN, std=STD)
+    ])
+    return transform
+
+def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
+    best_ratio_diff = float('inf')
+    best_ratio = (1, 1)
+    area = width * height
+    for ratio in target_ratios:
+        target_aspect_ratio = ratio[0] / ratio[1]
+        ratio_diff = abs(aspect_ratio - target_aspect_ratio)
+        if ratio_diff < best_ratio_diff:
+            best_ratio_diff = ratio_diff
+            best_ratio = ratio
+        elif ratio_diff == best_ratio_diff:
+            if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
+                best_ratio = ratio
+    return best_ratio
+
+def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
+    orig_width, orig_height = image.size
+    aspect_ratio = orig_width / orig_height
+
+    # calculate the existing image aspect ratio
+    target_ratios = set(
+        (i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
+        i * j <= max_num and i * j >= min_num)
+    target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
+
+    # find the closest aspect ratio to the target
+    target_aspect_ratio = find_closest_aspect_ratio(
+        aspect_ratio, target_ratios, orig_width, orig_height, image_size)
+
+    # calculate the target width and height
+    target_width = image_size * target_aspect_ratio[0]
+    target_height = image_size * target_aspect_ratio[1]
+    blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
+
+    # resize the image
+    resized_img = image.resize((target_width, target_height))
+    processed_images = []
+    for i in range(blocks):
+        box = (
+            (i % (target_width // image_size)) * image_size,
+            (i // (target_width // image_size)) * image_size,
+            ((i % (target_width // image_size)) + 1) * image_size,
+            ((i // (target_width // image_size)) + 1) * image_size
+        )
+        # split the image
+        split_img = resized_img.crop(box)
+        processed_images.append(split_img)
+    assert len(processed_images) == blocks
+    if use_thumbnail and len(processed_images) != 1:
+        thumbnail_img = image.resize((image_size, image_size))
+        processed_images.append(thumbnail_img)
+    return processed_images
+
+def load_image(image_file, input_size=448, max_num=12):
+    image = Image.open(image_file).convert('RGB')
+    transform = build_transform(input_size=input_size)
+    images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
+    pixel_values = [transform(image) for image in images]
+    pixel_values = torch.stack(pixel_values)
+    return pixel_values
+
+def prepare_inputs(messages, image_paths, tokenizer, device='cuda', dtype=torch.bfloat16):
+    """
+    Prepares multi-modal input components (supports multiple images per prompt).
+    
+    Args:
+        messages: List of input messages/prompts (strings or dicts with 'role' and 'content')
+        image_paths: List where each element is either None (for text-only) or a list of image paths
+        tokenizer: The tokenizer to use for applying chat templates
+        device: Device to place tensors on ('cuda', 'cpu', etc.)
+        dtype: Data type for image tensors (default: torch.bfloat16)
+    
+    Returns:
+        dict: Contains 'prompts', 'pixel_values_list', and 'num_patches_list' ready for the model
+    """
+    # Make sure image_paths list is at least as long as messages
+    if len(image_paths) < len(messages):
+        # Pad with None for text-only messages
+        image_paths = image_paths + [None] * (len(messages) - len(image_paths))
+    
+    # Process images and collect patch information
+    loaded_images = []
+    num_patches_list = []
+    for paths in image_paths:
+        if paths and isinstance(paths, list) and len(paths) > 0:
+            # Load each image in this prompt
+            prompt_images = []
+            prompt_patches = []
+            
+            for path in paths:
+                # Load the image
+                img = load_image(path).to(dtype=dtype, device=device)
+                
+                # Ensure img has correct shape [patches, C, H, W]
+                if len(img.shape) == 3:  # [C, H, W] -> [1, C, H, W]
+                    img = img.unsqueeze(0)
+                    
+                prompt_images.append(img)
+                # Record the number of patches for this image
+                prompt_patches.append(img.shape[0])
+            
+            loaded_images.append(prompt_images)
+            num_patches_list.append(prompt_patches)
+        else:
+            # Text-only prompt
+            loaded_images.append(None)
+            num_patches_list.append([])
+    
+    # Create the concatenated pixel_values_list
+    pixel_values_list = []
+    for prompt_images in loaded_images:
+        if prompt_images:
+            # Concatenate all images for this prompt
+            pixel_values_list.append(torch.cat(prompt_images, dim=0))
+        else:
+            # Text-only prompt
+            pixel_values_list.append(None)
+    
+    # Format messages for the model
+    if all(isinstance(m, str) for m in messages):
+        # Simple string messages: convert to chat format
+        batch_messages = [
+            [{"role": "user", "content": message}] 
+            for message in messages
+        ]
+    else:
+        # Assume messages are already in the right format
+        batch_messages = messages
+    
+    # Apply chat template
+    prompts = tokenizer.apply_chat_template(
+        batch_messages,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    
+    return {
+        'prompts': prompts,
+        'pixel_values_list': pixel_values_list,
+        'num_patches_list': num_patches_list
+    }
+
+def construct_message(text, template, examples=None):
+    """
+    Construct the individual NuExtract message texts, prior to chat template formatting.
+    """
+    # add few-shot examples if needed
+    if examples is not None and len(examples) > 0:
+        icl = "# Examples:\n"
+        for row in examples:
+            icl += f"## Input:\n{row['input']}\n## Output:\n{row['output']}\n"
+    else:
+        icl = ""
+        
+    return f"""# Template:\n{template}\n{icl}# Context:\n{text}"""
+```
+
+To handle inference:
+
+```python
+IMG_START_TOKEN='<img>'
+IMG_END_TOKEN='</img>'
+IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'
+
+def nuextract_generate(model, tokenizer, prompts, generation_config, pixel_values_list=None, num_patches_list=None):
+    """
+    Generate responses for a batch of NuExtract inputs.
+    Support for multiple and varying numbers of images per prompt.
+    
+    Args:
+        model: The vision-language model
+        tokenizer: The tokenizer for the model
+        pixel_values_list: List of tensor batches, one per prompt
+                          Each batch has shape [num_images, channels, height, width] or None for text-only prompts
+        prompts: List of text prompts
+        generation_config: Configuration for text generation
+        num_patches_list: List of lists, each containing patch counts for images in a prompt
+        
+    Returns:
+        List of generated responses
+    """
+    img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
+    model.img_context_token_id = img_context_token_id
+    
+    # Replace all image placeholders with appropriate tokens
+    modified_prompts = []
+    total_image_files = 0
+    total_patches = 0
+    image_containing_prompts = []
+    for idx, prompt in enumerate(prompts):
+        # check if this prompt has images
+        has_images = (pixel_values_list and
+                      idx < len(pixel_values_list) and 
+                      pixel_values_list[idx] is not None and 
+                      isinstance(pixel_values_list[idx], torch.Tensor) and
+                      pixel_values_list[idx].shape[0] > 0)
+        
+        if has_images:
+            # prompt with image placeholders
+            image_containing_prompts.append(idx)
+            modified_prompt = prompt
+            
+            patches = num_patches_list[idx] if (num_patches_list and idx < len(num_patches_list)) else []
+            num_images = len(patches)
+            total_image_files += num_images
+            total_patches += sum(patches)
+            
+            # replace each <image> placeholder with image tokens
+            for i, num_patches in enumerate(patches):
+                image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * model.num_image_token * num_patches + IMG_END_TOKEN
+                modified_prompt = modified_prompt.replace('<image>', image_tokens, 1)
+        else:
+            # text-only prompt
+            modified_prompt = prompt
+        
+        modified_prompts.append(modified_prompt)
+    
+    # process all prompts in a single batch
+    tokenizer.padding_side = 'left'
+    model_inputs = tokenizer(modified_prompts, return_tensors='pt', padding=True)
+    input_ids = model_inputs['input_ids'].to(model.device)
+    attention_mask = model_inputs['attention_mask'].to(model.device)
+    
+    eos_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>\n".strip())
+    generation_config['eos_token_id'] = eos_token_id
+    
+    # prepare pixel values
+    flattened_pixel_values = None
+    if image_containing_prompts:
+        # collect and concatenate all image tensors
+        all_pixel_values = []
+        for idx in image_containing_prompts:
+            all_pixel_values.append(pixel_values_list[idx])
+        
+        flattened_pixel_values = torch.cat(all_pixel_values, dim=0)
+        print(f"Processing batch with {len(prompts)} prompts, {total_image_files} actual images, and {total_patches} total patches")
+    else:
+        print(f"Processing text-only batch with {len(prompts)} prompts")
+    
+    # generate outputs
+    outputs = model.generate(
+        pixel_values=flattened_pixel_values,  # will be None for text-only prompts
+        input_ids=input_ids,
+        attention_mask=attention_mask,
+        **generation_config
+    )
+    
+    # Decode responses
+    responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)
+    
+    return responses
+```
+
+To load the model:
+
+```python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+
+model_name = ""
+
+tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side='left')
+model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, 
+                                             torch_dtype=torch.bfloat16,
+                                             attn_implementation="flash_attention_2" # we recommend using flash attention
+                                            ).to("cuda")
+```
+
+Simple 0-shot text-only example:
+```python
+template = """{"names": ["verbatim-string"]}"""
+text = "John went to the restaurant with Mary. James went to the cinema."
+
+input_messages = [construct_message(text, template)]
+
+input_content = prepare_inputs(
+    messages=input_messages,
+    image_paths=[],
+    tokenizer=tokenizer,
+)
+
+generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
+
+with torch.no_grad():
+    result = nuextract_generate(
+        model=model,
+        tokenizer=tokenizer,
+        prompts=input_content['prompts'],
+        pixel_values_list=input_content['pixel_values_list'],
+        num_patches_list=input_content['num_patches_list'],
+        generation_config=generation_config
+    )
+for y in result:
+    print(y)
+# {"names": ["John", "Mary", "James"]}
+```
+
+Text-only input with an in-context example:
+```python
+template = """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}"""
+text = "John went to the restaurant with Mary. James went to the cinema."
+examples = [
+    {
+        "input": "Stephen is the manager at Susan's store.",
+        "output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}"""
+    }
+]
+
+input_messages = [construct_message(text, template, examples)]
+
+input_content = prepare_inputs(
+    messages=input_messages,
+    image_paths=[],
+    tokenizer=tokenizer,
+)
+
+generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
+
+with torch.no_grad():
+    result = nuextract_generate(
+        model=model,
+        tokenizer=tokenizer,
+        prompts=input_content['prompts'],
+        pixel_values_list=input_content['pixel_values_list'],
+        num_patches_list=input_content['num_patches_list'],
+        generation_config=generation_config
+    )
+for y in result:
+    print(y)
+# {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]}
+```
+
+Example with image input and an in-context example. Image inputs should use `<image>` placeholder instead of text and image paths should be provided in a list in order of appearance in the prompt (in this example `0.jpg` will be for the in-context example and `1.jpg` for the true input).
+```python
+template = """{"store": "verbatim-string"}"""
+text = "<image>"
+examples = [
+    {
+        "input": "<image>",
+        "output": """{"store": "Walmart"}"""
+    }
+]
+
+input_messages = [construct_message(text, template, examples)]
+
+images = [
+    ["0.jpg", "1.jpg"]
+]
+
+input_content = prepare_inputs(
+    messages=input_messages,
+    image_paths=images,
+    tokenizer=tokenizer,
+)
+
+generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
+
+with torch.no_grad():
+    result = nuextract_generate(
+        model=model,
+        tokenizer=tokenizer,
+        prompts=input_content['prompts'],
+        pixel_values_list=input_content['pixel_values_list'],
+        num_patches_list=input_content['num_patches_list'],
+        generation_config=generation_config
+    )
+for y in result:
+    print(y)
+# {"store": "Trader Joe's"}
+```
+
+Multi-modal batched input:
+```python
+inputs = [
+    # image input with no ICL examples
+    {
+        "text": "<image>",
+        "template": """{"store_name": "verbatim-string"}""",
+        "examples": None,
+    },
+    # image input with 1 ICL example
+    {
+        "text": "<image>",
+        "template": """{"store_name": "verbatim-string"}""",
+        "examples": [
+            {
+                "input": "<image>",
+                "output": """{"store_name": "Walmart"}""",
+            }
+        ],
+    },
+    # text input with no ICL examples
+    {
+        "text": "John went to the restaurant with Mary. James went to the cinema.",
+        "template": """{"names": ["verbatim-string"]}""",
+        "examples": None,
+    },
+    # text input with ICL example
+    {
+        "text": "John went to the restaurant with Mary. James went to the cinema.",
+        "template": """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}""",
+        "examples": [
+            {
+                "input": "Stephen is the manager at Susan's store.",
+                "output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}"""
+            }
+        ],
+    },
+]
+
+input_messages = [
+    construct_message(
+        x["text"], 
+        x["template"], 
+        x["examples"]
+    ) for x in inputs
+]
+
+images = [
+    ["0.jpg"],
+    ["0.jpg", "1.jpg"],
+    None,
+    None
+]
+
+input_content = prepare_inputs(
+    messages=input_messages,
+    image_paths=images,
+    tokenizer=tokenizer,
+)
+
+generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
+
+with torch.no_grad():
+    result = nuextract_generate(
+        model=model,
+        tokenizer=tokenizer,
+        prompts=input_content['prompts'],
+        pixel_values_list=input_content['pixel_values_list'],
+        num_patches_list=input_content['num_patches_list'],
+        generation_config=generation_config
+    )
+for y in result:
+    print(y)
+# {"store_name": "WAL*MART"}
+# {"store_name": "Trader Joe's"}
+# {"names": ["John", "Mary", "James"]}
+# {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]}
+```
+
+## Template Generation
+If you want to convert existing schema files you have in other formats (e.g. XML, YAML, etc.) or start from an example, NuExtract 2 models can automatically generate this for you.
+
+E.g. convert XML into a NuExtract template:
+```python
+def generate_template(description):
+    input_messages = [description]
+    input_content = prepare_inputs(
+        messages=input_messages,
+        image_paths=[],
+        tokenizer=tokenizer,
+    )
+    generation_config = {"do_sample": True, "temperature": 0.4, "max_new_tokens": 256}
+    with torch.no_grad():
+        result = nuextract_generate(
+            model=model,
+            tokenizer=tokenizer,
+            prompts=input_content['prompts'],
+            pixel_values_list=input_content['pixel_values_list'],
+            num_patches_list=input_content['num_patches_list'],
+            generation_config=generation_config
+        )
+    return result[0]
+xml_template = """<SportResult>
+    <Date></Date>
+    <Sport></Sport>
+    <Venue></Venue>
+    <HomeTeam></HomeTeam>
+    <AwayTeam></AwayTeam>
+    <HomeScore></HomeScore>
+    <AwayScore></AwayScore>
+    <TopScorer></TopScorer>
+</SportResult>"""
+result = generate_template(xml_template)
+    
+print(result)
+# { 
+#     "SportResult": {
+#         "Date": "date-time",
+#         "Sport": "verbatim-string",
+#         "Venue": "verbatim-string",
+#         "HomeTeam": "verbatim-string",
+#         "AwayTeam": "verbatim-string",
+#         "HomeScore": "integer",
+#         "AwayScore": "integer",
+#         "TopScorer": "verbatim-string"
+#     }
+# }
+```
+
+E.g. generate a template from natural language description:
+```python
+text = """Give me relevant info about startup companies mentioned."""
+result = generate_template(text)
+    
+print(result)
+# {
+#     "Startup_Companies": [
+#         {
+#             "Name": "verbatim-string",
+#             "Products": [
+#                 "string"
+#             ],
+#             "Location": "verbatim-string",
+#             "Company_Type": [
+#                 "Technology",
+#                 "Finance",
+#                 "Health",
+#                 "Education",
+#                 "Other"
+#             ]
+#         }
+#     ]
+# }
+```