初始化项目,由ModelHub XC社区提供模型
Model: numind/NuExtract-2-2B-experimental Source: Original Platform
This commit is contained in:
617
README.md
Normal file
617
README.md
Normal file
@@ -0,0 +1,617 @@
|
||||
---
|
||||
license: mit
|
||||
language:
|
||||
- multilingual
|
||||
tags:
|
||||
- nlp
|
||||
base_model: OpenGVLab/InternVL2_5-2B
|
||||
pipeline_tag: text-generation
|
||||
inference: true
|
||||
---
|
||||
|
||||
# NuExtract-2-2B [experimental version] by NuMind 🔥
|
||||
|
||||
NuExtract 2.0 experimental is a family of models trained specifically for structured information extraction tasks. It supports both multimodal inputs and is multilingual.
|
||||
|
||||
NB: This is an experimental version that will be superseeded by NuExtract 2.0
|
||||
|
||||
We provide several versions of different sizes, all based on the InternVL2.5 family.
|
||||
| Model Size | Model Name | Base Model | Huggingface Link |
|
||||
|------------|------------|------------|------------------|
|
||||
| 2B | NuExtract-2.0-2B | [InternVL2_5-2B](https://huggingface.co/OpenGVLab/InternVL2_5-2B) | [NuExtract-2-2B](https://huggingface.co/numind/NuExtract-2-2B) |
|
||||
| 4B | NuExtract-2.0-4B | [InternVL2_5-4B](https://huggingface.co/OpenGVLab/InternVL2_5-4B) | [NuExtract-2-4B](https://huggingface.co/numind/NuExtract-2-4B) |
|
||||
| 8B | NuExtract-2.0-8B | [InternVL2_5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) | [NuExtract-2-8B](https://huggingface.co/numind/NuExtract-2-8B) |
|
||||
|
||||
## Overview
|
||||
|
||||
To use the model, provide an input text/image and a JSON template describing the information you need to extract. The template should be a JSON object, specifying field names and their expected type.
|
||||
|
||||
Support types include:
|
||||
* `verbatim-string` - instructs the model to extract text that is present verbatim in the input.
|
||||
* `string` - a generic string field that can incorporate paraphrasing/abstraction.
|
||||
* `integer` - a whole number.
|
||||
* `number` - a whole or decimal number.
|
||||
* `date-time` - ISO formatted date.
|
||||
* Array of any of the above types (e.g. `["string"]`)
|
||||
* `enum` - a choice from set of possible answers (represented in template as an array of options, e.g. `["yes", "no", "maybe"]`).
|
||||
* `multi-label` - an enum that can have multiple possible answers (represented in template as a double-wrapped array, e.g. `[["A", "B", "C"]]`).
|
||||
|
||||
If the model does not identify relevant information for a field, it will return `null` or `[]` (for arrays and multi-labels).
|
||||
|
||||
The following is an example template:
|
||||
```json
|
||||
{
|
||||
"first_name": "verbatim-string",
|
||||
"last_name": "verbatim-string",
|
||||
"description": "string",
|
||||
"age": "integer",
|
||||
"gpa": "number",
|
||||
"birth_date": "date-time",
|
||||
"nationality": ["France", "England", "Japan", "USA", "China"],
|
||||
"languages_spoken": [["English", "French", "Japanese", "Mandarin", "Spanish"]]
|
||||
}
|
||||
```
|
||||
An example output:
|
||||
```json
|
||||
{
|
||||
"first_name": "Susan",
|
||||
"last_name": "Smith",
|
||||
"description": "A student studying computer science.",
|
||||
"age": 20,
|
||||
"gpa": 3.7,
|
||||
"birth_date": "2005-03-01",
|
||||
"nationality": "England",
|
||||
"languages_spoken": ["English", "French"]
|
||||
}
|
||||
```
|
||||
|
||||
⚠️ We recommend using NuExtract with a temperature at or very close to 0. Some inference frameworks, such as Ollama, use a default of 0.7 which is not well suited to many extraction tasks.
|
||||
|
||||
## Inference
|
||||
|
||||
Use the following code to handle loading and preprocessing of input data:
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torchvision.transforms as T
|
||||
from PIL import Image
|
||||
from torchvision.transforms.functional import InterpolationMode
|
||||
|
||||
IMAGENET_MEAN = (0.485, 0.456, 0.406)
|
||||
IMAGENET_STD = (0.229, 0.224, 0.225)
|
||||
|
||||
def build_transform(input_size):
|
||||
MEAN, STD = IMAGENET_MEAN, IMAGENET_STD
|
||||
transform = T.Compose([
|
||||
T.Lambda(lambda img: img.convert('RGB') if img.mode != 'RGB' else img),
|
||||
T.Resize((input_size, input_size), interpolation=InterpolationMode.BICUBIC),
|
||||
T.ToTensor(),
|
||||
T.Normalize(mean=MEAN, std=STD)
|
||||
])
|
||||
return transform
|
||||
|
||||
def find_closest_aspect_ratio(aspect_ratio, target_ratios, width, height, image_size):
|
||||
best_ratio_diff = float('inf')
|
||||
best_ratio = (1, 1)
|
||||
area = width * height
|
||||
for ratio in target_ratios:
|
||||
target_aspect_ratio = ratio[0] / ratio[1]
|
||||
ratio_diff = abs(aspect_ratio - target_aspect_ratio)
|
||||
if ratio_diff < best_ratio_diff:
|
||||
best_ratio_diff = ratio_diff
|
||||
best_ratio = ratio
|
||||
elif ratio_diff == best_ratio_diff:
|
||||
if area > 0.5 * image_size * image_size * ratio[0] * ratio[1]:
|
||||
best_ratio = ratio
|
||||
return best_ratio
|
||||
|
||||
def dynamic_preprocess(image, min_num=1, max_num=12, image_size=448, use_thumbnail=False):
|
||||
orig_width, orig_height = image.size
|
||||
aspect_ratio = orig_width / orig_height
|
||||
|
||||
# calculate the existing image aspect ratio
|
||||
target_ratios = set(
|
||||
(i, j) for n in range(min_num, max_num + 1) for i in range(1, n + 1) for j in range(1, n + 1) if
|
||||
i * j <= max_num and i * j >= min_num)
|
||||
target_ratios = sorted(target_ratios, key=lambda x: x[0] * x[1])
|
||||
|
||||
# find the closest aspect ratio to the target
|
||||
target_aspect_ratio = find_closest_aspect_ratio(
|
||||
aspect_ratio, target_ratios, orig_width, orig_height, image_size)
|
||||
|
||||
# calculate the target width and height
|
||||
target_width = image_size * target_aspect_ratio[0]
|
||||
target_height = image_size * target_aspect_ratio[1]
|
||||
blocks = target_aspect_ratio[0] * target_aspect_ratio[1]
|
||||
|
||||
# resize the image
|
||||
resized_img = image.resize((target_width, target_height))
|
||||
processed_images = []
|
||||
for i in range(blocks):
|
||||
box = (
|
||||
(i % (target_width // image_size)) * image_size,
|
||||
(i // (target_width // image_size)) * image_size,
|
||||
((i % (target_width // image_size)) + 1) * image_size,
|
||||
((i // (target_width // image_size)) + 1) * image_size
|
||||
)
|
||||
# split the image
|
||||
split_img = resized_img.crop(box)
|
||||
processed_images.append(split_img)
|
||||
assert len(processed_images) == blocks
|
||||
if use_thumbnail and len(processed_images) != 1:
|
||||
thumbnail_img = image.resize((image_size, image_size))
|
||||
processed_images.append(thumbnail_img)
|
||||
return processed_images
|
||||
|
||||
def load_image(image_file, input_size=448, max_num=12):
|
||||
image = Image.open(image_file).convert('RGB')
|
||||
transform = build_transform(input_size=input_size)
|
||||
images = dynamic_preprocess(image, image_size=input_size, use_thumbnail=True, max_num=max_num)
|
||||
pixel_values = [transform(image) for image in images]
|
||||
pixel_values = torch.stack(pixel_values)
|
||||
return pixel_values
|
||||
|
||||
def prepare_inputs(messages, image_paths, tokenizer, device='cuda', dtype=torch.bfloat16):
|
||||
"""
|
||||
Prepares multi-modal input components (supports multiple images per prompt).
|
||||
|
||||
Args:
|
||||
messages: List of input messages/prompts (strings or dicts with 'role' and 'content')
|
||||
image_paths: List where each element is either None (for text-only) or a list of image paths
|
||||
tokenizer: The tokenizer to use for applying chat templates
|
||||
device: Device to place tensors on ('cuda', 'cpu', etc.)
|
||||
dtype: Data type for image tensors (default: torch.bfloat16)
|
||||
|
||||
Returns:
|
||||
dict: Contains 'prompts', 'pixel_values_list', and 'num_patches_list' ready for the model
|
||||
"""
|
||||
# Make sure image_paths list is at least as long as messages
|
||||
if len(image_paths) < len(messages):
|
||||
# Pad with None for text-only messages
|
||||
image_paths = image_paths + [None] * (len(messages) - len(image_paths))
|
||||
|
||||
# Process images and collect patch information
|
||||
loaded_images = []
|
||||
num_patches_list = []
|
||||
for paths in image_paths:
|
||||
if paths and isinstance(paths, list) and len(paths) > 0:
|
||||
# Load each image in this prompt
|
||||
prompt_images = []
|
||||
prompt_patches = []
|
||||
|
||||
for path in paths:
|
||||
# Load the image
|
||||
img = load_image(path).to(dtype=dtype, device=device)
|
||||
|
||||
# Ensure img has correct shape [patches, C, H, W]
|
||||
if len(img.shape) == 3: # [C, H, W] -> [1, C, H, W]
|
||||
img = img.unsqueeze(0)
|
||||
|
||||
prompt_images.append(img)
|
||||
# Record the number of patches for this image
|
||||
prompt_patches.append(img.shape[0])
|
||||
|
||||
loaded_images.append(prompt_images)
|
||||
num_patches_list.append(prompt_patches)
|
||||
else:
|
||||
# Text-only prompt
|
||||
loaded_images.append(None)
|
||||
num_patches_list.append([])
|
||||
|
||||
# Create the concatenated pixel_values_list
|
||||
pixel_values_list = []
|
||||
for prompt_images in loaded_images:
|
||||
if prompt_images:
|
||||
# Concatenate all images for this prompt
|
||||
pixel_values_list.append(torch.cat(prompt_images, dim=0))
|
||||
else:
|
||||
# Text-only prompt
|
||||
pixel_values_list.append(None)
|
||||
|
||||
# Format messages for the model
|
||||
if all(isinstance(m, str) for m in messages):
|
||||
# Simple string messages: convert to chat format
|
||||
batch_messages = [
|
||||
[{"role": "user", "content": message}]
|
||||
for message in messages
|
||||
]
|
||||
else:
|
||||
# Assume messages are already in the right format
|
||||
batch_messages = messages
|
||||
|
||||
# Apply chat template
|
||||
prompts = tokenizer.apply_chat_template(
|
||||
batch_messages,
|
||||
tokenize=False,
|
||||
add_generation_prompt=True
|
||||
)
|
||||
|
||||
return {
|
||||
'prompts': prompts,
|
||||
'pixel_values_list': pixel_values_list,
|
||||
'num_patches_list': num_patches_list
|
||||
}
|
||||
|
||||
def construct_message(text, template, examples=None):
|
||||
"""
|
||||
Construct the individual NuExtract message texts, prior to chat template formatting.
|
||||
"""
|
||||
# add few-shot examples if needed
|
||||
if examples is not None and len(examples) > 0:
|
||||
icl = "# Examples:\n"
|
||||
for row in examples:
|
||||
icl += f"## Input:\n{row['input']}\n## Output:\n{row['output']}\n"
|
||||
else:
|
||||
icl = ""
|
||||
|
||||
return f"""# Template:\n{template}\n{icl}# Context:\n{text}"""
|
||||
```
|
||||
|
||||
To handle inference:
|
||||
|
||||
```python
|
||||
IMG_START_TOKEN='<img>'
|
||||
IMG_END_TOKEN='</img>'
|
||||
IMG_CONTEXT_TOKEN='<IMG_CONTEXT>'
|
||||
|
||||
def nuextract_generate(model, tokenizer, prompts, generation_config, pixel_values_list=None, num_patches_list=None):
|
||||
"""
|
||||
Generate responses for a batch of NuExtract inputs.
|
||||
Support for multiple and varying numbers of images per prompt.
|
||||
|
||||
Args:
|
||||
model: The vision-language model
|
||||
tokenizer: The tokenizer for the model
|
||||
pixel_values_list: List of tensor batches, one per prompt
|
||||
Each batch has shape [num_images, channels, height, width] or None for text-only prompts
|
||||
prompts: List of text prompts
|
||||
generation_config: Configuration for text generation
|
||||
num_patches_list: List of lists, each containing patch counts for images in a prompt
|
||||
|
||||
Returns:
|
||||
List of generated responses
|
||||
"""
|
||||
img_context_token_id = tokenizer.convert_tokens_to_ids(IMG_CONTEXT_TOKEN)
|
||||
model.img_context_token_id = img_context_token_id
|
||||
|
||||
# Replace all image placeholders with appropriate tokens
|
||||
modified_prompts = []
|
||||
total_image_files = 0
|
||||
total_patches = 0
|
||||
image_containing_prompts = []
|
||||
for idx, prompt in enumerate(prompts):
|
||||
# check if this prompt has images
|
||||
has_images = (pixel_values_list and
|
||||
idx < len(pixel_values_list) and
|
||||
pixel_values_list[idx] is not None and
|
||||
isinstance(pixel_values_list[idx], torch.Tensor) and
|
||||
pixel_values_list[idx].shape[0] > 0)
|
||||
|
||||
if has_images:
|
||||
# prompt with image placeholders
|
||||
image_containing_prompts.append(idx)
|
||||
modified_prompt = prompt
|
||||
|
||||
patches = num_patches_list[idx] if (num_patches_list and idx < len(num_patches_list)) else []
|
||||
num_images = len(patches)
|
||||
total_image_files += num_images
|
||||
total_patches += sum(patches)
|
||||
|
||||
# replace each <image> placeholder with image tokens
|
||||
for i, num_patches in enumerate(patches):
|
||||
image_tokens = IMG_START_TOKEN + IMG_CONTEXT_TOKEN * model.num_image_token * num_patches + IMG_END_TOKEN
|
||||
modified_prompt = modified_prompt.replace('<image>', image_tokens, 1)
|
||||
else:
|
||||
# text-only prompt
|
||||
modified_prompt = prompt
|
||||
|
||||
modified_prompts.append(modified_prompt)
|
||||
|
||||
# process all prompts in a single batch
|
||||
tokenizer.padding_side = 'left'
|
||||
model_inputs = tokenizer(modified_prompts, return_tensors='pt', padding=True)
|
||||
input_ids = model_inputs['input_ids'].to(model.device)
|
||||
attention_mask = model_inputs['attention_mask'].to(model.device)
|
||||
|
||||
eos_token_id = tokenizer.convert_tokens_to_ids("<|im_end|>\n".strip())
|
||||
generation_config['eos_token_id'] = eos_token_id
|
||||
|
||||
# prepare pixel values
|
||||
flattened_pixel_values = None
|
||||
if image_containing_prompts:
|
||||
# collect and concatenate all image tensors
|
||||
all_pixel_values = []
|
||||
for idx in image_containing_prompts:
|
||||
all_pixel_values.append(pixel_values_list[idx])
|
||||
|
||||
flattened_pixel_values = torch.cat(all_pixel_values, dim=0)
|
||||
print(f"Processing batch with {len(prompts)} prompts, {total_image_files} actual images, and {total_patches} total patches")
|
||||
else:
|
||||
print(f"Processing text-only batch with {len(prompts)} prompts")
|
||||
|
||||
# generate outputs
|
||||
outputs = model.generate(
|
||||
pixel_values=flattened_pixel_values, # will be None for text-only prompts
|
||||
input_ids=input_ids,
|
||||
attention_mask=attention_mask,
|
||||
**generation_config
|
||||
)
|
||||
|
||||
# Decode responses
|
||||
responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)
|
||||
|
||||
return responses
|
||||
```
|
||||
|
||||
To load the model:
|
||||
|
||||
```python
|
||||
import torch
|
||||
from transformers import AutoModelForCausalLM, AutoTokenizer
|
||||
|
||||
model_name = ""
|
||||
|
||||
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side='left')
|
||||
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True,
|
||||
torch_dtype=torch.bfloat16,
|
||||
attn_implementation="flash_attention_2" # we recommend using flash attention
|
||||
).to("cuda")
|
||||
```
|
||||
|
||||
Simple 0-shot text-only example:
|
||||
```python
|
||||
template = """{"names": ["verbatim-string"]}"""
|
||||
text = "John went to the restaurant with Mary. James went to the cinema."
|
||||
|
||||
input_messages = [construct_message(text, template)]
|
||||
|
||||
input_content = prepare_inputs(
|
||||
messages=input_messages,
|
||||
image_paths=[],
|
||||
tokenizer=tokenizer,
|
||||
)
|
||||
|
||||
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
|
||||
|
||||
with torch.no_grad():
|
||||
result = nuextract_generate(
|
||||
model=model,
|
||||
tokenizer=tokenizer,
|
||||
prompts=input_content['prompts'],
|
||||
pixel_values_list=input_content['pixel_values_list'],
|
||||
num_patches_list=input_content['num_patches_list'],
|
||||
generation_config=generation_config
|
||||
)
|
||||
for y in result:
|
||||
print(y)
|
||||
# {"names": ["John", "Mary", "James"]}
|
||||
```
|
||||
|
||||
Text-only input with an in-context example:
|
||||
```python
|
||||
template = """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}"""
|
||||
text = "John went to the restaurant with Mary. James went to the cinema."
|
||||
examples = [
|
||||
{
|
||||
"input": "Stephen is the manager at Susan's store.",
|
||||
"output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}"""
|
||||
}
|
||||
]
|
||||
|
||||
input_messages = [construct_message(text, template, examples)]
|
||||
|
||||
input_content = prepare_inputs(
|
||||
messages=input_messages,
|
||||
image_paths=[],
|
||||
tokenizer=tokenizer,
|
||||
)
|
||||
|
||||
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
|
||||
|
||||
with torch.no_grad():
|
||||
result = nuextract_generate(
|
||||
model=model,
|
||||
tokenizer=tokenizer,
|
||||
prompts=input_content['prompts'],
|
||||
pixel_values_list=input_content['pixel_values_list'],
|
||||
num_patches_list=input_content['num_patches_list'],
|
||||
generation_config=generation_config
|
||||
)
|
||||
for y in result:
|
||||
print(y)
|
||||
# {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]}
|
||||
```
|
||||
|
||||
Example with image input and an in-context example. Image inputs should use `<image>` placeholder instead of text and image paths should be provided in a list in order of appearance in the prompt (in this example `0.jpg` will be for the in-context example and `1.jpg` for the true input).
|
||||
```python
|
||||
template = """{"store": "verbatim-string"}"""
|
||||
text = "<image>"
|
||||
examples = [
|
||||
{
|
||||
"input": "<image>",
|
||||
"output": """{"store": "Walmart"}"""
|
||||
}
|
||||
]
|
||||
|
||||
input_messages = [construct_message(text, template, examples)]
|
||||
|
||||
images = [
|
||||
["0.jpg", "1.jpg"]
|
||||
]
|
||||
|
||||
input_content = prepare_inputs(
|
||||
messages=input_messages,
|
||||
image_paths=images,
|
||||
tokenizer=tokenizer,
|
||||
)
|
||||
|
||||
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
|
||||
|
||||
with torch.no_grad():
|
||||
result = nuextract_generate(
|
||||
model=model,
|
||||
tokenizer=tokenizer,
|
||||
prompts=input_content['prompts'],
|
||||
pixel_values_list=input_content['pixel_values_list'],
|
||||
num_patches_list=input_content['num_patches_list'],
|
||||
generation_config=generation_config
|
||||
)
|
||||
for y in result:
|
||||
print(y)
|
||||
# {"store": "Trader Joe's"}
|
||||
```
|
||||
|
||||
Multi-modal batched input:
|
||||
```python
|
||||
inputs = [
|
||||
# image input with no ICL examples
|
||||
{
|
||||
"text": "<image>",
|
||||
"template": """{"store_name": "verbatim-string"}""",
|
||||
"examples": None,
|
||||
},
|
||||
# image input with 1 ICL example
|
||||
{
|
||||
"text": "<image>",
|
||||
"template": """{"store_name": "verbatim-string"}""",
|
||||
"examples": [
|
||||
{
|
||||
"input": "<image>",
|
||||
"output": """{"store_name": "Walmart"}""",
|
||||
}
|
||||
],
|
||||
},
|
||||
# text input with no ICL examples
|
||||
{
|
||||
"text": "John went to the restaurant with Mary. James went to the cinema.",
|
||||
"template": """{"names": ["verbatim-string"]}""",
|
||||
"examples": None,
|
||||
},
|
||||
# text input with ICL example
|
||||
{
|
||||
"text": "John went to the restaurant with Mary. James went to the cinema.",
|
||||
"template": """{"names": ["verbatim-string"], "female_names": ["verbatim-string"]}""",
|
||||
"examples": [
|
||||
{
|
||||
"input": "Stephen is the manager at Susan's store.",
|
||||
"output": """{"names": ["STEPHEN", "SUSAN"], "female_names": ["SUSAN"]}"""
|
||||
}
|
||||
],
|
||||
},
|
||||
]
|
||||
|
||||
input_messages = [
|
||||
construct_message(
|
||||
x["text"],
|
||||
x["template"],
|
||||
x["examples"]
|
||||
) for x in inputs
|
||||
]
|
||||
|
||||
images = [
|
||||
["0.jpg"],
|
||||
["0.jpg", "1.jpg"],
|
||||
None,
|
||||
None
|
||||
]
|
||||
|
||||
input_content = prepare_inputs(
|
||||
messages=input_messages,
|
||||
image_paths=images,
|
||||
tokenizer=tokenizer,
|
||||
)
|
||||
|
||||
generation_config = {"do_sample": False, "num_beams": 1, "max_new_tokens": 2048}
|
||||
|
||||
with torch.no_grad():
|
||||
result = nuextract_generate(
|
||||
model=model,
|
||||
tokenizer=tokenizer,
|
||||
prompts=input_content['prompts'],
|
||||
pixel_values_list=input_content['pixel_values_list'],
|
||||
num_patches_list=input_content['num_patches_list'],
|
||||
generation_config=generation_config
|
||||
)
|
||||
for y in result:
|
||||
print(y)
|
||||
# {"store_name": "WAL*MART"}
|
||||
# {"store_name": "Trader Joe's"}
|
||||
# {"names": ["John", "Mary", "James"]}
|
||||
# {"names": ["JOHN", "MARY", "JAMES"], "female_names": ["MARY"]}
|
||||
```
|
||||
|
||||
## Template Generation
|
||||
If you want to convert existing schema files you have in other formats (e.g. XML, YAML, etc.) or start from an example, NuExtract 2 models can automatically generate this for you.
|
||||
|
||||
E.g. convert XML into a NuExtract template:
|
||||
```python
|
||||
def generate_template(description):
|
||||
input_messages = [description]
|
||||
input_content = prepare_inputs(
|
||||
messages=input_messages,
|
||||
image_paths=[],
|
||||
tokenizer=tokenizer,
|
||||
)
|
||||
generation_config = {"do_sample": True, "temperature": 0.4, "max_new_tokens": 256}
|
||||
with torch.no_grad():
|
||||
result = nuextract_generate(
|
||||
model=model,
|
||||
tokenizer=tokenizer,
|
||||
prompts=input_content['prompts'],
|
||||
pixel_values_list=input_content['pixel_values_list'],
|
||||
num_patches_list=input_content['num_patches_list'],
|
||||
generation_config=generation_config
|
||||
)
|
||||
return result[0]
|
||||
xml_template = """<SportResult>
|
||||
<Date></Date>
|
||||
<Sport></Sport>
|
||||
<Venue></Venue>
|
||||
<HomeTeam></HomeTeam>
|
||||
<AwayTeam></AwayTeam>
|
||||
<HomeScore></HomeScore>
|
||||
<AwayScore></AwayScore>
|
||||
<TopScorer></TopScorer>
|
||||
</SportResult>"""
|
||||
result = generate_template(xml_template)
|
||||
|
||||
print(result)
|
||||
# {
|
||||
# "SportResult": {
|
||||
# "Date": "date-time",
|
||||
# "Sport": "verbatim-string",
|
||||
# "Venue": "verbatim-string",
|
||||
# "HomeTeam": "verbatim-string",
|
||||
# "AwayTeam": "verbatim-string",
|
||||
# "HomeScore": "integer",
|
||||
# "AwayScore": "integer",
|
||||
# "TopScorer": "verbatim-string"
|
||||
# }
|
||||
# }
|
||||
```
|
||||
|
||||
E.g. generate a template from natural language description:
|
||||
```python
|
||||
text = """Give me relevant info about startup companies mentioned."""
|
||||
result = generate_template(text)
|
||||
|
||||
print(result)
|
||||
# {
|
||||
# "Startup_Companies": [
|
||||
# {
|
||||
# "Name": "verbatim-string",
|
||||
# "Products": [
|
||||
# "string"
|
||||
# ],
|
||||
# "Location": "verbatim-string",
|
||||
# "Company_Type": [
|
||||
# "Technology",
|
||||
# "Finance",
|
||||
# "Health",
|
||||
# "Education",
|
||||
# "Other"
|
||||
# ]
|
||||
# }
|
||||
# ]
|
||||
# }
|
||||
```
|
||||
Reference in New Issue
Block a user