prithivMLmods/Qwen2.5-VL-3B-Instruct-Unredacted-MAX

Files

ModelHub XC 0a464b6fdd 初始化项目，由ModelHub XC社区提供模型

Model: prithivMLmods/Qwen2.5-VL-3B-Instruct-Unredacted-MAX
Source: Original Platform

2026-06-04 00:30:12 +08:00

3.9 KiB

Raw Permalink Blame History

license, base_model, tags, language, pipeline_tag, library_name

license

base_model

Qwen2.5-VL-3B-Instruct-Unredacted-MAX

Qwen2.5-VL-3B-Instruct-Unredacted-MAX is an optimized release built on top of huihui-ai/Qwen2.5-VL-3B-Instruct-abliterated. This version focuses on improved model packaging, updated compatibility with modern Transformers pipelines, and stable multimodal inference behavior, while preserving the core vision-language reasoning capabilities of the original architecture. The result is a compact 3B vision-language model designed for efficient deployment, research experimentation, and multimodal application development.

Key Highlights

Optimized Release Packaging Streamlined repository structure for smoother loading, inference, and deployment workflows.
Modern Transformers Compatibility Updated to ensure stable integration with recent Hugging Face Transformers versions.
3B Vision-Language Architecture Built on Qwen2.5-VL-3B-Instruct, balancing multimodal capability with lightweight deployment requirements.
Stable Multimodal Inference Designed for consistent performance across image-text reasoning tasks.
Efficient Caption Generation Produces structured, descriptive outputs suitable for annotation and dataset building.
Dynamic Resolution Support Retains native handling of varying image resolutions and aspect ratios.

Base Model Signatures:

This model has been re-sharded and optimized for the latest Transformers version from the base model: https://huggingface.co/huihui-ai/Qwen2.5-VL-3B-Instruct-abliterated

Quick Start with Transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Qwen2.5-VL-3B-Instruct-Unredacted-MAX",
    torch_dtype="auto",
    device_map="auto"
)

processor = AutoProcessor.from_pretrained(
    "prithivMLmods/Qwen2.5-VL-3B-Instruct-Unredacted-MAX"
)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Provide a detailed caption for this image."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

image_inputs, video_inputs = process_vision_info(messages)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256)

output_text = processor.batch_decode(
    [out[len(inp):] for inp, out in zip(inputs.input_ids, generated_ids)],
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

print(output_text)

Intended Use

Multimodal AI research and evaluation
Image captioning and dataset generation pipelines
Vision-language prototyping and experimentation
Lightweight deployment in constrained environments
Development of multimodal applications and tools

Limitations & Risks

Important Note: This model inherits behavior and constraints from its base architecture.

Performance depends on image quality, resolution, and prompt design
May produce incomplete or inaccurate interpretations in complex scenes
Requires adequate GPU resources for stable inference
Output consistency varies with decoding settings and runtime optimization

3.9 KiB Raw Permalink Blame History