Files
2025-10-09 16:47:16 +08:00

4.4 KiB
Raw Permalink Blame History

This model was released on 2024-07-24 and added to Hugging Face Transformers on 2025-04-15.

MLCD

PyTorch SDPA

Overview

The MLCD models were released by the DeepGlint-AI team in unicom, which focuses on building foundational visual models for large multimodal language models using large-scale datasets such as LAION400M and COYO700M, and employs sample-to-cluster contrastive learning to optimize performance. MLCD models are primarily used for multimodal visual large language models, such as LLaVA.

🔥MLCD-ViT-bigG🔥 series is the state-of-the-art vision transformer model enhanced with 2D Rotary Position Embedding (RoPE2D), achieving superior performance on document understanding and visual question answering tasks. Developed by DeepGlint AI, this model demonstrates exceptional capabilities in processing complex visual-language interactions.

Tips:

Result:

Vision Tower RoPE2D ChartQA DocVQA InfoVQA OCRBench MMMU
CLIP (ViT-L-14-336px) × 66.52 75.21 38.88 525.00 44.20
SigLIP (ViT-SO400M-384px) × 69.28 76.71 41.38 554.00 46.78
DFN5B (ViT-H-14-378px) × 64.36 70.87 38.59 473.00 48.00
MLCD (ViT-L-14-336px) × 67.84 76.46 43.48 531.00 44.30
MLCD (ViT-bigG-14-336px) 71.07 79.63 44.38 572.00 46.78
MLCD (ViT-bigG-14-448px) 73.80 83.34 46.59 582.00 46.00

Usage

import requests
from PIL import Image
from transformers import AutoProcessor, MLCDVisionModel

# Load model and processor
model = MLCDVisionModel.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")
processor = AutoProcessor.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")

# Process single image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")

# Generate outputs
with torch.no_grad():
    outputs = model(**inputs)

# Get visual features
features = outputs.last_hidden_state

print(f"Extracted features shape: {features.shape}")

MLCDVisionConfig

autodoc MLCDVisionConfig

MLCDVisionModel

autodoc MLCDVisionModel - forward