*This model was released on 2025-02-19 and added to Hugging Face Transformers on 2025-09-15.*
PyTorch FlashAttention SDPA
# Qwen3-VL [Qwen3-VL](https://huggingface.co/papers/2502.13923) is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions. Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding. These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks. Model usage ```py import torch from transformers import Qwen3VLForConditionalGeneration, AutoProcessor model = Qwen3VLForConditionalGeneration.from_pretrained( "Qwen/Qwen3-VL", dtype=torch.float16, device_map="auto", attn_implementation="sdpa" ) processor = AutoProcessor.from_pretrained("Qwen/Qwen3-VL") messages = [ { "role":"user", "content":[ { "type":"image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg" }, { "type":"text", "text":"Describe this image." } ] } ] inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt", ) inputs.pop("token_type_ids", None) generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ## Qwen3VLConfig [[autodoc]] Qwen3VLConfig ## Qwen3VLTextConfig [[autodoc]] Qwen3VLTextConfig ## Qwen3VLProcessor [[autodoc]] Qwen3VLProcessor ## Qwen3VLVideoProcessor [[autodoc]] Qwen3VLVideoProcessor ## Qwen3VLVisionModel [[autodoc]] Qwen3VLVisionModel - forward ## Qwen3VLTextModel [[autodoc]] Qwen3VLTextModel - forward ## Qwen3VLModel [[autodoc]] Qwen3VLModel - forward ## Qwen3VLForConditionalGeneration [[autodoc]] Qwen3VLForConditionalGeneration - forward