VideoChat-R1-thinking_7B/README.md

---
language:
- en
library_name: transformers
license: apache-2.0
metrics:
- accuracy
tags:
- multimodal
pipeline_tag: video-text-to-text
base_model: Qwen/Qwen2.5-VL-7B-Instruct
---


# 💡 VideoChat-R1-thinking_7B

[\[📂 GitHub\]](https://github.com/OpenGVLab/VideoChat-R1)  
[\[📜 Tech Report\]](https://arxiv.org/pdf/2504.06958) 


## 🚀 How to use the model

We provide a simple installation example below:
```
pip install transformers
pip install qwen_vl_utils
```
Then you could use our model:
```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model_path = "OpenGVLab/VideoChat-R1-thinking_7B"
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto",
    attn_implementation="flash_attention_2"
)

# default processer
processor = AutoProcessor.from_pretrained(model_path)

video_path = "your_video.mp4"
question = "Where is the final cup containing the object?"

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": video_path,
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": f"""{question}
             
             Output your thought process within the <think> </think> tags, including analysis with either specific timestamps (xx.xx) or time ranges (xx.xx to xx.xx) in <timestep> </timestep> tags.

            Then, provide your final answer within the <answer> </answer> tags.
             """},
        ],
    }
]


#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
# Preparation for inference
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# Inference
generated_ids = model.generate(**inputs, max_new_tokens=512)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

## ✏️ Citation

```bibtex

@article{li2025videochatr1,
  title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning},
  author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin},
  journal={arXiv preprint arXiv:2504.06958},
  year={2025}
}
```
初始化项目，由ModelHub XC社区提供模型 Model: OpenGVLab/VideoChat-R1-thinking_7B Source: Original Platform 2026-05-26 07:09:18 +08:00			`---`
			`language:`
			`- en`
			`library_name: transformers`
			`license: apache-2.0`
			`metrics:`
			`- accuracy`
			`tags:`
			`- multimodal`
			`pipeline_tag: video-text-to-text`
			`base_model: Qwen/Qwen2.5-VL-7B-Instruct`
			`---`


			`# 💡 VideoChat-R1-thinking_7B`

			`[\[📂 GitHub\]](https://github.com/OpenGVLab/VideoChat-R1)`
			`[\[📜 Tech Report\]](https://arxiv.org/pdf/2504.06958)`



			`## 🚀 How to use the model`

			`We provide a simple installation example below:`
			```
			`pip install transformers`
			`pip install qwen_vl_utils`
			```
			`Then you could use our model:`
			```python
			`from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor`
			`from qwen_vl_utils import process_vision_info`

			`model_path = "OpenGVLab/VideoChat-R1-thinking_7B"`
			`# default: Load the model on the available device(s)`
			`model = Qwen2_5_VLForConditionalGeneration.from_pretrained(`
			`model_path, torch_dtype="auto", device_map="auto",`
			`attn_implementation="flash_attention_2"`
			`)`

			`# default processer`
			`processor = AutoProcessor.from_pretrained(model_path)`

			`video_path = "your_video.mp4"`
			`question = "Where is the final cup containing the object?"`

			`messages = [`
			`{`
			`"role": "user",`
			`"content": [`
			`{`
			`"type": "video",`
			`"video": video_path,`
			`"max_pixels": 360 * 420,`
			`"fps": 1.0,`
			`},`
			`{"type": "text", "text": f"""{question}`

			`Output your thought process within the <think> </think> tags, including analysis with either specific timestamps (xx.xx) or time ranges (xx.xx to xx.xx) in <timestep> </timestep> tags.`

			`Then, provide your final answer within the <answer> </answer> tags.`
			`"""},`
			`],`
			`}`
			`]`



			`#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.`
			`# Preparation for inference`
			`text = processor.apply_chat_template(`
			`messages, tokenize=False, add_generation_prompt=True`
			`)`
			`image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)`
			`inputs = processor(`
			`text=[text],`
			`images=image_inputs,`
			`videos=video_inputs,`
			`padding=True,`
			`return_tensors="pt",`
			`**video_kwargs,`
			`)`
			`inputs = inputs.to("cuda")`

			`# Inference`
			`generated_ids = model.generate(**inputs, max_new_tokens=512)`
			`generated_ids_trimmed = [`
			`out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)`
			`]`
			`output_text = processor.batch_decode(`
			`generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False`
			`)`
			`print(output_text)`
			```

			`## ✏️ Citation`

			```bibtex

			`@article{li2025videochatr1,`
			`title={VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning},`
			`author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin},`
			`journal={arXiv preprint arXiv:2504.06958},`
			`year={2025}`
			`}`
			```