Qwen2.5-VL-3B-Instruct-Traffic/README.md

# Qwen2.5-VL-3B-Instruct-Traffic

**Qwen2.5-VL-3B-Instruct-Traffic** is a multimodal model fine-tuned on the **MITS (Multimodal Intelligent Traffic Surveillance)** dataset for intelligent traffic surveillance scenarios.

- **Tasks:** recognition, counting, localization, background awareness, reasoning
- **Data:** 170,400 images + ~5M instruction-following VQA pairs from MITS
- **Modality:** Image + Text → Text
- **Domain:** traffic scenes (congestion, accidents, construction, smoke/fireworks, unusual weather, spills, etc.)

## Quick Links
- 📚 Dataset: [`zhaokaikai/Multimodal_Intelligent_Traffic_Surveillance`](https://www.modelscope.cn/datasets/zhaokaikai/Multimodal_Intelligent_Traffic_Surveillance)
- 💻 Usage & examples: please refer to the GitHub repo  
  **https://github.com/LifeIsSoSolong/Multimodal-Intelligent-Traffic-Surveillance-Dataset-Models**

## Intended Use
- Urban traffic monitoring, incident analysis, visual question answering for transportation management
- Research on ITS-specific multimodal reasoning and instruction following

## Model Inputs/Outputs
- **Input:** an image (traffic scene) + a natural language instruction/question
- **Output:** a natural language response (e.g., description, count, event reasoning)

## Training Summary
- Objective: instruction tuning on MITS traffic QA
- Backbone family: Qwen2.5-VL 3B Instruct
- Notes: align vision-language features to traffic-centric concepts and events

## Limitations & Notes
- The model may make mistakes on rare objects or extreme weather/night scenes not well represented in training.
- Not a safety-critical system; human verification is required for real-world decisions.

## License
- Follow the licenses of this model and the MITS dataset as stated on their ModelScope pages.

## Citation
If you use this model or dataset, please cite:
```bibtex
@article{zhao2025mits,
  title   = {MITS: A large-scale multimodal benchmark dataset for Intelligent Traffic Surveillance},
  author  = {Zhao, Kaikai and Liu, Zhaoxiang and Wang, Peng and Wang, Xin and Ma, Zhicheng and Xu, Yajun and Zhang, Wenjing and Nan, Yibing and Wang, Kai and Lian, Shiguo},
  journal = {Image and Vision Computing},
  pages   = {105736},
  year    = {2025},
  publisher = {Elsevier}
}
```

## Contact
Unicom AI
初始化项目，由ModelHub XC社区提供模型 Model: zhaokaikai/Qwen2.5-VL-3B-Instruct-Traffic Source: Original Platform 2026-05-20 13:47:40 +08:00			`# Qwen2.5-VL-3B-Instruct-Traffic`

			`Qwen2.5-VL-3B-Instruct-Traffic is a multimodal model fine-tuned on the MITS (Multimodal Intelligent Traffic Surveillance) dataset for intelligent traffic surveillance scenarios.`

			`- Tasks: recognition, counting, localization, background awareness, reasoning`
			`- Data: 170,400 images + ~5M instruction-following VQA pairs from MITS`
			`- Modality: Image + Text → Text`
			`- Domain: traffic scenes (congestion, accidents, construction, smoke/fireworks, unusual weather, spills, etc.)`

			`## Quick Links`
			- 📚 Dataset: [`zhaokaikai/Multimodal_Intelligent_Traffic_Surveillance`](https://www.modelscope.cn/datasets/zhaokaikai/Multimodal_Intelligent_Traffic_Surveillance)
			`- 💻 Usage & examples: please refer to the GitHub repo`
			`https://github.com/LifeIsSoSolong/Multimodal-Intelligent-Traffic-Surveillance-Dataset-Models`

			`## Intended Use`
			`- Urban traffic monitoring, incident analysis, visual question answering for transportation management`
			`- Research on ITS-specific multimodal reasoning and instruction following`

			`## Model Inputs/Outputs`
			`- Input: an image (traffic scene) + a natural language instruction/question`
			`- Output: a natural language response (e.g., description, count, event reasoning)`

			`## Training Summary`
			`- Objective: instruction tuning on MITS traffic QA`
			`- Backbone family: Qwen2.5-VL 3B Instruct`
			`- Notes: align vision-language features to traffic-centric concepts and events`

			`## Limitations & Notes`
			`- The model may make mistakes on rare objects or extreme weather/night scenes not well represented in training.`
			`- Not a safety-critical system; human verification is required for real-world decisions.`

			`## License`
			`- Follow the licenses of this model and the MITS dataset as stated on their ModelScope pages.`

			`## Citation`
			`If you use this model or dataset, please cite:`
			```bibtex
			`@article{zhao2025mits,`
			`title = {MITS: A large-scale multimodal benchmark dataset for Intelligent Traffic Surveillance},`
			`author = {Zhao, Kaikai and Liu, Zhaoxiang and Wang, Peng and Wang, Xin and Ma, Zhicheng and Xu, Yajun and Zhang, Wenjing and Nan, Yibing and Wang, Kai and Lian, Shiguo},`
			`journal = {Image and Vision Computing},`
			`pages = {105736},`
			`year = {2025},`
			`publisher = {Elsevier}`
			`}`
			```

			`## Contact`
			`Unicom AI`