49 lines
2.2 KiB
Markdown
49 lines
2.2 KiB
Markdown
|
|
# Qwen2.5-VL-3B-Instruct-Traffic
|
||
|
|
|
||
|
|
**Qwen2.5-VL-3B-Instruct-Traffic** is a multimodal model fine-tuned on the **MITS (Multimodal Intelligent Traffic Surveillance)** dataset for intelligent traffic surveillance scenarios.
|
||
|
|
|
||
|
|
- **Tasks:** recognition, counting, localization, background awareness, reasoning
|
||
|
|
- **Data:** 170,400 images + ~5M instruction-following VQA pairs from MITS
|
||
|
|
- **Modality:** Image + Text → Text
|
||
|
|
- **Domain:** traffic scenes (congestion, accidents, construction, smoke/fireworks, unusual weather, spills, etc.)
|
||
|
|
|
||
|
|
## Quick Links
|
||
|
|
- 📚 Dataset: [`zhaokaikai/Multimodal_Intelligent_Traffic_Surveillance`](https://www.modelscope.cn/datasets/zhaokaikai/Multimodal_Intelligent_Traffic_Surveillance)
|
||
|
|
- 💻 Usage & examples: please refer to the GitHub repo
|
||
|
|
**https://github.com/LifeIsSoSolong/Multimodal-Intelligent-Traffic-Surveillance-Dataset-Models**
|
||
|
|
|
||
|
|
## Intended Use
|
||
|
|
- Urban traffic monitoring, incident analysis, visual question answering for transportation management
|
||
|
|
- Research on ITS-specific multimodal reasoning and instruction following
|
||
|
|
|
||
|
|
## Model Inputs/Outputs
|
||
|
|
- **Input:** an image (traffic scene) + a natural language instruction/question
|
||
|
|
- **Output:** a natural language response (e.g., description, count, event reasoning)
|
||
|
|
|
||
|
|
## Training Summary
|
||
|
|
- Objective: instruction tuning on MITS traffic QA
|
||
|
|
- Backbone family: Qwen2.5-VL 3B Instruct
|
||
|
|
- Notes: align vision-language features to traffic-centric concepts and events
|
||
|
|
|
||
|
|
## Limitations & Notes
|
||
|
|
- The model may make mistakes on rare objects or extreme weather/night scenes not well represented in training.
|
||
|
|
- Not a safety-critical system; human verification is required for real-world decisions.
|
||
|
|
|
||
|
|
## License
|
||
|
|
- Follow the licenses of this model and the MITS dataset as stated on their ModelScope pages.
|
||
|
|
|
||
|
|
## Citation
|
||
|
|
If you use this model or dataset, please cite:
|
||
|
|
```bibtex
|
||
|
|
@article{zhao2025mits,
|
||
|
|
title = {MITS: A large-scale multimodal benchmark dataset for Intelligent Traffic Surveillance},
|
||
|
|
author = {Zhao, Kaikai and Liu, Zhaoxiang and Wang, Peng and Wang, Xin and Ma, Zhicheng and Xu, Yajun and Zhang, Wenjing and Nan, Yibing and Wang, Kai and Lian, Shiguo},
|
||
|
|
journal = {Image and Vision Computing},
|
||
|
|
pages = {105736},
|
||
|
|
year = {2025},
|
||
|
|
publisher = {Elsevier}
|
||
|
|
}
|
||
|
|
```
|
||
|
|
|
||
|
|
## Contact
|
||
|
|
Unicom AI
|