69 lines
2.6 KiB
Markdown
69 lines
2.6 KiB
Markdown
|
|
---
|
||
|
|
license: apache-2.0
|
||
|
|
language:
|
||
|
|
- en
|
||
|
|
tags:
|
||
|
|
- long-video
|
||
|
|
- video-understanding
|
||
|
|
- video-qa
|
||
|
|
- agent
|
||
|
|
- qwen3
|
||
|
|
- transformers
|
||
|
|
- longtvqa
|
||
|
|
base_model: Qwen/Qwen3-4B-Thinking-2507
|
||
|
|
library_name: transformers
|
||
|
|
---
|
||
|
|
|
||
|
|
# LongVideoAgent Qwen3-4B
|
||
|
|
|
||
|
|
This repository hosts the released LLM checkpoint for **LongVideoAgent**, a multi-agent framework for long-video question answering. This model is a **Qwen3-4B-based checkpoint** used in the LongVideoAgent project.
|
||
|
|
|
||
|
|
|
||
|
|
## Overview
|
||
|
|
|
||
|
|
This model is trained based on the official repository: [longvideoagent/LongVideoAgent](https://github.com/longvideoagent/LongVideoAgent).
|
||
|
|
|
||
|
|
LongVideoAgent utilizes a multi-agent collaboration framework to decompose complex long-video reasoning into specialized roles. For detailed methodology and agent architecture, please refer to our paper on arXiv: [https://arxiv.org/abs/2512.20618](https://arxiv.org/abs/2512.20618).
|
||
|
|
|
||
|
|
This checkpoint is intended for use with the official LongVideoAgent codebase and evaluation pipeline.
|
||
|
|
|
||
|
|
## Performance
|
||
|
|
|
||
|
|
On the **LongTVQA+** test set, this model achieves an accuracy of **72%**, while `gpt-4o-mini` achieves 74% on the same benchmark.
|
||
|
|
|
||
|
|
This demonstrates that our model delivers strong performance, achieving reasoning capabilities comparable to advanced closed-source models while utilizing a significantly smaller parameter size.
|
||
|
|
|
||
|
|
## Intended Use
|
||
|
|
|
||
|
|
Use this model for:
|
||
|
|
|
||
|
|
- Research on long-video question answering
|
||
|
|
- Reproducing LongVideoAgent experiments
|
||
|
|
- Studying agentic reasoning over long videos
|
||
|
|
|
||
|
|
This checkpoint is **not** a general-purpose video model by itself. For inference and evaluation, please use the official repository:
|
||
|
|
|
||
|
|
- https://github.com/longvideoagent/LongVideoAgent
|
||
|
|
|
||
|
|
## Usage
|
||
|
|
**Note on Context Length:** This model natively supports a context length of **262,144**. If you experience Out-Of-Memory (OOM) errors or have limited VRAM during inference, you can reduce the maximum context length in your vLLM parameters. For example: `max_model_len=120000`.
|
||
|
|
|
||
|
|
Please follow the setup and inference instructions in the official repository and project documentation:
|
||
|
|
|
||
|
|
- https://github.com/longvideoagent/LongVideoAgent
|
||
|
|
- https://longvideoagent.github.io/
|
||
|
|
|
||
|
|
If you use this checkpoint in your work, please also cite the LongVideoAgent paper below.
|
||
|
|
|
||
|
|
## Citation
|
||
|
|
|
||
|
|
```bibtex
|
||
|
|
@misc{liu2025longvideoagentmultiagentreasoninglong,
|
||
|
|
title={LongVideoAgent: Multi-Agent Reasoning with Long Videos},
|
||
|
|
author={Runtao Liu and Ziyi Liu and Jiaqi Tang and Yue Ma and Renjie Pi and Jipeng Zhang and Qifeng Chen},
|
||
|
|
year={2025},
|
||
|
|
eprint={2512.20618},
|
||
|
|
archivePrefix={arXiv},
|
||
|
|
primaryClass={cs.AI},
|
||
|
|
url={[https://arxiv.org/abs/2512.20618](https://arxiv.org/abs/2512.20618)},
|
||
|
|
}
|