---
license: apache-2.0
base_model:
- Qwen/Qwen2.5-VL-3B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
---
**EN** | [中文](README_CN.md)
# SenseNova-SI: Scaling Spatial Intelligence with Multimodal Foundation Models
## Overview
Despite remarkable progress, multimodal foundation models still exhibit surprising deficiencies in spatial intelligence.
In this work, we explore scaling up multimodal foundation models to cultivate spatial intelligence within the **SenseNova-SI family**,
built upon established multimodal foundations including visual understanding models (i.e., Qwen3-VL and InternVL3) and unified understanding and generation models (i.e., Bagel).
We take a principled approach to constructing high-performing and robust spatial intelligence by systematically curating SenseNova-SI-8M:
eight million diverse data samples under a rigorous taxonomy of spatial capabilities.
SenseNova-SI demonstrates unprecedented performance across a broad range of spatial intelligence benchmarks: 68.7% on VSI-Bench, 43.3% on MMSI, 85.6% on MindCube,
54.6% on ViewSpatial, and 50.1% on SITE, while maintaining strong general multimodal understanding (e.g., 84.9% on MMBench-En).
More importantly, we analyze the impact of data scaling, discuss early signs of emergent generalization capabilities enabled by diverse data training,
analyze the risk of overfitting and language shortcuts, present a preliminary study on spatial chain-of-thought reasoning, and validate the potential downstream application. SenseNova-SI is an ongoing project, and this report will be updated continuously.
All newly trained multimodal foundation models are publicly released to facilitate further research in this direction.
*In the future, SenseNova-SI will be integrated with larger-scale in-house models.*
## Release Information
Currently, we build SenseNova-SI upon popular open-source foundation models to maximize compatibility with existing research pipelines.
In this release, we present
[**SenseNova-SI-1.2-InternVL3-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.2-InternVL3-8B), [**SenseNova-SI-1.1-Qwen2.5-VL-3B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen2.5-VL-3B), [**SenseNova-SI-1.1-Qwen2.5-VL-7B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen2.5-VL-7B), and [**SenseNova-SI-1.1-Qwen3-VL-8B**](https://huggingface.co/sensenova/SenseNova-SI-1.1-Qwen3-VL-8B), of which **SenseNova-SI-1.2-InternVL3-8B** achieve state-of-the-art performance among open-source models of comparable size across eight recent spatial intelligence benchmarks:
**VSI**, **MMSI**, **MindCube**, **ViewSpatial**, **SITE**, **BLINK**, **3DSRBench**, **EmbSpatial-Bench**.
| Model | VSI | MMSI | MindCube-Tiny | ViewSpatial | SITE |
|---|---|---|---|---|---|
| Open-source Models (~2B) | |||||
| InternVL3-2B | 32.9 | 26.5 | 37.5 | 32.5 | 30.0 |
| Qwen2.5-VL-3B-Instruct | 27.0 | 28.6 | 37.6 | 31.9 | 33.1 |
| Qwen3-VL-2B-Instruct | 50.3 | 28.9 | 34.5 | 36.9 | 35.6 |
| MindCube-3B-RawQA-SFT | 17.2 | 1.7 | 51.7 | 24.1 | 6.3 |
| SpatialLadder-3B | 44.8 | 27.4 | 43.4 | 39.8 | 27.9 |
| SpatialMLLM-4B | 46.3 | 26.1 | 33.4 | 34.6 | 18.0 |
| VST-3B-SFT | 57.9 | 30.2 | 35.9 | 52.8 | 35.8 |
| Cambrian-S-3B | 57.3 | 25.2 | 32.5 | 39.0 | 28.3 |
| SenseNova-SI-1.1-Qwen2.5-VL-3B | 54.9 | 30.8 | 52.6 | 43.5 | 37.8 |
| Proprietary Models | |||||
| Gemini-2.5-pro-2025-06 | 53.5 | 38.0 | 57.6 | 46.0 | 57.0 |
| Grok-4-2025-07-09 | 47.9 | 37.8 | 63.5 | 43.2 | 47.0 |
| GPT-5-2025-08-07 | 55.0 | 41.8 | 56.3 | 45.5 | 61.8 |