MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations

Abstract

With the emergence of LLMs and their integration with other data modalities, multi-modal 3D perception attracts more attention due to its connectivity to the physical world and makes rapid progress. However, limited by existing datasets, previous works mainly focus on understanding object properties or inter-object spatial relationships in a 3D scene. To tackle this problem, this paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan. It is constructed based on a top-down logic, from region to object level, from a single target to inter-target relationships, covering holistic aspects of spatial and attribute understanding. The overall pipeline incorporates powerful VLMs via carefully designed prompts to initialize the annotations efficiently and further involve humans' correction in the loop to ensure the annotations are natural, correct, and comprehensive. Built upon existing 3D scanning data, the resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks. We evaluate representative baselines on our benchmarks, analyze their capabilities in different aspects, and showcase the key problems to be addressed in the future. Furthermore, we use this high-quality dataset to train state-of-the-art 3D visual grounding and LLMs and obtain remarkable performance improvement both on existing benchmarks and in-the-wild evaluation.

Dataset Overview

MMScan provides the largest ever multi-modal 3D scene dataset with 6.9M hierarchical grounded language annotations, covering holistic aspects on both object- and region-level.

Annotation Pipeline

The dataset construction starts with collecting meta-annotations to cover holistic aspects of spatial and attribute understanding for different granularities of 3D scenes. In this paper, we design a top-down logic, to collect object- and region-level annotations. Specifically, we first select optimal views for each object to incorporate VLMs to initialize the annotations efficiently and then involve human’s correction in the loop. The annotation result includes both spatial (geometric shape, pose) and attribute (category, appearance, material, state, etc.) descriptions for the object. We annotate regions in a similar way while focusing on regions’ inherent properties, object-object/region relationships and advanced QA via a different UI and prompts.

Post-processing for Benchmarks

Given these meta-annotations, we further generate comprehensive samples for visual grounding and question-answering benchmarks.

In-the-Wild Test

Trained with MMScan, our model obtains remarkable performance improvement both on existing benchmarks and in-the-wild evaluation.

BibTeX

@inproceedings{mmscan,
    title={MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations},
    author={Lyu, Ruiyuan and Wang, Tai and Lin, Jingli and Yang, Shuai and Mao, Xiaohan and Chen, Yilun and Xu, Runsen and Huang, Haifeng and Zhu, Chenming and Lin, Dahua and Pang, Jiangmiao},
    year={2024},
    booktitle={arXiv},
}