iGGT: Instance-Grounded Geometry Transformer

Revolutionary Solution

Humans naturally perceive both geometric structure and semantic content of 3D worlds, but achieving "the best of both worlds" has been a grand challenge for AI. Traditional methods decouple 3D reconstruction (low-level geometry) from spatial understanding (high-level semantics), leading to error accumulation and poor generalization. Meanwhile, newer methods attempt to "lock" 3D models with specific Vision-Language Models (VLMs), which not only limits the model's perception capabilities (e.g., inability to distinguish between different instances of the same class) but also hinders extensibility to stronger downstream tasks.

Now, iGGT presents a revolutionary solution. NTU in collaboration with StepFun proposes iGGT (Instance-Grounded Geometry Transformer), an innovative end-to-end large unified Transformer that, for the first time, integrates spatial reconstruction with instance-level contextual understanding.

building upon our curated large-scale dataset InsScene-15K, we propose a novel end-to-end framework that enables geometric reconstruction and contextual understanding in a unified representation.

Key Contributions

🔧 End-to-End Unified Framework

We propose iGGT, a large unified Transformer that unifies knowledge of spatial reconstruction and instance-level contextual understanding through end-to-end training within a single model.

📊 Large-Scale Instance Dataset

We build InsScene-15K, a novel large-scale dataset containing 15K scenes, 200M images, and high-quality, 3D-consistent instance-level masks annotated through a novel data pipeline.

🔌 Instance Decoupling & Plug-and-Play

We pioneer the "Instance-Grounded Scene Understanding" paradigm. iGGT is not bound to any specific VLM, but generates instance masks as a "bridge" for seamless plug-and-play integration with arbitrary VLMs and LMMs.

🎯 Multi-Application Support

This unified representation greatly expands downstream capabilities. iGGT is the first model to simultaneously support spatial tracking, open-vocabulary segmentation, and scene question answering (QA).

InsScene-15K Dataset Construction

Overview of InsScene-15K dataset annotation pipeline.

The InsScene-15K dataset is constructed through a novel data curation pipeline driven by SAM2, integrating three different data sources with distinct processing approaches:

🎮 Synthetic Data (e.g., Aria, Infinigen)

This is the most straightforward case. In simulated environments, RGB images, depth maps, camera poses, and object-level segmentation masks are generated simultaneously. Since these simulated masks are "perfectly accurate," they can be used directly without any post-processing.

🎥 Real-World Video Capture (e.g., RE10K)

This pipeline, illustrated in Figure 2(a), is a customized SAM2 video dense prediction pipeline. First, SAM generates dense initial mask proposals on frame 0 of the video. Then, the SAM2 video object segmenter propagates these masks forward in time. To handle newly appearing objects or avoid drift, the pipeline iteratively adds new keyframes: if uncovered regions exceed a threshold, SAM is re-run on new frames to discover new objects. Finally, bi-directional propagation is performed to ensure high temporal consistency throughout the entire video sequence.

📷 Real-World RGBD Capture (e.g., ScanNet++)

This pipeline, shown in Figure 2(b), is a mask refinement pipeline. The 3D annotations provided by ScanNet++ are coarse. The pipeline first projects these 3D annotations onto 2D images to obtain initial GT masks with consistent IDs. Meanwhile, SAM2 is used to generate fine-grained mask proposals with accurate shapes but without IDs for the same RGB image. The key step is matching and merging: aligning the fine masks generated by SAM2 with the projected coarse GT masks to assign correct, multi-view consistent IDs to the fine masks. Through this approach, the pipeline significantly improves the quality of 2D masks, maintaining both 3D ID consistency and SAM2-level shape accuracy.

Architecture

Overview of iGGT architecture. Input images are encoded into unified token representations, which are then processed by the Geometry Head and Instance Head respectively to simultaneously generate high-quality geometric reconstruction and instance-grounded clustering results.

The iGGT architecture consists of three key components:

🏗️ Large Unified Transformer

Following VGGT, the model first uses pretrained DINOv2 to extract patch-level tokens from images. Subsequently, 24 attention modules process the multi-view image tokens through intra-view self-attention and global-view cross-attention, encoding them into powerful unified token representations.

🎯 Downstream Heads and Cross-Modal Fusion

The unified tokens are fed into two parallel decoders:

Geometry Head: Inherited from VGGT, responsible for predicting camera parameters, depth maps, and point clouds.
Instance Head: Adopts a DPT-like architecture to decode instance features.
Cross-Modal Fusion Block: To enable the instance head to perceive fine-grained geometric boundaries, we design a cross-modal fusion block. It efficiently embeds spatial structural features from the geometry head into instance representations through sliding window cross attention, significantly enhancing the spatial awareness of instance features.

🎨 3D-Consistent Contrastive Supervision

To enable the model to learn 3D-consistent instance features from only 2D inputs, we design a multi-view contrastive loss. The core idea is to "pull together" pixel features from different views belonging to the same 3D instance while "pushing apart" features from different instances in the feature space.

Instance-Grounded Scene Understanding

The core idea is to "decouple" the unified representation of the 3D model from downstream language models (VLMs or LMMs). This differs from previous methods that typically "tightly couple" or "forcibly align" 3D models with specific language models (like LSeg), which limits the model's perception capabilities and extensibility.

We first leverage unsupervised clustering (HDBSCAN) to group the 3D-consistent instance features predicted by iGGT, thereby segmenting the scene into different object instances. These clustering results are then reprojected to generate 3D-consistent 2D instance masks, which serve as a "bridge" for seamless plug-and-play integration with various VLMs (such as CLIP, OpenSeg) and LMMs (such as Qwen2.5-VL).

This decoupling paradigm greatly expands the application scope of the model:

Instance Spatial Tracking: Using the 3D-consistent masks generated by clustering, we can densely track and segment specific object instances across multiple views, even under significant camera motion without easily losing the target.
Open-Vocabulary Semantic Segmentation: Instance masks can serve as "prompts" fed into any off-the-shelf VLM (such as OpenSeg). The VLM assigns a semantic category to each masked region, enabling open-vocabulary segmentation.
QA Scene Grounding: This decoupled instance clustering can interact with LMMs (such as GPT-4 or Qwen-VL 2.5). For example, we can highlight masks of the same instance across multiple views and then query the LMM to perform complex object-centric question answering tasks in 3D scenes.

Experimental Results

Compared to existing methods, iGGT is the only model capable of simultaneously achieving reconstruction, understanding, and tracking, with significant improvements in understanding and tracking metrics.

On instance 3D tracking tasks, iGGT achieves tracking IoU and success rates of 70% and 90% respectively, making it the only model capable of successfully tracking objects that disappear and reappear.

Our method is compared with SAM2 and SpaTracker+SAM. For clarity, all instances are visualized using different IDs and colors.

We also conduct comprehensive visualization experiments on scenes, demonstrating that iGGT can generate 3D-consistent instance-based features that remain distinctive across multiple views: multiple instances of the same class exhibit similar yet distinguishable colors in PCA space.

We visualize 3D-consistent PCA results alongside instance feature-based clustering masks. Similar colors in PCA indicate higher feature similarity between instances. For clustering masks, the same object instance shares the same color across multiple views.

On 2D/3D open-vocabulary segmentation tasks, thanks to the instance-grounded paradigm, we can seamlessly integrate the latest Vision-Language Models to enhance the model's query performance.

Qualitative results of 2D open-vocabulary segmentation on ScanNet and ScanNet++.

Qualitative results of 3D open-vocabulary segmentation on ScanNet and ScanNet++.

Furthermore, we can leverage instance masks to construct visual prompts and integrate them with Large Multimodal Models (LMMs) such as Qwen-VL to enable more complex object-specific queries and question-answering tasks in scenes. In contrast, even state-of-the-art LMM models still have significant limitations in handling multi-view or 3D scene understanding.

Application of QA scene understanding compared with vanilla Gemini 2.5 Pro.

Citation

@article{iggt2024, title={IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction}, author={Hao Li and Zhengyu Zou and Fangfu Liu and Xuanyang Zhang and Fangzhou Hong and Yukang Cao and Yushi Lan and Manyuan Zhang and Gang Yu and Dingwen Zhang and Ziwei Liu}, journal={arXiv preprint arXiv:2510.22706}, year={2024} }