Stereo-GS: Multi-View Stereo Vision Model for Generalizable 3D Gaussian Splatting Reconstruction

ACM Multimedia 2025
Xiufeng Huang1,2, Ka Chun Cheung2, Runmin Cong3, Simon See2, Renjie Wan1
1Department of Computer Science, Hong Kong Baptist University,
2NVIDIA AI Technology Center
3School of Control Science and Engineering, Shandong University
Figure 1

Abstract

Generalizable 3D Gaussian Splatting reconstruction showcases advanced Image-to-3D content creation but requires substantial computational resources and large datasets, posing challenges to training models from scratch. Current methods usually entangle the prediction of 3D Gaussian geometry and appearance, which rely heavily on data-driven priors and result in slow regression speeds. To address this, we propose Stereo-GS, a disentangled framework for efficient 3D Gaussian prediction. Our method extracts features from local image pairs using a stereo vision backbone and fuses them via global attention blocks. Dedicated point and Gaussian prediction heads generate multi-view point-maps for geometry and Gaussian features for appearance, combined as GS-maps to represent the 3DGS object. A refinement network enhances these GSmaps for high-quality reconstruction. Unlike existing methods that depend on camera parameters, our approach achieves pose-free 3D reconstruction, improving robustness and practicality. By reducing resource demands while maintaining high-quality outputs, StereoGS provides an efficient, scalable solution for real-world 3D content generation.

Framework

Figure 1

Our proposed Stereo-GS generates multi-view GS-maps in a disentangled manner for predicting 3DGS geometry and appearance, enabling high-quality 3D Gaussian reconstruction. It first uses a stereo vision model to extract local feature tokens from image pairs, which are fused via multi-view global attention blocks. A point prediction head estimates geometry through multi-view point-maps, while a Gaussian prediction head generates Gaussian features for appearance. These are combined into GS-maps representing the 3DGS object, refined by a cross-view attention-based network, and rendered as per-pixel 3D Gaussians for novel views during training.