Zero-shot Inexact CAD Model Alignment from a Single Image

One practical approach to infer 3D scene structure from a single image is to retrieve a closely matching 3D model from a database and align it with the object in the image. Existing methods rely on supervised training with images and pose annotations, which limits them to a narrow set of object categories. To address this, we propose a weakly supervised 9-DoF alignment method for inexact 3D models that requires no pose annotations and generalizes to unseen categories. Our approach derives a novel feature space based on foundation features that ensure multi-view consistency and overcome symmetry ambiguities inherent in foundation features using a self-supervised triplet loss. Additionally, we introduce a texture-invariant pose refinement technique that performs dense alignment in normalized object coordinates, estimated through the enhanced feature space. We conduct extensive evaluations on the real-world ScanNet25k dataset, where our method outperforms SOTA weakly supervised baselines by +4.3% mean alignment accuracy and is the only weakly supervised approach to surpass the supervised ROCA by +2.7%. To assess generalization, we introduce SUN2CAD, a real-world test set with 20 novel object categories, where our method achieves SOTA results without prior training on them.

In coarse alignment, object pixels and 3D model parts are encoded into a shared feature space, where correspondences are found via nearest neighbors and used to estimate pose with least squares. The key challenge is designing an effective feature space and encoder. Our solution trains a small feature adapter network that converts foundation features, computed from an image or a 3D model rendering, into custom features. This network enforces multi-view consistency, ensuring features for the same part remain similar across views, while distinguishing features for symmetrical parts that are not well separated in the foundation feature space. Leveraging direct access to CAD models, we formulate these objectives into a self-supervised triplet loss. This new feature space improves geometric awareness while allowing useful semantics in the foundation features to be retained.

In fine alignment, we use dense image-based alignment to optimize the 3D pose by matching the 3D model's rendering to the input image. However, instead of comparing in RGB space, which is impractical due to mismatched texture, we convert both the input and the rendering into NOC maps for comparison. These maps assign pixels from the same object part to a shared normalized 3D coordinate, allowing direct matching. To predict NOC maps, we leverage our feature space and perform nearest neighbor matching, as in coarse alignment. Since nearest neighbor matching is invariant to global scaling and shifting in the feature space, our NOC maps can offer improved robustness to domain gaps and have been found to generalize better to real-world images, even outperforming direct NOC regressors trained on the same synthetic renderings.