Zero-shot Inexact CAD Model Alignment from a Single Image

1VISTEC, Thailand
2Technical University Munich, Germany
ICCV 2025
Given an input image and user-selected CAD models, we estimate the 9-DoF pose of each model by aligning it to the target object using the Normalized Object Coordinate (NOC) space, without requiring scene-level pose annotations. Despite being trained on only 9 classes, our method generalizes well to unseen categories in real images.

Abstract

One practical approach to infer 3D scene structure from a single image is to retrieve a closely matching 3D model from a database and align it with the object in the image. Existing methods rely on supervised training with images and pose annotations, which limits them to a narrow set of object categories. To address this, we propose a weakly supervised 9-DoF alignment method for inexact 3D models that requires no pose annotations and generalizes to unseen categories. Our approach derives a novel feature space based on foundation features that ensure multi-view consistency and overcome symmetry ambiguities inherent in foundation features using a self-supervised triplet loss. Additionally, we introduce a texture-invariant pose refinement technique that performs dense alignment in normalized object coordinates, estimated through the enhanced feature space. We conduct extensive evaluations on the real-world ScanNet25k dataset, where our method outperforms SOTA weakly supervised baselines by +4.3% mean alignment accuracy and is the only weakly supervised approach to surpass the supervised ROCA by +2.7%. To assess generalization, we introduce SUN2CAD, a real-world test set with 20 novel object categories, where our method achieves SOTA results without prior training on them.

Proposed Solution

We propose a technique to enhance foundation features and integrate them into a 3D alignment pipeline with a coarse-to-fine estimation scheme. (1) A coarse 9-DoF pose is estimated using a geometry-aware feature space derived from DINOv2, which is more robust to object symmetries. (2) The pose is refined through dense alignment optimization in a texture-invariant space (NOC), using a new NOC estimator that generalizes better than prior work.

Coarse Pose Estimation

In coarse alignment, object pixels and 3D model parts are encoded into a shared feature space, where correspondences are found via nearest neighbors and used to estimate pose with least squares. The key challenge is designing an effective feature space and encoder. Our solution trains a small feature adapter network that converts foundation features, computed from an image or a 3D model rendering, into custom features. This network enforces multi-view consistency, ensuring features for the same part remain similar across views, while distinguishing features for symmetrical parts that are not well separated in the foundation feature space. Leveraging direct access to CAD models, we formulate these objectives into a self-supervised triplet loss. This new feature space improves geometric awareness while allowing useful semantics in the foundation features to be retained.

Fine Pose Estimation

In fine alignment, we use dense image-based alignment to optimize the 3D pose by matching the 3D model's rendering to the input image. However, instead of comparing in RGB space, which is impractical due to mismatched texture, we convert both the input and the rendering into NOC maps for comparison. These maps assign pixels from the same object part to a shared normalized 3D coordinate, allowing direct matching. To predict NOC maps, we leverage our feature space and perform nearest neighbor matching, as in coarse alignment. Since nearest neighbor matching is invariant to global scaling and shifting in the feature space, our NOC maps can offer improved robustness to domain gaps and have been found to generalize better to real-world images, even outperforming direct NOC regressors trained on the same synthetic renderings.

Alignmet Results

We evaluate our method on ScanNet and outperform weakly supervised baselines. Additionally, we introduce SUN2CAD dataset, an inexact 9-DoF test set with 20 unseen categories, where our approach surpasses both the supervised SOTA (SPARC) and weakly supervised baselines, achieving state-of-the-art generalization by a large margin.

SUN2CAD dataset

Coming soon!

BibTex

@inproceedings{Arsomngern2025ZeroCAD,
  author = {Arsomngern, Pattaramanee and Khwanmuang, Sasikarn and Nie{\ss}ner, Matthias and Suwajanakorn, Supasorn},
  title = {Zero-shot Inexact CAD Model Alignment from a Single Image},
  booktitle = {International Conference on Computer Vision},
  year = {2025},
}