Dissertations, Theses, and Capstone Projects

Date of Degree

2-2026

Document Type

Doctoral Dissertation

Degree Name

Doctor of Philosophy

Program

Computer Science

Advisor

Ioannis Stamos

Committee Members

Yingli Tian

Zhigang Zhu

Philippos Mordohai

Subject Categories

Artificial Intelligence and Robotics

Abstract

6D object pose estimation is the task of determining an object’s 3D rotation and translation with respect to a camera, and plays a critical role in applications such as robotic manipulation, autonomous navigation, and augmented reality. While recent advances in deep learning have substantially improved performance, many existing methods still face limitations in learning robust and generalizable representations. Factors such as variations in object appearance, occlusion, sensor noise, and domain shifts can degrade model accuracy, highlighting the need for more effective representation learning strategies that capture rich geometric and semantic cues for reliable pose estimation across diverse conditions.

This dissertation investigates 6D pose estimation from geometric information, emphasizing the development of robust and generalizable representation learning techniques to address both instance-level and category-level settings. Instance-level pose estimation refers to the task of predicting the 6D pose of specific, known objects for which exact 3D models are available during both training and testing. For instance-level pose estimation, this dissertation presents a depth-only fusion framework that converts depth images into normal vector angle maps to explicitly embed geometric cues, and combines them with point cloud features for accurate 3D keypoint localization and semantic segmentation. This approach achieves state-of-the-art performance on the LineMod and Occlusion-LineMod datasets, and delivers competitive results on YCB-Video without post-processing.

Category-level pose estimation, on the other hand, aims to estimate 6D poses for previously unseen object instances that belong to a predefined category. For category-level pose estimation, this dissertation introduces a contrastive learning framework that learns pose-aware point cloud representations while preserving the intrinsic continuity of 6D poses. Specifically, we present two frameworks: the first is a two-phase one that combines pose-aware and geometry-aware representations to estimate target object poses, and the second is an end-to-end hierarchical ranking contrastive learning architecture, eliminating the need for a separate geometric encoder and enhancing the pose estimation modules. The resulting model achieves state-of-the-art accuracy among depth-only methods on the REAL275 and CAMERA25 datasets, while maintaining real-time inference speed.

In addition, we conduct an exploratory study on applying diffusion-based generative modeling to category-level pose estimation. The method generates canonical partial-view point clouds from observed depth-based point clouds before estimating poses via the Umeyama algorithm. While preliminary results reveal limitations in generation fidelity and pose consistency, the study highlights key challenges and opportunities for integrating generative models into pose estimation pipelines.

Overall, this dissertation contributes novel geometric representation learning frameworks for both instance-level and category-level 6D pose estimation, supported by extensive experiments on widely used benchmarks. The findings not only advance geometric-based pose estimation methods but also open pathways toward unified, generative–discriminative approaches for robust object pose estimation in real-world environments. Such capabilities are critical for enabling reliable robotic manipulation in cluttered or unstructured settings, enhancing perception for autonomous navigation in dynamic scenes, and improving interaction in augmented and mixed reality systems. By bridging fundamental representation learning with practical deployment, this work moves closer to making 6D pose estimation an integral component of real-world intelligent systems.

Share

COinS