Dissertations and Theses

Date of Award

2025

Document Type

Thesis

Department

Computer Science

First Advisor

Shigang Li

Second Advisor

Zhigang Zhu

Keywords

Cognitive Maps, Visual-Language Navigation, Multimodal Alignment, Semantic Matching, Dynamic Programming, Urban Navigation, Scene Interpretation Engineering

Abstract

Visual-Language Navigation (VLN) presents significant challenges for autonomous agents, such as robots and virtual assistants, particularly in complex, dynamic environments where the seamless integration of visual perception and natural language understanding is critical. Traditional VLN systems often struggle with effectively aligning language instructions and visual scene understanding, limiting their adaptability and navigation efficiency.

This thesis proposes a novel Cognitive Map-based framework that addresses these challenges by transforming natural language navigation instructions into structured graph representations. The Cognitive Map consists of nodes representing waypoints, landmarks, decision points, and edges encoding spatial relationships and navigational actions. These maps are generated using Large Language Models (LLMs), specifically GPT-4o, to extract spatial information and construct detailed route topologies. Additionally, panoramic images sourced from Google Street View are processed using Large Multimodal Models (LMMs) to generate comprehensive visual descriptions of the environment.

The system first applies incremental alignment based on instruction order and heading direction to integrate these modalities. A dynamic programming algorithm is used to refine alignment for II segments with ambiguity or mismatch, leveraging semantic similarity computed via SBERT. This two-stage process ensures temporal coherence and semantic grounding between text-based instructions and real-world visual environments.

Experiments conducted on the Touchdown and Map2Seq datasets demonstrate improved navigation accuracy, robust scene-language alignment, and increased adaptability compared to traditional VLN methods. The proposed Cognitive Map framework bridges the gap between visual and linguistic modalities, offering a scalable solution for real-world applications such as urban navigation, search and rescue, and augmented reality guidance systems.

Comments

This thesis is under a 12-month embargo to support the submission of related research to AAAI-26. The full text will be publicly available upon expiration of the embargo period.

Available for download on Wednesday, May 20, 2026

Share

COinS