Dissertations and Theses
Date of Award
2025
Document Type
Thesis
Department
Computer Science
First Advisor
Shigang Li
Second Advisor
Zhigang Zhu
Keywords
Cognitive Maps, Visual-Language Navigation, Multimodal Alignment, Semantic Matching, Dynamic Programming, Urban Navigation, Scene Interpretation Engineering
Abstract
Visual-Language Navigation (VLN) presents significant challenges for autonomous agents, such as robots and virtual assistants, particularly in complex, dynamic environments where the seamless integration of visual perception and natural language understanding is critical. Traditional VLN systems often struggle with effectively aligning language instructions and visual scene understanding, limiting their adaptability and navigation efficiency.
This thesis proposes a novel Cognitive Map-based framework that addresses these challenges by transforming natural language navigation instructions into structured graph representations. The Cognitive Map consists of nodes representing waypoints, landmarks, decision points, and edges encoding spatial relationships and navigational actions. These maps are generated using Large Language Models (LLMs), specifically GPT-4o, to extract spatial information and construct detailed route topologies. Additionally, panoramic images sourced from Google Street View are processed using Large Multimodal Models (LMMs) to generate comprehensive visual descriptions of the environment.
The system first applies incremental alignment based on instruction order and heading direction to integrate these modalities. A dynamic programming algorithm is used to refine alignment for II segments with ambiguity or mismatch, leveraging semantic similarity computed via SBERT. This two-stage process ensures temporal coherence and semantic grounding between text-based instructions and real-world visual environments.
Experiments conducted on the Touchdown and Map2Seq datasets demonstrate improved navigation accuracy, robust scene-language alignment, and increased adaptability compared to traditional VLN methods. The proposed Cognitive Map framework bridges the gap between visual and linguistic modalities, offering a scalable solution for real-world applications such as urban navigation, search and rescue, and augmented reality guidance systems.
Recommended Citation
Sandoval Mesa, Alexander, "Cognitive Map Generation for Vision and Language Navigation" (2025). CUNY Academic Works.
https://academicworks.cuny.edu/cc_etds_theses/1274
Included in
Artificial Intelligence and Robotics Commons, Computational Engineering Commons, Data Science Commons, Geological Engineering Commons, Other Computer Engineering Commons, Transportation Engineering Commons

Comments
This thesis is under a 12-month embargo to support the submission of related research to AAAI-26. The full text will be publicly available upon expiration of the embargo period.