Dissertations, Theses, and Capstone Projects
Date of Degree
2-2025
Document Type
Capstone Project
Degree Name
M.S.
Program
Data Analysis & Visualization
Advisor
Matthew K. Gold
Keywords
Multi-Document Analysis, Graph Visualization, Text Corpora, Natural Language Processing, Semantic Encoding, Topic Modeling, Named Entity Recognition, Interactive Data Visualization, Force-Directed Networks, Dimensionality Reduction, t-SNE, UMAP, Situated Knowledges, Ethical Data Practices
Abstract
This thesis introduces the Multi-Document Graph Visualizer (MDGV), a tool designed to analyze and visualize relationships within large-scale text corpora, addressing a critical gap in computational text analysis. By synthesizing state-of-the-art natural language processing (NLP) techniques with advanced interactive data visualization, MDGV empowers researchers across disciplines—including the sciences, social sciences, and humanities—to uncover latent connections, relationships, and patterns in vast and complex document collections. Importantly, MDGV requires limited technical expertise from its users, making it accessible to a broader audience of researchers and practitioners. This approach draws on Haraway's concept of situated knowledges, which emphasizes interpretability and engagement in computational tools (Haraway 577).
At its core, MDGV employs a Python-based backend that leverages BERT-based transformers for semantic encoding, BERTopic for dynamic topic modeling, and spaCy for named entity recognition. This backend processes text corpora to extract semantic, thematic, and structural insights. The processed data is visualized through a Vue.js-powered frontend that supports dynamic interaction and exploration. By incorporating complementary visualization methods—including force-directed entity networks, dimensional reduction plots (t-SNE and UMAP), hierarchical topic models, and custom topic relationship graphs—MDGV enables users to engage with their data through multiple analytical perspectives, enhancing interpretability and insight. Techniques such as force-directed entity networks leverage innovations like those described by Bostock et al. in his D3.js framework, enabling dynamic interaction and visual clarity (Bostock et al. 2301).
Key technical contributions of this project include a scalable document processing pipeline that supports multiple input formats (PDF and plaintext), robust mechanisms for entity relationship extraction and topic similarity calculations, and an architecture optimized for extensibility. These innovations allow MDGV to handle document collections ranging from hundreds to tens of thousands of files, making it suitable for use in scenarios such as historical analysis, large-scale literature reviews, policy research, and archival studies.
MDGV contributes to the advancement of computational text analysis by providing a flexible, transparent, and accessible solution for researchers seeking to make sense of complex textual data. This work underscores the importance of integrating robust technical solutions with thoughtful design principles to create tools that are both powerful and user-friendly, setting a new standard for interdisciplinary research in the digital age.
Recommended Citation
Barreda, Atilio II, "Multi-Document Graph Visualizer: Bridging Computational Text Analysis and Interactive Visualization for Large-Scale Text Corpora" (2025). CUNY Academic Works.
https://academicworks.cuny.edu/gc_etds/6196
Code and data file export from GitHub repository