Dissertations, Theses, and Capstone Projects

Multi-Document Graph Visualizer: Bridging Computational Text Analysis and Interactive Visualization for Large-Scale Text Corpora

Atilio Barreda II, The Graduate Center, City University of New YorkFollow

Date of Degree

2-2025

Document Type

Capstone Project

Degree Name

M.S.

Program

Data Analysis & Visualization

Advisor

Matthew K. Gold

Keywords

Multi-Document Analysis, Graph Visualization, Text Corpora, Natural Language Processing, Semantic Encoding, Topic Modeling, Named Entity Recognition, Interactive Data Visualization, Force-Directed Networks, Dimensionality Reduction, t-SNE, UMAP, Situated Knowledges, Ethical Data Practices

Abstract

This thesis introduces the Multi-Document Graph Visualizer (MDGV), a tool designed to analyze and visualize relationships within large-scale text corpora, addressing a critical gap in computational text analysis. By synthesizing state-of-the-art natural language processing (NLP) techniques with advanced interactive data visualization, MDGV empowers researchers across disciplines—including the sciences, social sciences, and humanities—to uncover latent connections, relationships, and patterns in vast and complex document collections. Importantly, MDGV requires limited technical expertise from its users, making it accessible to a broader audience of researchers and practitioners. This approach draws on Haraway's concept of situated knowledges, which emphasizes interpretability and engagement in computational tools (Haraway 577).

At its core, MDGV employs a Python-based backend that leverages BERT-based transformers for semantic encoding, BERTopic for dynamic topic modeling, and spaCy for named entity recognition. This backend processes text corpora to extract semantic, thematic, and structural insights. The processed data is visualized through a Vue.js-powered frontend that supports dynamic interaction and exploration. By incorporating complementary visualization methods—including force-directed entity networks, dimensional reduction plots (t-SNE and UMAP), hierarchical topic models, and custom topic relationship graphs—MDGV enables users to engage with their data through multiple analytical perspectives, enhancing interpretability and insight. Techniques such as force-directed entity networks leverage innovations like those described by Bostock et al. in his D3.js framework, enabling dynamic interaction and visual clarity (Bostock et al. 2301).

Key technical contributions of this project include a scalable document processing pipeline that supports multiple input formats (PDF and plaintext), robust mechanisms for entity relationship extraction and topic similarity calculations, and an architecture optimized for extensibility. These innovations allow MDGV to handle document collections ranging from hundreds to tens of thousands of files, making it suitable for use in scenarios such as historical analysis, large-scale literature reviews, policy research, and archival studies.

MDGV contributes to the advancement of computational text analysis by providing a flexible, transparent, and accessible solution for researchers seeking to make sense of complex textual data. This work underscores the importance of integrating robust technical solutions with thoughtful design principles to create tools that are both powerful and user-friendly, setting a new standard for interdisciplinary research in the digital age.

Recommended Citation

Barreda, Atilio II, "Multi-Document Graph Visualizer: Bridging Computational Text Analysis and Interactive Visualization for Large-Scale Text Corpora" (2025). CUNY Academic Works.
https://academicworks.cuny.edu/gc_etds/6196

barreda-Multi-Document-Graph-Visualizer-main.zip (14619 kB)
Code and data file export from GitHub repository

Download

COinS

Dissertations, Theses, and Capstone Projects

Multi-Document Graph Visualizer: Bridging Computational Text Analysis and Interactive Visualization for Large-Scale Text Corpora

Date of Degree

Document Type

Degree Name

Program

Advisor

Keywords

Abstract

Recommended Citation

Browse

Author Corner

Search

Links

Dissertations, Theses, and Capstone Projects

Multi-Document Graph Visualizer: Bridging Computational Text Analysis and Interactive Visualization for Large-Scale Text Corpora

Author

Date of Degree

Document Type

Degree Name

Program

Advisor

Keywords

Abstract

Recommended Citation

Share

Browse

Author Corner

Search

Links