Dissertations, Theses, and Capstone Projects
Date of Degree
9-2021
Document Type
Dissertation
Degree Name
Ph.D.
Program
Linguistics
Advisor
William Sakas
Committee Members
Kyle Gorman
Alla Rozovskaya
Subject Categories
Computational Linguistics
Keywords
homograph disambiguation, label imputation, natural language processing, machine learning, deep learning, token classification
Abstract
This dissertation presents the first implementation of label imputation for the task of homograph disambiguation using 1) transcribed audio, and 2) parallel, or translated, corpora. For label imputation from parallel corpora, a hypothesis of interlingual alignment between homograph pronunciations and text word forms is developed and formalized. Both audio and parallel corpora label imputation techniques are tested empirically in experiments that compare homograph disambiguation model performance using: 1) hand-labeled training data, and 2) hand-labeled training data augmented with label-imputed data. Regularized, multinomial logistic regression and pre-trained ALBERT, BERT, and XLNet language models fine-tuned as token classifiers are developed for homograph disambiguation. Model performance after training on parallel corpus-based, label-imputed augmented data shows improvement over training on hand-labeled data alone in classes with low prevalence samples. Four homograph disambiguation data sets generated during the work on the dissertation are made available to the research community. In addition, this dissertation offers a novel typology of homographs with practical implications for both the label imputation process and homograph disambiguation.
Recommended Citation
Seale, Jennifer M., "Label Imputation for Homograph Disambiguation: Theoretical and Practical Approaches" (2021). CUNY Academic Works.
https://academicworks.cuny.edu/gc_etds/4518