Date of Degree
homograph disambiguation, label imputation, natural language processing, machine learning, deep learning, token classification
This dissertation presents the first implementation of label imputation for the task of homograph disambiguation using 1) transcribed audio, and 2) parallel, or translated, corpora. For label imputation from parallel corpora, a hypothesis of interlingual alignment between homograph pronunciations and text word forms is developed and formalized. Both audio and parallel corpora label imputation techniques are tested empirically in experiments that compare homograph disambiguation model performance using: 1) hand-labeled training data, and 2) hand-labeled training data augmented with label-imputed data. Regularized, multinomial logistic regression and pre-trained ALBERT, BERT, and XLNet language models fine-tuned as token classifiers are developed for homograph disambiguation. Model performance after training on parallel corpus-based, label-imputed augmented data shows improvement over training on hand-labeled data alone in classes with low prevalence samples. Four homograph disambiguation data sets generated during the work on the dissertation are made available to the research community. In addition, this dissertation offers a novel typology of homographs with practical implications for both the label imputation process and homograph disambiguation.
Seale, Jennifer M., "Label Imputation for Homograph Disambiguation: Theoretical and Practical Approaches" (2021). CUNY Academic Works.