Date of Degree

9-2021

Document Type

Dissertation

Degree Name

Ph.D.

Program

Linguistics

Advisor

William Sakas

Committee Members

Kyle Gorman

Alla Rozovskaya

Subject Categories

Computational Linguistics

Keywords

homograph disambiguation, label imputation, natural language processing, machine learning, deep learning, token classification

Abstract

This dissertation presents the first implementation of label imputation for the task of homograph disambiguation using 1) transcribed audio, and 2) parallel, or translated, corpora. For label imputation from parallel corpora, a hypothesis of interlingual alignment between homograph pronunciations and text word forms is developed and formalized. Both audio and parallel corpora label imputation techniques are tested empirically in experiments that compare homograph disambiguation model performance using: 1) hand-labeled training data, and 2) hand-labeled training data augmented with label-imputed data. Regularized, multinomial logistic regression and pre-trained ALBERT, BERT, and XLNet language models fine-tuned as token classifiers are developed for homograph disambiguation. Model performance after training on parallel corpus-based, label-imputed augmented data shows improvement over training on hand-labeled data alone in classes with low prevalence samples. Four homograph disambiguation data sets generated during the work on the dissertation are made available to the research community. In addition, this dissertation offers a novel typology of homographs with practical implications for both the label imputation process and homograph disambiguation.

Share

COinS