Dissertations, Theses, and Capstone Projects
Date of Degree
2-2020
Document Type
Dissertation
Degree Name
Ph.D.
Program
Computer Science
Advisor
Jia Xu
Committee Members
Robert Haralick
Lei Xie
Hui Wan
Subject Categories
Artificial Intelligence and Robotics | Computer Sciences
Keywords
Machine Learning, Deep Learning, Natural Language Processing, Neural Machine Translation
Abstract
This thesis aims for general robust Neural Machine Translation (NMT) that is agnostic to the test domain. NMT has achieved high quality on benchmarks with closed datasets such as WMT and NIST but can fail when the translation input contains noise due to, for example, mismatched domains or spelling errors. The standard solution is to apply domain adaptation or data augmentation to build a domain-dependent system. However, in real life, the input noise varies in a wide range of domains and types, which is unknown in the training phase. This thesis introduces five general approaches to improve NMT accuracy and robustness, where three of them are invariant to models, test domains, and noise types. First, we describe a novel unsupervised text normalization framework Lex-Var, to reduce the lexical variations for NMT. Then, we apply the phonetic encoding as auxiliary linguistic information and obtained very significant (5 BLEU point) improvement in translation quality and robustness. Furthermore, we introduce the random clustering encoding method based on our hypothesis of Semantic Diversity by Phonetics and generalizes to all languages. We also discussed two domain adaptation models for the known test domain. Finally, we provide a measurement of translation robustness based on the consistency of translation accuracy among samples and use it to evaluate our other methods. All these approaches are verified with extensive experiments across different languages and achieved significant and consistent improvements in translation quality and robustness over the state-of-the-art NMT.
Recommended Citation
Khan, Abdul Rafae, "Robust Neural Machine Translation" (2020). CUNY Academic Works.
https://academicworks.cuny.edu/gc_etds/3532