Date of Degree

2-2015

Document Type

Thesis

Degree Name

M.A.

Program

Linguistics

Advisor(s)

Martin Chodorow

Subject Categories

Jewish Studies | Linguistics

Keywords

information retrieval, noisy channel, orthographic normalization, Yiddish

Abstract

Yiddish is characterized by a multitude of orthographic systems. A number of approaches to automatic normalization of variant orthography have been explored for the processing of historic texts of languages whose orthography has since been standardized. However, these approaches have not yet been applied to Yiddish.

Using a manually normalized set of 16 Yiddish documents as a training and test corpus, four techniques for automatic normalization were compared: a hand-crafted set of transformation rules, an off-the-shelf spell checker, edit distance minimization with manually set weights, and edit distance minimization with weights learned through a training set.

Performance was evaluated by calculating the proportion of correctly normalized words in a test set, and by measuring precision and recall in a test of information retrieval.

For the given test corpus, normalization by minimization of edit distance with multi-character edit operations and learned weights was found to perform best in all tests.

Share

COinS
 
 

To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.