Dissertations, Theses, and Capstone Projects

Date of Degree

6-2016

Document Type

Thesis

Degree Name

M.A.

Program

Linguistics

Advisor

William Sakas

Keywords

morphological segmentation, Tagalog, low-resource languages, infixation

Abstract

In this paper, I present a method for coercing a widely-used morphological segmentation algorithm, Morfessor (Creutz and Lagus 2005), into accurately segmenting non-concatenative morphological patterns. The non-concatenative patterns targeted—infixation and partial-reduplication—present problems for many segmentation algorithms, and tools that can successfully identify and segment those patterns can improve a number of downstream natural language processing tasks, including keyword search and machine translation.

Included with this project is an implementation of the segmentation method described in the form of a Python library called Infixer. This approach involves a preprocessing step that re- structures non-concatenative patterns into concatenative ones using regular expressions, which allows the algorithm to operate on the input data as it would any other data. The target language for this project is Tagalog, a major language of the Philippines that makes extensive use of non- concatenative morphology, and the training data is built from data from the IARPA Babel Program and the Tagalog Wikipedia.

The results for this test were promising, especially for the more straightforward cases of infixation tested. For the data tested, the Infixer implementation using affix regular expressions showed performance gains over those without, demonstrating an improved ability to segment data containing non-concatenative morphological forms. In the future it is hoped that this project can lead to the development of tools that can be used effectively on languages besides Tagalog and on a more diverse array of phenomena.

steven_butler_suppl_files.zip (17 kB)
4 .py files, described in the text of the paper

Share

COinS