Dissertations, Theses, and Capstone Projects
Date of Degree
6-2024
Document Type
Thesis
Degree Name
M.A.
Program
Linguistics
Advisor
Kyle Gorman
Subject Categories
Computational Linguistics | Jewish Studies | Linguistics
Keywords
Hebrew, natural language processing, language annotation, diacritization, vocalization, niqud
Abstract
Written modern Hebrew presents a unique challenge for training computational models for language processing because modern Hebrew text often lacks vocalization. The lack of available vocalized Hebrew data can lead to ambiguity in training these models and generally hinders work on natural language processing problems. The goal of this project is to contribute to the collection of vocalized Hebrew text by collecting and preprocessing a large corpus of unvocalized Hebrew text and building an online annotation tool. The annotation tool allows people to upload unvocalized Hebrew text, to annotate by adding Hebrew vocalization, and to download comma-separated values files of vocalized text. This project seeks to enhance the space both by collecting vocalized Hebrew text and by building the vocalization annotation interface—allowing for the continuous growth of a vocalized Hebrew text corpus.
Recommended Citation
Bloch, Rachel Shanblatt, "Expanding the Corpus of Vocalized Hebrew Text: Compiling an Unvocalized Text Corpus and Building an Online Interface for Vocalization Annotation" (2024). CUNY Academic Works.
https://academicworks.cuny.edu/gc_etds/5891