Dissertations, Theses, and Capstone Projects

Date of Degree

6-2024

Document Type

Thesis

Degree Name

M.A.

Program

Linguistics

Advisor

Kyle Gorman

Subject Categories

Computational Linguistics | Jewish Studies | Linguistics

Keywords

Hebrew, natural language processing, language annotation, diacritization, vocalization, niqud

Abstract

Written modern Hebrew presents a unique challenge for training computational models for language processing because modern Hebrew text often lacks vocalization. The lack of available vocalized Hebrew data can lead to ambiguity in training these models and generally hinders work on natural language processing problems. The goal of this project is to contribute to the collection of vocalized Hebrew text by collecting and preprocessing a large corpus of unvocalized Hebrew text and building an online annotation tool. The annotation tool allows people to upload unvocalized Hebrew text, to annotate by adding Hebrew vocalization, and to download comma-separated values files of vocalized text. This project seeks to enhance the space both by collecting vocalized Hebrew text and by building the vocalization annotation interface—allowing for the continuous growth of a vocalized Hebrew text corpus.

Share

COinS