Dissertations, Theses, and Capstone Projects

Date of Degree

6-2024

Document Type

Dissertation

Degree Name

Ph.D.

Program

Linguistics

Advisor

Kyle Gorman

Committee Members

William Haddican

Michael Mandel

Subject Categories

Computational Linguistics

Keywords

Speech recognition, Arabic, Text Normalization, Large Language Models, Arabic numeral noun morphosyntax

Abstract

This dissertation aims to document the linguistic features of Arabic that pose challenges to speech and language technologies and advance these technologies by developing state-of-the-art computational tools focusing on automatic speech recognition (ASR), text normalization (TN), and corpus development. TN converts expressions such as numbers, dates, and times—named semiotic classes—from their written to their spoken domain, such as converting ‘$84.00’ to ‘eighty-four dollars’, while inverse text normalization (ITN) converts verbalized text to its written form. This conversion is an essential preprocessing step for text-to-speech (TTS), and post-processing step for ASR. Arabic presents a challenge for TN and ITN because one must not only determine the correct numerical value and verbalization but also the gender agreement and the case form of the converted number. Chapter 2 examines the morphosyntactic features of numeral-noun constructions in Modern Standard Arabic (MSA) and compares certain features with two contemporary Saudi dialects—Najdi and Hijazi. The analysis has noted some key ways that Najdi and Hijazi differ from MSA, which suggest some dialectal divergence from MSA in the number system. Due to the lack of existing Arabic text normalization data, Chapter 3 describes the development of a structured TN corpus for Arabic designed to evaluate and improve TN and ITN models, establishing a robust baseline for Arabic TN and ITN systems. Additionally, the chapter provides a computational grammar-based library for Arabic TN and ITN covering a wide range of numeral classes. The Arabic TN system achieved a macro accuracy of 88\% and a micro accuracy of 98% across all semiotic classes. Chapter 4 benchmarks the performance of GPT-4 on the Arabic TN task, providing a baseline for future research in LLM-based TN systems. The evaluation of the LLM-based model, when compared to the rule-based system, showed an overall decrease in accuracy. Specifically, the cardinal class presented a 75% accuracy compared to 80% for the rule-based system. The chapter also introduced a novel approach that integrates morphosyntactic feature---such as gender and case---to enhance Arabic text normalization using an automatic parser and morphological analyzer. Lastly, Chapter 5 details the development of a unified model for multi-variant Arabic ASR that includes Classical Arabic, MSA, and Dialectal Arabic using a state-of-the-art conformer architecture. The ASR models are trained on transcribed speech data, both with and without diacritization. The research compares the recognition performance of diacritized versus non-diacritized speech. Findings indicate that diacritization does not degrade the overall ASR performance, especially when the model is trained with sizable diacritized transcripts associated with Quranic speech.

Share

COinS