Date of Degree

9-2022

Document Type

Doctoral Dissertation

Degree Name

Doctor of Philosophy

Program

Computer Science

Advisor

Michael I. Mandel

Committee Members

Rebecca Levitan

Alla Rozovskaya

Brian Kingsbury

Subject Categories

Artificial Intelligence and Robotics | Data Science

Keywords

active learning, data selection, automatic speech recognition, speech enhancement, sound event detection, environmental sound classification

Abstract

There is growing recognition of the importance of data-centric methods for building machine learning systems. Data-centric methods assume a fixed model and iterate over the data to improve system performance. This is in contrast to traditional model-centric approaches, which assume a fixed dataset and iterate over models for the same ends. Data-centric machine learning is driven by the observation that, beyond the size of the training data, model performance depends on factors such as the quality of the annotations, and whether the data are representative of conditions in which models will be deployed. This is particularly of interest in the domains of speech and audio, where it is relatively cheap to acquire large amounts of data, but highly expensive and laborious to annotate or assess the quality of the training data. In this work, we investigate and develop methods for identifying highly informative subsets of speech or audio for improving system performance while reducing the amount of data required for training.

First, we investigate submodular data selection in the context of unsupervised active learning for automatic speech recognition (ASR) of low-resource languages. We demonstrate methods for sampling the acoustic feature space and selecting highly informative and diverse examples to build models with highly limited data. Second, we present data selection methods that apply domain knowledge for a speech enhancement system. We demonstrate that linguistic and acoustic characteristics of speech data can be used to select examples that enable a deep neural network (DNN) to learn a more generalizable similarity metric. Last, we investigate data valuation for curation of training data by assessing data quality. In the context of speech recognition, we develop a method for estimating Shapley values of speech data for training an end-to-end neural ASR model, which performs a structured prediction task. We also demonstrate how data valuation can be used for environmental sound classification. In the context of ecoacoustic monitoring of large data with limited labels, we estimate Shapley values of audio clips with overlapping sounds for a multi-label classifier. We demonstrate that these values identify high quality subsets of the training data for improving model performance. We also demonstrate methods for assessing data quality within individual sound classes and identifying annotation errors.

Recommended Citation

Syed, Ali Raza, "Data-Centric Machine Learning for Speech and Audio" (2022). CUNY Academic Works.
https://academicworks.cuny.edu/gc_etds/5059

Download

Included in

Artificial Intelligence and Robotics Commons, Data Science Commons

COinS

Dissertations, Theses, and Capstone Projects

Data-Centric Machine Learning for Speech and Audio

Date of Degree

Document Type

Degree Name

Program

Advisor

Committee Members

Subject Categories

Keywords

Abstract

Recommended Citation

Included in

Browse

Author Corner

Search

Links

Dissertations, Theses, and Capstone Projects

Data-Centric Machine Learning for Speech and Audio

Author

Date of Degree

Document Type

Degree Name

Program

Advisor

Committee Members

Subject Categories

Keywords

Abstract

Recommended Citation

Included in

Share

Browse

Author Corner

Search

Links