Date of Degree

9-2022

Document Type

Dissertation

Degree Name

Ph.D.

Program

Computer Science

Advisor

Michael I. Mandel

Committee Members

Rebecca Levitan

Alla Rozovskaya

Brian Kingsbury

Subject Categories

Artificial Intelligence and Robotics | Data Science

Keywords

active learning, data selection, automatic speech recognition, speech enhancement, sound event detection, environmental sound classification

Abstract

There is growing recognition of the importance of data-centric methods for building machine learning systems. Data-centric methods assume a fixed model and iterate over the data to improve system performance. This is in contrast to traditional model-centric approaches, which assume a fixed dataset and iterate over models for the same ends. Data-centric machine learning is driven by the observation that, beyond the size of the training data, model performance depends on factors such as the quality of the annotations, and whether the data are representative of conditions in which models will be deployed. This is particularly of interest in the domains of speech and audio, where it is relatively cheap to acquire large amounts of data, but highly expensive and laborious to annotate or assess the quality of the training data. In this work, we investigate and develop methods for identifying highly informative subsets of speech or audio for improving system performance while reducing the amount of data required for training.

First, we investigate submodular data selection in the context of unsupervised active learning for automatic speech recognition (ASR) of low-resource languages. We demonstrate methods for sampling the acoustic feature space and selecting highly informative and diverse examples to build models with highly limited data. Second, we present data selection methods that apply domain knowledge for a speech enhancement system. We demonstrate that linguistic and acoustic characteristics of speech data can be used to select examples that enable a deep neural network (DNN) to learn a more generalizable similarity metric. Last, we investigate data valuation for curation of training data by assessing data quality. In the context of speech recognition, we develop a method for estimating Shapley values of speech data for training an end-to-end neural ASR model, which performs a structured prediction task. We also demonstrate how data valuation can be used for environmental sound classification. In the context of ecoacoustic monitoring of large data with limited labels, we estimate Shapley values of audio clips with overlapping sounds for a multi-label classifier. We demonstrate that these values identify high quality subsets of the training data for improving model performance. We also demonstrate methods for assessing data quality within individual sound classes and identifying annotation errors.

This work is embargoed and will be available for download on Saturday, September 30, 2023

Graduate Center users:
To read this work, log in to your GC ILL account and place a thesis request.

Non-GC Users:
See the GC’s lending policies to learn more.

Share

COinS