Dissertations, Theses, and Capstone Projects

Date of Degree

6-2024

Document Type

Dissertation

Degree Name

Ph.D.

Program

Computer Science

Advisor

Jun Li

Committee Members

Ping Ji

Liang Zhao

Deng Pan

Subject Categories

Computer Sciences

Keywords

Machine Learning, Distributed Computing, Fault Tolerance, Coing Theory

Abstract

The advancement of artificial intelligence has facilitated the generation of vast datasets, whose size often exceeds tens of terabytes. Meanwhile, machine learning has become the most important technique for big data analytics, and sophisticated models with thousands or even millions of parameters are designed to leverage big data. However, processing big data and training large models necessitate significant storage and computational resources, which has been a bottleneck in the development of artificial intelligence. Distributed computing, such as MapReduce, has been proposed to surmount the computational bottlenecks associated with single-machine analysis.

Although distributed computing harnesses the collective power of numerous less powerful machines, it is challenging in practice due to the issue of stragglers. Stragglers are the nodes that temporarily underperform for multiple reasons, such as resource contention or network error, significantly delaying the training progress. Assigning identical tasks to multiple workers is a straightforward solution to mitigate the impact of stragglers. However, such solutions rely on prohibitive additional resources. Coded distributed computing that introduces parity tasks has been proposed to tolerate the same number of stragglers as replicating each task with fewer additional resources. Recent literature has explored multiple aspects of distributed machine learning, such as coded matrix multiplication and gradient coding, and proposed numerous coding techniques to alleviate the stragglers. However, since many codings are designed based on theory only, there are still many challenges in practice. In this dissertation, we study such practical challenges and focus on merging the gap between theoretical solutions and practical challenges to speed up coded distributed machine learning.

We first explore the re-encoding issue, which has been overlooked yet is significant in practice. As multiple jobs share resources in the distributed infrastructure, it is common that their performances fluctuate dynamically with time. However, existing coding techniques for distributed matrix multiplication do not support a flexible change of the coding scheme or parameters without receiving additional data. Therefore, we propose a framework of local re-encoding, which allows changing the coding scheme and its parameters for distributed matrix multiplication without incurring additional traffic.

We then study the specific applications of coding techniques in distributed stochastic gradient descent since it is another important application of distributed machine learning. To address the issue of stragglers, existing techniques either use gradient coding to recover the full gradients from a certain number of workers or ignore-straggler stochastic gradient descent to recover the partial gradients from an arbitrary number of workers. To enhance the flexibility of gradient coding, we introduce ignore-straggler gradient coding, allowing tolerance for an arbitrary number of stragglers and maximizing gradient recovery based on the specified number of stragglers.

Although gradient coding addresses the issue of stragglers by duplicating dataset partitions across workers, it necessitates workers to compute gradients across multiple partitions in each step, potentially extending computation time despite the efforts to alleviate straggler issues. Therefore, we propose pipelined gradient coding to effectively mitigate the impact of stragglers without increasing the computational workload on each worker. Through simulations and experiments, we demonstrate that pipelined gradient coding maintains the same computational efficiency as distributed stochastic gradient descent and outperforms existing schemes in reducing overall training time and achieving faster convergence.

This work is embargoed and will be available for download on Sunday, June 01, 2025

Share

COinS