Dissertations, Theses, and Capstone Projects

Author

Di HeFollow

Date of Degree

6-2021

Document Type

Dissertation

Degree Name

Ph.D.

Program

Computer Science

Advisor

Lei Xie

Committee Members

Liang Zhao

Matluba Khodjaeva

Jia Xu

Subject Categories

Artificial Intelligence and Robotics | Data Science

Keywords

Deep Learning, Machine Learning, Transfer Learning, Representation Learning, Multi-Omics

Abstract

Machine learning has made significant contributions to bioinformatics and computational biol­ogy. In particular, supervised learning approaches have been widely used in solving problems such as bio­marker identification, drug response prediction, and so on. However, because of the limited availability of comprehensively labeled and clean data, constructing predictive models in super­ vised settings is not always desirable or possible, especially when using data­hunger, red­hot learning paradigms such as deep learning methods. Hence, there are urgent needs to develop new approaches that could leverage more readily available unlabeled data in driving successful machine learning ap­ plications in this area.

In my dissertation, I focused on exploring and designing deep learning­based unsupervised representation learning methods. A consistent scheme of these methods is that they construct a low­ dimensional space from the unlabeled raw datasets, and then leverage the learned low­dimensional embedding explicitly or implicitly for diverse downstream supervised tasks. Although progress has been made in recent years, most deep learning applications in biomedical studies are still in their infancy. It remains a challenging task to fully extract the biological meaningful information from a biomedical dataset such as multi­omics data to support predictive modeling for practical tasks of interest. To improve the biological relevance of learned representations, innovative approaches that could better integrate mulit­omics data and utilize their specific characteristics and natural ”annotations” are needed.

Hence, we proposed two approaches, namely, Cross LEvel Information Transmission (CLEIT) network and Coherent Cell­line Tissue Deconfounding Autoencoder (CODE­AE). Specifically, CLEIT aims to leverage the hierarchical relationships among omics data at different levels to drive the biologically meaningful representation learning, and CODE­AE learns biologically meaningful representations by explicitly de­confounding the con­founding factors such as data source origins. As the benchmark results showed, these two methods are able to improve knowledge transfer be­ tween multi­omics data, and in­vitro and in­vivo samples respectively, and significantly boost re­spective performance in drug response prediction task. Thus, they are potentially powerful tools for precision medicine and drug discovery.

Share

COinS