Date of Degree


Document Type


Degree Name



Computer Science


Lie Xie

Committee Members

Xiangdong Li

Louis D'Alotto

Jiangtao Gou

Subject Categories

Artificial Intelligence and Robotics | Biological Engineering | Computational Engineering | Computer Engineering | Data Science | Other Engineering


Computational prediction of a phenotypic response upon the chemical perturbation on a biological system plays an important role in drug discovery and many other applications. Chemical fingerprints derived from chemical structures are a widely used feature to build machine learning models. However, the fingerprints ignore the biological context, thus, they suffer from several problems such as the activity cliff and curse of dimensionality. Fundamentally, the chemical modulation of biological activities is a multi-scale process. It is the genome-wide chemical-target interactions that modulate chemical phenotypic responses. Thus, the genome-scale chemical-target interaction profile will more directly correlate with in vitro and in vivo activities than the chemical structure. Nevertheless, the scope of direct application of the chemical-target interaction profile is limited due to the severe incompleteness, bias, and noisiness of bioassay data. To address the aforementioned problems, we developed two new chemical and protein representation methods in this thesis. The first one is a Latent Target Interaction Profile (LTIP). LTIP embeds chemicals into a low dimensional continuous latent space that represents genome-scale chemical-target interactions. Subsequently, LTIP can be used as a feature to build machine learning models. Using the drug sensitivity of cancer cell lines as a benchmark, we have shown that the LTIP robustly outperforms chemical fingerprints regardless of machine learning algorithms. Moreover, the LTIP is complementary to the chemical fingerprints. We can combine LTIP with other fingerprints to further improve the performance of bioactivity prediction. We also developed a new protein sequence embedding method Distilled Sequence Alignment Embedding (DISAE) to represent proteins. We compared CGKronRLS to other machine learning algorithms including Random Forest and XGBoost for predicting drug-target interactions. We show how the resultant protein deep representations can be used to predict novel drug-protein pairs interactions which can improve drug safety and open many avenues for drug repurposing. Our results demonstrate the potential of LTIP in particular and multi-scale modeling in general in predictive modeling of chemical modulation of biological activities. It also shows the predictive power of DISAE which can further be improved through deep learning models.