Dissertations, Theses, and Capstone Projects

Date of Degree


Document Type


Degree Name



Computer Science


Lei Xie

Committee Members

Michael I. Mandel

Liang Zhao

Bin Chen

Subject Categories

Artificial Intelligence and Robotics | Biochemistry, Biophysics, and Structural Biology


Deep Learning, Out-of-Distribution generalization, dark protein, drug discovery


Dark protein illumination is a fundamental challenge in drug discovery where majority human proteins are understudied, i.e. with only known protein sequence but no known small molecule binder. It's a major road block to enable drug discovery paradigm shift from single-targeted which looks to identify a single target and design drug to regulate the single target to multi-targeted in a Systems Pharmacology perspective. Diseases such as Alzheimer's and Opioid-Use-Disorder plaguing millions of patients call for effective multi-targeted approach involving dark proteins. Using limited protein data to predict dark protein property requires deep learning systems with OOD generalization capacity. Out-of-Distribution (OOD) generalization is a problem hindering the application and adoption of deep learning in real world problems. Classic deep learning setting in contrast is assuming training and testing data are independent identically distributed (iid). A well trained model under iid setting with reported 98% accuracy could deteriorate to worse than random guess in deployment to OOD data significantly different from training data. Numerous techniques in the research field emerged but are only addressing some specific OOD scenario instead of a general one. Dark protein illumination has unique complexity comparing to common deep learning tasks. There are three OOD axes, protein-OOD, compound-OOD, interaction-OOD. Previous research have only focused on compound-OOD, where new compound design algorithms are developed but still for 500 common proteins, instead of whole human genome 20,000 proteins, and only for single-targeted paradigm instead of multi-targeted. Focusing on an instrumental problem in drug discovery, dark protein function illumination problem is introduced from the OOD perspective. A series of dark protein OOD algorithms are developed to predict dark protein ligand interaction where multiple instrumental deep learning techniques are adapted to the biology context. By proposing the dark protein illumination problem, highlighting the neglected axes, demonstrating possibilities, numerous diseases now embrace new hopes.