Date of Degree
Biochemistry | Biophysics | Structural Biology
Machine Learning, Virtual Screening, Solvation Theory, Drug Discovery
Drug discovery is a notoriously expensive and time-consuming process; hence, developing computational methods to facilitate the discovery process and lower the associated costs is a long-sought goal of computational chemists. Protein-ligand binding, which provides the physical and chemical basis for the mechanism of action of most drugs, occurs in an aqueous environment, and binding affinity is determined not only by atomic interactions between the protein and ligand but also by changes in their interactions with surrounding water molecules that occur upon binding. Thus, a quantitative understanding of the roles water molecules play in the protein-ligand binding process is an essential foundation for developing computational methods and tools to aid the drug discovery process.
Grid inhomogeneous solvation theory (GIST) is a tool that measures the thermodynamic and structural properties of water molecules on protein surfaces. Since its implementation, GIST has been used to study water behavior upon protein-ligand binding and to account for solvent effects in scoring functions used in virtual screening. This thesis is comprised of two research projects that extend the applications and functionality of GIST. In the first project, we investigated whether the water properties measured by GIST could improve the performance of machine learning models, specifically, convolutional neural networks (CNN) applied to virtual screening (GIST-CNN project). In the second project, we implemented the particle mesh Ewald (PME) algorithm for energy calculation in GIST, enabling GIST to become a more accurate and more efficient tool for end-state free energy calculation (PME-GIST project).
The GIST-CNN project arose in response to reports indicating that convolutional neural network (CNN) models were able to outperform classical scoring functions in virtual screening. We noticed that all the reported machine learning models had been trained only by protein-ligand structures, while water molecules were completely neglected. Given that water molecules play essential roles in protein-ligand binding, we hypothesized that we could further improve the performance of CNN models in terms of enrichment efficiency by adding water features, measured by GIST, to the data used to train the model. Contrary to our hypothesis, we found that adding water features could not further improve the performance of a CNN model trained by protein-ligand structures, which was already very high. However, further investigation revealed that the high performance and reported enrichment efficiency of a CNN model trained by protein-ligand information was solely attributable to biases in the Database of Useful Decoys-Enhanced (DUD-E), which was used to train and test the model. In this project, we also established a suite of methods to investigate what a model learns from the input during training and argued that machine learning models should be thoroughly validated before being applied in real drug discovery projects.
The motivations for the PME-GIST project were twofold. First, although GIST provides the statistical thermodynamic framework for thermodynamic end-state free energy calculation, inconsistencies in energy calculations between the previous GIST implementation (GIST-2016) and modern molecular dynamics engines prevent precise comparison of the GIST end-state method to other reference free energy calculation methods such as thermodynamic integration (TI). Second, the O(N2) nonbonded energy calculation is the most expensive step in the entire GIST calculation process. By implementation of the PME algorithm into GIST, we aimed to achieve GIST energy calculations consistent with those of modern molecular dynamic engines and to accelerate the energy calculation to O(NlogN), which is highly desirable when applying GIST to the measurement of water properties across an entire protein surface. In addition to implementing PME, we derived a simple empirical estimator for high order entropies, which are truncated in GIST. After incorporating PME-based energy calculation and the high order entropy estimator, we used PME-GIST to calculate end-state solvation free energy for a wide range of small molecules and achieved results highly consistent with TI (= 0.99, mean unsigned difference = 0.44 kcal/mol). The PME-GIST code we developed in this project was integrated into the open-source molecular dynamics analysis software CPPTRAJ for easy access by others in the drug discovery community.
In summary, in this thesis, we explored the potential of adding solvation thermodynamics to machine learning-based virtual screening and found that the high performance reported for machine learning models in this application reflected biases in the dataset used construct and test them rather than successfully generalization of the physical principles that govern molecular interactions. We also addressed the inconsistent energy calculation between GIST and modern molecular simulation engines by developing PME-GIST. We hope the research work presented in this thesis will further expand and accelerate the application of GIST to drug discovery.
Chen, Lieyang, "Machine Learning and Solvation Theory for Drug Discovery" (2021). CUNY Academic Works.