Dissertations, Theses, and Capstone Projects
Date of Degree
6-2026
Document Type
Master's Capstone Project
Degree Name
Master of Science
Program
Data Analysis & Visualization
Advisor
Howard Everson
Subject Categories
Biochemistry | Data Science | Other Mathematics
Keywords
Machine Learning, Clinical Trial Termination, Trial Protocol Analysis, Clinical Trials, Trial2Vec, Natural Language Processing, Random Forest, Embedding, Explanatory Analysis
Abstract
About one in five clinical trials in medicine ends early, wasting valuable resources and reducing the evidence available for developing life-saving medical treatments. This project uses a method called Trial2Vec, which is a self-supervised machine-learning method that converts clinical trial documents into dense numerical representations that capture their key design and clinical characteristics, to turn each proposed clinical trial’s written protocol into a compact numerical profile (a process referred to as embedding). These profiles are then paired with a predictive machine learning models to identify the words and phrases in the trial documents that can signal a higher risk of premature termination thus potentially explaining the protocols’ characteristics that contribute to the premature termination of a clinical trial.
Drawing on a database of 25 years of clinical trials cataloged in the ClinicalTrials.gov database, the project was designed to do the following: (i) measure how strongly specific wording in core sections of the trial protocols, such as titles, brief summaries, listed conditions, intervention or treatment descriptions, and outcome measures that may predict early termination, focusing on the influence of language and syntax alone; (ii) create clear visual maps revealing hidden clusters of related words and phrases that tend to occur in trial protocols associated with a higher than average risk of early termination of the clinical trial; and (iii) develop an interactive dashboard that highlights phrases and language associated higher risk of premature termination of a clinical trial. The work done here combines natural-language representation learning with AI models to provide a data-driven, decision-support tool for clinical researchers, sponsors, and policymakers. The aim, ultimately, is to strengthen the design of clinical trial protocols to reduce early termination of clinical trials in medicine and make better use of limited research resources.
Recommended Citation
Ramnarain, Rohan, "Identifying Textual Predictors of Early Termination in Clinical Trials in Medicine: An Explainable Machine-Learning Study" (2026). CUNY Academic Works.
https://academicworks.cuny.edu/gc_etds/6761
Code repository exported from GitHub

Comments
Online component: https://rohanramnarain.github.io/Trial2Vec-main-2/