Dissertations, Theses, and Capstone Projects

Date of Degree

6-2026

Document Type

Master's Capstone Project

Degree Name

Master of Science

Program

Data Analysis & Visualization

Advisor

Howard Everson

Subject Categories

Biochemistry | Data Science | Other Mathematics

Keywords

Machine Learning, Clinical Trial Termination, Trial Protocol Analysis, Clinical Trials, Trial2Vec, Natural Language Processing, Random Forest, Embedding, Explanatory Analysis

Abstract

About one in five clinical trials in medicine ends early, wasting valuable resources and reducing the evidence available for developing life-saving medical treatments. This project uses a method called Trial2Vec, which is a self-supervised machine-learning method that converts clinical trial documents into dense numerical representations that capture their key design and clinical characteristics, to turn each proposed clinical trial’s written protocol into a compact numerical profile (a process referred to as embedding). These profiles are then paired with a predictive machine learning models to identify the words and phrases in the trial documents that can signal a higher risk of premature termination thus potentially explaining the protocols’ characteristics that contribute to the premature termination of a clinical trial.

Drawing on a database of 25 years of clinical trials cataloged in the ClinicalTrials.gov database, the project was designed to do the following: (i) measure how strongly specific wording in core sections of the trial protocols, such as titles, brief summaries, listed conditions, intervention or treatment descriptions, and outcome measures that may predict early termination, focusing on the influence of language and syntax alone; (ii) create clear visual maps revealing hidden clusters of related words and phrases that tend to occur in trial protocols associated with a higher than average risk of early termination of the clinical trial; and (iii) develop an interactive dashboard that highlights phrases and language associated higher risk of premature termination of a clinical trial. The work done here combines natural-language representation learning with AI models to provide a data-driven, decision-support tool for clinical researchers, sponsors, and policymakers. The aim, ultimately, is to strengthen the design of clinical trial protocols to reduce early termination of clinical trials in medicine and make better use of limited research resources.

Trial2Vec-main2.zip (352838 kB)
Code repository exported from GitHub

Share

COinS