Date of Degree


Document Type


Degree Name



Social Welfare


Michael A. Lewis

Committee Members

Alexis Kuerbis

Harriet Goodman

Maria Rodriguez

Hal Salzman

Subject Categories

Quantitative, Qualitative, Comparative, and Historical Methodologies | Social Statistics | Social Work


education research, machine-learning, causal inference, causal estimation, STEM education, STEM research


Estimation methods to identify the causal relationships between dependent and independent variables are fundamental to social science research. For social workers, these methods provide crucial knowledge about different factors' complex relationships with a particular issue. Such knowledge helps social workers be better micro, mezzo, and macro change agents.

Different causal estimation methods exist, from randomized controlled studies to methods involving observational studies. In observational studies, which is the focus of this dissertation, participants self-select into intervention. This behavior makes causal estimation more challenging. Since participants self-select into intervention or treatment, there are observed and unobserved differences between participants in the intervention and control groups. One dominant and well-known method to address this challenge is propensity score matching.

Logistic regression has traditionally been the primary approach in calculating propensity scores. However, other approaches, particularly those using machine learning models, are becoming more prominent. While in its nascent stage, several studies used simulated and actual data applying machine learning to causal estimation. Nevertheless, research in this area still needs to be expanded. This study follows a similar approach, comparing two machine learning models to the logistic regression model, thus aiming to add to this knowledge.

Using data from the National Center for Education Statistics (NCES), the Baccalaureate & Beyond longitudinal study 2008/12, this research compared two machine learning models, namely the Random Forest (RF) and the Gradient Boosted Machine (GBM), with logistic regression. All three models were used to calculate the probabilities of assignment to intervention, also known as propensity scores. In this study, the intervention group is students graduating with Science, Technology, Engineering, and Mathematics (STEM) majors, and the comparison group is students graduating with non-STEM majors. Observed covariates included students' background characteristics, high school performance, scholastic scores, and early college performance. The three models were assessed in how well they predicted assignment into intervention and reduced differences in observed characteristics between the intervention and comparison group.

Results indicated that all three models did well in overall prediction accuracy. However, the logistic regression model had a lower sensitivity score than both machine learning models. Additionally, the Random Forest model reduced differences in observed characteristics between the intervention and comparison groups among the three models. In contrast, the logistic model did better than the Gradient Boosted Machine (GBM). Furthermore, both machine learning models increased the differences in those observed characteristics that were not different among the intervention and control groups.