Dissertations, Theses, and Capstone Projects

Date of Degree


Document Type


Degree Name



Educational Psychology


Jay Verkuilen

Committee Members

Howard Everson

David Rindskopf

Irene Moustaki

Louis Roussos

Subject Categories

Educational Psychology | Quantitative Psychology


Bayesian, Rater Agreement, Structural Equation Modeling, Automated Essay Scoring


Rater comparison analysis is commonly necessary in the social sciences. Conventional approaches to the problem generally focus on calculation of agreement statistics, which provide useful but incomplete information about rater agreement. Importantly, one-number agreement statistics give no indication regarding the nature of disagreements, nor do they distinguish between agreement on presence versus absence of a state or trait. Latent variable models can address both problems, as well as overcoming other well-documented limitations of agreement statistics (e.g., sample dependence, inappropriate population assumptions). Whether raters exactly agree is usually not the question of interest – researchers almost never care whether the difference between raters is exactly 0. Rather, the question is whether the difference is small enough to be unimportant. Bayesian estimation makes answering this question straightforward. The posterior distribution for rater difference on appropriate parameters can be divided into ranges corresponding to important versus unimportant rater differences. The area under the curve in these ranges is the probability the difference falls within each respective range. The proposed research would demonstrate the use of Bayesian Generalized Structural Equation Models (BGSEM) as a flexible framework for rater comparison. Bayesian estimation also allows more accurate modeling of beliefs when prior information is available and can cope with data limitations (e.g., few raters) which are unfortunately unavoidable in many rater agreement studies. The method is demonstrated with empirical examples of rater agreement data taken from education contexts and involving both human and automated machine raters.