Dissertations, Theses, and Capstone Projects
Date of Degree
2-2025
Document Type
Capstone Project
Degree Name
M.S.
Program
Data Analysis & Visualization
Advisor
Michelle McSweeney
Subject Categories
Adult and Continuing Education | Artificial Intelligence and Robotics | Categorical Data Analysis | Curriculum and Social Inquiry | Digital Humanities | Educational Assessment, Evaluation, and Research | Higher Education and Teaching | Inequality and Stratification | Language and Literacy Education | Language Interpretation and Translation | Other Social and Behavioral Sciences | Quantitative, Qualitative, Comparative, and Historical Methodologies | Race and Ethnicity
Keywords
Automated essay scoring, Human-AI comparative analysis, Bias in AI, Educational Equity, Language Proficiency
Abstract
This study evaluates the capabilities and limitations of large language models (LLMs), specifically OpenAI’s ChatGPT-4o, in grading essays from students in the City University of New York’s Language Immersion Program. The program serves English language learners with diverse linguistic and demographic backgrounds, offering intensive language instruction to prepare students for academic success in college. Using a dataset of 30 pre- and post-program essays scored by program instructors and ChatGPT-4o under three paradigms, this research explores the alignment between human and AI-generated scores across five rubric-based competency areas. Findings reveal that ChatGPT-4o aligns moderately with human grading, with the strongest agreement in essays’ critical response and organization, but with significant discrepancies in areas such as word choice and grammar, where ChatGPT-4o frequently assigns lower scores. Though preliminary and directional only, further demographic analysis highlights how Black-identifying students consistently receive lower scores compared to other groups, suggesting the presence of algorithmic biases that can perpetuate educational inequities. AI has unquestionably been powerful as a supplement to human efforts in education assessment, but its limitations in interpreting nuance and its impact on equity raise critical concerns. The paper argues that AI tools should be viewed as complementary aids rather than as replacements, and contributes to the growing discourse on the role of LLMs in educational settings.
Recommended Citation
Inbar, Benjamin, "ChatGPT Didn’t Write This: Evaluating the Impact of LLMs with a Case Study in Grading CUNY Language Immersion Program Student Essays" (2025). CUNY Academic Works.
https://academicworks.cuny.edu/gc_etds/6169
Archived GitHub repo files
Included in
Adult and Continuing Education Commons, Artificial Intelligence and Robotics Commons, Categorical Data Analysis Commons, Curriculum and Social Inquiry Commons, Digital Humanities Commons, Educational Assessment, Evaluation, and Research Commons, Higher Education and Teaching Commons, Inequality and Stratification Commons, Language and Literacy Education Commons, Language Interpretation and Translation Commons, Other Social and Behavioral Sciences Commons, Quantitative, Qualitative, Comparative, and Historical Methodologies Commons, Race and Ethnicity Commons