Dissertations, Theses, and Capstone Projects
Date of Degree
2-2021
Document Type
Thesis
Degree Name
M.A.
Program
Linguistics
Advisor
Kyle Gorman
Subject Categories
Computational Linguistics | Other Linguistics | Spanish Linguistics
Keywords
translanguaging, computational linguistics, computer mediated communication, Doc2Vec, Word2Vec, KNN
Abstract
Code-switching is the linguistic phenomenon where a multilingual person alternates between two or more languages in a conversation, whether that be spoken or written. This thesis studies the automatic detection of code-switching occurring specifically between English and Spanish in two corpora.
Twitter and other social media sites have provided an abundance of linguistic data that is available to researchers to perform countless experiments. Collecting the data is fairly easy if a study is on monolingual text, but if a study requires code-switched data, this becomes a complication as APIs only accept one language as a parameter. This thesis focuses on identifying code-switching in both Twitter data and the Miami-Bangor corpus. This is done by conducting three different experiments. Our first experiment is a logistic regression model where we attempt to distinguish code-switched data from monolingual data. The second experiment is using a novel Word2Vec average nearest neighbor (WANN) classifier based on word embeddings to detect code-switching. The third experiment uses Doc2Vec, where the model uses the mean vector of each document to learn and distinguish between code-switched and monolingual data. Each of these experiments are performed twice, once with tweets and once with the Miami Bangor corpus. The results show that the WANN model performs best on Twitter data. The Doc2Vec model performs best on the Miami Bangor corpus. However, both approaches did well and the performances are comparable.
Recommended Citation
Polanco, Yohamy C., "A Computational Study in the Detection of English–Spanish Code-Switches" (2021). CUNY Academic Works.
https://academicworks.cuny.edu/gc_etds/4195
Included in
Computational Linguistics Commons, Other Linguistics Commons, Spanish Linguistics Commons