Dissertations, Theses, and Capstone Projects

Date of Degree

2-2023

Document Type

Dissertation

Degree Name

Ph.D.

Program

Computer Science

Advisor

Elena Filatova

Committee Members

Sarah Levitan

George Valkanas

Ronak Etemadpour

Subject Categories

Computational Linguistics

Keywords

Social Media, Twitter, Natural Language Processing, Machine learning, Artificial Intelligence

Abstract

Coronavirus disease 2019 (COVID-19) started in Wuhan, China, in late 2019, and after being utterly contagious in Asian countries, it rapidly spread to other countries. This disease caused governments worldwide to declare a public health crisis with severe measures taken to reduce the speed of the spread of the disease. This pandemic affected the lives of millions of people. Many citizens that lost their loved ones and jobs experienced a wide range of emotions, such as disbelief, shock, concerns about health, fear about food supplies, anxiety, and panic. All of the aforementioned phenomena led to the spread of racism and hate against Asians in western countries, especially in the United States. An analysis of official preliminary police data by the Center for the Study of Hate \& Extremism at California State University shows that Anti-Asian hate crime in 16 of America's largest cities increased by 149\% in 2020. In this study, we first chose a baseline of Americans' hate speech on twitter. Then we present an approach to balance the biased dataset and consequently improve the performance of tweet classification. We also have downloaded 10 million tweets through the Twitter API V-2 which in this study, we have used a small portion of that in supervised methods and bigger portions in a semi-supervised approach. In this article, three thousand tweets from our collected corpus are annotated by four annotators, including three Asian and one Asian-American. Using this data, we built predictive models of hate speech using various machine learning and deep learning methods. Our machine learning methods include Random Forest, K-nearest neighbors (KNN), Support Vector Machine (SVM), Extreme Gradient Boosting (XGBoost), Logistic Regression, Decision Tree, and Naive Bayes. Our Deep Learning models include Basic Long-Term Short-Term Memory (LSTM), Bidirectional LSTM, Bidirectional LSTM with Drop out, Convolution, and Bidirectional Encoder Representations from Transformers (BERT). We also adjusted our dataset by filtering tweets that were ambiguous to the annotators based on low Fleiss Kappa agreement between annotators. Our final result showed that Logistic Regression achieved the best statistical machine learning performance with an F1 score of 0.72, while BERT achieved the best performance of the deep learning models, with an F1-Score of 0.85. Also, we analyzed our baseline dataset with annotators from different cultures to investigate the effect of the inter-annotators agreement and its effect on the system of Anti-Asian hate crime detection on Twitter during COVID-19. Our result showed that although the dataset's distribution is more balanced in the Asian-American student annotations, the machine learning performance was better in the dataset annotated by the Chinese students. The reason is that Chinese students are more willing to tag a tweet as hateful than Asian-American students, and because our system is hate detection, we had more hateful tweets to train our system, and therefore we got better performance there.

Share

COinS