Dissertations, Theses, and Capstone Projects

Date of Degree

6-2024

Document Type

Capstone Project

Degree Name

M.S.

Program

Data Analysis & Visualization

Advisor

Howard T. Everson

Subject Categories

Databases and Information Systems | Data Science | Journalism Studies

Keywords

fake news, news classification, data diversity, algorithmic bias, misinformation analysis, text classification

Abstract

In today's digital world, detecting fake news has emerged as a critical challenge, one that has significant effects on democracy and public discourse at large both regionally and globally. This research studies how diversity of news sources in training datasets affects how well machine learning models can classify fake vs true news. I used the Linear Support Vector Classification (LinearSVC) to create and compare two classification models: one was trained on a dataset that only had real news from a singular source, Reuters (Dataset 1), and the other was trained on a dataset that contained real news from Reuters, The New York Times, and NPR (Dataset 2). Both datasets contained fake news articles from diverse sources. The datasets were prepared by cleaning the data and using Term Frequency - Inverse Document Frequency (TF-IDF) Vectorization. The models were then trained using LinearSVC, tested on a comparison dataset and evaluated using accuracy, precision, recall, and F1-score metrics. The study's results show that the model trained on Dataset 2, did better on all evaluation metrics than the model trained on Dataset 1. Seeing this improvement in performance shows how important it is to include different journalistic points of view in training datasets. This makes the model learn better and be better at the task. The study adds to what is already known about classifying and detecting fake news by showing how important it is to have a variety of sources in training datasets using different types of news sources to make classification models more accurate. Not only does this study provide insight about fake news classification, but it also underscores the broader implications of machine learning in media credibility and information consumption in the digital age.

islam-capstone-dataset.zip (73082 kB)
datasets

fakenewsclassification-main.zip (73187 kB)
Export of GitHub repo at time of deposit.

Share

COinS