Date of Degree


Document Type


Degree Name





Martin Chodorow

Subject Categories

Computational Linguistics


Computational Linguistics, Natural Language Processing, Authorship Attribution, Twitter


In recent years, Twitter has become a popular testing ground for techniques in authorship attribution. This is due to both the ease of building large corpora as well as the challenges associated with the character limit imposed by the service and the writing styles that have developed as a result. As both false and genuine claims of hacked Twitter accounts have made international news, there is an increasing need for this type of work. For newer Twitter accounts, however, there is little training data. Thus, this study looks to lay the groundwork for cross-domain authorship attribution: training on one source of writing, but testing on another. This work examines three types of feature sets – word n-grams, character n-grams, and stop words – and three machine learning algorithms – Naïve Bayes, Logistic Regression, and Linear Support Vector Classification.



To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.