Date of Degree


Document Type

Capstone Project

Degree Name



Data Analysis & Visualization


Howard T. Everson

Subject Categories

Other Computer Engineering


Supervised machine learning


Type II diabetes is a disease that affects how the body regulates and uses sugar (glucose) as a fuel. This chronic disease results in too much sugar circulating in the bloodstream. High blood sugar levels can lead to circulatory, nervous, and immune systems disorders. Machine learning (ML) techniques have proven their strength in diabetes diagnosis. In this paper, we aimed to contribute to the literature on the use of ML methods by examining the value of a number of supervised machine learning algorithms such as logistic regression, decision tree classifiers, random forest classifiers, and support vector classifiers to identify factors and indicators (such as pregnancy, blood pressure, etc.) that may lead to more accurate predictions and classifications of Type II diabetes in women. By identifying these indicators, women will be able to take the necessary actions to prevent the onset of Type II diabetes. To apply these ML techniques,the Pima Indian Women Diabetes dataset was downloaded from the Kaggle website. Different experiments were conducted on the dataset. Each machine learning algorithm was trained on unscaled data using a balanced and unbalanced dataset and again using scaled data with a balanced and unbalanced dataset. Consequently, sixteen models were generated to evaluate the different ML classifiers' performance and select the best model. The results of these analyses are presented, and model-based findings are contrasted. (1010 kB)
GitHub repository containing the Pima-Indians-diabetes Dataset and Jupyter notebook