Project Title: Predict Healthcare Access in the United States using Classification Algorithms

Team Members:

Maurice Ngouen
Prasad Bhoite
Priyanka Linge
Shahid Tufail

Project Overview:

Motivation:

As per the Constitution of the World Health Organization, “The enjoyment of the highest attainable standard of health is one of the fundamental rights of every human being without distinction of race, religion, political belief, economic or social condition”. Access to healthcare is an important step toward achieving better health that impacts an individual's overall physical, mental, and social health status and quality of life.
According to the Commonwealth Fund, which regularly ranks the health systems of the developed countries, the US is the lowest overall performer though it spends the highest amount on healthcare amongst all of them.

Healthcare system performance compared to spending

For specific indicators of Healthcare quality such as Healthcare access and equity, the US is the poorest performer:

Healthcare system performance compared to spending

Therefore, in this project we have identified the determinanats and explore the inquality of Healthcare access in the US using classificaation techniques such as Logistic Regression, Random Forest Tree and Gradient Boosted Trees algorithm based on independent variables such as Gender, Residing State, Annual Household Income, Race, Education Attainment, Marital Status, Household Ownership, Veteran Status, Employment Status.

This project focuses on following components of access to health care (Dependent variables):

Health insurance coverage (Do you have any kind of health care coverage, including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?)
Availability of Primary Care Physician (Do you have one person you think of as your personal doctor or health care provider?)
Healthcare access problem related to cost (Was there a time in past 12 months when you needed to see a doctor but could not because of cost?)
Duration since last Annual Health check up (About how long has it been since you last visited a doctor for a routine checkup?)

Aims:

Rate of different components of Healthcare Access by Gender, Residing State, Annual household income, Race, Education attainment, Marital Status, Household ownership, Veteran Status, Employment Status
Top 10 states with the highest and lowest rate of different components of Healthcare Access
Explore determinants Healthcare access problems related to cost (Unable to see a doctor due to cost)
Create the classification models for all the 4 aforementioned components of healthcare access of an individual using Logistic Regression, Random Forest Tree, and Gradient Boosted Trees.

Data set/s Descripiton:

Our group has primarily used the 2018 Behavioral Risk Factor Surveillance System (BRFSS) data set from the Centers for Disease Control and Prevention (CDC) to classify the healthcare access in the USA. The aforementioned data set has data on 437,436 individuals with 276 features. The BRFSS is a cross-sectional telephone survey that is administered annually by state health departments to assess health-related behaviors, medical conditions, and preventive service use among adult residents in all 50 states, Puerto Rico, Guam, the District of Columbia, and the U.S. Virgin Islands.
Additionally, we have also used the data on mean unemployment rates of all the states in the USA during calendar year 2018. The data on unemployment was downloaded from the Bureau of Labor Statistics of the United States Department of Labor. The data on mean unemployment rates was used to predict the mean health insurance rates by residing states in the USA.

Methodology:

We used data from 2018's Behavioral Risk Factor Surveillance System Survey.

The downloaded SAS formatted dataset was converted into .csv format using R programming language (version 3.6.1).
Exploratory Analysis and Exploratory Visualization using Pandas, Seaborn, and Matplotlib libraries in Python (version 3.6)
Data Preprocessing using Pandas library
Balanced the datasets by under sampling technique using imblearn library in Python
Created the Training and Test datasets
Built the classifiers / models for all four dependent variables (each component of Healthcare Access) on training dataset after after 10-fold Stratified CV using scikit-learn (sklearn) library in Python.
Interpreted the results of Logistic Regression, Random Forest Classfier , Gradient Boosted Trees and chose the best model for each component of Healthcare Access using computed Mean Accuracy Scores and recommendations in reviewed literature
Ran the best algorithm only on the whole training dataset and computed the model accuracy
Ran the best algorithm only on the whole test dataset and computed the model accuracy
Present the findings.

Results:

Top 10 States with Highest Medical Insurance:

Top 10 States by Health Insurance Rate

The mean rate of Health insurance rate during 2018 was 91.44%.
The District of Columbia ranked the first in the US with 95.88% rate of health insurance.

Lowest 10 States by Health Insurance Rates:

Lowest 10 States by Health Insurance Rate

Guam ranked the worst in the US with 83.15% rate of health insurance. Florida ranked the fifth last in the US with 86.05% rate of health insurace.

Model Accuracy Table for each Healthcare component (Dependent Variables):

Model Accuracy Table for each Healthcare component (Dependent Variables)

Accuracy of classifier/model refers to the ability to predict the class labels correctly.
For all the Healthcare components, accuracies for Logistic Regression and Gradient Boosted tree algorithms were better in compared to Random Forest Trees. However, Gradient Boosted Tree algorithm was chosen as the best algorithm for final model building as per the recommendions in reviewed literature. All the Gradient Boosted models correctly classfied the class labels with an accuracy of more than 80% during final model building.