Project Title: Predict Healthcare Access in the United States using Classification Algorithms
Team Members:
- Maurice Ngouen
- Prasad Bhoite
- Priyanka Linge
- Shahid Tufail
Project Overview:
Motivation:
As per the Constitution of the World Health Organization, “The enjoyment of the highest attainable standard of health is one of
the fundamental rights of every human being without distinction of race, religion, political belief, economic or social condition”.
Access to healthcare is an important step toward achieving better health that impacts an individual's overall physical, mental, and
social health status and quality of life.
According to the Commonwealth Fund, which regularly ranks the health systems of the developed countries, the US is the lowest overall performer
though it spends the highest amount on healthcare amongst all of them.
For specific indicators of Healthcare quality such as Healthcare access and equity, the US is the poorest performer:
Therefore, in this project we have identified the determinanats and explore the inquality of Healthcare access in the US
using classificaation techniques such as Logistic Regression, Random Forest Tree and Gradient Boosted Trees algorithm
based on independent variables such as
Gender, Residing State, Annual Household Income, Race, Education Attainment, Marital Status, Household Ownership, Veteran Status, Employment Status.
This project focuses on following components of access to health care (Dependent variables):
- Health insurance coverage
(Do you have any kind of health care coverage,
including health insurance, prepaid plans such as HMOs, or government plans such as Medicare, or Indian Health Service?)
- Availability of Primary Care Physician
(Do you have one person you think of as your personal doctor or health care provider?)
- Healthcare access problem related to cost
(Was there a time in past 12 months when you needed to see a doctor but could not because of cost?)
- Duration since last Annual Health check up
(About how long has it been since you last visited a doctor for a routine checkup?)
Aims:
- Rate of different components of Healthcare Access by Gender, Residing State, Annual household income, Race, Education attainment,
Marital Status, Household ownership, Veteran Status, Employment Status
- Top 10 states with the highest and lowest rate of different components of Healthcare Access
- Explore determinants Healthcare access problems related to cost (Unable to see a doctor due to cost)
- Create the classification models for all the 4 aforementioned components of healthcare access of an individual using
Logistic Regression, Random Forest Tree, and Gradient Boosted Trees.
Data set/s Descripiton:
Our group has primarily used the
2018 Behavioral Risk Factor Surveillance System (BRFSS) data set
from the Centers for Disease Control and Prevention (CDC) to classify the healthcare access in the USA.
The aforementioned data set has data on 437,436 individuals with 276 features.
The BRFSS is a cross-sectional telephone survey that is administered annually by state health departments to assess health-related behaviors,
medical conditions, and preventive service use among adult residents in all 50 states, Puerto Rico, Guam, the District of Columbia,
and the U.S. Virgin Islands.
Additionally, we have also used the data on mean unemployment rates of all the states in the USA during calendar year 2018.
The data on unemployment was downloaded from
the Bureau of Labor Statistics of the United States Department of Labor.
The data on mean unemployment rates was used to predict the mean health insurance rates by residing states in the USA.
Methodology:
We used data from 2018's Behavioral Risk Factor Surveillance System Survey.
- The downloaded SAS formatted dataset was converted into .csv format using R programming language (version 3.6.1).
- Exploratory Analysis and Exploratory Visualization using
Pandas,
Seaborn, and
Matplotlib
libraries in Python (version 3.6)
- Data Preprocessing using
Pandas library
- Balanced the datasets by under sampling technique using
imblearn library in Python
- Created the Training and Test datasets
- Built the classifiers / models for all four dependent variables (each component of Healthcare Access) on training dataset
after after 10-fold Stratified CV using
scikit-learn (sklearn)
library in Python.
- Interpreted the results of
Logistic Regression,
Random Forest Classfier ,
Gradient Boosted Trees
and chose the best model for each component of Healthcare Access
using computed Mean Accuracy Scores and recommendations in reviewed literature
- Ran the best algorithm only on the whole training dataset and computed the model accuracy
- Ran the best algorithm only on the whole test dataset and computed the model accuracy
- Present the findings.
Results:
Top 10 States with Highest Medical Insurance:
The mean rate of Health insurance rate during 2018 was 91.44%.
The District of Columbia ranked the first in the US with 95.88% rate of health insurance.
Lowest 10 States by Health Insurance Rates:
Guam ranked the worst in the US with 83.15% rate of health insurance.
Florida ranked the fifth last in the US with 86.05% rate of health insurace.
Model Accuracy Table for each Healthcare component (Dependent Variables):
Accuracy of classifier/model refers to the ability to predict the class labels correctly.
For all the Healthcare components, accuracies for Logistic Regression and Gradient Boosted tree algorithms were better
in compared to Random Forest Trees.
However, Gradient Boosted Tree algorithm was chosen as the best algorithm for final model building as per the recommendions in reviewed literature.
All the Gradient Boosted models correctly classfied the class labels with an accuracy of more than 80% during final model building.
References:
- E.C.Schneider, D.O. Sarnak, D. Squires, A. Shah,
M.M. Doty, Mirror, Mirror 2017: International Comparison Reflects Flaws and Opportunities for Better U.S. Health Care,
The Commonwealth Fund, July 2017
- CDC 2018 BRFSS Data
- Unemployment Data, Buereau of Labor Statistics,
the US Department of Labor