Abstract

Statistical learning methods were applied to cardiological and demographic data in order to detect the presence of heart disease in medical patients. A variety of learning techniques were explored and validated. Simple methods like logistic regression show promise, but further data collection and analysis is recommended.

Introduction

Cardiovascular disease¹ (CVD), which is often simply referred to as heart disease, is the leading cause of death in the United States.² Risk factors for heart disease include genetics, age, sex, diet, lifestyle, sleep, and environment. While treatment often involves altering diet, sleep, and lifestyle to align with healthy choices that would be beneficial to the entire population, detection is still important to allow for medical practitioners to help with these interventions.

In an attempt to construct a tool to detect heart disease, statistical learning techniques have been applied to cardiological and demographic data from four hospitals. The goal of this model would be to perform this detection based on demographic and non-invasive cardiology metrics. The results show potential to use these models to pre-screen patients, but until further data collection and analysis have been performed, existing clinical practice should be continued.

Methods

Data

The data originates from the UCI Machine Learning Repository³. The data was accessed via the ucidata package on GitHub.⁴ The data was collected from four sources:

Hungarian Institute of Cardiology. Budapest (Andras Janosi, M.D.)
University Hospital, Zurich, Switzerland (William Steinbrunn, M.D.)
University Hospital, Basel, Switzerland (Matthias Pfisterer, M.D.)
V.A. Medical Center, Long Beach and Cleveland Clinic Foundation (Robert Detrano, M.D., Ph.D. )

Each patient in the data was graded on a scale from 0 to 4, indicating the severity of heart disease. A score of 0 indicates no presence of heart disease. Scores 1 through 4 indicate increasing severity of heart disease. In particular, the scores from 1 to 4 indicate the number of vessels found through angiography⁵ with greater than 50% diameter narrowing.

Demographic information including age, sex, and location (hospital) is available for each patient. In addition to this demographic information, several cardiology measure are available for each patient, including:

Electrocardiography⁶ results
Cardiac Stress Test⁷ results
Blood Test⁸ results
Blood Pressure⁹

Some exploratory data analysis can be found in the appendix.

Modeling

In order to detect the presence of heart disease, several classification strategies were explored. Both multiclass models, using the five possible disease states, as well as binary models, only considering whether or not an individual has any presence of heart disease, were considered.

Two modeling strategies were considered:

Random Forests, though the use of the ranger¹⁰ package. (The ranger packages implements random forests, as well as extremely randomized trees.¹¹ The difference is considered a tuning parameter.)
Logistic Regression (using built in R functionality) for the binary outcome and Multinomial Regression, through the use of the nnet package, for the mutliclass outcome.

Additional modeling techniques were considered, but did not produce meaningful results.

Evaluation

Models were tuned using 5-fold cross-validation through the use of the caret package. Multiclass models were tuned for accuracy, while the binary models were tuned to maximize area under the ROC curve.

Models were ultimately evaluated based on their ability to simply detect heart disease. For the multiclass models, predictions of level 1 through 4 were collapsed to simply indicate presence of heart disease. However, these binary predictions were compared to the true multiclass outcome. This allows additional grading of severity of mistakes. (Predicting no presence of heart disease for a true status of 1 is better than predicting no presence of heart disease for a true status of 4.) Because these results cannot be cross-validated during the initial tuning of the binary models, the better performing binary model was re-cross-validated in order to produce a confusion matrix that allows for these comparisons.

Multiclass Classification

cv_multi = trainControl(method = "cv", number = 5)

set.seed(42)
multi_mod_rf = train(
  form = num ~ . - num_bin, 
  data = heart_trn,
  method = "ranger",
  trControl = cv_multi
)

set.seed(42)
multi_mod_logistic = train(
  form = num ~ . - num_bin, 
  data = heart_trn,
  method = "multinom",
  trControl = cv_multi,
  trace = FALSE
)

Binary Classification

cv_binary = trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary)

set.seed(42)
bin_mod_rf = train(
  form = num_bin ~ . - num, 
  data = heart_trn,
  method = "ranger",
  trControl = cv_binary,
  metric = "ROC"
)

set.seed(42)
bin_mod_glm = train(
  form = num_bin ~ . - num, 
  data = heart_trn,
  method = "glm",
  trControl = cv_binary,
  metric = "ROC"
)

Results

The results below show somewhat similar performance across the three methods considered. Ultimately, the binary logistic regression is chosen as it makes severe errors less frequently. Additional intermediate tuning results can be found in the appendix.

Table: **Multiclass Random Forest**, Cross-Validated Binary Predictions versus Multiclass Response, Percent
	True Number of Valves
	v0	v1	v2	v3	v4
Predict: None	41.484	8.938	2.024	1.686	0.843
Predict: Some	8.094	17.875	7.926	8.938	2.192

Table: **Multiclass Multinomial Regression**, Cross-Validated Binary Predictions versus Multiclass Response, Percent
	True Number of Valves
	v0	v1	v2	v3	v4
Predict: None	43.339	10.455	2.698	2.024	0.337
Predict: Some	6.239	16.358	7.251	8.600	2.698

Table: **Binary Logistic Regression**, Cross-Validated Binary Predictions versus Multiclass Response, Percent
	True Number of Valves
	v0	v1	v2	v3	v4
Predict: None	40.809	8.600	1.180	0.843	0.337
Predict: Some	8.769	18.212	8.769	9.781	2.698

Discussion

The results show promise, but with the provided data, there are performance and sampling issues that must be addressed before putting a similar model into practice. The table below summarizes the results of the chosen logistic regression model on a held-out test dataset. As these results are somewhat random based on the split, and only based on 147 observations, we view these results as optimistic, and defer back to the cross-validated results for a better estimate of future performance.

Table: Test Results, **Binary Logistic Regression**, Percent
	True Number of Valves
	v0	v1	v2	v3	v4
Predict: None	34.694	9.524	0.000	0.680	0.000
Predict: Some	8.163	21.088	13.605	9.524	2.721

Despite the somewhat promising results, we do not recommend putting this model into practice. First, the cross-validated results still indicate a certain amount of failure to detect the most severe cases of heart disease. (These could be detected at the cost of increasing false positives.)

Another issue is the sampling procedure used to collect this data, which is not actually defined in the documentation. Two obvious issues arise. First, there are many more male than female individuals in this dataset. This would cause problems if this model was used to screen the general population. (This distribution might be more reasonable for patients already seeking cardiac care. The documentation hints that this is the case.) A similar issue arises with the age of the individuals in the dataset. Lastly, the data was collected at four very specific locations. Using the model outside these locations could result in terrible extrapolation.

The worst issue with this dataset is its age. The data was donated to the UCI Machine Learning Repository in 1988. (It is unclear when the data was collected.) This causes two issues. First, significant changes in the population may have occurred over the past 30 years. Second, serious advances in medical technology may have occurred. This could either make the model obsolete, or provide richer data sources to input into a similar model.

Additional analysis based on updated data collection is recommended.

Appendix

Data Dictionary

age - age in years
sex - sex of individual
cp - chest pain type
trestbps - resting blood pressure (in mm Hg on admission to the hospital)
chol - serum cholesterol in mg/dl
fbs - fasting blood sugar > 120 mg/dl
restecg - resting electrocardiographic results
thalach - maximum heart rate achieved
exang - exercise induced angina
oldpeak - ST depression induced by exercise relative to rest
location - hospital that treated individual
num - diagnosis of heart disease (0 - 4)
num_bin - diagnosis of heart disease (0 - 1)

See the documentation for the ucidata package or the UCI website for additional documentation.

EDA

Additional Results

Table: Random Forest Multiclass Classification
mtry	min.node.size	splitrule	Accuracy	Kappa	AccuracySD	KappaSD
2	1	gini	0.590	0.328	0.025	0.044
2	1	extratrees	0.594	0.324	0.019	0.039
9	1	gini	0.584	0.338	0.034	0.054
9	1	extratrees	0.589	0.355	0.009	0.016
16	1	gini	0.594	0.361	0.038	0.065
16	1	extratrees	0.600	0.375	0.013	0.023

Table: Multinomial Multiclass Classification
decay	Accuracy	Kappa	AccuracySD	KappaSD
0e+00	0.6021	0.3749	0.0129	0.0297
1e-04	0.6021	0.3744	0.0152	0.0329
1e-01	0.6087	0.3768	0.0167	0.0294

Table: Random Forest Binary Classification
mtry	min.node.size	splitrule	ROC	Sens	Spec	ROCSD	SensSD	SpecSD
2	1	gini	0.882	0.803	0.796	0.029	0.033	0.061
2	1	extratrees	0.885	0.796	0.766	0.028	0.049	0.066
9	1	gini	0.863	0.769	0.786	0.036	0.054	0.038
9	1	extratrees	0.868	0.782	0.776	0.038	0.064	0.083
16	1	gini	0.857	0.772	0.779	0.040	0.064	0.051
16	1	extratrees	0.861	0.776	0.766	0.039	0.069	0.049

Table: Logistic Regression Binary Classification
parameter	ROC	Sens	Spec	ROCSD	SensSD	SpecSD
none	0.882	0.803	0.776	0.037	0.016	0.063

Heart Disease Detection

Your Name (netid@illinois.edu)

17 November, 2019