Abstract

Statistical learning methods were applied to cardiological and demographic data in order to detect the presence of heart disease in medical patients. A variety of learning techniques were explored and validated. Simple methods like logistic regression show promise, but further data collection and analysis is recommended.


Introduction

Cardiovascular disease1 (CVD), which is often simply referred to as heart disease, is the leading cause of death in the United States.2 Risk factors for heart disease include genetics, age, sex, diet, lifestyle, sleep, and environment. While treatment often involves altering diet, sleep, and lifestyle to align with healthy choices that would be beneficial to the entire population, detection is still important to allow for medical practitioners to help with these interventions.

In an attempt to construct a tool to detect heart disease, statistical learning techniques have been applied to cardiological and demographic data from four hospitals. The goal of this model would be to perform this detection based on demographic and non-invasive cardiology metrics. The results show potential to use these models to pre-screen patients, but until further data collection and analysis have been performed, existing clinical practice should be continued.


Methods

Data

The data originates from the UCI Machine Learning Repository3. The data was accessed via the ucidata package on GitHub.4 The data was collected from four sources:

  1. Hungarian Institute of Cardiology. Budapest (Andras Janosi, M.D.)
  2. University Hospital, Zurich, Switzerland (William Steinbrunn, M.D.)
  3. University Hospital, Basel, Switzerland (Matthias Pfisterer, M.D.)
  4. V.A. Medical Center, Long Beach and Cleveland Clinic Foundation (Robert Detrano, M.D., Ph.D. )

Each patient in the data was graded on a scale from 0 to 4, indicating the severity of heart disease. A score of 0 indicates no presence of heart disease. Scores 1 through 4 indicate increasing severity of heart disease. In particular, the scores from 1 to 4 indicate the number of vessels found through angiography5 with greater than 50% diameter narrowing.

Demographic information including age, sex, and location (hospital) is available for each patient. In addition to this demographic information, several cardiology measure are available for each patient, including:

  • Electrocardiography6 results
  • Cardiac Stress Test7 results
  • Blood Test8 results
  • Blood Pressure9

Some exploratory data analysis can be found in the appendix.

Modeling

In order to detect the presence of heart disease, several classification strategies were explored. Both multiclass models, using the five possible disease states, as well as binary models, only considering whether or not an individual has any presence of heart disease, were considered.

Two modeling strategies were considered:

  • Random Forests, though the use of the ranger10 package. (The ranger packages implements random forests, as well as extremely randomized trees.11 The difference is considered a tuning parameter.)
  • Logistic Regression (using built in R functionality) for the binary outcome and Multinomial Regression, through the use of the nnet package, for the mutliclass outcome.

Additional modeling techniques were considered, but did not produce meaningful results.

Evaluation

Models were tuned using 5-fold cross-validation through the use of the caret package. Multiclass models were tuned for accuracy, while the binary models were tuned to maximize area under the ROC curve.

Models were ultimately evaluated based on their ability to simply detect heart disease. For the multiclass models, predictions of level 1 through 4 were collapsed to simply indicate presence of heart disease. However, these binary predictions were compared to the true multiclass outcome. This allows additional grading of severity of mistakes. (Predicting no presence of heart disease for a true status of 1 is better than predicting no presence of heart disease for a true status of 4.) Because these results cannot be cross-validated during the initial tuning of the binary models, the better performing binary model was re-cross-validated in order to produce a confusion matrix that allows for these comparisons.

Multiclass Classification

cv_multi = trainControl(method = "cv", number = 5)
set.seed(42)
multi_mod_rf = train(
  form = num ~ . - num_bin, 
  data = heart_trn,
  method = "ranger",
  trControl = cv_multi
)
set.seed(42)
multi_mod_logistic = train(
  form = num ~ . - num_bin, 
  data = heart_trn,
  method = "multinom",
  trControl = cv_multi,
  trace = FALSE
)

Binary Classification

cv_binary = trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE,
  summaryFunction = twoClassSummary)
set.seed(42)
bin_mod_rf = train(
  form = num_bin ~ . - num, 
  data = heart_trn,
  method = "ranger",
  trControl = cv_binary,
  metric = "ROC"
)
set.seed(42)
bin_mod_glm = train(
  form = num_bin ~ . - num, 
  data = heart_trn,
  method = "glm",
  trControl = cv_binary,
  metric = "ROC"
)

Results

The results below show somewhat similar performance across the three methods considered. Ultimately, the binary logistic regression is chosen as it makes severe errors less frequently. Additional intermediate tuning results can be found in the appendix.

Table: Multiclass Random Forest, Cross-Validated Binary Predictions versus Multiclass Response, Percent
True Number of Valves
v0 v1 v2 v3 v4
Predict: None 41.484 8.938 2.024 1.686 0.843
Predict: Some 8.094 17.875 7.926 8.938 2.192
Table: Multiclass Multinomial Regression, Cross-Validated Binary Predictions versus Multiclass Response, Percent
True Number of Valves
v0 v1 v2 v3 v4
Predict: None 43.339 10.455 2.698 2.024 0.337
Predict: Some 6.239 16.358 7.251 8.600 2.698
Table: Binary Logistic Regression, Cross-Validated Binary Predictions versus Multiclass Response, Percent
True Number of Valves
v0 v1 v2 v3 v4
Predict: None 40.809 8.600 1.180 0.843 0.337
Predict: Some 8.769 18.212 8.769 9.781 2.698

Discussion

The results show promise, but with the provided data, there are performance and sampling issues that must be addressed before putting a similar model into practice. The table below summarizes the results of the chosen logistic regression model on a held-out test dataset. As these results are somewhat random based on the split, and only based on 147 observations, we view these results as optimistic, and defer back to the cross-validated results for a better estimate of future performance.

Table: Test Results, Binary Logistic Regression, Percent
True Number of Valves
v0 v1 v2 v3 v4
Predict: None 34.694 9.524 0.000 0.680 0.000
Predict: Some 8.163 21.088 13.605 9.524 2.721

Despite the somewhat promising results, we do not recommend putting this model into practice. First, the cross-validated results still indicate a certain amount of failure to detect the most severe cases of heart disease. (These could be detected at the cost of increasing false positives.)

Another issue is the sampling procedure used to collect this data, which is not actually defined in the documentation. Two obvious issues arise. First, there are many more male than female individuals in this dataset. This would cause problems if this model was used to screen the general population. (This distribution might be more reasonable for patients already seeking cardiac care. The documentation hints that this is the case.) A similar issue arises with the age of the individuals in the dataset. Lastly, the data was collected at four very specific locations. Using the model outside these locations could result in terrible extrapolation.

The worst issue with this dataset is its age. The data was donated to the UCI Machine Learning Repository in 1988. (It is unclear when the data was collected.) This causes two issues. First, significant changes in the population may have occurred over the past 30 years. Second, serious advances in medical technology may have occurred. This could either make the model obsolete, or provide richer data sources to input into a similar model.

Additional analysis based on updated data collection is recommended.


Appendix

Data Dictionary

  • age - age in years
  • sex - sex of individual
  • cp - chest pain type
  • trestbps - resting blood pressure (in mm Hg on admission to the hospital)
  • chol - serum cholesterol in mg/dl
  • fbs - fasting blood sugar > 120 mg/dl
  • restecg - resting electrocardiographic results
  • thalach - maximum heart rate achieved
  • exang - exercise induced angina
  • oldpeak - ST depression induced by exercise relative to rest
  • location - hospital that treated individual
  • num - diagnosis of heart disease (0 - 4)
  • num_bin - diagnosis of heart disease (0 - 1)

See the documentation for the ucidata package or the UCI website for additional documentation.

EDA

Additional Results

Table: Random Forest Multiclass Classification
mtry min.node.size splitrule Accuracy Kappa AccuracySD KappaSD
2 1 gini 0.590 0.328 0.025 0.044
2 1 extratrees 0.594 0.324 0.019 0.039
9 1 gini 0.584 0.338 0.034 0.054
9 1 extratrees 0.589 0.355 0.009 0.016
16 1 gini 0.594 0.361 0.038 0.065
16 1 extratrees 0.600 0.375 0.013 0.023
Table: Multinomial Multiclass Classification
decay Accuracy Kappa AccuracySD KappaSD
0e+00 0.6021 0.3749 0.0129 0.0297
1e-04 0.6021 0.3744 0.0152 0.0329
1e-01 0.6087 0.3768 0.0167 0.0294
Table: Random Forest Binary Classification
mtry min.node.size splitrule ROC Sens Spec ROCSD SensSD SpecSD
2 1 gini 0.882 0.803 0.796 0.029 0.033 0.061
2 1 extratrees 0.885 0.796 0.766 0.028 0.049 0.066
9 1 gini 0.863 0.769 0.786 0.036 0.054 0.038
9 1 extratrees 0.868 0.782 0.776 0.038 0.064 0.083
16 1 gini 0.857 0.772 0.779 0.040 0.064 0.051
16 1 extratrees 0.861 0.776 0.766 0.039 0.069 0.049
Table: Logistic Regression Binary Classification
parameter ROC Sens Spec ROCSD SensSD SpecSD
none 0.882 0.803 0.776 0.037 0.016 0.063

  1. Wikipedia: Cardiovascular Disease

  2. CDC: Leading Causes of Death

  3. UCI: Heart Disease

  4. GitHub: ucidata

  5. Wikipedia: Angiography

  6. Wikipedia: Electrocardiography

  7. Wikipedia: Cardiac Stress Test

  8. Wikipedia: Blood Test

  9. Wikipedia: Blood Pressure

  10. STAT 432: Extremely Randomized Trees, ranger, xgboost

  11. Extremely Randomized Trees