Statistical learning methods were applied to cardiological and demographic data in order to detect the presence of heart disease in medical patients. A variety of learning techniques were explored and validated. Simple methods like logistic regression show promise, but further data collection and analysis is recommended.
Cardiovascular disease1 (CVD), which is often simply referred to as heart disease, is the leading cause of death in the United States.2 Risk factors for heart disease include genetics, age, sex, diet, lifestyle, sleep, and environment. While treatment often involves altering diet, sleep, and lifestyle to align with healthy choices that would be beneficial to the entire population, detection is still important to allow for medical practitioners to help with these interventions.
In an attempt to construct a tool to detect heart disease, statistical learning techniques have been applied to cardiological and demographic data from four hospitals. The goal of this model would be to perform this detection based on demographic and non-invasive cardiology metrics. The results show potential to use these models to pre-screen patients, but until further data collection and analysis have been performed, existing clinical practice should be continued.
The data originates from the UCI Machine Learning Repository3. The data was accessed via the ucidata
package on GitHub.4 The data was collected from four sources:
Each patient in the data was graded on a scale from 0 to 4, indicating the severity of heart disease. A score of 0 indicates no presence of heart disease. Scores 1 through 4 indicate increasing severity of heart disease. In particular, the scores from 1 to 4 indicate the number of vessels found through angiography5 with greater than 50% diameter narrowing.
Demographic information including age, sex, and location (hospital) is available for each patient. In addition to this demographic information, several cardiology measure are available for each patient, including:
Some exploratory data analysis can be found in the appendix.
In order to detect the presence of heart disease, several classification strategies were explored. Both multiclass models, using the five possible disease states, as well as binary models, only considering whether or not an individual has any presence of heart disease, were considered.
Two modeling strategies were considered:
ranger
10 package. (The ranger
packages implements random forests, as well as extremely randomized trees.11 The difference is considered a tuning parameter.)R
functionality) for the binary outcome and Multinomial Regression, through the use of the nnet
package, for the mutliclass outcome.Additional modeling techniques were considered, but did not produce meaningful results.
Models were tuned using 5-fold cross-validation through the use of the caret
package. Multiclass models were tuned for accuracy, while the binary models were tuned to maximize area under the ROC curve.
Models were ultimately evaluated based on their ability to simply detect heart disease. For the multiclass models, predictions of level 1 through 4 were collapsed to simply indicate presence of heart disease. However, these binary predictions were compared to the true multiclass outcome. This allows additional grading of severity of mistakes. (Predicting no presence of heart disease for a true status of 1 is better than predicting no presence of heart disease for a true status of 4.) Because these results cannot be cross-validated during the initial tuning of the binary models, the better performing binary model was re-cross-validated in order to produce a confusion matrix that allows for these comparisons.
cv_multi = trainControl(method = "cv", number = 5)
set.seed(42)
multi_mod_rf = train(
form = num ~ . - num_bin,
data = heart_trn,
method = "ranger",
trControl = cv_multi
)
set.seed(42)
multi_mod_logistic = train(
form = num ~ . - num_bin,
data = heart_trn,
method = "multinom",
trControl = cv_multi,
trace = FALSE
)
cv_binary = trainControl(
method = "cv",
number = 5,
classProbs = TRUE,
summaryFunction = twoClassSummary)
set.seed(42)
bin_mod_rf = train(
form = num_bin ~ . - num,
data = heart_trn,
method = "ranger",
trControl = cv_binary,
metric = "ROC"
)
set.seed(42)
bin_mod_glm = train(
form = num_bin ~ . - num,
data = heart_trn,
method = "glm",
trControl = cv_binary,
metric = "ROC"
)
The results below show somewhat similar performance across the three methods considered. Ultimately, the binary logistic regression is chosen as it makes severe errors less frequently. Additional intermediate tuning results can be found in the appendix.
v0 | v1 | v2 | v3 | v4 | |
---|---|---|---|---|---|
Predict: None | 41.484 | 8.938 | 2.024 | 1.686 | 0.843 |
Predict: Some | 8.094 | 17.875 | 7.926 | 8.938 | 2.192 |
v0 | v1 | v2 | v3 | v4 | |
---|---|---|---|---|---|
Predict: None | 43.339 | 10.455 | 2.698 | 2.024 | 0.337 |
Predict: Some | 6.239 | 16.358 | 7.251 | 8.600 | 2.698 |
v0 | v1 | v2 | v3 | v4 | |
---|---|---|---|---|---|
Predict: None | 40.809 | 8.600 | 1.180 | 0.843 | 0.337 |
Predict: Some | 8.769 | 18.212 | 8.769 | 9.781 | 2.698 |
The results show promise, but with the provided data, there are performance and sampling issues that must be addressed before putting a similar model into practice. The table below summarizes the results of the chosen logistic regression model on a held-out test dataset. As these results are somewhat random based on the split, and only based on 147 observations, we view these results as optimistic, and defer back to the cross-validated results for a better estimate of future performance.
v0 | v1 | v2 | v3 | v4 | |
---|---|---|---|---|---|
Predict: None | 34.694 | 9.524 | 0.000 | 0.680 | 0.000 |
Predict: Some | 8.163 | 21.088 | 13.605 | 9.524 | 2.721 |
Despite the somewhat promising results, we do not recommend putting this model into practice. First, the cross-validated results still indicate a certain amount of failure to detect the most severe cases of heart disease. (These could be detected at the cost of increasing false positives.)
Another issue is the sampling procedure used to collect this data, which is not actually defined in the documentation. Two obvious issues arise. First, there are many more male than female individuals in this dataset. This would cause problems if this model was used to screen the general population. (This distribution might be more reasonable for patients already seeking cardiac care. The documentation hints that this is the case.) A similar issue arises with the age of the individuals in the dataset. Lastly, the data was collected at four very specific locations. Using the model outside these locations could result in terrible extrapolation.
The worst issue with this dataset is its age. The data was donated to the UCI Machine Learning Repository in 1988. (It is unclear when the data was collected.) This causes two issues. First, significant changes in the population may have occurred over the past 30 years. Second, serious advances in medical technology may have occurred. This could either make the model obsolete, or provide richer data sources to input into a similar model.
Additional analysis based on updated data collection is recommended.
age
- age in yearssex
- sex of individualcp
- chest pain typetrestbps
- resting blood pressure (in mm Hg on admission to the hospital)chol
- serum cholesterol in mg/dlfbs
- fasting blood sugar > 120 mg/dlrestecg
- resting electrocardiographic resultsthalach
- maximum heart rate achievedexang
- exercise induced anginaoldpeak
- ST depression induced by exercise relative to restlocation
- hospital that treated individualnum
- diagnosis of heart disease (0 - 4)num_bin
- diagnosis of heart disease (0 - 1)See the documentation for the ucidata
package or the UCI website for additional documentation.
mtry | min.node.size | splitrule | Accuracy | Kappa | AccuracySD | KappaSD |
---|---|---|---|---|---|---|
2 | 1 | gini | 0.590 | 0.328 | 0.025 | 0.044 |
2 | 1 | extratrees | 0.594 | 0.324 | 0.019 | 0.039 |
9 | 1 | gini | 0.584 | 0.338 | 0.034 | 0.054 |
9 | 1 | extratrees | 0.589 | 0.355 | 0.009 | 0.016 |
16 | 1 | gini | 0.594 | 0.361 | 0.038 | 0.065 |
16 | 1 | extratrees | 0.600 | 0.375 | 0.013 | 0.023 |
decay | Accuracy | Kappa | AccuracySD | KappaSD |
---|---|---|---|---|
0e+00 | 0.6021 | 0.3749 | 0.0129 | 0.0297 |
1e-04 | 0.6021 | 0.3744 | 0.0152 | 0.0329 |
1e-01 | 0.6087 | 0.3768 | 0.0167 | 0.0294 |
mtry | min.node.size | splitrule | ROC | Sens | Spec | ROCSD | SensSD | SpecSD |
---|---|---|---|---|---|---|---|---|
2 | 1 | gini | 0.882 | 0.803 | 0.796 | 0.029 | 0.033 | 0.061 |
2 | 1 | extratrees | 0.885 | 0.796 | 0.766 | 0.028 | 0.049 | 0.066 |
9 | 1 | gini | 0.863 | 0.769 | 0.786 | 0.036 | 0.054 | 0.038 |
9 | 1 | extratrees | 0.868 | 0.782 | 0.776 | 0.038 | 0.064 | 0.083 |
16 | 1 | gini | 0.857 | 0.772 | 0.779 | 0.040 | 0.064 | 0.051 |
16 | 1 | extratrees | 0.861 | 0.776 | 0.766 | 0.039 | 0.069 | 0.049 |
parameter | ROC | Sens | Spec | ROCSD | SensSD | SpecSD |
---|---|---|---|---|---|---|
none | 0.882 | 0.803 | 0.776 | 0.037 | 0.016 | 0.063 |