Park C., Truong D., Zheng E., Hooda J.
The University of British Columbia, Oct 2022
DSCI 100, Section 001
Cardiovascular diseases have been an issue of major concern for years. According to the recent reports of the World Health Organization, Cardiovascular diseases account for 32% (17.9 million) of global deaths annually (Cardiovascular Diseases (CVDs), 2021). This makes it a leading cause of death, which highlights the importance of its analysis. Cardiovascular diseases can be of multiple types including coronary heart, cerebrovascular, and peripheral arterial disease. While about 80% of heart disease and strokes are preventable, there have not been significant improvements in the number of successfully treated heart disease patients. This is likely because of the difficulty involved in correctly predicting heart diseases in their premature stages (McGill et al., 2008; Tsao et al., 2022). So, the aim of our project is to help elucidate some of the predictors and symptomatic patterns surrounding heart disease so that we can understand how we might better predict its occurrence. We focused on using groups of attributes, such as chest pain location, cholesterol, and blood sugar levels to help us predict the occurrence of heart disease in patients.
In this project, we accessed the UCI Machine learning repository’s heart disease dataset , which was used by machine learning researchers to characterise the presence of heart disease in patients (Heart Disease Data Set, 1988). The dataset rates the presence of heart disease from 0 to 4, with 0 indicating no presence. Past work with the Cleveland database focused on distinguishing the presence values from 1-4 from the absence value of 0.
We expect to find out how well these groups of attributes lead to accurate prediction of heart disease value. Our study may help us find that factors related to resting blood pressure are more likely to help predict potency of heart disease. Or, resting blood pressure trends could be compared to the other indicators' trends to see whether resting blood pressure in tandem with the other indicators could be a better predictor.|
To build our predictive model, we selected the Cleveland heart disease dataset and excluded the variable “trestbps,” which indicates resting blood pressure. We also modified the “num,” column, which indicated heart disease presence and severity, to just indicate presence so that our model would predict just the presence or absence of heart disease. The variables included are defined below:
Then, we used the K-Nearest Neighbours classification algorithm to train the model with the modified dataset. A small and proportionally-representative subset of the data was separated to assess the accuracy of the model. For detailed information about the variables used, please refer to the documentation found on the heart disease dataset's page (Heart Disease Data Set, 1988).
library(tidyverse)
library(tidymodels)
library("GGally")
library("ggplot2")
options(repr.matrix.max.rows = 6, digits=10)
set.seed(4321)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.3.6 ✔ purrr 0.3.4
✔ tibble 3.1.7 ✔ dplyr 1.0.9
✔ tidyr 1.2.0 ✔ stringr 1.4.0
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
── Attaching packages ────────────────────────────────────── tidymodels 1.0.0 ──
✔ broom 1.0.0 ✔ rsample 1.0.0
✔ dials 1.0.0 ✔ tune 1.0.0
✔ infer 1.0.2 ✔ workflows 1.0.0
✔ modeldata 1.0.0 ✔ workflowsets 1.0.0
✔ parsnip 1.0.0 ✔ yardstick 1.0.0
✔ recipes 1.0.1
── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
✖ scales::discard() masks purrr::discard()
✖ dplyr::filter() masks stats::filter()
✖ recipes::fixed() masks stringr::fixed()
✖ dplyr::lag() masks stats::lag()
✖ yardstick::spec() masks readr::spec()
✖ recipes::step() masks stats::step()
• Dig deeper into tidy modeling with R at https://www.tmwr.org
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
raw_cleveland_heart_data <- read_delim("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", delim = ",", col_names = FALSE, show_col_types = FALSE)
colnames(raw_cleveland_heart_data) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg", "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")
nrow(raw_cleveland_heart_data)
In the wrangling section, our team decided to mutate some of the predictor variables like the columns ca, thal, and num as numerical quantities while mutating the num, the target variable, as the minimum in order to find the best variable at the very top.
Moreover, our team decided to create a ggpairs visualization of the data in order to determine which attributes are best applicable in predicting heart disease in patients.
cleveland_heart_data <- raw_cleveland_heart_data %>%
rowwise() %>%
mutate(num = min(num, 1)) %>%
ungroup() %>%
mutate(ca = as.numeric(ca), thal = as.numeric(thal), num = as.factor(num)) %>%
na.omit()
plt <- ggpairs(cleveland_heart_data)
Warning message in mask$eval_all_mutate(quo):
“NAs introduced by coercion”
Warning message in mask$eval_all_mutate(quo):
“NAs introduced by coercion”
plt <- plt + theme(text = element_text(size = 8))
ggsave("ggpairs.png", plot = plt, width = 10, height = 10, units = "in", dpi = 300)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Note: Please make sure to download the "ggpairs.png" file attached alongside the HTML and .ipynb file in submissions so that the image above could be rendered above. If the image above does not render in jupyter notebook, closing jupyter notebook (not refreshing) and opening it again should work as a solution.
In a classification model, it is imperative to analyse the dataset.
This must be done to consider the quality of the training data as to tell whether or not it is worthwhile to create a classification model, and if it is possible to create an effective classification model. Consider the graph on the bottom right hand corner. It can be shown that through the entire dataset, the number of entries where there is prescence of heart disease and without is approximately even, with slightly more cases without heart disease. Looking at all the graphs on the bottom row,
This must also be done to properly pick predictor variables, such that they can effectively differentiate between the class variable. Consider the right column of the plot, between the intersection of num and trestbps. According to the ggpairs plot, we decided to remove trestbps (resting blood pressure) because as a predictor variable the distribution of it between patients with or without heart disease is nearly indistinguishable, proving that it would be unnecessary to use it. Consider now the right column between the intesection of num and fbs (fasting blood sugar). From a cursatory glance, it seems that the distribution is relatively similar regardless of the class variable. Looking at the bottom row, between the same intersection, it is easily shown that the distribution between those who have fbs and those who do not is virtually identical. It is for this reason that this variable was also removed.
cleveland_heart_data <- cleveland_heart_data %>%
select(-trestbps, -fbs)
heart_split = initial_split(cleveland_heart_data, prop = 0.8)
heart_train = training(heart_split)
heart_test = testing(heart_split)
knn_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tune()) |>
set_engine("kknn") |>
set_mode("classification")
knn_spec
heart_recipe <- recipe(num ~ ., data = heart_train) |>
step_scale(all_predictors()) |>
step_center(all_predictors())
k_vals <- tibble(neighbors = seq(from = 1, to = 130, by = 1))
knn_vfold <- vfold_cv(heart_train, v = 5, strata = num)
knn_results <- workflow() |>
add_recipe(heart_recipe) |>
add_model(knn_spec) |>
tune_grid(resamples = knn_vfold, grid = k_vals)
K-Nearest Neighbor Model Specification (classification)
Main Arguments:
neighbors = tune()
weight_func = rectangular
Computational engine: kknn
result_table <- collect_metrics(knn_results)%>%
filter(.metric == "accuracy")
result_table %>%
filter((neighbors + 1)%% 4 == 0) %>%
ggplot(aes(x = neighbors, y = mean)) +
geom_vline(xintercept = 47, colour = "red") +
geom_line() +
geom_point()
result_table %>%
filter(neighbors < 70) %>%
ggplot(aes(x = neighbors, y = mean)) +
geom_vline(xintercept = 47, colour = "red") +
geom_line() +
geom_point()
best_ks <- result_table %>%
arrange(desc(mean)) %>%
head(5) %>%
select(neighbors, mean, std_err)
best_ks
| neighbors | mean | std_err |
|---|---|---|
| <dbl> | <dbl> | <dbl> |
| 47 | 0.8356074622 | 0.02000131555 |
| 48 | 0.8356074622 | 0.02000131555 |
| 49 | 0.8356074622 | 0.02000131555 |
| 50 | 0.8356074622 | 0.02000131555 |
| 51 | 0.8356074622 | 0.02000131555 |
for (n in best_ks %>% select(neighbors) %>% pull()) {
knn_n_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = n) |>
set_engine("kknn") |>
set_mode("classification")
knn_n <- workflow() |>
add_recipe(heart_recipe) |>
add_model(knn_n_spec) |>
fit(data = heart_train)
knn_n_err <- predict(knn_n, heart_test) %>%
bind_cols(heart_test) %>%
metrics(truth = num, estimate = .pred_class) %>%
filter(.metric == "accuracy") %>%
pull(.estimate)
cat("K=", n , "Model Accuracy", toString(knn_n_err), "\n")
# counts of predictions
# knn_n_err <- predict(knn_n, heart_test) %>%
# group_by(.pred_class) %>%
# summarize(count = n()) %>%
# print()
# confusion matrix
predict(knn_n, heart_test) %>%
bind_cols(heart_test) %>%
conf_mat(truth = num, estimate = .pred_class) %>%
print()
}
K= 47 Model Accuracy 0.883333333333333
Truth
Prediction 0 1
0 29 5
1 2 24
K= 48 Model Accuracy 0.883333333333333
Truth
Prediction 0 1
0 29 5
1 2 24
K= 49 Model Accuracy 0.883333333333333
Truth
Prediction 0 1
0 29 5
1 2 24
K= 50 Model Accuracy 0.883333333333333
Truth
Prediction 0 1
0 29 5
1 2 24
K= 51 Model Accuracy 0.883333333333333
Truth
Prediction 0 1
0 29 5
1 2 24
Our project aimed to use the characteristics of patients’ health as predictors of heart disease. We used the K-Nearest Neighbours Classification algorithm to shed light on the symptomatic patterns surrounding heart disease so that we might gain insight on how to better predict its occurrence. During our analysis of the dataset, we found that the distribution of the resting blood pressure variable was very similar in patients with or without heart disease. The resting blood pressure’s unchanging distribution across heart disease states indicated that it may not be a reliable predictor since it seemed unaffected by heart disease presence, so we omitted it from our analysis.
When we tested our model against the actual dataset, we found that it had an accuracy of 88% when we used K = 47, our most stable number of neighbours with the best accuracy. About 8% of our predictions were false negatives, with about 3% of them being false positives. Using 47 neighbours proved the most stable because the model’s mean accuracy would only depreciate by 1 to 2% at nearby K values. Other K values introduced more false negatives, so we avoided them because the impact against a patient would be greater if they had a heart disease but were not notified. Our results lie within expectations because the dataset has been used in other projects to train models with a similarly high accuracy as ours (Aha & Kibler, 1988; Alotaibi, 2019). We also expected a relatively high accuracy because we tested our model on data from the same demographic that was used to train the model. So, our model performed better against a testing group with the same demographic biases it was trained on, which was expected. Followup studies could test our model’s accuracy against health data from other demographics so we could get a more generalizable view on which health factors may contribute to accurate heart disease prediction.
Future work may also investigate whether the trends from our results appear in data from other institutions. Together with those other data, our findings could contribute evidence for a specific combination of significant health attributes that may improve the performance of the prediction models, which is what other studies in the field are also searching for (Alotaibi, 2019; Amin et al., 2019). Then, a predictive model could incorporate findings from followup studies to build a commercial tool that could be tested for accurate repeatability across diverse demographics. So, the data from our model could aid the creation of a tool used to aid health care providers in a real-world setting to predict premature heart disease, which could alleviate the difficulties health care has faced so far (McGill et al., 2008; Tsao et al., 2022).
A contemporary ethical question that our predictive model and any subsequent models will have to contend with is how readable the technologies behind it are (Wang et al., 2020). When using a learning model to predict heart disease, it may become extremely difficult to parse the model’s reasoning as to why it thinks a patient’s specific health attributes lead to its prediction. With current methodologies, it is difficult to probe the model’s large training dataset to decipher the tuning that the model used to reach its final decision, which makes it a “black-box” technology for how challenging it is to see inside the model’s reasoning. A predictive model’s lack of transparent reasoning can become very concerning for patients and health care providers who must put their trust into it without fully understanding how it works because who becomes responsible for a misdiagnosis is ambiguous.
Overall, our project followed up with other modelling done with the Cleveland heart disease dataset to build a model using the K-Nearest Neighbours algorithm that could predict heart disease occurrence with up to 88% accuracy. Future work might investigate the generalizability of the health attributes we used to predict heart disease in other demographics. Whether our model could be used in a healthcare institution remains to be seen because of the ethical issues behind the black box technologies involved.
Aha, D., & Kibler, D. (1988). Instance-based prediction of heart-disease presence with the Cleveland database. University of California, 3(1), 3–2.
Alotaibi, F. S. (2019). Implementation of Machine Learning Model to Predict Heart Failure Disease. International Journal of Advanced Computer Science and Applications (IJACSA), 10(6), Article 6. https://doi.org/10.14569/IJACSA.2019.0100637
Amin, M. S., Chiam, Y. K., & Varathan, K. D. (2019). Identification of significant features and data mining techniques in predicting heart disease. Telematics and Informatics, 36, 82–93. https://doi.org/10.1016/j.tele.2018.11.007
Cardiovascular diseases (CVDs). (2021, June 11). World Health Organization. https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
Heart Disease Data Set. (1988, July 1). UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/Heart+Disease
McGill, H. C., McMahan, C. A., & Gidding, S. S. (2008). Preventing Heart Disease in the 21st Century. Circulation, 117(9), 1216–1227. https://doi.org/10.1161/CIRCULATIONAHA.107.717033
Tsao, C. W., Aday, A. W., Almarzooq, Z. I., Alonso, A., Beaton, A. Z., Bittencourt, M. S., Boehme, A. K., Buxton, A. E., Carson, A. P., Commodore-Mensah, Y., Elkind, M. S. V., Evenson, K. R., Eze-Nliam, C., Ferguson, J. F., Generoso, G., Ho, J. E., Kalani, R., Khan, S. S., Kissela, B. M., … Martin, S. S. (2022). Heart Disease and Stroke Statistics—2022 Update: A Report From the American Heart Association. Circulation, 145(8). https://doi.org/10.1161/CIR.0000000000001052
Wang, F., Kaushal, R., & Khullar, D. (2020). Should Health Care Demand Interpretable Artificial Intelligence or Accept “Black Box” Medicine? Annals of Internal Medicine, 172(1), 59–60. https://doi.org/10.7326/M19-2548