High rates of readmission after hospitalization for heart failure puts a tremendous burden on patients and the healthcare system. Predictive models are used to identify patients with high risk for hospital readmissions and potentially enable direct specific interventions toward those who might benefit most by identifying key risk factors. The current ability to predict readmissions in patients with heart failure is modest at best. The inclusion of a richer set of predictor variables encompassing patients’ clinical, social, and demographic domains, while improving discrimination in some internally validated studies, may not improve discrimination. This richer set of predictors might not contain the predictive domain of variables required but does represent a large set of data not routinely collected in other studies. It is unclear whether machine learning techniques that address higher dimensional, nonlinear relationships among variables would enhance prediction. The authors seek to compare the effectiveness of several machine learning algorithms for predicting readmissions.
Data for this study were drawn from Tele-HF, which enrolled 1653 patients within 30 days of their discharge after an index hospitalization for heart failure. In addition to the clinical data from the index admission, Tele-HF used validated instruments to collect data on patients’ socioeconomic, psychosocial, and health status. Of the 1653 enrolled patients, the authors excluded 36 who were readmitted or died before the interview, 574 whose interviews were completed after 30 days from discharge, and 39 who were missing data on >15 of the 236 baseline features to create a study sample of 1004 patients for the 30-day readmission analysis set. 472 variables were used for input. Models were built using both traditional statistical methods and ML methods to predict readmission and model were compared to discrimination and predictive range of the various techniques. An LR model and a Poisson regression were used for traditional statistical models. Three ML methods— RF, boosting, and SVM—were used for readmission prediction.
Using the variables selected in a recent study from Tele-HF would provide the most accurate representation of an LR model on the Tele-HF data set for comparison purposes which were selected by the authors for their study to compare model performance, as the current analysis is concerned with finding improved analytic algorithms for predicting 30-day readmissions rather than primarily with variable selection. Given the flexibility of nonlinear methods, the complexity of the desired models might overwhelm the available data, resulting in overfitting. Although all the available variables can be used in ML techniques such as RF and boosting, which are robust to this overfitting, we may require some form of feature selection to help prevent overfitting in less robust techniques like SVM.
To overcome the potential for overfitting in LR and SVM, a hierarchical method with RF was developed. Previous hierarchical methods used RF as a feature selection method because it is well suited to a data set of high dimensionalities with varied data types, to identify a subset of features to feed into methods such as LR and SVM. RF is well known to use out-of-bag estimates and an internal bootstrap to help reduce and select only predictive variables and avoid overfitting, like AdaBoost.
To construct the derivation and validation data sets, the cohort were split into 2 equally sized groups, ensuring equal percentages of readmitted patients in each group. To account for a significant difference in numbers of patients who were readmitted and not readmitted in each group, the ML algorithms were weighted. The weight selected for the readmitted patients was the ratio of not-readmitted patients to readmitted patients in the derivation set. Once the derivation and validation sets were created, a traditional LR model was trained. The models generated were run on the validation set and calculated the area under the receiver operating characteristics curve (C statistic), which provided a measure of model discrimination. The analysis was run 100× in order to provide robustness over a potentially poor random split of patients and to generate a mean C statistic with a 95% confidence interval (CI). The probabilities of readmission generated over the 100 iterations were then sorted into deciles. Finally, the observed readmission rate for each decile were calculated to determine the predictive range of the algorithms.
Thirty-Day All-Cause Model Discrimination
LR had a low C statistic of 0.533 (95% CI, 0.527–0.538). Boosting on the input data had the highest C statistic (0.601; 95% CI, 0.594–0.607) in a 30-day binary outcome with a 30-day training case. Boosting also had the highest C statistic for the 30-day binary outcome with 180-day binary training (0.613; 95% CI, 0.607–0.618). For the 30-day outcomes with 180-day counts training, the RF technique had the highest C statistic (0.628; 95% CI, 0.624–0.633).
One Hundred Eighty–Day All-Cause Model Discrimination
LR again showed a low C statistic (0.574; 95% CI, 0.571–0.578) for the 180-day binary case. The RF into SVM hierarchical method had the highest achieved C statistic across all methods (0.654; 95% CI, 0.650–0.657) in 180-day binary outcome and 180-day count case (0.649; 95% CI, 0.648–0.651).
Readmission Because Of Heart Failure Discrimination
For readmissions because of heart failure, the LR model again had a low C statistic for the 30-day binary case (0.543; 95% CI, 0.536–0.550) and the 180-day binary case (0.566; 95% CI, 0.562–0.570). Boosting had the best C statistic for the 30-day binary-only case (0.615; 95% CI, 0.607–0.622) and for the 30-day with 180-day binary case training (0.678; 95% CI, 0.670– 0.687). The highest C statistic for other prediction cases were RF for the 30-day with 180-day counts case training (0.669; 95% CI, 0.661–0.676); RF into SVM for the 180- day binary-only case (0.657; 95% CI, 0.652–0.661); and RF into SVM for the 180-day counts case (0.651; 95% CI, 0.646–0.656).
Readmission Because Of Heart Failure Predictive Range
When the deciles of risk prediction were plotted against the observed readmission rate, RF and boosting each had the biggest differences between the first and tenth deciles of risk (1.8–11.9% and 1.4–12.2%, respectively).
Thirty-day all cause readmission had a PPV of 0.22 (95% CI, 0.21–0.23), sensitivity of 0.61 (0.59–0.64), and a specificity of 0.61 (0.58– 0.63) at a maximal f-score of 0.32 (0.31–0.32); 180-day all cause readmission had a PPV of 0.51 (0.51–0.52), a sensitivity of 0.92 (0.91–0.93), and a specificity of 0.18 (0.16-0.21) at a maximal f-score of 0.66 (0.65–0.66); 30-day readmission because of heart failure had a PPV of 0.15 (0.13–0.16), a sensitivity of 0.45 (0.41–0.48), and a specificity of 0.79 (0.76–0.82) at a maximal f-score of 0.20 (0.19–0.21); 180- day readmission because of heart failure had a PPV of 0.51 (0.50–0.51), a sensitivity of 0.94 (0.93–0.95), and a specificity of 0.15 (0.13–0.17) at a maximal f-score of 0.66 (0.65–0.66). The 30-day predictions, in general, were better at identifying negative cases, whereas the 180-day predictions were better able to correctly identify positive cases.
The results of this study support the hypothesis that ML methods, with the ability to leverage all available data and their complex relationships, can improve both discrimination and range of prediction over traditional statistical techniques. The performance of the ML methods varies in complex ways, including discrimination and predictive range.
While we know how the data has been obtained, the quality of data is unknown. When data is collected, there are always errors in it such as outliers, missing values, incorrect data types, etc. It is unknown whether such types of errors are removed. Once these are removed, the expected accuracy of the model can increase by as much as 10% based on the type of data and the model that is being used.
While simple Random Forest, Boosting and SVM are good techniques, Neural Networks can also be used when the data set tends to get larger. Here, with over 450 different parameters, SVM and Random Forest would slow down. Simpler models like logistic regression can also be tried out to identify the accuracy of that model. With the addition of these two models, the comparison between the 5 different classification models would help better understand how these techniques work.
These techniques work on this set of data. While there has been some diversity mentioned in this paper, the research would need to include more diversity in the results. The risk of heart disease readmission could be different for people from different ethnicities and different origins depending on the type of food they eat and the climate they grow up in and the changes to their lifestyle over a period.
Ideas for Follow-On Work
Future work might be to focus on further improvement of predictive ability through advanced methods and more discerning data, to facilitate better targeting of interventions to subgroups of patients at highest risk for adverse outcomes. Newer models are being developed every day, which can improve the predictability of these types of data sets.
While it is not mentioned whether the process is performed on cloud or not, this can be included as a part of Edge Analytics using wearable devices. This combination of edge analytics and processing on the cloud can be done using federated learning. Federated learning is a machine learning technique that trains an algorithm across multiple decentralized edge devices or servers holding local data samples, without exchanging their data samples. This approach stands in contrast to traditional centralized machine learning techniques where all data samples are uploaded to one server, as well as to more classical decentralized approaches which assume that local data samples are identically distributed.
Federated learning enables multiple actors to build a common, robust machine learning model without sharing data, thus addressing critical issues such as data privacy, data security, data access rights, and access to heterogeneous data which makes it perfect for this kind of analysis.