1. Dataset Description
UCI Machine Learning repository – Diabetes 130-US hospitals for years 1999-2008 Data Set
This research includes a publicly available dataset taken from the Center for Clinical and Translational Research, Virginia Commonwealth University. It consists of over a million records collected across 130 US hospitals and from various healthcare providers over 10 years (1999 – 2008) . It consists of fifty features representing diabetic patients’ information, mainly regarding readmission. As per our research in the dataset, the essential features that can affect our model are:
- Admission source – It consists of 21 unique parameters of patients’ admission
- Discharge disposition information – Includes 29 values indicating patient discharge location
- Medication changes – Includes information about patients’ medication changes
- Diagnosis information – Consists of ICD-9 (International Statistical Classification of Diseases and Related Health Problems) code 
- Drug usage – Lists drug dosage information among 23 different types of drugs.
- Readmission time – Shows if patient readmission was within or after 30 days or no readmission at all.
The train-test split initially includes 80% training and 20% test set data. Also, 5 folds cross-validation is to be applied to get the best evaluation parameters for the given model.
Background: A considerable number of problems have been solved in the healthcare sector using machine learning techniques. We plan on researching one such domain. Hospital readmissions not only prove costly but also risks the patients’ medical condition. Moreover, hospital readmission has been a decisive factor in ranking health center credibility. An increase in hospital visits after discharge is costly and time-consuming for both hospitals and patients .
Major studies  propose that if there is unplanned readmission within 30 days, it indicates treatment or diagnosis error, which could be avoided. However, if readmission is after 30 days, it depends on the patients’ lifestyle or several other factors . So, an early prediction of readmitting the patients becomes an important task.
Current research and existing models on similar research predict readmission in less than 30 days after discharge . Our research includes predicting unplanned readmission in diabetic patients using multiclass classification. It includes testing whether patients are readmitted within or after 30 days or not readmitted at all. The primary tasks to perform include data preprocessing steps such as data reduction, data cleaning, and data transformation. Furthermore, a good model requires extracting essential features. So, we plan on using various feature selection algorithms to obtain the best features. Using such features, different models such as Random Forest, Support vector machine, Logistic regression, Multilayer perceptron, Naïve Bayes, and Ensemble model is to be tested and compared to obtain the best evaluation parameters (accuracy, precision, recall, F1-score, AUC curve).
Following are the goals of our research:
Predict if the patient will be:
- Readmitted within 30 days (• Readmitted after 30 days (>30)
- Not be readmitted (No)
To achieve the goal, we will perform the following tasks:
- Task 1: Data Analysis for Decision Making
The first step includes collecting data, analyzing the data by projecting graphs among various features, check correlation among the features, and interpret results. Based on the results, an idea about essential features and outliers is obtained.
- Task 2: Data Cleaning
The data contains ‘?’ instead of standard missing values such as ‘NaN’ or ‘NULL’. So, encoding such data and removing redundant features becomes an important task. This step also includes replacing or modifying the dirty data.
- Task 3: Data Preprocessing
Process missing data: The features with more than 50% missing data and irrelevant to predicting the target variable are removed.
Encode categorical data: Imputation of categorical data such as gender, race to be done using oneHotEncoder and Label Encoding.
Scale features and apply transformation: In the dataset, some of the features are highly skewed. So, to balance the data, we plan to use various transformation functions such as normalization function, sigmoid function, log function, and cube root function.
- Task 4: Feature Selection and Addition
Selecting essential features for the model: In Machine Learning, when there are too many features, it is better to select only the relevant features. We plan to use various algorithms such as SelectKbest, SelectPercentile, and Boruta algorithm for feature selection.
Feature addition: By combining some of the features in the dataset, we can create additional features. It helps to predict the target variable better.
- Task 5: Model Building
After splitting the data into train and test, algorithms to select the best model are applied based on accuracy, and test data is fit on it.
We plan to use the following multiclass classification algorithms:
- Logistic Regression
- Random Forest
- Decision trees
- Gaussian Naïve Bayes
- Support vector machine
- Neural networks (MLPClassifier, feed-forward backpropagation network)
- Task 6: Evaluation and Prediction
Evaluation parameters are essential to satisfy the goal of the research. We plan to evaluate the models using various matrices such as confusion matrix, F1 score, precision, and recall.
4. Potential Tools
- Jupyter Notebook
- Kaggle Notebook
- Google Colab
5. Potential Timeline
15th March-30th March
Data analysis for decision making
31st March-5th April
Feature selection and Addition
6th April-25th April
Evaluation and Prediction
26th April-1st May
- Beata Strack, ‘Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records,’ BioMed Research International, 2014. [Online]. Available: https://www.hindawi.com/journals/bmri/2014/781670/.
- ‘Wikipedia,’ 30 December 2019. [Online]. Available: https://en.wikipedia.org/wiki/List_of_ICD-9_codes.
- N. Hammoudeh, ‘Predicting Hospital Readmission among Diabetics using Deep Learning,’ November 2018. [Online]. Available: https://www.researchgate.net/publication/328887677_Predicting_Hospital_Readmission_among_Diabetics_using_Deep_Learning.
- D. Mordaunt, ‘Improving 30-day readmission risk predictions using machine learning,’ in Health Informatics New Zealand (HiNZ) Conference, 2016.
- Medicare.gov, ’30-day unplanned readmission and death measures,’ 2017. [Online]. Available: Medicare.gov.
- Ti’jay Goudjerkan, ‘Predicting 30-Day Hospital Readmission for Diabetes Patients using Multilayer Perceptron,’ Patients using Multilayer Perceptron, vol. 10, no. 2, pp. 268-275, 2019.