An NLP (Natural Language Processing) Framework to perform risk identification using featured engineering from unstructured data
The hospital readmissions in case of COPD (Chronic Obstructive Pulmonary disease) increases medical expenses and also require intensive care for patients. Natural Language Processing (NLP) is the art and science which helps us extract information from text and use it in our computations and algorithms. We aim to develop a Natural Language Processing framework to analyze clinical notes, physician entries, x-ray reports, and other unstructured hospital data and predict the hospital readmissions. The framework is to be built with natural language processing techniques of text preprocessing, featured analysis, and machine learning’s neural network for the classification process.
Keywords—Natural Language Processing, Chronic Obstructive Pulmonary Disease, Bio-Informatics, Preprocessing of unstructured data, Hospital readmissions, Featured analysis, Classification.
One in five patients requires re-hospitalization within 30 days in case of COPD disease (Chronic Obstructive Pulmonary Disease). In the United States, COPD is part of Medicare’s Hospital Readmissions Reduction Program (HRRP), which penalizes hospitals for excess 30-day, all-cause readmissions after a hospitalization for an acute exacerbation of COPD, despite minimal evidence to guide hospitals on how to reduce readmissions. This review outlines challenges for improving overall COPD care quality and specifically for HRRP. There is limited evidence available on readmission risk factors and reasons for readmission to guide hospitals in initiating programs to reduce COPD readmissions. Over the study period, there were 26,798,404 inpatient admissions, of which 3.5% were index COPD admissions. At 30 days, 20.2% were readmitted to the hospital. Respiratory-related diseases accounted for only one-half of the reasons for readmission, and COPD was the most common diagnosis, explaining 27.6% of all readmissions. To address rising costs and quality concerns, the Hospital Readmissions Reduction Program (HRRP) was enacted, targeting inpatient discharges in the Medicare fee-for-service population for congestive heart failure (CHF), acute myocardial infarction (AMI), and pneumonia. The HRRP mandates up to a 3% reduction in all Medicare reimbursements should hospitals fail to stay below their expected readmission rates. In October 2014, the HRRP was expanded to include COPD.
A. Problem Statement
The US government penalizes hospitals for excess readmissions of patients into the hospitals. The information in the hospital contains more of unstructured data. The data include clinical notes, physician notes, x-ray reports, etc. A Natural Language Processing (NLP) based approach to extract such information from hospital records is being developed. Because of the variation and complexity in such unstructured information, a protocol which can standardize the records by converting this unstructured data into a structured form is required.
The raw unstructured data must be processed before analyzing. This process involves some natural language processing techniques like tokenization, stemming, noise reduction, removal of stop words, etc. Few approaches on preprocessing, analyzing, and classifying are discussed below.
Preprocessing is the process of extracting the relevant information or records from the unstructured data. Pre-processing more commonly focuses on these three components such as tokenization, normalization, and substitution. Tokenization is a technique which splits the cluster of words which is string into tokenized individual keywords. During Tokenization, some stop words like ‘the’, ‘is’, ‘are’, ‘an’ will be eliminated. Tokenization is a step that splits the cluster of words into a minimal meaningful units which is called as tokens. Normalization coverts a set of words in a sequential manner. Under stemming is the process of eliminating unwanted pre occurred syllables in a word (suffixes, prefixes, infixes, circumfixes) and Noise removal is also called text cleaning which removes data like text file headers, footers, metadata, etc., and it also extracts records from other different formats.
- Featured Analysis
Feature extraction involves in reducing the resources that describe the large set of data. Analyzing a large data sets includes more efficient time and effort but feature extraction technique extracts only the vital keywords by removing unwanted data words. Feature extraction is classified into two methods such as BOW and CTAKES. BOW is abbreviated as Bag-Of-Words which represents the maximum occurrence of words in the data set that is fed to the system. Bag-Of-Words is very useful in producing efficient solutions to complex problems and its used to extract features from text documents. The higher-level feature extraction is done using Apache cTAKES. cTAKES stands for clinical Text Analysis and Knowledge Extraction System which is used to extract the clinical information from electronic health records unstructured text.
The text classification can be done via natural language processing algorithms like Naïve Bayes, Random Forest, Knn algorithm, etc. But we propose a framework which uses neural network for classification rather than natural language processing algorithm. A few minor drawbacks of using natural language processing algorithms for text classification is that the algorithms provide a score but what we need is probability. And moreover, The NLP algorithms learn from what is present in a class but not from what is not. Therefore the NLP algorithms is not understanding the context of a sentence, instead classifying it based on the scores. Hence neural networks are used to obtain high performance on NLP tasks. The Neural network algorithms are advantageous over classification algorithms in a way that they provide more accurate results. The accuracy is improved based on two primary methods. A first method is supervised neural network which will run input through various classifications and the second one is unsupervised neural network to optimize the feature selection.
Neural network in simple terms is feeding the data and making it analyze to provide various solutions for complex tasks. A basic neural network is classified into three layers: input layer, hidden layer, and output layer. The classifiers used in neural networks are known as softmax layer which is the final layer. Thereby we can model neurons to perform classification computation. Neural network with multiple neurons can be considered as providing same data to various classification functions. Each neuron denotes a different regression function. A huge set of data is fed to train these networks. The training is achieved through back propagation. Each layer sends the previous layer’s output to another function.
II. Literature survey
- Natural language processing is a computer technology which mainly concentrates on human-computer interaction. Most of the data present these days is in unstructured form which makes it hard for computers to understand for further use and analysis. This unstructured text needs to be converted into structured form by clearly defining the sentence boundaries, word boundaries, and context-dependent character boundaries for further analysis. Key steps include many algorithms within the field of data mining and machine learning, so a framework for component selection is created to select the best components. NLP is applied followed by some of processing techniques like, tokenization, stop words removal, stemming, pruning, semantic analysis, POS Tagger, etc.
- An intelligent system is developed for the analysis and the real-time evaluation of patient’s condition. A hybrid classiﬁer has been implemented on a personal digital assistant, combining a support vector machine, a random forest, and a rule-based system to provide a more advanced categorization scheme for the early and in the real-time characterization of a COPD episode. This is followed by a severity estimation algorithm which classiﬁes the identiﬁed pathological situation in different levels and triggers an alerting mechanism to provide an informative and instructive message/advice to the patient.
- The cTAKES builds on existing open-source technologies—the Unstructured Information Management Architecture framework and Open NLP natural language processing toolkit. Its components, speciﬁcally trained for the clinical domain, create rich linguistic and semantic annotations. The cTAKES annotations are the foundation for methods and modules for higher-level semantic processing of clinical free-text. The cTAKES is a modular system of pipelined components combining rule-based and machine learning techniques aiming at information extraction from the clinical narrative.
- The goal of this study is to analyze key factors using machine learning methods and patients’ medical records of a reputed Indian hospital which impact the all-purpose readmission of a patient with diabetes and compare different classification models that predict readmission and evaluate the best model. It proposed architecture of this prediction model and identified various risk factors using text mining techniques. This study not only discovered risk factors that predict the risk of readmission but also identified individual as well as group of factors that are strong indicators of low risk of readmission along with the cost analysis using real-world data.
- We provide voice-based android application to the user where user can interact with system and get inference of diseases and their remedies by giving the symptoms as input. For processing the given input we normalize the data by using noun phrase extraction and medical term identiﬁer. The pre-processing system and the question-answer system are the crucial elements of the proposed system. The question generation is performed using a QA matrix. Further, the response of the system is reached to the user in voice format.
- Hospital readmission rates are considered to be an important indicator of the quality of care because they may be a consequence of actions of commission or omission made during the initial hospitalization of the patient. the framework considers speciﬁc COPD-related laboratory test results as part of the structured patient data. These data types are used in the development of appropriate regression models to Predicting Hospital Readmission Risk for COPD Using EHR Information.
- In this paper, we are interested by analyzing and pre-processing tweets for NLP and machine learning applications such as machine translation and classiﬁcation. We propose a pre-processing pipeline for tweets consisting of ﬁltering part-of-speech, named entities recognition, hashtag segmentation, and disambiguation. Our proposed approach is also based on the graph theory and group words of tweets using semantic relations of WordNet and the idea of connected components. The integration of WordNet in preprocessing transforms our corpus into a bag of words. We keep the frequency of each word found in the corpus and each tweet is represented as a sequence of frequencies of the words in the tweet.
- This project creates a novel natural language processing (NLP) pipeline for extraction and classification of temporal information as historic, current, and planned from free-text eligibility criteria. The pipeline uses pattern learning algorithms for extracting temporal information and trained Random Forest classifier for classification. The pipeline achieved an accuracy of 0.82 in temporal data detection and classification with an average precision of 0.83 and recall of 0.80 in temporal data classification. The accuracy of the classifier was further tuned by training the random forest classifier with a different number of decision trees.
- Fake reviews are considered as spam reviews, which may have a great impact in the online marketplace behavior. Many types of features could be used for extracting useful features such as Bag-of-Words, linguistic features, words counts, and n-gram feature. We will investigate the effects of using two different feature selection methods on spam review detection: Bag-of-Words and word count. Different machine learning algorithms are applied Support Victor Machine, Decision Tree, Naïve Bayes, and Random Forest. Two different categories of spam review detection methods: First, Supervised techniques that required labeled datasets to detect reviews of unseen data, such as Naïve Bayes (NB), Support Vector Machine (SVM), Decision Tree (DT), and Random Forests (RF). Second, Unsupervised techniques that concern about finding hidden patterns in data that is unlabeled
III. Comparison table
- NLP-based Clinical Data Analysis for Accessing Re-admissions of patients with COPD. Priyanka V. Medhe, Dinesh D. Puri, 2017
They analyze the patient hospital lab reports and discharge summary which classifies them as primary and secondary factor. In preprocessing they convert documents into data schemes and divide them into clusters. With the help of predictive model the proposed factors are listed and detect the prediction which uses cTAKES, Prediction model technique, Text analysis, and UMA models.
- Identification of COPD patients’ health Status using an Intelligent System in the chronious wearable platform. Christos C. Bellos, Athanasios Papadopoulos, Roberto Rosso, Dimitrios I. Fotiadi, 2014
Patient data, whose likelihood of having COPD has been recorded. The dataset contains information from sources like; Sensors, External devices, and Database are collected. Preprocessing technique involves in removal of baseline wander noises and high-frequency noises and then feature extraction processes the preprocessed functional data and gives sensor acquired data. Heterogeneous data fusion which fuses data from different sources in various formats.
- Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES) architecture component Evaluation and applications. Guergana K Savova, James J Masanz, Philip V Ogren, Jiaping Zheng, Sunghwan Sohn, Karin C Kipper-Schuler, Christopher G Chute, 2010
They used cTAKES that consists of components executed to process the components which incrementally Contributing the cumulative annotation dataset. The cTAKES accepts a plain text or XML documents. The sentence boundary detector predicts the period or question mark in the end of the sentence. Tokenizer spits the sentence into smaller tokens. cTAKES POS tagger and the shallow parser are the wrappers around Open NLP’s modules that extract data from the system.
- Predictive risk modeling for early hospital readmission of patients with diabetes in India. Reena Duggal, Suren Shukla, Sarika Chandra, Balvinder Shukla, Sunil Kumar Khatri, 2016
This system takes the raw data of 7100 diabetes patients are collected over a period of 2 years and noisy, inconsistent data are removed using preprocessing. Predictive feature demographic the illness severity. cTAKES is adopted to create one or more pipelines to process clinical notes and entities like diseases and disorders, signs and symptoms, drugs and procedures. Five classification models such as naïve Bayesian, Logistic Regression, Random Forest, Adaboost, and Neural Networks classifiers were used to build the system.
- An Interactive Medical Assistant using Natural Language Processing. Shamli Deshmukh, Ritika Balani, Vijayalaxmi Rohane, Asmita Singh, 2016
In this NLP-based medical assistant, the given input is tokenized with the help of a POS tagger. As an input statement, only the disease and duration is mentioned. Medical terms and keywords will be extracted from the input separately by feature extraction technique and the Question-Generation System checks the similarities and then plots them in the QA map entity which helps in understanding the accurate requirement of patient disease prediction.
- Predicting Hospital Readmission Risk for COPD Using EHR Information. Ravi Behara, Ankur Agarwal, Faiz Fatteh and Borko Furht, 2013
In this study, Mayo Clinic’s Text Analysis and Knowledge Extraction system is adopted which is used to create one or more pipelines to process clinical notes and entities such as diseases and disorders, signs and symptoms, anatomical sites and procedures, and drugs. The Lexical Analyzer Layer parses all tokenized words from preprocessing. Assertion determines if the text discussed is related to patient and then Structured Data Analysis will develop predictive statistical model predicts readmission.
- Efficient Natural Language Pre-processing for Analyzing Large Data Sets. Belainine Billal, Alexsandro Fonseca and Fatiha Sadat, 2016
They Collected content from Twitter (tweets) which are considered as unstructured and highly noisy texts. To extract from such data, they used a traditional NLP and machine language technique. Once the words are selected, they cluster the words with their synset in graph. Tweet Normalization rewrites words in a standard way that helps to process efficiently.
- Natural Language Processing Pipeline for Temporal Information Extraction and Classification from Free Text Eligibility Criteria Gayathri. Parthasarathy, Aspen Olmsted, Paul Anderson, 2016
This project creates a novel natural language processing pipeline for extraction and classification of temporal information. The pipeline uses pattern learning algorithms for extracting data. The initial step involves in extracting the trained temporal patterns from the data set. The pipeline involves in generating temporal patterns using the TEXer algorithm. Then pipeline utilizes the trained temporal pattern that detects the temporal expressions from sentences. As a next step, labeled fragments are used to create Bag of words, and classifier extracts data.
- The Effects of Features Selection Methods on Spam Review Detection Performance. Wael Etaiwi, Arafat Awajan, 2017
They used feature selection methods and classification algorithms to detect spam reviews. The feature selection method extracts the features from text or review and finds the accuracy of words within a review by the frequency of occurrence. Other characteristics features like lexical and syntactic also included which enhance the detection of performance. They used four classification algorithms, Naïve Bayes, Decision Tree¸ Support Vector Machine, and Random Forest.
- Priyanka V. Medhe, Dinesh D. Puri, “NLP Based Clinical Data Analysis for Assessing Readmissions of Patients with COPD”. International Conference Proceeding ICGTETM Dec 2017 | ISSN: 2320-2882.
- Christos C. Bellos, Athanasios Papadopoulos, Roberto Rosso, Dimitrios I. Fotiadi, “Identiﬁcation of COPD Patients’ Health Status Using an Intelligent System in the CHRONIOUS Wearable Platform” IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, VOL. 18, NO. 3, MAY 201.
- Guergana K Savova, James J Masanz, Philip V Ogren, Jiaping Zheng, Sunghwan Sohn, Karin C Kipper-Schuler, Christopher G Chute, “Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation, and application” Application of information technology.
- Reena Duggal, Suren Shukla, Sarika Chandra, Balvinder Shukla, Sunil Kumar Khatri, “Predictive risk modeling for early hospital readmission of patients with diabetes in India” Int J Diabetes Dev Ctries.
- Shamli Deshmukh, Ritika Balani, Vijayalaxmi Rohane, Asmita Singh, ”Sia: An Interactive Medical Assistant using Natural Language Processin” 2016 International Conference on Global Trends in Signal Processing, Information Computing, and Communication.
- Ravi Behara, Ankur Agarwal, Faiz Fatteh and Borko Furht, ”Predicting Hospital Readmission Risk for COPD Using EHR Information” Handbook of Medical and Healthcare Technologies.
- Belainine Billal, Alexsandro Fonseca and Fatiha Sadat, “Efﬁcient Natural Language Pre-processing for Analyzing Large DataSets” 978-1-4673-9005-7/1
- Gayathri Parthasarathy, Aspen Olmsted, Paul Anderson, “Natural Language Processing Pipeline for Temporal Information Extraction and Classification from Free Text Eligibility Criteria” International Conference on Information Society (i-Society 2016).
- Wael Etaiwi, Arafat Awajan, “The Effects of Features Selection Methods on Spam Review Detection Performance ” 2017 International Conference on New Trends in Computing Sciences.