Abstract— Breast Cancer is one of the most common disease that is responsible for a high number of deaths every year. Despite the fact that cancer is treatable and healable in earliest stages, A huge number of patients are examined with cancer very late. Data mining process and classification are efficient way to categorise the data particularly in medical fields, where those approaches are broadly used in diagnosis to make decision. Mining provides useful information from the huge volume of the data stored in repositories. The present study focusses on implementing some different algorithms using the data mining to find out the relations between different attributes and visualize them. The Dataset is taken from UCI machine learning repository. The Main thing that motivated us to do this is the capability of modern data mining algorithms that can derive meaning from the given data. The prediction is done based upon the accuracy rates of each and every algorithm i.e., the algorithm with the best accuracy will be taken for prediction.
I. Introduction
There are many hazardous diseases in the present world but nothing is more deadly and dangerous than cancer. Among the family of cancers, Breast cancer is a type that exclusively affects women in the world, this disease mostly due to abnormalities inherited from their parents and there are two types of breast cancers malignant and benign. Benign means that cancer is in early stage and can be cured, and an additional problem is that this disease can reoccur even after completion of the treatment, as the number of cases of this disease is on a rise in all of the world we also have increased data that we have about the patients that incur this disease, so we can use aid of technology (i.e) Data Mining Techniques. We can use these techniques to predict the occurrence or recurrence of breast cancer and there are numerous papers for recognition, prediction and clustering using different principles like Associative rule mining, Classification Rule mining with algorithms like C5.0, K-Nearest Neighbour, Support Vector Machine, Fuzzy C-Mean etc. The above mention techniques were performed on datasets like SEER etc. Now in this project, we are going to take a dataset from UCI Repository which contains a total of 570 records and 32 attributes like diagnosis which states if that patient is either in malignant indicated in the data set as ‘M’ or benign symbolised by ‘B’ and radius which is the mean of distances from centre to the points on the perimeter, texture which is standard deviation of grey-scale values, area, perimeter etc are some of the values which are calculated for each nucleus. Here in this experiment, we will use different algorithms like Support Vector Machine (SVM), Decision Tree, Naïve Bayes (NB) and k Nearest Neighbours(k-NN) to get the relationship between attributes. We use Boosting algorithms like Ada Boost, Gradient Boost, XG Boost (Extreme Boost) to increase the prediction speed when compared to a normal algorithm through which we can obtain results at much more speed. We have seen various algorithms and have decided to use Ada-Boost and Gradient Boost algorithms, However, we have already applied other primitive algorithms such as KNN,SVM etc on the dataset while experimenting with them.
Save your time!
We can take care of your essay
- Proper editing and formatting
- Free revision, title page, and bibliography
- Flexible prices and money-back guarantee
Place an order
II. Related Work
In this section, we first review a couple of related work on breast cancer disease findings using data mining techniques. We then examine some related work at breast cancer investigation
In the Research Paper of Dr.S.N.Singh, they used four classifiers: J4.8, Simple CART, NaiveBayes, Bayesian LogisticRegression. Which yielded a considerable accuracy but was lacking modern standards and their dataset was taken from WBCO [bc7 doc].This shows promise for many of those algorithms but we are taking it to boost algorithms to test if they can perform better practically
Dona Sara Jacob, Rakhi Viswan, V Manju, L PadmaSuresh, Shine Raj published another paper which theorizes that identification of the tumour in the first stage is the most critical strategy that can save many more lives and hence they took a different approach and a different dataset from WPBC and also used WEKA GUI to get comparable and statistical visuals of their approach[bc9 doc]. This inspired us to look in more than one direction through which our problem could be addressed and could potentially be solved in an efficient way.
Dr. K. Soma Sundaram’s research paper suggests that we could use blood sample data as a way to predict breast cancer[bc8].
Our view on this approach wasn’t much positive as it would take time to sample the blood and the results to be generated and formatted for the algorithm to work, This approach however produced a better accuracy compared to other methods we saw as it has a set of really strong attributes/features that can predict the result efficiently.
A research paper published by Chinese teachers Qi Fan, Chang - Jie Zhu and Liu Yin [be3doc] particularly caught our eyes because they were using Decision tree algorithms to draw the prediction specifically C 5.0, CHAID, C&RT, and QUEST with a SEER Dataset However their accuracy was considerable but was low (65-75%) and this caused us to reconsider the algorithms we wanted to initially use.
Researchers Umesh D R and Dr. B Ramachandra used a dataset obtained from Globocon which had huge number of entries and that potentially resulted in getting higher accuracy of 87% as the model had a lot of unbiased data to train on[bc6]
A group of scholars (D. Soria, J.M. Garibaldi, F. Ambrogi, P.J.G. Lisboa, P. Boracchi, E. Biganzoli) used Five different algorithms (i) Hierarchical (H), (ii) Fuzzy C-Means (FCM), (iii) K-means initialized with hierarchical clustering (method average), (iv) Partitioning Around Medoids (PAM), and (v) Adaptive Resonance Theory (ART). On a local dataset of a series of 1076 patients from the Nottingham Tenovus Primary Breast Carcinoma Series presenting with primary operable invasive breast cancer between 1986-98[bc5]. This resulted in a very varying result out of which PAM was outstanding and most accurate.
Using a Part of Data mining (PCA)-Principle Component Analysis Sharaf Hussain, Naveen Zehra Quazilbash, Samita Bai, and Shakeel Khoja from IBA presented a paper which yielded the most relevant attribute selection from a given set of vast features[be2]. This helped us consider the features in our dataset which were viable to be used as inputs and which weren’t further narrowing down our strategy and getting us one step closer to develop our algorithm.
III. Proposed Work
Based on what we have learnt so far from various sources, We have finalized on using two of the modern algorithms as addition to our list in which we have already tested the simpler algorithms such as KNN, SVM etc..
Those two algorithms will be Ada Boost and Gradient Boost algorithms. Both are decision tree boosting algorithms. We have previously seen that the works of some researchers had low accuracy with decision tree algorithms and hence we plan to approach this problem with boosting algorithms that require compute power but not a lot of GPU power to give us acceptable results.
We have chosen both of these algorithms for their features which are detailed below
AdaBoost
The general idea behind boosting methods is to train predictors sequentially, each trying to correct its predecessor. The two most commonly used boosting algorithms are AdaBoost and Gradient Boosting. In the proceeding article, we’ll cover AdaBoost. At a high level, AdaBoost is similar to Random Forest in that they both tally up the predictions made by each decision trees within the forest to decide on the final classification. There are however, some subtle differences. For instance, in AdaBoost, the decision trees have a depth of 1 (i.e. 2 leaves). In addition, the predictions made by each decision tree have varying impact on the final prediction made by the model.
In first step of AdaBoost each sample is associated with a weight that indicates how important it is with regards to the classification. Initially, all the samples have identical weights
Next, for each feature, we build a decision tree with a depth of 1. Then, we use every decision tree to classify the data. Afterwards, we compare the predictions made by each tree with the actual labels in the training set. The feature and corresponding tree that did the best job of classifying the training samples becomes the next tree in the forest. Once we have decided on a decision tree. We use the proceeding formula to calculate the amount of say the it has in the final classification.
Significance=1/2(log((1-totalerror)/totalerror)
Where the total error is the sum of the weights of the incorrectly classified samples.
We look at the samples that the current tree classified incorrectly and increase their associated weights using the following formula.
New sample weight=sample weight*e^significance
Then, we look at the samples that the tree classified correctly and decrease their associated weights using the following formula.
New sample weight=sample weight*e^-significance
We start by making a new and empty dataset that is the same size as the original. Then, imagine a roulette table where each pocket corresponds to a sample weight. We select numbers between 0 and 1 at random. The location where each number falls determines which sample we place in the new dataset.
Repeat steps 2 through 5 until the number of iterations equals the number specified by the hyperparameter (i.e. estimators) [image: ]
Now use the forest of decision trees to make predictions on data outside of the training set
The AdaBoost model makes predictions by having each tree in the forest classify the sample. Then, we split the trees into groups according to their decisions. For each group, we add up the significance of every tree inside the group. The final classification made by the forest as a whole is determined by the group with the largest sum.[link1]
Grdient Boost
Gradient Boosting is similar to AdaBoost in that they both use an ensemble of decision trees to predict a target label. However, unlike AdaBoost, the Gradient Boost trees have a depth larger than 1. In practice, you’ll typically see Gradient Boost being used with a maximum number of leaves of between 8 and 32.
When tackling regression problems, we start with a leaf that is the average value of the variable we want to predict. This leaf will be used as a baseline to approach the correct solution in the proceeding steps.
For every sample, we calculate the residual with the proceeding formula.
Residual=actual-predicted
Next, we build a tree with the goal of predicting the residuals. In other words, every leaf will contain a prediction as to the value of the residual (not the desired label).
Each sample passes through the decision nodes of the newly formed tree until it reaches a given lead. The residual in said leaf is used to predict the value.
It should be better used with a learning rate so that the algorithm doesn’t just memorize all the values and give you a really good accuracy at the first run itself
Hence a learning rate alpha must be introduced which lies between 0 and 1
Prediction=first guess (avg) + (alpha * residual)
[image: ]Again we keep computing the residuals and branching until we are getting perfect values or a given bound is reached.
The final prediction will be equal to the mean we computed in the first step, plus all of the residuals predicted by the trees that make up the forest multiplied by the learning rate.[link 2]
Other Algorithms
We have also used simpler algorithms like KNN, Apriori and SVM to make a predictive model but we mainly focused on these boosting algorithms as they are the latest and an unexplored piece of the data mining archipelago which are mostly used at the mid-higher level of the industry.
One of the main differences between AdaBoost and Gradient Boost is that one allows the creation of unequal stumps while the latter doesn’t which has a slight impact on performance depending on the kind of data given as input.
IV. Conclusion
In this study we have taken different attributes and by comparing different algorithms we came to the conclusion that the Gradient boosting algorithm is the best among algorithms
References
- Dona Sara Jacob, Rakhi Viswan, V Manju, L PadmaSuresh, Shine Raj, “A Survey on Breast Cancer Prediction Using Data Mining Techniques”
- D. Soria_, J.M. Garibaldi_, F. Ambrogiy, P.J.G. Lisboa], P. Boracchiy, E. Biganzoliy, “CLUSTERING BREAST CANCER DATA BY CONSENSUS OF DIFFERENT VALIDITY INDICES”
- S. Muthuselvan Dr. K. Soma Sundaram Dr. Prabasheela, “Prediction of Breast Cancer UsingClassification Rule Mining Techniques in Blood Test Datasets”
- Umesh D R Dr. B Ramachandra, “Association Rule Mining Based Predicting Breast Cancer Recurrence on SEER Breast Cancer Data”
- Dr. S. N. Singh Shivani Thakral, “Using Data Mining Tools for Breast Cancer Prediction and Analysis”