Rise in deaths due to prostate and breast cancer are expected to continue in future. These diseases are the most common types of cancer for men and women across the globe. Machine Learning can be used to drop the number of deaths by these diseases with early detection. One of them is the classification of data of prostate cancer and breast cancer. The Cancer data which has been used has a variety of features, but not all features are essential features. In this study, we use Support Vector Machine-Recursive Feature Elimination(SVM-RFE) as a feature selection method. In this method, it will get a ranked features list. The use of this method in the classification of prostate cancer and breast cancer data results in a high level of evaluation. This method can produce an accuracy rate of 96.50%, the precision of 96.56%, and recall of 96.50%.
Cancer is a disease caused by abnormal cell growth. These cells exist because of the changes in gene expression, then they will be developed into a population of cell that can attack specific tissues. This is very dangerous because it can cause death. Based on the Global Cancer (GLOBOCAN) statistics part of the International Agency of Research on Cancer (IARC) in 2018, in the 18.1 million cases of cancer, the second most common cases experienced by men are prostate cancer cases, while the most common cancer cases experienced by women are breast cancer cases. Until now, there has not been found a way to treat cancer efficiently.
In prostate cancer, there is an uncontrolled growth of cancer cells formed in prostate tissue. It is the most common cancer in men, and the case will continue t increase in many countries. In breast cancer, there is an uncontrolled growth of cancer cells formed in breast tissue. The growth of cancer cells form lumps that can spread to other tissues within the body, which is also known as malignant tumor. Cancer data has many features that possess information about the cancer itself. However, not all features are relevant features. The benefit of feature selection in machine learning is reducing the amount of data needed to reach the learning stage, increasing the predictive accuracy, more easy-to-understand data, and reducing execution time.
In the field of health, many methods have been carried out to diagnose breast cancer and prostate cancer. But in this study, we used computational techniques by applying machine learning. The method that is proposed is Supporting Vector Machine-Recursive Features Elimination (SVM-RFE). It is expected that feature selection methods and classification methods would give significant contribution to the health sector, especially in diagnosing prostate cancer and breast cancer. Previous studies on the classification of prostate and breast cancer have been carried out with various methods such as Convolutional Neural Network, Logistic Regression and Decision Tree.
Support Vector Machines
The basic methodology of the SVM method is to form an optimal plane or hyperplane that separates data into each class. The optimal hyperplane is a field that separates data into its class and is located perpendicular to the closest pattern where patterns are dots that describe a dataset-. Suppose there is a dataset D, xi , yi where i = 1, …, D, the set of training data in the dataset D that has two classes consist of N input vectors x1, …,xn and yi with yi being the class label from the dataset (malignant cancer or benign cancer).
Support Vector Machines-Recursive Feature Elimination
It is a combination of Support Vector Machines and RFE. RFE is a method that works by selecting features recursively based on the smallest feature value. SVM-RFE works by removing irrelevant features in each iteration, namely the lowest weight feature. We can exclude more than one feature in each iteration for speed reasons.
Performance Evaluation of Model
A classification model will map data to prediction classes. There are four cases possible. If the data has a positive label and classified as positive, then it is true positive (TP); if classified negative, it is false negative (FN). If the data has a negative label and is classified as negative, then it is true negative (TN); if classified as positive, it is false positive (FP). From a classifier and a data set, a 2 × 2 confusion matrix can be formed.
Classification report is calculated which gives us the following measures: Precision is used to calculate how many of them are truly positive. Recall is used to calculate how many real positive are captured by the model and labeled Positive. F1 score is the harmonic mean of precision and recall of the model.
Experiments and Results
The data used were data based on prostate cancer and breast cancer, which is obtained from the Kaggle website. 100 observations were recorded for prostate cancer data, in which 62% observations were malignant cancer and 38% observations were benign cancers. Meanwhile, the breast cancer data consisted of 569 observations, in which 212 cancers were malignant cancer, and 357 were benign cancer. Features for each data are mentioned here.
The result and analysis of classification of Prostate and Breast Cancer with the help of SVM-RFE is covered in this section. The results of the ranking score that are obtained using Equation (7) for the feature selection of prostate cancer are listed below in increasing order of their weightage of features: [‘fractal_dimension’, ‘smoothness’, ‘compactness’, ‘symmetry’,’radius’, ‘texture’, ‘perimeter’, ‘area’]
The feature having highest weight is the area feature which has a weight of 23992022.23703918,while the lowest weight feature is fractal_dimension feature,which has a weight of only 1.5849710602904137. The results of the ranking score that are obtained using Equation (7) for the feature selection of breast cancer are listed below in increasing order of their weightage of features: [‘fractal dimension error’, ‘smoothness error’, ‘concave points error’, ‘mean fractal dimension’, ‘symmetry error’, ‘mean smoothness’, ‘compactness error’, ‘concavity error’, ‘worst fractal dimension’, ‘radius error’, ‘worst smoothness’, ‘mean symmetry’, ‘mean concave points’, ‘mean compactness’, ‘worst symmetry’, ‘worst concave points’, ‘mean concavity’, ‘texture error’, ‘worst compactness’, ‘worst concavity’, ‘perimeter error’, ‘mean radius’, ‘worst radius’, ‘mean texture’, ‘mean perimeter’, ‘worst texture’, ‘worst perimeter’, ‘area error’, ‘mean area’, ‘worst area’]
The first highest feature is the worst area feature which has a weight of 147512379.03601986,while the lowest feature is fractal dimension error feature, which has a weight of only 0.14560705396198254.
We implemented categorizing breast and prostate cancer by selecting features subsets based on Support Vector Machine-Recursive Features Elimination. In breast cancer data, feature selection was performed by selecting 8 features from 30 features that have the highest rating on SVM weights while in prostate cancer data, feature selection was performed by selecting 2 features from 8 features that have the highest rating on SVM weights. Based upon SVM-RFE experiment, the feature profile of worst area had the highest score for breast cancer while the feature profile of area had the highest score for prostate cancer. We were able to produce an accuracy rate of 96.50%, the precision of 96.56%, and recall of 96.50% with the model. In future work, SVM-RFE optimization is needed to provide a consistent process in feature selection.
- NCBI. What is Cancer? https://www.cancer.gov/about-cancer/understanding/what-is-cancer.
- IARC Global Cancer Observatory. 2018.
- Jakkula, Vikramaditya. ‘Tutorial on support vector machine (svm).’ School of EECS, Washington State University 37 (2006).
- Learning: Support Vector Machines https://www.youtube.com/watch?v=_PwhiWxHK8o&t=25s
- Qifeng Zhou, Wencai Hong, Guifang Shao and Weiyou Cai, ‘A new SVM- RFE approach towards ranking problem,’ 2009 IEEE International Conference on Intelligent Computing and Intelligent Systems, Shanghai, 2009, pp. 270-273.
- Bustamam, Alhadi & Bachtiar, Anas & Sarwinda, Devvi. (2019). “Selecting Features Subsets Based on Support Vector Machine-Recursive Features
- Elimination and One Dimensional-Naïve Bayes Classifier using Support Vector Machines for Classification of Prostate and Breast Cancer”. Procedia Computer Science. 157. 450-458. 10.1016/j.procs.2019.08.238.
- Guyon, I., Weston, J., Barnhill, S., Vapnik, V. (2002) “Gene selection for cancer classification using support vector machines.” Mach. Learn 46: 389– 422.
- A. Adorada, R. Permatasari, P. W. Wirawan, A. Wibowo and A. Sujiwo, ‘Support Vector Machine – Recursive Feature Elimination (SVM – RFE) for Selection of MicroRNA Expression Features of Breast Cancer,’ 2018 2nd International Conference on Informatics and Computational Sciences (ICICoS), Semarang, Indonesia, 2018, pp. 1-4.
- P. A. Mundra and J. C. Rajapakse, ‘SVM-RFE With MRMR Filter for Gene Selection,’ in IEEE Transactions on NanoBioscience, vol. 9, no. 1, pp. 31-37, March 2010.