Classifying Stars And Quasars

Topics:
Words:
1419
Pages:
3
This essay sample was donated by a student to help the academic community. Papers provided by EduBirdie writers usually outdo students' samples.

Cite this essay cite-image

Abstract

One of the main problems in the world of astro-physics is differentiating between heavenly bodies such as stars and quasars which are distinct from each other. The main problem between scientists and researchers alike is that they have a hard time differentiating the two separate entities from the collected SDSS catalog data that comprises of the features and classification data. The current methods have proved to be inefficient and not up to par with modern expectations. Matching the recorded findings from the Galex is a long and rather tedious process. The decision tree that has been implemented is used to tackle this problem and achieve a satisfactory F1 score and accuracy.

PROBLEM STATEMENT

The Galaxy Evolution Explorer or GALEX is a space tele- scope that was developed under the NASA Explorer program. It observed astronomical sources in the far-UV and near-UV wavebands. The Sloan Digital Sky Survey or SDSS is an optical survey that observed large portions of the sky in the wave bands u,g,r,i,z and obtained the spectra of the sources so that their red-shifts could be determined as well.

This project attempts is to classify photo-metric data col- lected from the Galaxy Evolution Explorer (GALEX) and the Sloan Digital Sky Survey (SDSS) over the North Galactic region and Equatorial region in to spectroscopic classes of Stars and Quasars. Decision Tree Machine Learning algorithm is used to successfully distinguish between Stars and Quasars. Inferences about the data set and the results of the Decision tree model have been elucidated.

MACHINE LEARNING TECHNIQUES USED

Context

Machine learning uses algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. Machine learning techniques are broadly classified into 2 types - Supervised and Unsupervised:

  • Supervised learning is a learning technique in which a model is trained using the data which is well labeled, that means some data is already tagged with the correct answer. After that, the model is provided with a new set of examples(test data) so that supervised learning algorithm analyses the training data given to it before and produces the correct output from new data.
  • Unsupervised learning is the process of training of an algorithm using information that is neither classified nor labeled and allowing the algorithm to act on that information without any guidance. Here the task of the computer is to group unsorted information according to similarities, patterns and differences without any previous training of the given data.

Decision Tree Algorithm

A decision tree is a decision-making support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes etc. It is one of the ways to display an algorithm that only has conditional control statements. A decision tree is a flowchart-like structure wherein each internal node represents a “test” on an attribute (e.g. heads or tails when flipping a coin), each branch repre- sents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

Advantages:

  1. Compared to other models, decision trees requires less effort for data preparation during pre-processing.
  2. A decision tree does not require normalization of the data.
  3. A decision tree does not require scaling of the data as well.
  4. Missing values in the data also does not affect the process of building decision tree to any meaningful extent.
  5. A Decision tree model is very intuitive and easy to explain to technical team and stakeholders.

Disadvantages:

  1. A minuscule change in the data can cause a large change in the structure of the decision tree thus causing instability.
  2. For a Decision tree sometimes the calculation can become far more complex compared to other algorithms.
  3. Decision tree involves higher training time for the model.
  4. Decision tree training is generally more expensive as complexity and time taken is more.
  5. Decision Tree algorithm is not adequate for applying regression and predicting continuous values.

Synthetic Minority Oversampling Technique

SMOTE stands for Synthetic Minority Oversampling Technique. This is one of the statistical techniques used for increasing the total number of cases in your dataset in a balanced way. This module works by creating new instances from existing minority cases that you supply as input. This implementation of SMOTE does not change the number of majority cases. New instances are not just copies of already existing minority cases; instead, the algorithm takes samples of the feature space for each target class and its nearest neighbors, and creates new examples that combine features of the target case with features of its neighbors. This approach increases the features available to each class and makes the samples more generic. SMOTE takes the entire dataset as an input, but it increases the percentage of only the minority cases. For example, suppose you have an imbalanced dataset where just 1 percent of the cases have the target value A - the minority class, and 99 percent of the cases have the value B. To increase the percentage of minority cases to 2x the previous percentage, you would enter 200 for SMOTE percentage in the given module’s properties.

Save your time!
We can take care of your essay
  • Proper editing and formatting
  • Free revision, title page, and bibliography
  • Flexible prices and money-back guarantee
Place Order
document

DETAILED METHODOLOGY

Pre-processing the data

Catalog 3 which had data from both the regions (North Galactic Pole and Equatorial Region) and only samples which had fuv values were used. ‘Galex objid’,’SDSS objid’, ’Pred’,’class’. Also, ‘spectrometric redshift’ columns are dropped while training the model. As the data was imbalanced, we used SMOTE to balance both the classes.

Selecting a classification model

The Random Forest Algorithm is by far the most efficient and is the first choice made by most researcher to tackle this unique problem as the trees are more diverse and can handle over-fitting better. Also, a decision tree is nothing but a tiny subset of a Random Forest. A decision tree classifier is a binary tree where predictions are made by traversing the tree from root to leaf — at each node, we go left if a feature is less than a threshold, right otherwise. Finally, each leaf is associated with a class, which is the output of the predictor.

To divide the data at each node we use a metric called Gini Impurity (G), which describes how homogeneous or pure a node is. When G = o at a node ,it is pure which means that all the samples belong to the same class. A node with many samples from different classes will have a Gini closer to 1. More formally the Gini impurity of n training samples split across k classes is defined where p[k] is the fraction of samples belonging to class k

The training algorithm is a recursive algorithm called CART (Classification and Regression Trees). Each node is split so that the Gini impurity of the children is minimized. The key to the CART algorithm is finding the optimal feature and threshold such that the Gini impurity is minimized. To do so, we try all possible splits and compute the resulting Gini impurities. It is done as follows

  • iterate through the sorted feature values as possible thresholds,
  • keep track of the number of samples per class on the left and on the right,
  • increment/decrement them by 1 after each threshold Indeed if m is the size of the node and m[k] the number of samples of class k in the node, then and since after seeing the i-th threshold there are i elements on the left and m–i on the right, and The resulting Gini is a simple weighted average:

Training and Testing the Classification model

The dataset has been split in 70:30 train-test ratio using sklearn.train test split(). The accuracy of the model is compared with and without using SMOTE. A score method has been implemented in the DecisionTreeClassifier class to get the accuracy. The model was trained on a range of values for max depth parameter to get the most optimal value for max depth.

RESULT AND CONCLUSION

After training and testing the Decision Tree Classifier, it was found out that the optimal value of max depth was 8 and sometimes 9. The following are the results for when max depth was 8.

The following are the results while using different set of features for a value of 8 for max depth hyper-parameter without using SMOTE:

  • All features are used - 0.95 Accuracy
  • Extinction values are dropped - 0.96 Accuracy
  • Pair wise differences are dropped - 0.93 Accuracy
  • Extinction and pair wise differences are dropped - 0.93

REFERENCES

  1. Makhija, Simran & Saha, Snehanshu & Das, Mousumi & Basak, Suryoday. (2019). Separating Stars from Quasars: Machine Learning Investigation Using Photometric Data. 10.13140/RG.2.2.24220.74889.
  2. https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb
Make sure you submit a unique essay

Our writers will provide you with an essay sample written from scratch: any topic, any deadline, any instructions.

Cite this paper

Classifying Stars And Quasars. (2022, February 18). Edubirdie. Retrieved April 23, 2024, from https://edubirdie.com/examples/classifying-stars-and-quasars/
“Classifying Stars And Quasars.” Edubirdie, 18 Feb. 2022, edubirdie.com/examples/classifying-stars-and-quasars/
Classifying Stars And Quasars. [online]. Available at: <https://edubirdie.com/examples/classifying-stars-and-quasars/> [Accessed 23 Apr. 2024].
Classifying Stars And Quasars [Internet]. Edubirdie. 2022 Feb 18 [cited 2024 Apr 23]. Available from: https://edubirdie.com/examples/classifying-stars-and-quasars/
copy

Join our 150k of happy users

  • Get original paper written according to your instructions
  • Save time for what matters most
Place an order

Fair Use Policy

EduBirdie considers academic integrity to be the essential part of the learning process and does not support any violation of the academic standards. Should you have any questions regarding our Fair Use Policy or become aware of any violations, please do not hesitate to contact us via support@edubirdie.com.

Check it out!
close
search Stuck on your essay?

We are here 24/7 to write your paper in as fast as 3 hours.