Classifying Stars And Quasars

Topics:	Stars
Words:	1419
Pages:	3 This essay sample was donated by a student to help the academic community. Papers provided by EduBirdie writers usually outdo students' samples.

Abstract

One of the main problems in the world of astro-physics is differentiating between heavenly bodies such as stars and quasars which are distinct from each other. The main problem between scientists and researchers alike is that they have a hard time differentiating the two separate entities from the collected SDSS catalog data that comprises of the features and classification data. The current methods have proved to be inefficient and not up to par with modern expectations. Matching the recorded findings from the Galex is a long and rather tedious process. The decision tree that has been implemented is used to tackle this problem and achieve a satisfactory F1 score and accuracy.

PROBLEM STATEMENT

The Galaxy Evolution Explorer or GALEX is a space tele- scope that was developed under the NASA Explorer program. It observed astronomical sources in the far-UV and near-UV wavebands. The Sloan Digital Sky Survey or SDSS is an optical survey that observed large portions of the sky in the wave bands u,g,r,i,z and obtained the spectra of the sources so that their red-shifts could be determined as well.

This project attempts is to classify photo-metric data col- lected from the Galaxy Evolution Explorer (GALEX) and the Sloan Digital Sky Survey (SDSS) over the North Galactic region and Equatorial region in to spectroscopic classes of Stars and Quasars. Decision Tree Machine Learning algorithm is used to successfully distinguish between Stars and Quasars. Inferences about the data set and the results of the Decision tree model have been elucidated.

MACHINE LEARNING TECHNIQUES USED

Context

Machine learning uses algorithms and statistical models that computer systems use to perform a specific task without using explicit instructions, relying on patterns and inference instead. Machine learning techniques are broadly classified into 2 types - Supervised and Unsupervised:

Supervised learning is a learning technique in which a model is trained using the data which is well labeled, that means some data is already tagged with the correct answer. After that, the model is provided with a new set of examples(test data) so that supervised learning algorithm analyses the training data given to it before and produces the correct output from new data.
Unsupervised learning is the process of training of an algorithm using information that is neither classified nor labeled and allowing the algorithm to act on that information without any guidance. Here the task of the computer is to group unsorted information according to similarities, patterns and differences without any previous training of the given data.

Decision Tree Algorithm

A decision tree is a decision-making support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes etc. It is one of the ways to display an algorithm that only has conditional control statements. A decision tree is a flowchart-like structure wherein each internal node represents a “test” on an attribute (e.g. heads or tails when flipping a coin), each branch repre- sents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

Advantages:

Compared to other models, decision trees requires less effort for data preparation during pre-processing.
A decision tree does not require normalization of the data.
A decision tree does not require scaling of the data as well.
Missing values in the data also does not affect the process of building decision tree to any meaningful extent.
A Decision tree model is very intuitive and easy to explain to technical team and stakeholders.

Disadvantages:

A minuscule change in the data can cause a large change in the structure of the decision tree thus causing instability.
For a Decision tree sometimes the calculation can become far more complex compared to other algorithms.
Decision tree involves higher training time for the model.
Decision tree training is generally more expensive as complexity and time taken is more.
Decision Tree algorithm is not adequate for applying regression and predicting continuous values.

Synthetic Minority Oversampling Technique

SMOTE stands for Synthetic Minority Oversampling Technique. This is one of the statistical techniques used for increasing the total number of cases in your dataset in a balanced way. This module works by creating new instances from existing minority cases that you supply as input. This implementation of SMOTE does not change the number of majority cases. New instances are not just copies of already existing minority cases; instead, the algorithm takes samples of the feature space for each target class and its nearest neighbors, and creates new examples that combine features of the target case with features of its neighbors. This approach increases the features available to each class and makes the samples more generic. SMOTE takes the entire dataset as an input, but it increases the percentage of only the minority cases. For example, suppose you have an imbalanced dataset where just 1 percent of the cases have the target value A - the minority class, and 99 percent of the cases have the value B. To increase the percentage of minority cases to 2x the previous percentage, you would enter 200 for SMOTE percentage in the given module’s properties.

Save your time!
We can take care of your essay

Proper editing and formatting
Free revision, title page, and bibliography
Flexible prices and money-back guarantee

Place Order

DETAILED METHODOLOGY

Pre-processing the data

Catalog 3 which had data from both the regions (North Galactic Pole and Equatorial Region) and only samples which had fuv values were used. ‘Galex objid’,’SDSS objid’, ’Pred’,’class’. Also, ‘spectrometric redshift’ columns are dropped while training the model. As the data was imbalanced, we used SMOTE to balance both the classes.

Selecting a classification model

The Random Forest Algorithm is by far the most efficient and is the first choice made by most researcher to tackle this unique problem as the trees are more diverse and can handle over-fitting better. Also, a decision tree is nothing but a tiny subset of a Random Forest. A decision tree classifier is a binary tree where predictions are made by traversing the tree from root to leaf — at each node, we go left if a feature is less than a threshold, right otherwise. Finally, each leaf is associated with a class, which is the output of the predictor.

To divide the data at each node we use a metric called Gini Impurity (G), which describes how homogeneous or pure a node is. When G = o at a node ,it is pure which means that all the samples belong to the same class. A node with many samples from different classes will have a Gini closer to 1. More formally the Gini impurity of n training samples split across k classes is defined where p[k] is the fraction of samples belonging to class k

The training algorithm is a recursive algorithm called CART (Classification and Regression Trees). Each node is split so that the Gini impurity of the children is minimized. The key to the CART algorithm is finding the optimal feature and threshold such that the Gini impurity is minimized. To do so, we try all possible splits and compute the resulting Gini impurities. It is done as follows

iterate through the sorted feature values as possible thresholds,
keep track of the number of samples per class on the left and on the right,
increment/decrement them by 1 after each threshold Indeed if m is the size of the node and m[k] the number of samples of class k in the node, then and since after seeing the i-th threshold there are i elements on the left and m–i on the right, and The resulting Gini is a simple weighted average:

Training and Testing the Classification model

The dataset has been split in 70:30 train-test ratio using sklearn.train test split(). The accuracy of the model is compared with and without using SMOTE. A score method has been implemented in the DecisionTreeClassifier class to get the accuracy. The model was trained on a range of values for max depth parameter to get the most optimal value for max depth.

RESULT AND CONCLUSION

After training and testing the Decision Tree Classifier, it was found out that the optimal value of max depth was 8 and sometimes 9. The following are the results for when max depth was 8.

The following are the results while using different set of features for a value of 8 for max depth hyper-parameter without using SMOTE:

All features are used - 0.95 Accuracy
Extinction values are dropped - 0.96 Accuracy
Pair wise differences are dropped - 0.93 Accuracy
Extinction and pair wise differences are dropped - 0.93

REFERENCES

Makhija, Simran & Saha, Snehanshu & Das, Mousumi & Basak, Suryoday. (2019). Separating Stars from Quasars: Machine Learning Investigation Using Photometric Data. 10.13140/RG.2.2.24220.74889.
https://medium.com/greyatom/decision-trees-a-simple-way-to-visualize-a-decision-dc506a403aeb

Cite this paper

Classifying Stars And Quasars. (2022, February 18). Edubirdie. Retrieved April 23, 2024, from https://edubirdie.com/examples/classifying-stars-and-quasars/

“Classifying Stars And Quasars.” Edubirdie, 18 Feb. 2022, edubirdie.com/examples/classifying-stars-and-quasars/

Classifying Stars And Quasars. [online]. Available at: <https://edubirdie.com/examples/classifying-stars-and-quasars/> [Accessed 23 Apr. 2024].

Classifying Stars And Quasars [Internet]. Edubirdie. 2022 Feb 18 [cited 2024 Apr 23]. Available from: https://edubirdie.com/examples/classifying-stars-and-quasars/

copy

Helioseismology And Solar Neutrinos As The Techniques To Learn The Inside Of The Stars

Stars

The Sun, the centre of our solar system, an object in which without life on Earth would cease to...

4 Pages | 1809 Words

Light, Stars, And The Solar System

There is a huge difference between the three spectra. The first to use was the Testine Light Bulb,...

1 Page | 485 Words

How And Why Do Stars Die?

Several billion years after a star’s birth its life will eventually end. How a star will die is...

1 Page | 492 Words

The Inner Working Of The Stars

Helioseismology, simply put, is the study of the Sun’s core through the observation of wave...

4 Pages | 1621 Words

Impact Of Innovation And Media On Characters In Fahrenheit 451 By Ray Bradbury

Regularly of our lives, we spend endless hours under the grasp of innovation. In Ray Bradbury's...

3 Pages | 1242 Words

Literary Theories, Linguistic Aspects And Interpretations Of The Alchemist

In the last decade many researchers and readers has analyzed “The Alchemist by Paulo Coelho”...

2 Pages | 1088 Words

Animal Testing In Science Is Not Ethical

The outcome of scientific studies that involved animal is not eternally trustworthy. This is due...

1 Page | 542 Words

Is It Correct To Test On Animals?

In today’s world, this topic is very controversial, in comparison to the past. In addition, it...

2 Pages | 926 Words

Bipolar Disorder: Theories And Impact

Writing this bipolar disorder essay example, I will delve into the intricate ways in which...

3 Pages | 1244 Words

Abstract

PROBLEM STATEMENT