Exoplanet Hunting Using Machine Learning
Search for exoplanets — those planets beyond our own solar system — using machine learning, and implement these searches in Python.
Our Solar System formed around 4600 million years ago. We know this from the study of meteorites and radioactivity. It all began with a cloud of gas and dust. A nearby supernova explosion probably perturbed the calm cloud, which then started to contract due to gravity, forming a flat, rotating disk with most of the material concentrated in the center: the protosun. Later, gravity pulled the rest of the material into clumps and rounded some of them, forming the planets and dwarf planets. The leftovers resulted in comets, asteroids, and meteoroids.
But what are Exoplanets?
Exoplanets are planets beyond our own solar system. Thousands have been discovered in the past two decades, mostly with NASA’s Kepler Space Telescope.
These exoplanets come in a huge variety of sizes and orbits. Some are gigantic planets hugging close to their parent stars; others are icy, some rocky. NASA and other agencies are looking for a special kind of planet: one that’s the same size as Earth, orbiting a sun-like star in the habitable zone.
The habitable zone is the area around a star where it is not too hot and not too cold for liquid water to exist on the surface of surrounding planets. Imagine if Earth was where Pluto is. The Sun would be barely visible (about the size of a pea) and Earth’s ocean and much of its atmosphere would freeze.
Exoplanets: Worlds Beyond Our Solar System
Exoplanets are planets beyond our own solar system. Thousands have been discovered in the past two decades, mostly with…
Why even search for exoplanets?
There are about 100,000,000,000 stars in our Galaxy, the Milky Way. How many exoplanets — planets outside of the Solar System — do we expect to exist? Why are some stars surrounded by planets? How diverse are planetary systems? Does this diversity tell us something about the process of planet formation? These are some of the many questions that motivate the study of exoplanets. Some exoplanets may have the necessary physical conditions (amount and quality of light from the star, temperature, atmospheric composition) for the existence of complex organic chemistry and perhaps for the development of Life (which may be quite different from Life on Earth).
However, detecting exoplanets is no simple task. We may have imagined life on other planets in books and film for centuries, but detecting actual exoplanets is a recent phenomenon. Planets on their own emit very little if any light. We can only see Jupiter or Venus in the night sky because they reflect the sun’s light. If we were to look at an exoplanet (the nearest one is over 4 light-years away), it would be very close to a brilliantly lit star, making the planet impossible to see.
Scientists discovered a very efficient way to study these occurrences; planets themselves do not emit light, but the stars around which they orbit do. Considering this fact into account scientists at NASA developed a method which they called Transit method in which a digital-camera-like technology is used to detect and measure tiny dips in a star’s brightness as a planet crosses in front of the star. With observations of transiting planets, astronomers can calculate the ratio of a planet’s radius to that of its star — essentially the size of the planet’s shadow — and with that ratio, they can calculate the planet’s size.
Kepler Space Telescope’s primary method of searching for planets was the “Transit” method.
Transit method: In the diagram below, a star is orbited by a planet. From the graph, it is visible that the starlight intensity drops because it is partially obscured by the planet, given our position. The starlight rises back to its original value once the planets crosses in front of the star.
Until just a few years ago, astronomers had only confirmed the presence of fewer than a thousand exoplanets. Then came the Kepler mission, and the number of exoplanets exploded. The Kepler mission is sadly over in 2018, but the TESS mission or Transiting Exoplanet Survey Satellite has taken its place and is regularly finding new exoplanets in the night sky. TESS is monitoring the brightness of stars for periodic drops caused by planet transits. The TESS mission is finding planets ranging from small, rocky worlds to giant planets, showcasing the diversity of planets in the galaxy.
I wanted to see if I could look at the available exoplanet data and make predictions about which planets might be hospitable to life. The data made publicly available by NASA is beautiful in that it contains many useful features. The goal is to create a model that can predict the existence of an Exoplanet, utilizing the flux (light intensity) readings from 3198 different stars over time.
The dataset can be downloaded from here.
Lets us start by importing all the libraries:
Load the train and test data.
Now the target column
LABEL consists of two categories 1(Does not represents exoplanet) and 2(represents the presence of exoplanet). So, convert them to binary values for easier processing of data.
Before moving forward let us also reduce the amount of memory used by both test and train data frames.
This step is for memory optimization purpose and has reduced the memory usage of
test_data dataframe by 55.1%, you can do that for
train_data data frame also.
Now visualize the target column in the train_dataset and get an idea about the class distribution.
It turns out that the data is highly imbalanced. So first let us start with data preprocessing techniques.
Let us plot the first 4 rows of the train data and observe the intensity of flux values.
Well, our data is clean but is not normalized. Let us plot the Gaussian histogram of non-exoplanets data.
Now plot Gaussian histogram of the data when exoplanets are present.
So let us first split our dataset and normalize it.
Data Normalization is a technique often applied as part of data preparation for machine learning. The goal of normalization is to change the values of numeric columns in the dataset to a common scale, without distorting differences in the ranges of values.
The next step is to apply gaussian filters to both test and train.
In probability theory, the normal (or Gaussian or Gauss or Laplace–Gauss) distribution is a very common continuous probability distribution. Normal distributions are important in statistics and are often used in the natural and social sciences to represent real-valued random variables whose distributions are not known.
The normal distribution is useful because of the central limit theorem. In its most general form, under some conditions (which include finite variance), it states that averages of samples of observations of random variables independently drawn from the same distribution converge in distribution to the normal, that is, they become normally distributed when the number of observations is sufficiently large. Physical quantities that are expected to be the sum of many independent processes often have distributions that are nearly normal.
we use feature scaling so that all the values remain in the comparable range.
The number of columns/features that we have been working with is huge. We have 5087 rows and 3198 columns in our training dataset. Basically we need to decrease the number of features(Dimentioanlity Reduction) to remove the possibility of Curse of Dimensionality.
For reducing the number of dimensions/features we will use the most popular dimensionality reduction algorithm i.e. PCA(Principal Component Analysis).
To perform PCA we have to choose the number of features/dimensions that we want in our data.
The above code gives k=37.
Now let us take k=37 and apply PCA on our independent variables.
The above plot tells us that selecting 37 components we can preserve something around 98.8% or 99% of the total variance of the data. It makes sense, we’ll not use 100% of our variance, because it denotes all components, and we want only the principal ones.
The number of columns got reduced to 37 in both test and train datasets.
Now moving on to the next step, as we know the target class is not equally distributed and one class dominates the other. So we need to resample our data so that the target class is equally distributed.
There are 4 ways of addressing class imbalance problems like these:
- Synthesis of new minority class instances
- Over-sampling of minority class
- Under-sampling of the majority class
- Tweak the cost function to make misclassification of minority instances more important than misclassification of majority instances.
We have used the SMOTE(Synthetic Minority Over-sampling TEchnique) resampling method. It is an over-sampling method. What it does is, it creates synthetic (not duplicate) samples of the minority class. Hence making the minority class equal to the majority class. SMOTE does this by selecting similar records and altering that record one column at a time by a random amount within the difference to the neighboring records.
Now it comes to building a model which can classify exoplanets on the test data.
So I’ll create a function
model which will:
- fit the model
- perform Cross-validation
- Check the Accuracy of our model
- generate Classification report
- generate Confusion matrix
There is always a need to validate the stability of your machine learning model. I mean you just can’t fit the model to your training data and hope it would accurately work for the real data it has never seen before. You need some kind of assurance that your model has got most of the patterns from the data correct, and it's not picking up too much on the noise, or in other words its low on bias and variance.
Now fit the Support Vector Machine (SVM) algorithm to the training set and do prediction.
Also, try the Random forest model and get the feature importance but before doing that include below code in the function model.
and call the Random forest classification algorithm.
Generally, Feature importance provides a score that indicates how useful or valuable each feature was in the construction of the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance.
We can see that we are getting pretty good results from SVM and Random forest algorithms. However, you can go ahead and tweak the parameters and also use other algorithms and check the difference in Accuracy.
Now let's try to solve the same problem with neural networks(ANN) using Keras python library.
The Neural network model gave Accuracy mean of 91.86%
and Accuracy variance of 7.30% after cross-validation which is a pretty handsome result.
Conclusion: The Future
It’s amazing we are able to gather light from distant stars, study this light that has been traveling for thousands of years, and make conclusions about what potential worlds these stars might harbor.
Within the next 10 years, 30 to 40m diameter telescopes will operate from the Earth to detect exoplanets by imaging and velocity variations of the stars. Satellite telescopes including Cheops, JWST, Plato, and Ariel, will be launched to detect planets by the transit method. The JWST will also do direct imaging. Large Space telescopes 8 to18m in diameter (LUVOIR, Habex) are being designed at NASA to detect signs of life on exoplanets by 2050.
In the more distant future, huge space interferometers will make detailed maps of planets. And possibly, interstellar probes will be launched towards the nearest exoplanets to take close-up images. Engineers are already working on propulsion techniques to reach such distant targets.
So in this article, we predicted the presence of an exoplanet using machine learning models and neural networks.
Well, that’s all for this article hope you guys have enjoyed reading it. I’ll be glad if the article is of any help. Feel free to share your comments/thoughts/feedback in the comment section.
You can find the code in my Github repository:
Thanks for reading!!!
Bio: Nagesh Singh Chauhan is a Data Science enthusiast. Interested in Big Data, Python, Machine Learning.
Original. Reposted with permission.
- Stock Market Forecasting Using Time Series Analysis
- Classify A Rare Event Using 5 Machine Learning Algorithms
- Geovisualization with Open Data