2025-10-01

·

Exoplanet Detection using Hybrid Ensemble ML Architecture and NASA Open Source Datasets

·

ML & Astronomy

·

4 min read

1. Introduction

This project presents an ensemble model developed for exoplanet detection. This methodology draws upon previous work done by Saha, R (2021), and Luz, T. S. F et al. (2024).

2. Machine Learning Workflow

The process follows the standard machine learning workflow.

2.1 Data Retrieval

Data is retrieved programmatically via the Table Access Protocol (TAP) service. The available tables can be found at:

https://exoplanetarchive.ipac.caltech.edu/docs/TAP/usingTAP.html

Astronomical Data Query Language (ADQL) is used to query records from the following tables:

  • TOI (TESS Objects of Interest)
  • KOI (Kepler Objects of Interest)
  • K2Pandoc (K2 Planet Candidates)

The query structure follows this format:

query = "SELECT * FROM <table>"

and with further encoding, the code below returns the data of interest in CSV format.

encoded_query = urllib.parse.quote(query)

# TAP URL
tap_url = f"https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query={encoded_query}&format=csv"

2.2 Data Preprocessing

In the data preprocessing phase, the data are explored to understand their key features. Looking at each dataset it is possible to plot the Transit Duration vs Transit Depth.

Transit Duration vs Transit Depth

Fig. 1: Transit Duration vs Transit Depth for Kepler dataset.

Similarly for K2,

K2 Transit Duration vs Transit Depth

Fig. 2: Transit Duration vs Transit Depth for K2 dataset.

Finally for TESS,

TESS Transit Duration vs Transit Depth

Fig. 3: Transit Duration vs Transit Depth for TESS dataset.

After removing all classes aside from CONFIRMED CANDIDATES we are left with:

KOI Disposition

Upon removing FALSE POSITIVES: 4725 rows
Remaining classes: koi_disposition
CONFIRMED    2746
CANDIDATE    1979
Name: count, dtype: int64

K2 Disposition

Class distribution
Upon removing FALSE POSITIVES: 3689 rows
Remaining classes: disposition
CONFIRMED    2315
CANDIDATE    1374
Name: count, dtype: int64

TESS Disposition


Upon removing FALSE POSITIVES: 5946 rows
Remaining classes: tfopwg_disp
PC    4679 # planetary candidates (CANDIDATES)
CP     684 # confirmed planets (CONFIRMED)
KP     583 # known planets  (CONFIRMED)
Name: count, dtype: int64

The candidates are mapped to 0 and confirmed planets are mapped to 1.

In the next step, null values are inspected. Columns with > 50% values missing are identified and dropped.

2.2.1 Feature Engineering & Selection

In the feature engineering phase, the datasets are processed to decrease dimensionality. The first step involves applying Pearson correlation. Subsequently, the dataset disposition is used as a target variable to identify the top 20-30 features.

2.2.2 Apply Scikit-learn Iterative Imputer

Despite the dimensionality being reduced, there are still null entries in the dataset. Dropping additional columns containing null values can cause loss of information, affecting the accuracy of the model. To address this problem, scikit-learn IterativeImputer is applied.

The iterative imputer computes p(x|y) where x represents the feature with missing values and y represents the features containing values.

2.3 Model Training

For model training, the dataset is split into training and testing datasets with an 80:20 ratio. Subsequently, StandardScaler is applied to the split data to prevent data leakage from the test dataset into the model.

A VotingClassifier is defined using an ensemble of five models.

ensemble_20 = VotingClassifier([
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('xgb', XGBClassifier(random_state=42)),
    ('svm', SVC(probability=True, random_state=42)),
    ('lr', LogisticRegression(random_state=42, max_iter=2000)),
    ('deep_nn', MLPClassifier(hidden_layer_sizes=(200, 100, 50, 25, 10), max_iter=500, random_state=42))
], voting='soft')

2.4 Results

The Hybrid ML Ensemble Architecture provides improved performance for the 30 top features compared to the top-20 features model for Kepler and K2 datasets. This binary classification approach yields the following accuracy results:

DatasetFeaturesAccuracy
Kepler3088.4%
Kepler2087.2%
K23099.3%
K22099.2%
TESS2086.1%
TESS3085.8%

3. Notebooks

View Kepler Notebook
View K2 Notebook
View TESS Notebook

4. References

Saha, R 2021. Comparing Classification Models on Kepler Data. arXiv preprint arXiv:2101.01904 [astro-ph.EP], viewed 5/10/2025, https://arxiv.org/abs/2101.01904.

Luz, T. S. F., Braga, R. A. S., & Ribeiro, E. R. (2024). Assessment of Ensemble-Based Machine Learning Algorithms for Exoplanet Identification. Electronics, 13(19), 3950. https://doi.org/10.3390/electronics13193950

Topics

  • Exoplanet Detection
  • Photometric Data Analysis
  • NASA Space Mission Data
  • Astronomical Data Analysis
  • Time Series Analysis
  • Transit Detection

Tech Stack

  • Python
  • Scikit-learn
  • TensorFlow/PyTorch
  • Pandas
  • NumPy
  • Matplotlib/Plotly
  • Astronomical Data Query Language (ADQL)