Exoplanet Detection using Hybrid Ensemble ML Architecture (Part I)¶

Kepler Object of Interest (KOI)¶

In this notebook we focus on the Kepler Project mission data source that can be found here.

We explore the open-source data collected by the Kepler transit method satellite, and build a suitable ML model for exoplanet identification.

The historical data cover the period 2009-2013. This work is built upon previous studies you can find in the references section.

In [ ]:
# Adding imports
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from xgboost import XGBClassifier

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt


import requests
import urllib.parse
import io

import joblib
In [ ]:
import warnings
# suppress all warnings
warnings.filterwarnings('ignore')

The data is queried using Astronomical Data Query Language (ADQL) is used to access the data programmatically via TAP service.

In [ ]:
# define query to select all data from cumulative table
kepler_query = "SELECT * FROM cumulative"

# use HTTP client to parse the query
encoded_query = urllib.parse.quote(kepler_query)

# use TAP url to programmatically fetch Kepler historical data
tap_url = f"https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query={encoded_query}&format=csv"

print("Downloading Kepler KOI Cumulative data...")
print(f"URL: {tap_url}")

response = requests.get(tap_url)

if response.status_code == 200:
  print("Successful download...")
else:
  print(f"Encounter error while downloading: {response.status_code}")
Downloading Kepler KOI Cumulative data...
URL: https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=SELECT%20%2A%20FROM%20cumulative&format=csv
Successful download...
In [ ]:
# Convert the downloaded data into a panda dataframe
kepler_data = pd.read_csv(io.StringIO(response.text))

print(f"Dataset shape: {kepler_data.shape}")
print(f"# of columns: {len(kepler_data.columns)}")
print(f"Sample of column names: {list(kepler_data.columns[:10])}")
Dataset shape: (9564, 153)
# of columns: 153
Sample of column names: ['kepid', 'kepoi_name', 'kepler_name', 'ra', 'ra_err', 'ra_str', 'dec', 'dec_err', 'dec_str', 'koi_gmag']
In [ ]:
print(kepler_data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9564 entries, 0 to 9563
Columns: 153 entries, kepid to koi_score
dtypes: float64(127), int64(6), object(20)
memory usage: 11.2+ MB
None
In [ ]:
print("\n=== First 5 rows ===")
print(kepler_data.head())

print("\n=== Disposition counts ===")
print(kepler_data['koi_disposition'].value_counts())
=== First 5 rows ===
      kepid kepoi_name   kepler_name         ra  ra_err        ra_str  \
0  10797460  K00752.01  Kepler-227 b  291.93423     0.0  19h27m44.22s   
1  10797460  K00752.02  Kepler-227 c  291.93423     0.0  19h27m44.22s   
2  10811496  K00753.01           NaN  297.00482     0.0  19h48m01.16s   
3  10848459  K00754.01           NaN  285.53461     0.0  19h02m08.31s   
4  10854555  K00755.01  Kepler-664 b  288.75488     0.0  19h15m01.17s   

         dec  dec_err       dec_str  koi_gmag  ...  koi_fpflag_co  \
0  48.141651      0.0  +48d08m29.9s    15.890  ...              0   
1  48.141651      0.0  +48d08m29.9s    15.890  ...              0   
2  48.134129      0.0  +48d08m02.9s    15.943  ...              0   
3  48.285210      0.0  +48d17m06.8s    16.100  ...              0   
4  48.226200      0.0  +48d13m34.3s    16.015  ...              0   

   koi_fpflag_ec  koi_insol  koi_insol_err1  koi_insol_err2  koi_srho  \
0              0      93.59           29.45          -16.65   3.20796   
1              0       9.11            2.87           -1.62   3.02368   
2              0      39.30           31.04          -10.49   7.29555   
3              0     891.96          668.95         -230.35   0.22080   
4              0     926.16          874.33         -314.24   1.98635   

   koi_srho_err1  koi_srho_err2  koi_fittype  koi_score  
0        0.33173       -1.09986      LS+MCMC      1.000  
1        2.20489       -2.49638      LS+MCMC      0.969  
2       35.03293       -2.75453      LS+MCMC      0.000  
3        0.00917       -0.01837      LS+MCMC      0.000  
4        2.71141       -1.74541      LS+MCMC      1.000  

[5 rows x 153 columns]

=== Disposition counts ===
koi_disposition
FALSE POSITIVE    4839
CONFIRMED         2746
CANDIDATE         1979
Name: count, dtype: int64
In [ ]:
kepler_data.tail(5)
Out[ ]:
kepid kepoi_name kepler_name ra ra_err ra_str dec dec_err dec_str koi_gmag ... koi_fpflag_co koi_fpflag_ec koi_insol koi_insol_err1 koi_insol_err2 koi_srho koi_srho_err1 koi_srho_err2 koi_fittype koi_score
9559 10090151 K07985.01 NaN 297.18875 0.0 19h48m45.30s 47.093819 0.0 +47d05m37.7s 14.082 ... 1 0 4500.53 3406.38 -1175.26 0.16318 0.01693 -0.01718 LS+MCMC 0.000
9560 10128825 K07986.01 NaN 286.50937 0.0 19h06m02.25s 47.163219 0.0 +47d09m47.6s 15.189 ... 0 0 1585.81 1537.86 -502.22 0.50770 0.07330 -0.26989 LS+MCMC 0.497
9561 10147276 K07987.01 NaN 294.16489 0.0 19h36m39.57s 47.176281 0.0 +47d10m34.6s 15.853 ... 1 0 5713.41 5675.74 -1836.94 8.97692 23.11894 -8.63287 LS+MCMC 0.021
9562 10155286 K07988.01 NaN 296.76288 0.0 19h47m03.09s 47.145142 0.0 +47d08m42.5s 11.733 ... 0 0 22.68 2.07 -10.95 85.88623 17.31552 -41.55038 LS+MCMC 0.092
9563 10156110 K07989.01 NaN 297.00977 0.0 19h48m02.34s 47.121021 0.0 +47d07m15.7s 15.210 ... 1 1 607.42 600.39 -194.33 1.40645 0.16166 -0.95964 LS+MCMC 0.000

5 rows × 153 columns

Exploratory data analysis¶

In [ ]:
# Create the scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(kepler_data['koi_depth'], kepler_data['koi_duration'],
           alpha=0.6, s=20, c='#1e40af', edgecolors='none')

# Add labels and title
plt.xlabel('KOI Depth [ppm]', fontsize=12, fontweight='bold')
plt.ylabel('KOI Duration [hrs]', fontsize=12, fontweight='bold')
plt.title('Transit Duration vs Transit Depth', fontsize=14, fontweight='bold')

# Add grid for better readability
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
# where are the outliers?

# Box plots to identify outliers
plt.figure(figsize=(15, 5))
kepler_data[['koi_period', 'koi_depth', 'koi_duration']].boxplot()
plt.title('Outlier Detection in Key Features')
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image
In [ ]:
# class distribution w.t.r. to koi_disposition

plt.figure(figsize=(8, 6))
class_counts = kepler_data['koi_disposition'].value_counts()
colors = ['#10b981', '#f59e0b', '#ef4444']  # Green, Orange, Red

bars = plt.bar(class_counts.index, class_counts.values, color=colors, alpha=0.8,width=0.3)
plt.title('Distribution of Exoplanet Classes (Original KOI Dataset)')
plt.xlabel('Class')
plt.ylabel('Count')
plt.xticks(rotation=0)

# Add value labels on bars
for bar, count in zip(bars, class_counts.values):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
             f'{count:,}', ha='center', va='bottom', fontweight='bold')

plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("Class distribution:")
for class_name, count in class_counts.items():
    percentage = (count / len(kepler_data)) * 100
    print(f"{class_name}: {count:,} ({percentage:.1f}%)")
No description has been provided for this image
Class distribution:
FALSE POSITIVE: 4,839 (50.6%)
CONFIRMED: 2,746 (28.7%)
CANDIDATE: 1,979 (20.7%)

Data Preprocessing¶

Only koi_disposition values corresponding to CONFIRMED and CANDIDATE are considered for classification. Initially, multi-class classification was used to include FALSE POSITIVE entries; however, this method yielded worse accuracy in comparison.

In [ ]:
# Removing FALSE POSITIVES for binary classification
kepler_data_binary = kepler_data[kepler_data['koi_disposition'] != 'FALSE POSITIVE'].copy()
print(f"Upon removing FALSE POSITIVES: {kepler_data_binary.shape[0]} rows")
print(f"Remaining classes: {kepler_data_binary['koi_disposition'].value_counts()}")

# Create y_binary target
y_binary = kepler_data_binary['koi_disposition'].map({
    'CONFIRMED': 1,
    'CANDIDATE': 0
})
Upon removing FALSE POSITIVES: 4725 rows
Remaining classes: koi_disposition
CONFIRMED    2746
CANDIDATE    1979
Name: count, dtype: int64

In the next step, null values are inspected. Columns with > 50% values missing are identified.

In [ ]:
missing_percentages = (kepler_data_binary.isnull().sum() / len(kepler_data_binary)) * 100

high_missing = missing_percentages[missing_percentages > 50]
print(f"Columns with >50% missing data: {len(high_missing)}")
print(high_missing.sort_values(ascending=False))

# Show columns with >80% missing
excessive_missing = missing_percentages[missing_percentages > 80]
print(f"\nColumns with >80% missing data: {len(excessive_missing)}")
print(excessive_missing.sort_values(ascending=False))
Columns with >50% missing data: 24
koi_gmag_err        100.0
koi_rmag_err        100.0
koi_imag_err        100.0
koi_zmag_err        100.0
koi_kepmag_err      100.0
koi_model_dof       100.0
koi_model_chisq     100.0
koi_eccen_err1      100.0
koi_eccen_err2      100.0
koi_longp           100.0
koi_longp_err1      100.0
koi_longp_err2      100.0
koi_sma_err1        100.0
koi_sma_err2        100.0
koi_ingress         100.0
koi_ingress_err1    100.0
koi_ingress_err2    100.0
koi_incl_err1       100.0
koi_incl_err2       100.0
koi_teq_err1        100.0
koi_teq_err2        100.0
koi_sage            100.0
koi_sage_err1       100.0
koi_sage_err2       100.0
dtype: float64

Columns with >80% missing data: 24
koi_gmag_err        100.0
koi_rmag_err        100.0
koi_imag_err        100.0
koi_zmag_err        100.0
koi_kepmag_err      100.0
koi_model_dof       100.0
koi_model_chisq     100.0
koi_eccen_err1      100.0
koi_eccen_err2      100.0
koi_longp           100.0
koi_longp_err1      100.0
koi_longp_err2      100.0
koi_sma_err1        100.0
koi_sma_err2        100.0
koi_ingress         100.0
koi_ingress_err1    100.0
koi_ingress_err2    100.0
koi_incl_err1       100.0
koi_incl_err2       100.0
koi_teq_err1        100.0
koi_teq_err2        100.0
koi_sage            100.0
koi_sage_err1       100.0
koi_sage_err2       100.0
dtype: float64
In [ ]:
# drop column above due to null values

columns_drop = missing_percentages[missing_percentages > 50].index.tolist()

print(f"Columns to drop: {len(columns_drop)}")
print("Dropping these columns:")
for col in columns_drop:
    print(f"  - {col} ({missing_percentages[col]:.1f}% missing)")

# Apply the drops
kepler_data_clean = kepler_data_binary.drop(columns=columns_drop)

print(f"\nInitial dataset: {kepler_data_clean.shape}")
print(f"Cleaned dataset: {kepler_data_clean.shape}")
print(f"Removed {kepler_data.shape[1] - kepler_data_clean.shape[1]} columns")
Columns to drop: 24
Dropping these columns:
  - koi_gmag_err (100.0% missing)
  - koi_rmag_err (100.0% missing)
  - koi_imag_err (100.0% missing)
  - koi_zmag_err (100.0% missing)
  - koi_kepmag_err (100.0% missing)
  - koi_model_dof (100.0% missing)
  - koi_model_chisq (100.0% missing)
  - koi_eccen_err1 (100.0% missing)
  - koi_eccen_err2 (100.0% missing)
  - koi_longp (100.0% missing)
  - koi_longp_err1 (100.0% missing)
  - koi_longp_err2 (100.0% missing)
  - koi_sma_err1 (100.0% missing)
  - koi_sma_err2 (100.0% missing)
  - koi_ingress (100.0% missing)
  - koi_ingress_err1 (100.0% missing)
  - koi_ingress_err2 (100.0% missing)
  - koi_incl_err1 (100.0% missing)
  - koi_incl_err2 (100.0% missing)
  - koi_teq_err1 (100.0% missing)
  - koi_teq_err2 (100.0% missing)
  - koi_sage (100.0% missing)
  - koi_sage_err1 (100.0% missing)
  - koi_sage_err2 (100.0% missing)

Initial dataset: (4725, 129)
Cleaned dataset: (4725, 129)
Removed 24 columns

Feature Engineering & Selection¶

Of the initial 153 columns, 25 columns were removed. To reduce dimensionality further, Pearson correlation is used.

In [ ]:
# plot correlation for numeric columns
numeric_columns = kepler_data_clean.select_dtypes(include=[np.number]).columns
correlation_matrix = kepler_data_clean[numeric_columns].corr()

print(f"Correlation matrix shape: {correlation_matrix.shape}")
Correlation matrix shape: (109, 109)
In [ ]:
# Find pairs with high correlation (>0.8 or <-0.8)
high_corr_pairs = []

for i in range(len(correlation_matrix.columns)):
    for j in range(i+1, len(correlation_matrix.columns)):
        corr_value = correlation_matrix.iloc[i, j]
        if abs(corr_value) > 0.8:  # High correlation threshold set to 0.8
            high_corr_pairs.append({
                'feature1': correlation_matrix.columns[i],
                'feature2': correlation_matrix.columns[j],
                'correlation': corr_value
            })


high_corr_df = pd.DataFrame(high_corr_pairs)
print(f"Found {len(high_corr_pairs)} highly correlated pairs:")
print(high_corr_df.sort_values('correlation', key=abs, ascending=False))
Found 91 highly correlated pairs:
            feature1         feature2  correlation
37  koi_time0bk_err1   koi_time0_err1     1.000000
87    koi_time0_err1   koi_time0_err2    -1.000000
66      koi_dor_err1     koi_dor_err2    -1.000000
62   koi_period_err1  koi_period_err2    -1.000000
58    koi_depth_err1   koi_depth_err2    -1.000000
..               ...              ...          ...
45     koi_prad_err1          koi_ror     0.821799
52        koi_impact     koi_ror_err2    -0.819624
63           koi_ror     koi_ror_err2    -0.819558
44     koi_prad_err1       koi_impact     0.814026
50        koi_impact  koi_impact_err2    -0.806706

[91 rows x 3 columns]
In [ ]:
# Create correlation heatmap
plt.figure(figsize=(8, 6))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))  # Mask upper triangle

sns.heatmap(correlation_matrix,
            mask=mask,
            annot=False,
            cmap='RdBu_r',
            center=0,
            square=True,
            cbar_kws={'shrink': 0.8})

plt.title('Feature Correlation Matrix', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
No description has been provided for this image

Given the previous graph dimensionality canbe reduced by firstly removikng highly correlated features, secondly by using the target variable koi_disposition and exatracting the top 20 or 30 features the variable is correlated with, while dropping everything else.

This will allow us to keep only the information necessary.

In [ ]:
# Remove highly correlated features
def remove_highly_correlated_features(df, threshold=0.8):
    # Compute correlation matrix
    corr_matrix = df.select_dtypes(include=[np.number]).corr()

    # array containing highly correlated pairs
    high_corr_pairs = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                high_corr_pairs.append((corr_matrix.columns[i], corr_matrix.columns[j]))

    features_to_remove = set()
    for feat1, feat2 in high_corr_pairs:

        if df[feat1].isnull().sum() <= df[feat2].isnull().sum():
            features_to_remove.add(feat2)
        else:
            features_to_remove.add(feat1)

    return df.drop(columns=list(features_to_remove)), list(features_to_remove)

kepler_data_final, removed_features = remove_highly_correlated_features(kepler_data_clean, threshold=0.8)

print(f"Initial features: {kepler_data_clean.shape[1]}")
print(f"After removal correlated features: {kepler_data_final.shape[1]}")
print(f"Removed {len(removed_features)} features:")
for feat in removed_features:
    print(f"  - {feat}")
Initial features: 129
After removal correlated features: 85
Removed 44 features:
  - koi_fwm_pdeco_err
  - koi_srho_err2
  - koi_time0_err2
  - koi_dor_err1
  - koi_imag
  - koi_steff_err2
  - koi_fwm_sdec
  - koi_gmag
  - koi_impact
  - koi_time0
  - dec
  - koi_dicco_mdec_err
  - koi_prad_err2
  - koi_insol_err1
  - koi_impact_err2
  - koi_max_mult_ev
  - koi_fwm_srao_err
  - koi_ror_err2
  - koi_period_err2
  - koi_dicco_msky_err
  - koi_dor_err2
  - koi_dikco_msky_err
  - koi_dicco_mra_err
  - koi_insol_err2
  - koi_hmag
  - koi_fwm_prao_err
  - koi_prad_err1
  - koi_depth_err2
  - koi_srad_err2
  - koi_time0_err1
  - koi_impact_err1
  - koi_zmag
  - koi_ldm_coeff1
  - koi_fwm_sdec_err
  - koi_kepmag
  - koi_kmag
  - koi_duration_err2
  - koi_fwm_sra
  - koi_ror
  - koi_dor
  - koi_rmag
  - koi_fwm_sdeco_err
  - koi_sma
  - koi_time0bk_err2

In the following step koi_disposition target is now used to further reduce the dataframe dimensionality.

In [ ]:
target = 'koi_disposition'

if target in kepler_data_clean.columns:
    # Calculate correlation with 'koi_disposition'
    target_corr = kepler_data_clean.select_dtypes(include=[np.number]).corrwith(
        pd.get_dummies(kepler_data_clean[target]).iloc[:, 0]
    ).abs().sort_values(ascending=False)

    print("Top features correlated with target 'koi_disposition':")
    print(target_corr.head(10))
Top features correlated with target 'koi_disposition':
koi_fwm_sdeco_err     0.547956
koi_fwm_sdec_err      0.547860
koi_fwm_srao_err      0.547552
koi_fwm_sra_err       0.541306
koi_dicco_msky_err    0.447153
koi_dikco_msky_err    0.437682
koi_dicco_mra_err     0.425693
koi_dikco_mra_err     0.421518
koi_dicco_mdec_err    0.412034
koi_dikco_mdec_err    0.407862
dtype: float64
In [ ]:
# Compute correlation with target (koi_disposition)
target_corr = kepler_data_final.select_dtypes(include=[np.number]).corrwith(
    pd.get_dummies(kepler_data_final['koi_disposition']).iloc[:, 0]
).abs().sort_values(ascending=False)

# Select top features without the target
top_20_features = target_corr.head(20).index.tolist()

top_30_features = target_corr.head(30).index.tolist()

X_20_features = kepler_data_final[top_20_features]
X_30_features = kepler_data_final[top_30_features]


print(f"X_20 features shape: {X_20_features.shape}")
print(f"X_30 features shape: {X_30_features.shape}")
print(f"y_binary shape: {y_binary.shape}")
print(f"Target: koi_disposition")
X_20 features shape: (4725, 20)
X_30 features shape: (4725, 30)
y_binary shape: (4725,)
Target: koi_disposition
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Calculate correlation matrix for top 20 features
correlation_matrix = kepler_data_clean[top_20_features].corr()

# Create the heatmap
plt.figure(figsize=(10, 10))
sns.heatmap(correlation_matrix,
            annot=True,
            cmap='coolwarm',
            center=0,
            square=True,
            fmt='.2f',
            cbar_kws={'shrink': 0.8})
# Rotate the labels
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)

plt.title('Kepler Top 20 Features Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('kepler_correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()
No description has been provided for this image
In [ ]:
# Differences with 30-feature model
additional_features = set(top_30_features) - set(top_20_features)
print(f"\nAdditional features in 30-feature model:")
for i, feature in enumerate(additional_features, 1):
    importance = target_corr[feature]
    print(f"  {i:2d}. {feature:25s} {importance:.3f}")
Additional features in 30-feature model:
   1. koi_depth_err1            0.151
   2. koi_ldm_coeff2            0.100
   3. koi_duration              0.133
   4. koi_steff                 0.136
   5. koi_slogg                 0.162
   6. koi_smass                 0.159
   7. koi_slogg_err1            0.129
   8. koi_model_snr             0.112
   9. koi_srho                  0.161
  10. koi_jmag_err              0.088
In [ ]:
# Let us plot a graph to understand
fig, ax = plt.subplots(figsize=(8, 6))
top_20_features_corr = target_corr.head(20)

# Create bars
bars = ax.barh(range(len(top_20_features_corr)), top_20_features_corr.values,
               color='#1e40af', alpha=0.8, height=0.8)

# Styling
ax.set_yticks(range(len(top_20_features_corr)))
ax.set_yticklabels(top_20_features_corr.index, fontsize=10)
ax.set_xlabel('Correlation Coefficient', fontsize=12, fontweight='bold')
ax.set_title('Kepler Feature Scores for Top 20 Features', fontsize=14, fontweight='bold')

# create a grid
ax.grid(axis='x', alpha=0.3, linestyle='-', linewidth=0.5)

# add statistics
n_features = len(top_20_features_corr)
max_corr = top_20_features_corr.max()
ax.text(0.02, 0.98, f'Top {n_features} Features\nMax Correlation: {max_corr:.3f}',
        transform=ax.transAxes, fontsize=10,
        verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

ax.invert_yaxis()
plt.tight_layout()
plt.show()
No description has been provided for this image

Apply Scikit-Learn Iterative Imputer¶

Dispite the dimensionality is reduce there exists still null entries in the dataset. Dropping the column containing null values can cause loss of information effecting the accuracy of the model. To address this problem scikit-learn iterative imputater is used as per [1]

The iterative imputer computes p(x|y) with x being the feature with missing values and y the features containing velues.

In [ ]:
# check if X_features datasets are pd
print(type(X_20_features))
print(type(X_30_features))
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
In [ ]:
# Use RandomForestImputer

# Apply imputation to feature datasets only
imputer_20 = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=10, random_state=42),
    max_iter=10,
    random_state=42,
    verbose=1
)

imputer_30 = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=10, random_state=42),
    max_iter=10,
    random_state=42,
    verbose=1
)
In [ ]:
# Impute missing values
X_20_imputed = imputer_20.fit_transform(X_20_features)
X_30_imputed = imputer_30.fit_transform(X_30_features)

# Convert to DataFrames
X_20_imputed_df = pd.DataFrame(X_20_imputed, columns=X_20_features.columns, index=X_20_features.index)
X_30_imputed_df = pd.DataFrame(X_30_imputed, columns=X_30_features.columns, index=X_30_features.index)

print(f"20-feature DataFrame after imputation: {X_20_imputed_df.shape}")
print(f"30-feature DataFrame after imputation: {X_30_imputed_df.shape}")
[IterativeImputer] Completing matrix with shape (4725, 20)
[IterativeImputer] Change: 100.74282401559108, scaled tolerance: 0.90704471 
[IterativeImputer] Change: 111.30440420000002, scaled tolerance: 0.90704471 
[IterativeImputer] Change: 81.99496367999998, scaled tolerance: 0.90704471 
[IterativeImputer] Change: 132.14853723999997, scaled tolerance: 0.90704471 
[IterativeImputer] Change: 108.8969121, scaled tolerance: 0.90704471 
[IterativeImputer] Change: 149.60129577, scaled tolerance: 0.90704471 
[IterativeImputer] Change: 191.11078299999997, scaled tolerance: 0.90704471 
[IterativeImputer] Change: 289.4939520000001, scaled tolerance: 0.90704471 
[IterativeImputer] Change: 307.45984660000005, scaled tolerance: 0.90704471 
[IterativeImputer] Change: 193.61510960000004, scaled tolerance: 0.90704471 
[IterativeImputer] Completing matrix with shape (4725, 30)
/usr/local/lib/python3.12/dist-packages/sklearn/impute/_iterative.py:895: ConvergenceWarning: [IterativeImputer] Early stopping criterion not reached.
  warnings.warn(
[IterativeImputer] Change: 440.46695557890655, scaled tolerance: 10.894 
[IterativeImputer] Change: 348.9466892399001, scaled tolerance: 10.894 
[IterativeImputer] Change: 540.7758820600006, scaled tolerance: 10.894 
[IterativeImputer] Change: 427.88395509999975, scaled tolerance: 10.894 
[IterativeImputer] Change: 299.6462616999996, scaled tolerance: 10.894 
[IterativeImputer] Change: 432.3645534999995, scaled tolerance: 10.894 
[IterativeImputer] Change: 270.78313480000014, scaled tolerance: 10.894 
[IterativeImputer] Change: 463.99424001000045, scaled tolerance: 10.894 
[IterativeImputer] Change: 656.6959960500004, scaled tolerance: 10.894 
[IterativeImputer] Change: 382.7486146350007, scaled tolerance: 10.894 
20-feature DataFrame after imputation: (4725, 20)
30-feature DataFrame after imputation: (4725, 30)
/usr/local/lib/python3.12/dist-packages/sklearn/impute/_iterative.py:895: ConvergenceWarning: [IterativeImputer] Early stopping criterion not reached.
  warnings.warn(

Model Training¶

For the model training, the dataset is split into training and testing dataset. The ratio is 80/20. Subsequently StandardScaler is applied to the split data to avoid any data leakege of the test dataset into the model.

In [ ]:
# Split data using the y_binary by 80/20 split
X_train_20, X_test_20, y_train_20, y_test_20 = train_test_split(
    X_20_imputed_df, y_binary, test_size=0.2, random_state=42
)

X_train_30, X_test_30, y_train_30, y_test_30 = train_test_split(
    X_30_imputed_df, y_binary, test_size=0.2, random_state=42
)

print(f"20-feature training set: {X_train_20.shape}")
print(f"20-feature test set: {X_test_20.shape}")
print(f"30-feature training set: {X_train_30.shape}")
print(f"30-feature test set: {X_test_30.shape}")
20-feature training set: (3780, 20)
20-feature test set: (945, 20)
30-feature training set: (3780, 30)
30-feature test set: (945, 30)
In [ ]:
# Define a standard scaler
scaler_20 = StandardScaler()
scaler_30 = StandardScaler()

# Fit scalers on training data only
X_train_20_scaled = scaler_20.fit_transform(X_train_20)
X_test_20_scaled = scaler_20.transform(X_test_20)

X_train_30_scaled = scaler_30.fit_transform(X_train_30)
X_test_30_scaled = scaler_30.transform(X_test_30)

# Convert back to DataFrames
X_train_20_scaled_df = pd.DataFrame(X_train_20_scaled, columns=X_train_20.columns, index=X_train_20.index)
X_test_20_scaled_df = pd.DataFrame(X_test_20_scaled, columns=X_test_20.columns, index=X_test_20.index)

X_train_30_scaled_df = pd.DataFrame(X_train_30_scaled, columns=X_train_30.columns, index=X_train_30.index)
X_test_30_scaled_df = pd.DataFrame(X_test_30_scaled, columns=X_test_30.columns, index=X_test_30.index)
In [ ]:
print(f"20-feature training set (scaled): {X_train_20_scaled_df.shape}")
print(f"20-feature test set (scaled): {X_test_20_scaled_df.shape}")
20-feature training set (scaled): (3780, 20)
20-feature test set (scaled): (945, 20)

Now the ensemble for the datafrane are created. Note that the maximum iteration is setto 500, after experimenting it was noted using 1000 provides negligible improvement to the model accurancy whilest increasing the training time.

In [ ]:
ensemble_20 = VotingClassifier([
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('xgb', XGBClassifier(random_state=42)),
    ('svm', SVC(probability=True, random_state=42)),
    ('lr', LogisticRegression(random_state=42, max_iter=2000)),
    ('deep_nn', MLPClassifier(hidden_layer_sizes=(200, 100, 50, 25, 10), max_iter=500, random_state=42))
], voting='soft')


ensemble_30 = VotingClassifier([
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('xgb', XGBClassifier(random_state=42)),
    ('svm', SVC(probability=True, random_state=42)),
    ('lr', LogisticRegression(random_state=42, max_iter=2000)),
    ('deep_nn', MLPClassifier(hidden_layer_sizes=(200, 100, 50, 25, 10), max_iter=500, random_state=42))
], voting='soft')

Train 20-feature model¶

In [ ]:
# Train ensemble
ensemble_20.fit(X_train_20_scaled_df, y_train_20)

# Evaluation
y_pred_20 = ensemble_20.predict(X_test_20_scaled_df)
In [ ]:
print(f"Ensemble_20 Accuracy: {accuracy_score(y_test_20, y_pred_20):.3f}")
print(classification_report(y_test_20, y_pred_20))
Ensemble_20 Accuracy: 0.869
              precision    recall  f1-score   support

           0       0.87      0.80      0.84       394
           1       0.87      0.92      0.89       551

    accuracy                           0.87       945
   macro avg       0.87      0.86      0.86       945
weighted avg       0.87      0.87      0.87       945

In [ ]:
# Compare with 20-feature model
y_pred_proba_20 = ensemble_20.predict_proba(X_test_20_scaled)
auc_20 = roc_auc_score(y_test_20, y_pred_proba_20[:, 1])
print(f"20-Feature Model AUC Score: {auc_20:.3f}")
20-Feature Model AUC Score: 0.952
In [ ]:
# Saving 20-feature model
model_package_20 = {
    'model': ensemble_20,
    'imputer': imputer_20,
    'scaler': scaler_20,
    'features': top_20_features,
    'accuracy': 0.869,
    'auc_score': 0.952,
    'mission': 'kepler',
    'classes': ['CANDIDATE', 'CONFIRMED'],
    'class_mapping': {'CANDIDATE': 0, 'CONFIRMED': 1},
    'n_features': 20,
    'performance': {
        'candidate_precision': 0.87,
        'candidate_recall': 0.80,
        'confirmed_precision': 0.87,
        'confirmed_recall': 0.92
    }
}
In [ ]:
joblib.dump(model_package_20, 'kepler_20_features_model.pkl')
Out[ ]:
['kepler_20_features_model.pkl']

Train 30-feature model¶

In [ ]:
# Train ensemble
ensemble_30.fit(X_train_30_scaled_df, y_train_30)

# Evaluation
y_pred_30 = ensemble_30.predict(X_test_30_scaled_df)
In [ ]:
print(f"Ensemble_30 Accuracy: {accuracy_score(y_test_30, y_pred_30):.3f}")
print(classification_report(y_test_30, y_pred_30))
Ensemble_30 Accuracy: 0.884
              precision    recall  f1-score   support

           0       0.89      0.82      0.85       394
           1       0.88      0.93      0.90       551

    accuracy                           0.88       945
   macro avg       0.89      0.87      0.88       945
weighted avg       0.88      0.88      0.88       945

In [ ]:
# Get prediction probabilities
y_pred_proba_30 = ensemble_30.predict_proba(X_test_30_scaled)

# Calculate AUC score
auc_30 = roc_auc_score(y_test_30, y_pred_proba_30[:, 1])
print(f"30-Feature Model AUC Score: {auc_30:.3f}")
30-Feature Model AUC Score: 0.956

Note that using 30 features improve the ensamble accuracy across all metrics. Now both models are saved using .pkl

In [ ]:
# Save the 30-feature model
model_package_30 = {
    'model': ensemble_30,
    'imputer': imputer_30,
    'scaler': scaler_30,
    'features': top_30_features,
    'accuracy': 0.884,
    'auc_score': 0.956,
    'mission': 'kepler',
    'classes': ['CANDIDATE', 'CONFIRMED'],
    'class_mapping': {'CANDIDATE': 0, 'CONFIRMED': 1},
    'n_features': 30,
    'performance': {
        'candidate_precision': 0.89,
        'candidate_recall': 0.82,
        'confirmed_precision': 0.88,
        'confirmed_recall': 0.93
    }
}

joblib.dump(model_package_30, 'kepler_final_30_features.pkl')
print("Final Kepler model saved with 30 features")
Final Kepler model saved with 30 features

Results & Discussion¶

The Hybrid ML Ensemble Architecture provides improved performance for the 30 top features comapred to Top-20 features model.

model_package_30 = {
    'model': ensemble_30,
    'imputer': imputer_30,
    'scaler': scaler_30,
    'features': top_30_features,
    'accuracy': 0.884,
    'auc_score': 0.956,  
    'mission': 'kepler',
    'classes': ['CANDIDATE', 'CONFIRMED'],
    'class_mapping': {'CANDIDATE': 0, 'CONFIRMED': 1},
    'n_features': 30,
    'performance': {
        'candidate_precision': 0.89,
        'candidate_recall': 0.82,
        'confirmed_precision': 0.88,
        'confirmed_recall': 0.93
    }
}

The methodology used in this study borrows from Rohan Saha (2019) and S. F. Luz et al. (2024), while employing a different ml architecture.

This model has the following metrics:

  • Accuracy: 88.4%
  • AUC Score: 95.6%
  • Confirmed Exoplanet Recall: 93%
  • Candidate Precision: 89%

This suggests that for binary classification using a hybrid ensemble ML architecture yields good results.

Summary & Conclusion¶

The work done in this notebook uses as reference previous studies from Rohan Saha (2019) and S. F. Luz et al. (2024).

The machine learning workflow made use of StandardScaler and feature extraction to reduce the dimensionality of the dataframe. The models are trained using the historical data recorded by the Kepler mission between 2009 and 2018.

Following are the main features:

  • Hybrid Ensemble Architecture: Using a voting ensemble combining Random Forest, XGBoost, SVM, Logistic Regression, and Neural Networks for exoplanet detection.

  • Binary Classification Approach: Focused on distinguishing confirmed exoplanets from candidates, achieving 88.4% accuracy.

  • Iterative Imputation Preprocessing: Using iterative imputation to prevent loss of information.

  • High Discrimination Ability: AUC was computed, achieving 95.6%.

The next step is to apply this same methodology to K2 and TESS.

References¶

Saha, R 2021. Comparing Classification Models on Kepler Data. arXiv preprint arXiv:2101.01904 [astro-ph.EP], viewed 5/10/2025, https://arxiv.org/abs/2101.01904.

Luz, T. S. F., Braga, R. A. S., & Ribeiro, E. R. (2024). Assessment of Ensemble-Based Machine Learning Algorithms for Exoplanet Identification. Electronics, 13(19), 3950. https://doi.org/10.3390/electronics13193950

In [ ]: