Exoplanet Detection using Hybrid Ensemble ML Architecture (Part II)¶

K2 Object of Interest¶

In this notebook we focus on the Kepler Project mission data source that can be found here.

We explore the open-source data collected by the Kepler 2 transit method satellite, and build a suitable ML model for exoplanet identification.

The historical data cover the period 2014-2018. This work is built upon previous studies you can find in the references section.

In [ ]:
# Adding imports
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from xgboost import XGBClassifier

from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

import pandas as pd
import numpy as np
import seaborn as sns

import matplotlib.pyplot as plt


import requests
import urllib.parse
import io

import joblib
In [ ]:
import warnings
# suppress all warnings
warnings.filterwarnings('ignore')

Similarly to the Kepler notebook the data are programmatically loaded using TAP service

In [ ]:
import pandas as pd
import requests
import urllib.parse

# Define the query for the cumulative table
query = "SELECT * FROM k2pandc"

# Encode the query for the URL
encoded_query = urllib.parse.quote(query)

# TAP URL
tap_url = f"https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query={encoded_query}&format=csv"

print("Downloading K2 Planets and Candidates data...")
print(f"URL: {tap_url}")

# Download the data
response = requests.get(tap_url)

# Check if successful
if response.status_code == 200:
  print("Successful download...")
else:
  print(f"Encounter error while downloading: {response.status_code}")
Downloading K2 Planets and Candidates data...
URL: https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=SELECT%20%2A%20FROM%20k2pandc&format=csv
Successful download...
In [ ]:
import io

# Convert response to DataFrame
k2_data = pd.read_csv(io.StringIO(response.text))

print(f"Dataset shape: {k2_data.shape}")
print(f"Number of columns: {len(k2_data.columns)}")
print(f"First few column names: {list(k2_data.columns[:10])}")
Dataset shape: (4004, 361)
Number of columns: 361
First few column names: ['pl_name', 'pl_letter', 'k2_name', 'epic_hostname', 'epic_candname', 'hostname', 'hd_name', 'hip_name', 'tic_id', 'gaia_id']
In [ ]:
# Check the data types and basic info
print("\n=== Dataset Info ===")
print(k2_data.info())

print("\n=== First 5 rows ===")
print(k2_data.head())

print("\n=== Disposition counts ===")
print(k2_data['disposition'].value_counts())
=== Dataset Info ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4004 entries, 0 to 4003
Columns: 361 entries, pl_name to sy_kmagerr2
dtypes: float64(241), int64(26), object(94)
memory usage: 11.0+ MB
None

=== First 5 rows ===
             pl_name pl_letter   k2_name   epic_hostname      epic_candname  \
0  EPIC 220192485.01       NaN       NaN  EPIC 220192485  EPIC 220192485.01   
1           K2-151 b         b  K2-151 b  EPIC 220621087  EPIC 220621087.01   
2           K2-369 b         b  K2-369 b  EPIC 206146957  EPIC 206146957.01   
3           K2-232 b         b  K2-232 b  EPIC 247098361                NaN   
4  EPIC 211946007.01       NaN       NaN  EPIC 211946007  EPIC 211946007.01   

         hostname    hd_name hip_name         tic_id  \
0  EPIC 220192485        NaN      NaN  TIC 248387177   
1          K2-151        NaN      NaN  TIC 376936928   
2          K2-369        NaN      NaN   TIC 49984349   
3          K2-232  HD 286123      NaN   TIC 68577662   
4  EPIC 211946007        NaN      NaN  TIC 184892124   

                        gaia_id  ...  sy_jmagerr1 sy_jmagerr2  \
0  Gaia DR2 2534555412207560960  ...        0.030      -0.030   
1  Gaia DR2 2579620343673729408  ...        0.027      -0.027   
2  Gaia DR2 2616766622463761024  ...        0.027      -0.027   
3  Gaia DR2 3406687485600728192  ...        0.030      -0.030   
4   Gaia DR2 661229701489213824  ...        0.032      -0.032   

            sy_jmagstr sy_hmag sy_hmagerr1 sy_hmagerr2            sy_hmagstr  \
0  10.463&plusmn;0.030  10.084       0.025      -0.025   10.084&plusmn;0.025   
1  10.930&plusmn;0.027  10.335       0.022      -0.022   10.335&plusmn;0.022   
2  10.357&plusmn;0.027   9.980       0.021      -0.021  9.9800&plusmn;0.0210   
3   8.739&plusmn;0.030   8.480       0.018      -0.018    8.480&plusmn;0.018   
4  14.348&plusmn;0.032  13.769       0.037      -0.037   13.769&plusmn;0.037   

  sy_kmag  sy_kmagerr1 sy_kmagerr2  
0   9.947        0.019      -0.019  
1  10.117        0.021      -0.021  
2   9.934        0.019      -0.019  
3   8.434        0.017      -0.017  
4  13.499        0.043      -0.043  

[5 rows x 361 columns]

=== Disposition counts ===
disposition
CONFIRMED         2315
CANDIDATE         1374
FALSE POSITIVE     293
REFUTED             22
Name: count, dtype: int64
In [ ]:
k2_data.tail(5)
Out[ ]:
pl_name pl_letter k2_name epic_hostname epic_candname hostname hd_name hip_name tic_id gaia_id ... sy_jmagerr1 sy_jmagerr2 sy_jmagstr sy_hmag sy_hmagerr1 sy_hmagerr2 sy_hmagstr sy_kmag sy_kmagerr1 sy_kmagerr2
3999 EPIC 211796070.01 NaN NaN EPIC 211796070 EPIC 211796070.01 EPIC 211796070 NaN NaN TIC 20497205 Gaia DR2 658897706044705280 ... 0.021 -0.021 12.163&plusmn;0.021 11.574 0.024 -0.024 11.574&plusmn;0.024 11.466 0.023 -0.023
4000 EPIC 212044495.01 NaN NaN EPIC 212044495 EPIC 212044495.01 EPIC 212044495 NaN NaN TIC 386908279 Gaia DR2 637093252297660416 ... 0.024 -0.024 11.757&plusmn;0.024 11.426 0.032 -0.032 11.426&plusmn;0.032 11.345 0.022 -0.022
4001 EPIC 220542353.01 NaN NaN EPIC 220542353 EPIC 220542353.01 HIP 4601 HD 5731 HIP 4601 TIC 266074740 Gaia DR2 2577606244530236416 ... 0.020 -0.020 7.842&plusmn;0.020 7.616 0.040 -0.040 7.616&plusmn;0.040 7.560 0.026 -0.026
4002 EPIC 211965883.01 NaN NaN EPIC 211965883 EPIC 211965883.01 EPIC 211965883 NaN NaN TIC 203287539 Gaia DR2 636574699422681088 ... 0.022 -0.022 12.184&plusmn;0.022 11.565 0.022 -0.022 11.565&plusmn;0.022 11.361 0.018 -0.018
4003 EPIC 210852232.01 NaN NaN EPIC 210852232 EPIC 210852232.01 EPIC 210852232 NaN NaN TIC 14080244 Gaia DR2 51555211171046016 ... 0.024 -0.024 11.807&plusmn;0.024 11.373 0.027 -0.027 11.373&plusmn;0.027 11.248 0.022 -0.022

5 rows × 361 columns

In [ ]:
# Create the scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(k2_data['pl_trandep'], k2_data['pl_trandur'],
           alpha=0.6, s=20, c='#1e40af', edgecolors='none')

# Add labels and title
plt.xlabel('K2 Depth [ppm]', fontsize=12, fontweight='bold')
plt.ylabel('K2 Duration [hrs]', fontsize=12, fontweight='bold')
plt.title('Transit Duration vs Transit Depth', fontsize=14, fontweight='bold')

# Add grid for better readability
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
# where are the outliers?

# Box plots to identify outliers
plt.figure(figsize=(15, 5))
k2_data[['pl_orbper', 'pl_trandep', 'pl_trandur']].boxplot()
plt.title('Outlier Detection in Key Features')
plt.xticks(rotation=45)
plt.show()
No description has been provided for this image
In [ ]:
# class distribution w.t.r. to koi_disposition

plt.figure(figsize=(8, 6))
class_counts = k2_data['disposition'].value_counts()
colors = ['#10b981', '#f59e0b', '#ef4444']  # Green, Orange, Red

bars = plt.bar(class_counts.index, class_counts.values, color=colors, alpha=0.8,width=0.3)
plt.title('Distribution of Exoplanet Classes (Original K2 Dataset)')
plt.xlabel('Class')
plt.ylabel('Count')
plt.xticks(rotation=0)

# Add value labels on bars
for bar, count in zip(bars, class_counts.values):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
             f'{count:,}', ha='center', va='bottom', fontweight='bold')

plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()

print("Class distribution:")
for class_name, count in class_counts.items():
    percentage = (count / len(k2_data)) * 100
    print(f"{class_name}: {count:,} ({percentage:.1f}%)")
No description has been provided for this image
Class distribution:
CONFIRMED: 2,315 (57.8%)
CANDIDATE: 1,374 (34.3%)
FALSE POSITIVE: 293 (7.3%)
REFUTED: 22 (0.5%)

Data Preprocessing¶

Only disposition values corresponding to CONFIRMED FALSE POSITIVE and CANDIDATE are considered for classification.

In [ ]:
# Removing FALSE POSITIVES for binary classification
k2_data_binary = k2_data[(k2_data['disposition'] != 'REFUTED') & (k2_data['disposition'] != 'FALSE POSITIVE')].copy()
print(f"Upon removing FALSE POSITIVES: {k2_data_binary.shape[0]} rows")
print(f"Remaining classes: {k2_data_binary['disposition'].value_counts()}")

# Drop rows where 'disposition' is NaN before creating y_binary
k2_data_binary.dropna(subset=['disposition'], inplace=True)

# Create y_binary target
# STEP 2: Create target variable FIRST
y_binary = k2_data_binary['disposition'].map({
    'CANDIDATE': 0,
    'CONFIRMED': 1
})
Upon removing FALSE POSITIVES: 3689 rows
Remaining classes: disposition
CONFIRMED    2315
CANDIDATE    1374
Name: count, dtype: int64

In the next step, null values are inspected. Columns with > 50% values missing are identified.

In [ ]:
missing_percentages = (k2_data_binary.isnull().sum() / len(k2_data_binary)) * 100

high_missing = missing_percentages[missing_percentages > 50]
print(f"Columns with >50% missing data: {len(high_missing)}")
print(high_missing.sort_values(ascending=False))

# Show columns with >80% missing
excessive_missing = missing_percentages[missing_percentages > 80]
print(f"\nColumns with >80% missing data: {len(excessive_missing)}")
print(excessive_missing.sort_values(ascending=False))
Columns with >50% missing data: 179
pl_occdeperr2     100.000000
pl_occdeperr1     100.000000
sy_kepmagerr1     100.000000
sy_icmagerr1      100.000000
sy_icmag          100.000000
                     ...    
sy_rmag            56.465167
pl_trandurerr2     51.070751
pl_trandurerr1     50.989428
st_masserr2        50.365953
st_masserr1        50.203307
Length: 179, dtype: float64

Columns with >80% missing data: 126
pl_occdeperr2    100.000000
pl_occdeperr1    100.000000
sy_icmagerr2     100.000000
sy_icmag         100.000000
sy_kepmagerr2    100.000000
                    ...    
pl_insolstr       83.247493
pl_eqterr2        81.621036
pl_eqterr1        81.621036
st_denserr1       81.404175
st_denserr2       81.404175
Length: 126, dtype: float64
In [ ]:
# drop column above due to null values

columns_drop = missing_percentages[missing_percentages > 50].index.tolist()

print(f"Columns to drop: {len(columns_drop)}")
print("Dropping these columns:")
for col in columns_drop:
    print(f"  - {col} ({missing_percentages[col]:.1f}% missing)")

# Apply the drops
k2_data_clean = k2_data_binary.drop(columns=columns_drop)

print(f"\nInitial dataset: {k2_data_clean.shape}")
print(f"Cleaned dataset: {k2_data_clean.shape}")
print(f"Removed {k2_data.shape[1] - k2_data_clean.shape[1]} columns")
Columns to drop: 179
Dropping these columns:
  - hd_name (97.2% missing)
  - hip_name (96.4% missing)
  - pl_orblpererr1 (94.2% missing)
  - pl_orblper (93.3% missing)
  - pl_orblpererr2 (94.2% missing)
  - pl_orblperlim (93.3% missing)
  - pl_orblperstr (93.3% missing)
  - pl_orbsmax (78.0% missing)
  - pl_orbsmaxerr1 (78.2% missing)
  - pl_orbsmaxerr2 (78.2% missing)
  - pl_orbsmaxlim (78.0% missing)
  - pl_orbsmaxstr (78.0% missing)
  - pl_orbincl (73.5% missing)
  - pl_orbinclerr1 (74.5% missing)
  - pl_orbinclerr2 (74.5% missing)
  - pl_orbincllim (73.5% missing)
  - pl_orbinclstr (73.5% missing)
  - pl_orbtper (99.2% missing)
  - pl_orbtpererr1 (99.2% missing)
  - pl_orbtpererr2 (99.2% missing)
  - pl_orbtperlim (99.2% missing)
  - pl_orbtperstr (99.2% missing)
  - pl_orbeccen (88.4% missing)
  - pl_orbeccenerr1 (93.9% missing)
  - pl_orbeccenerr2 (93.9% missing)
  - pl_orbeccenlim (88.4% missing)
  - pl_orbeccenstr (88.4% missing)
  - pl_eqt (77.2% missing)
  - pl_eqterr1 (81.6% missing)
  - pl_eqterr2 (81.6% missing)
  - pl_eqtlim (77.2% missing)
  - pl_eqtstr (77.2% missing)
  - pl_occdep (100.0% missing)
  - pl_occdeperr1 (100.0% missing)
  - pl_occdeperr2 (100.0% missing)
  - pl_occdeplim (100.0% missing)
  - pl_occdepstr (100.0% missing)
  - pl_insol (83.2% missing)
  - pl_insolerr1 (87.9% missing)
  - pl_insolerr2 (87.9% missing)
  - pl_insollim (83.2% missing)
  - pl_insolstr (83.2% missing)
  - pl_dens (90.3% missing)
  - pl_denserr1 (90.9% missing)
  - pl_denserr2 (90.9% missing)
  - pl_denslim (90.3% missing)
  - pl_densstr (90.3% missing)
  - pl_trandeperr1 (65.4% missing)
  - pl_trandeperr2 (65.4% missing)
  - pl_trandurerr1 (51.0% missing)
  - pl_trandurerr2 (51.1% missing)
  - sy_umag (56.5% missing)
  - sy_umagerr1 (56.5% missing)
  - sy_umagerr2 (56.5% missing)
  - sy_umagstr (56.5% missing)
  - sy_rmag (56.5% missing)
  - sy_rmagerr1 (56.5% missing)
  - sy_rmagerr2 (56.5% missing)
  - sy_rmagstr (56.5% missing)
  - sy_imag (56.5% missing)
  - sy_imagerr1 (56.5% missing)
  - sy_imagerr2 (56.5% missing)
  - sy_imagstr (56.5% missing)
  - sy_zmag (56.5% missing)
  - sy_zmagerr1 (56.5% missing)
  - sy_zmagerr2 (56.5% missing)
  - sy_zmagstr (56.5% missing)
  - sy_w4magerr1 (84.6% missing)
  - sy_w4magerr2 (84.6% missing)
  - sy_gmag (56.5% missing)
  - sy_gmagerr1 (56.5% missing)
  - sy_gmagerr2 (56.5% missing)
  - sy_gmagstr (56.5% missing)
  - st_metratio (57.3% missing)
  - st_spectype (88.2% missing)
  - sy_kepmagerr1 (100.0% missing)
  - sy_kepmagerr2 (100.0% missing)
  - st_rotp (95.3% missing)
  - st_rotperr1 (95.3% missing)
  - st_rotperr2 (95.3% missing)
  - st_rotplim (95.3% missing)
  - st_rotpstr (95.3% missing)
  - pl_projobliq (99.2% missing)
  - pl_projobliqerr1 (99.2% missing)
  - pl_projobliqerr2 (99.2% missing)
  - pl_projobliqlim (99.2% missing)
  - pl_projobliqstr (99.2% missing)
  - pl_rvamp (91.8% missing)
  - pl_rvamperr1 (92.4% missing)
  - pl_rvamperr2 (92.4% missing)
  - pl_rvamplim (91.8% missing)
  - pl_rvampstr (91.8% missing)
  - pl_trueobliq (99.7% missing)
  - pl_trueobliqerr1 (99.7% missing)
  - pl_trueobliqerr2 (99.7% missing)
  - pl_trueobliqlim (99.7% missing)
  - pl_trueobliqstr (99.7% missing)
  - sy_icmag (100.0% missing)
  - sy_icmagerr1 (100.0% missing)
  - sy_icmagerr2 (100.0% missing)
  - sy_icmagstr (100.0% missing)
  - pl_imppar (60.4% missing)
  - pl_impparerr1 (70.9% missing)
  - pl_impparerr2 (70.9% missing)
  - pl_impparlim (60.4% missing)
  - pl_impparstr (60.4% missing)
  - pl_cmassj (99.6% missing)
  - pl_cmassjerr1 (99.6% missing)
  - pl_cmassjerr2 (99.6% missing)
  - pl_cmassjlim (99.6% missing)
  - pl_cmassjstr (99.6% missing)
  - pl_cmasse (99.6% missing)
  - pl_cmasseerr1 (99.6% missing)
  - pl_cmasseerr2 (99.6% missing)
  - pl_cmasselim (99.6% missing)
  - pl_cmassestr (99.6% missing)
  - pl_massj (88.8% missing)
  - pl_massjerr1 (89.9% missing)
  - pl_massjerr2 (89.9% missing)
  - pl_massjlim (88.8% missing)
  - pl_massjstr (88.8% missing)
  - pl_masse (88.8% missing)
  - pl_masseerr1 (89.9% missing)
  - pl_masseerr2 (89.9% missing)
  - pl_masselim (88.8% missing)
  - pl_massestr (88.8% missing)
  - pl_bmassj (88.2% missing)
  - pl_bmassjerr1 (89.3% missing)
  - pl_bmassjerr2 (89.3% missing)
  - pl_bmassjlim (88.2% missing)
  - pl_bmassjstr (88.2% missing)
  - pl_bmasse (88.2% missing)
  - pl_bmasseerr1 (89.3% missing)
  - pl_bmasseerr2 (89.3% missing)
  - pl_bmasselim (88.2% missing)
  - pl_bmassestr (88.2% missing)
  - pl_bmassprov (88.2% missing)
  - pl_msinij (99.2% missing)
  - pl_msinijerr1 (99.2% missing)
  - pl_msinijerr2 (99.2% missing)
  - pl_msinijlim (99.2% missing)
  - pl_msinijstr (99.2% missing)
  - pl_msinie (99.2% missing)
  - pl_msinieerr1 (99.2% missing)
  - pl_msinieerr2 (99.2% missing)
  - pl_msinielim (99.2% missing)
  - pl_msiniestr (99.2% missing)
  - st_met (57.3% missing)
  - st_meterr1 (58.0% missing)
  - st_meterr2 (58.0% missing)
  - st_metlim (57.3% missing)
  - st_metstr (57.3% missing)
  - st_radv (95.1% missing)
  - st_radverr1 (95.1% missing)
  - st_radverr2 (95.3% missing)
  - st_radvlim (95.1% missing)
  - st_radvstr (95.1% missing)
  - st_vsin (89.4% missing)
  - st_vsinerr1 (91.2% missing)
  - st_vsinerr2 (91.4% missing)
  - st_vsinlim (89.4% missing)
  - st_vsinstr (89.4% missing)
  - st_lum (83.8% missing)
  - st_lumerr1 (86.0% missing)
  - st_lumerr2 (86.0% missing)
  - st_lumlim (83.8% missing)
  - st_lumstr (83.8% missing)
  - st_age (91.4% missing)
  - st_ageerr1 (92.3% missing)
  - st_ageerr2 (92.3% missing)
  - st_agelim (91.4% missing)
  - st_agestr (91.4% missing)
  - st_masserr1 (50.2% missing)
  - st_masserr2 (50.4% missing)
  - st_dens (79.3% missing)
  - st_denserr1 (81.4% missing)
  - st_denserr2 (81.4% missing)
  - st_denslim (79.3% missing)
  - st_densstr (79.3% missing)

Initial dataset: (3689, 182)
Cleaned dataset: (3689, 182)
Removed 179 columns
In [ ]:
# check if any values for disposition is Nan

# Check for NaN values in disposition column
nan_count = k2_data_clean['disposition'].isna().sum()
print(f"Number of NaN values in disposition column: {nan_count}")
Number of NaN values in disposition column: 0
In [ ]:
empty_strings = (k2_data_clean['disposition'] == '').sum()
print(f"Empty strings: {empty_strings}")
Empty strings: 0

Feature Engineering & Selection¶

Of the initial 361 columns, 179 columns were removed. To reduce dimensionality further, Pearson correlation is used.

In [ ]:
# plot correlation for numeric columns
numeric_columns = k2_data_clean.select_dtypes(include=[np.number]).columns
correlation_matrix = k2_data_clean[numeric_columns].corr()

print(f"Correlation matrix shape: {correlation_matrix.shape}")
Correlation matrix shape: (126, 126)
In [ ]:
# Create correlation heatmap
plt.figure(figsize=(8, 6))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))  # Mask upper triangle

sns.heatmap(correlation_matrix,
            mask=mask,
            annot=False,
            cmap='RdBu_r',
            center=0,
            square=True,
            cbar_kws={'shrink': 0.8})

plt.title('Feature Correlation Matrix', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
No description has been provided for this image

Given the previous graph dimensionality can be reduced by firstly removikng highly correlated features, secondly by using the target variable disposition and exatracting the top 20 or 30 features the variable is correlated with, while dropping everything else.

This will allow us to keep only the information necessary.

In [ ]:
# Remove highly correlated features
def remove_highly_correlated_features(df, threshold=0.8):
    # Compute correlation matrix
    corr_matrix = df.select_dtypes(include=[np.number]).corr()

    # array containing highly correlated pairs
    high_corr_pairs = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):
            if abs(corr_matrix.iloc[i, j]) > threshold:
                high_corr_pairs.append((corr_matrix.columns[i], corr_matrix.columns[j]))

    features_to_remove = set()
    for feat1, feat2 in high_corr_pairs:

        if df[feat1].isnull().sum() <= df[feat2].isnull().sum():
            features_to_remove.add(feat2)
        else:
            features_to_remove.add(feat1)

    return df.drop(columns=list(features_to_remove)), list(features_to_remove)

k2_data_final, removed_features = remove_highly_correlated_features(k2_data_clean, threshold=0.8)

print(f"Initial features: {k2_data_clean.shape[1]}")
print(f"After removal correlated features: {k2_data_final.shape[1]}")
print(f"Removed {len(removed_features)} features:")
for feat in removed_features:
    print(f"  - {feat}")
Initial features: 182
After removal correlated features: 133
Removed 49 features:
  - pl_radeerr1
  - elon
  - sy_w3magerr2
  - sy_pmraerr2
  - sy_w3magerr1
  - sy_gaiamag
  - sy_jmagerr2
  - sy_pmraerr1
  - sy_w1magerr2
  - sy_tmagerr2
  - pl_ratrorerr2
  - pl_tranmiderr2
  - pl_radj
  - sy_hmagerr2
  - y
  - pl_orbpererr1
  - sy_gaiamagerr2
  - sy_kmag
  - z
  - sy_vmagerr2
  - sy_bmag
  - sy_plxerr2
  - st_tefferr2
  - sy_jmag
  - sy_vmag
  - pl_ratdorlim
  - pl_radelim
  - sy_w1mag
  - sy_disterr2
  - sy_w2mag
  - sy_disterr1
  - pl_radjerr1
  - x
  - sy_w3mag
  - sy_hmag
  - sy_w2magerr2
  - pl_radjerr2
  - sy_pmerr2
  - sy_bmagerr2
  - pl_rade
  - sy_kmagerr1
  - sy_pmdecerr2
  - sy_kepmagstr
  - pl_orbpererr2
  - sy_pmdecerr1
  - sy_kmagerr2
  - sy_kepmag
  - st_logg
  - pl_radeerr2

In the following step disposition target is now used to further reduce the dataframe dimensionality.

In [ ]:
target = 'disposition'

if target in k2_data_clean.columns:
    # Calculate correlation with 'koi_disposition'
    target_corr = k2_data_clean.select_dtypes(include=[np.number]).corrwith(
        pd.get_dummies(k2_data_clean[target]).iloc[:, 0]
    ).abs().sort_values(ascending=False)

    print("Top features correlated with target 'koi_disposition':")
    print(target_corr.head(10))
Top features correlated with target 'koi_disposition':
sy_pnum          0.655884
rv_flag          0.463046
default_flag     0.450125
pl_nnotes        0.363169
sy_hmag          0.307843
sy_kmag          0.307203
pl_ratrorerr2    0.304416
sy_jmag          0.302544
sy_tmag          0.292194
sy_w1mag         0.288758
dtype: float64
In [ ]:
# Compute correlation with target (koi_disposition)
target_corr = k2_data_final.select_dtypes(include=[np.number]).corrwith(
    pd.get_dummies(k2_data_final['disposition']).iloc[:, 0]
).abs().sort_values(ascending=False)

# Select top features without the target
top_20_features = target_corr.head(20).index.tolist()

top_30_features = target_corr.head(30).index.tolist()

X_20_features = k2_data_final[top_20_features]
X_30_features = k2_data_final[top_30_features]

print(f"X_20 features shape: {X_20_features.shape}")
print(f"X_30 features shape: {X_30_features.shape}")
print(f"y_binary shape: {y_binary.shape}")
print(f"Target: disposition")
X_20 features shape: (3689, 20)
X_30 features shape: (3689, 30)
y_binary shape: (3689,)
Target: disposition
In [ ]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np


# Calculate correlation matrix for top 20 features
correlation_matrix = k2_data_clean[top_20_features].corr()

# Create the heatmap
plt.figure(figsize=(10, 10))
sns.heatmap(correlation_matrix,
            annot=True,
            cmap='coolwarm',
            center=0,
            square=True,
            fmt='.2f',
            cbar_kws={'shrink': 0.8})
# Rotate the labels
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)

plt.title('K2 Top 20 Features Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('k2_correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()
No description has been provided for this image
In [ ]:
# Differences with 30-feature model
additional_features = set(top_30_features) - set(top_20_features)
print(f"\nAdditional features in 30-feature model:")
for i, feature in enumerate(additional_features, 1):
    importance = target_corr[feature]
    print(f"  {i:2d}. {feature:25s} {importance:.3f}")
Additional features in 30-feature model:
   1. pl_ntranspec              0.108
   2. sy_pmra                   0.113
   3. st_raderr2                0.120
   4. sy_hmagerr1               0.091
   5. sy_pmdec                  0.091
   6. st_tefferr1               0.115
   7. disc_year                 0.094
   8. st_mass                   0.100
   9. st_nphot                  0.121
  10. st_loggerr2               0.093
In [ ]:
# Let us plot a graph to understand
fig, ax = plt.subplots(figsize=(8, 6))
top_20_features_corr = target_corr.head(20)

# Create bars
bars = ax.barh(range(len(top_20_features_corr)), top_20_features_corr.values,
               color='#1e40af', alpha=0.8, height=0.8)

# Styling
ax.set_yticks(range(len(top_20_features_corr)))
ax.set_yticklabels(top_20_features_corr.index, fontsize=10)
ax.set_xlabel('Correlation Coefficient', fontsize=12, fontweight='bold')
ax.set_title('K2 Feature Scores for Top 20 Features', fontsize=14, fontweight='bold')

# create a grid
ax.grid(axis='x', alpha=0.3, linestyle='-', linewidth=0.5)

# add statistics
n_features = len(top_20_features_corr)
max_corr = top_20_features_corr.max()
ax.text(0.02, 0.98, f'Top {n_features} Features\nMax Correlation: {max_corr:.3f}',
        transform=ax.transAxes, fontsize=10,
        verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))

ax.invert_yaxis()
plt.tight_layout()
plt.show()
No description has been provided for this image

Apply Scikit-Learn Iterative Imputer¶

Dispite the dimensionality is reduce there exists still null entries in the dataset. Dropping the column containing null values can cause loss of information effecting the accuracy of the model. To address this problem scikit-learn iterative imputater is used as per [1]

The iterative imputer computes p(x|y) with x being the feature with missing values and y the features containing velues.

In [ ]:
# check if X_features datasets are pd
print(type(X_20_features))
print(type(X_30_features))
<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
In [ ]:
# Use RandomForestImputer

# Apply imputation to feature datasets only
imputer_20 = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=10, random_state=42),
    max_iter=10,
    random_state=42,
    verbose=1
)

imputer_30 = IterativeImputer(
    estimator=RandomForestRegressor(n_estimators=10, random_state=42),
    max_iter=10,
    random_state=42,
    verbose=1
)
In [ ]:
# Impute missing values
X_20_imputed = imputer_20.fit_transform(X_20_features)
X_30_imputed = imputer_30.fit_transform(X_30_features)

# Convert to DataFrames
X_20_imputed_df = pd.DataFrame(X_20_imputed, columns=X_20_features.columns, index=X_20_features.index)
X_30_imputed_df = pd.DataFrame(X_30_imputed, columns=X_30_features.columns, index=X_30_features.index)

print(f"20-feature DataFrame after imputation: {X_20_imputed_df.shape}")
print(f"30-feature DataFrame after imputation: {X_30_imputed_df.shape}")
[IterativeImputer] Completing matrix with shape (3689, 20)
[IterativeImputer] Change: 1365.8289471095313, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 84.066111851, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 46.85257698000026, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 90.6564069, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 56.4809425, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 54.82016341700001, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 48.649899100000006, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 86.62866443300004, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 33.11205755000007, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 31.85761967000003, scaled tolerance: 9.319510000000001 
[IterativeImputer] Completing matrix with shape (3689, 30)
[IterativeImputer] Change: 1584.8118238046268, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 1178.94519572, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 1913.3184013999994, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 1139.129011206, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 473.13393140000005, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 597.9553620999999, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 1779.0361205, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 1156.8816909999996, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 567.4880454999999, scaled tolerance: 9.319510000000001 
[IterativeImputer] Change: 825.5374065620001, scaled tolerance: 9.319510000000001 
20-feature DataFrame after imputation: (3689, 20)
30-feature DataFrame after imputation: (3689, 30)

Model Training¶

For the model training, the dataset is split into training and testing dataset. The ratio is 80/20. Subsequently StandardScaler is applied to the split data to avoid any data leakege of the test dataset into the model.

In [ ]:
# Split data using the y_binary by 80/20 split
X_train_20, X_test_20, y_train_20, y_test_20 = train_test_split(
    X_20_imputed_df, y_binary, test_size=0.2, random_state=42
)

X_train_30, X_test_30, y_train_30, y_test_30 = train_test_split(
    X_30_imputed_df, y_binary, test_size=0.2, random_state=42
)

print(f"20-feature training set: {X_train_20.shape}")
print(f"20-feature test set: {X_test_20.shape}")
print(f"30-feature training set: {X_train_30.shape}")
print(f"30-feature test set: {X_test_30.shape}")
20-feature training set: (2951, 20)
20-feature test set: (738, 20)
30-feature training set: (2951, 30)
30-feature test set: (738, 30)
In [ ]:
# Define a standard scaler
scaler_20 = StandardScaler()
scaler_30 = StandardScaler()

# Fit scalers on training data only
X_train_20_scaled = scaler_20.fit_transform(X_train_20)
X_test_20_scaled = scaler_20.transform(X_test_20)

X_train_30_scaled = scaler_30.fit_transform(X_train_30)
X_test_30_scaled = scaler_30.transform(X_test_30)

# Convert back to DataFrames
X_train_20_scaled_df = pd.DataFrame(X_train_20_scaled, columns=X_train_20.columns, index=X_train_20.index)
X_test_20_scaled_df = pd.DataFrame(X_test_20_scaled, columns=X_test_20.columns, index=X_test_20.index)

X_train_30_scaled_df = pd.DataFrame(X_train_30_scaled, columns=X_train_30.columns, index=X_train_30.index)
X_test_30_scaled_df = pd.DataFrame(X_test_30_scaled, columns=X_test_30.columns, index=X_test_30.index)
In [ ]:
print(f"20-feature training set (scaled): {X_train_20_scaled_df.shape}")
print(f"20-feature test set (scaled): {X_test_20_scaled_df.shape}")
20-feature training set (scaled): (2951, 20)
20-feature test set (scaled): (738, 20)

Now the ensemble for the datafrane are created. Note that the maximum iteration is setto 500, after experimenting it was noted using 1000 provides negligible improvement to the model accurancy whilest increasing the training time.

In [ ]:
ensemble_20 = VotingClassifier([
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('xgb', XGBClassifier(random_state=42)),
    ('svm', SVC(probability=True, random_state=42)),
    ('lr', LogisticRegression(random_state=42, max_iter=2000)),
    ('deep_nn', MLPClassifier(hidden_layer_sizes=(200, 100, 50, 25, 10), max_iter=500, random_state=42))
], voting='soft')


ensemble_30 = VotingClassifier([
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('xgb', XGBClassifier(random_state=42)),
    ('svm', SVC(probability=True, random_state=42)),
    ('lr', LogisticRegression(random_state=42, max_iter=2000)),
    ('deep_nn', MLPClassifier(hidden_layer_sizes=(200, 100, 50, 25, 10), max_iter=500, random_state=42))
], voting='soft')
In [ ]:
print("NaN values in y_binary:", y_binary.isna().sum())
print("Total y_binary values:", len(y_binary))
print("Unique y_binary values:", y_binary.unique())
NaN values in y_binary: 0
Total y_binary values: 3689
Unique y_binary values: [0 1]
In [ ]:
# Train ensemble
ensemble_20.fit(X_train_20_scaled_df, y_train_20)

# Evaluation
y_pred_20 = ensemble_20.predict(X_test_20_scaled_df)

Note when using a multi-classifier

y_binary = k2_data_binary['disposition'].map({
    'CANDIDATE': 0,
    'FALSE POSITIVE': 1,
    'CONFIRMED': 2
})

the following metrics were obtained:


Ensemble_20 Accuracy: 0.949
              precision    recall  f1-score   support

           0       0.92      0.93      0.92       268
           1       0.65      0.53      0.59        45
           2       0.99      1.00      0.99       484

    accuracy                           0.95       797
   macro avg       0.85      0.82      0.83       797
weighted avg       0.95      0.95      0.95       797

Therefore it made sense to focus on a binary classfier for higher accuracy.

In [ ]:
print(f"Ensemble_20 Accuracy: {accuracy_score(y_test_20, y_pred_20):.3f}")
print(classification_report(y_test_20, y_pred_20))
Ensemble_20 Accuracy: 0.992
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       265
           1       0.99      1.00      0.99       473

    accuracy                           0.99       738
   macro avg       0.99      0.99      0.99       738
weighted avg       0.99      0.99      0.99       738

In [ ]:
# Compare with 20-feature model
y_pred_proba_20 = ensemble_20.predict_proba(X_test_20_scaled)
auc_20 = roc_auc_score(y_test_20, y_pred_proba_20[:, 1])
print(f"20-Feature Model AUC Score: {auc_20:.3f}")
20-Feature Model AUC Score: 0.999
In [ ]:
# Saving 20-feature model
model_package_20 = {
    'model': ensemble_20,
    'imputer': imputer_20,
    'scaler': scaler_20,
    'features': top_20_features,
    'accuracy': 0.992,
    'auc_score': 0.999,
    'mission': 'kepler',
    'classes': ['CANDIDATE', 'CONFIRMED'],
    'class_mapping': {'CANDIDATE': 0, 'CONFIRMED': 1},
    'n_features': 20,
    'performance': {
        'candidate_precision': 1.00,
        'candidate_recall': 0.98,
        'confirmed_precision': 0.99,
        'confirmed_recall': 1.00
    }
}
In [ ]:
joblib.dump(model_package_20, 'k2_20_features_model.pkl')
Out[ ]:
['k2_20_features_model.pkl']

Train 30-feature model¶

In [ ]:
# Train ensemble
ensemble_30.fit(X_train_30_scaled_df, y_train_30)

# Evaluation
y_pred_30 = ensemble_30.predict(X_test_30_scaled_df)
In [ ]:
print(f"Ensemble_30 Accuracy: {accuracy_score(y_test_30, y_pred_30):.3f}")
print(classification_report(y_test_30, y_pred_30))
Ensemble_30 Accuracy: 0.993
              precision    recall  f1-score   support

           0       1.00      0.98      0.99       265
           1       0.99      1.00      0.99       473

    accuracy                           0.99       738
   macro avg       0.99      0.99      0.99       738
weighted avg       0.99      0.99      0.99       738

In [ ]:
# Get prediction probabilities
y_pred_proba_30 = ensemble_30.predict_proba(X_test_30_scaled)

# Calculate AUC score
auc_30 = roc_auc_score(y_test_30, y_pred_proba_30[:, 1])
print(f"30-Feature Model AUC Score: {auc_30:.3f}")
30-Feature Model AUC Score: 0.999

Note that using 30 features improve the ensamble accuracy across all metrics. Now both models are saved using .pkl

In [ ]:
# Save the 30-feature model
model_package_30 = {
    'model': ensemble_30,
    'imputer': imputer_30,
    'scaler': scaler_30,
    'features': top_30_features,
    'accuracy': 0.993,
    'auc_score': 0.999,
    'mission': 'kepler',
    'classes': ['CANDIDATE', 'CONFIRMED'],
    'class_mapping': {'CANDIDATE': 0, 'CONFIRMED': 1},
    'n_features': 30,
    'performance': {
        'candidate_precision': 1.00,
        'candidate_recall': 0.98,
        'confirmed_precision': 0.99,
        'confirmed_recall': 1.00
    }
}

joblib.dump(model_package_30, 'k2_final_30_features.pkl')
print("Final Kepler model saved with 30 features")
Final Kepler model saved with 30 features

Results & Discussion¶

The Hybrid ML Ensemble Architecture provides improved performance for the 30 top features comapred to Top-20 features model.

Both ensamble achieved high accuracy ~ 0.99.

References¶

Saha, R 2021. Comparing Classification Models on Kepler Data. arXiv preprint arXiv:2101.01904 [astro-ph.EP], viewed 5/10/2025, https://arxiv.org/abs/2101.01904.

Luz, T. S. F., Braga, R. A. S., & Ribeiro, E. R. (2024). Assessment of Ensemble-Based Machine Learning Algorithms for Exoplanet Identification. Electronics, 13(19), 3950. https://doi.org/10.3390/electronics13193950