Exoplanet Detection using Hybrid Ensemble ML Architecture (Part II)¶
K2 Object of Interest¶
In this notebook we focus on the Kepler Project mission data source that can be found here.
We explore the open-source data collected by the Kepler 2 transit method satellite, and build a suitable ML model for exoplanet identification.
The historical data cover the period 2014-2018. This work is built upon previous studies you can find in the references section.
# Adding imports
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from xgboost import XGBClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import requests
import urllib.parse
import io
import joblib
import warnings
# suppress all warnings
warnings.filterwarnings('ignore')
Similarly to the Kepler notebook the data are programmatically loaded using TAP service
import pandas as pd
import requests
import urllib.parse
# Define the query for the cumulative table
query = "SELECT * FROM k2pandc"
# Encode the query for the URL
encoded_query = urllib.parse.quote(query)
# TAP URL
tap_url = f"https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query={encoded_query}&format=csv"
print("Downloading K2 Planets and Candidates data...")
print(f"URL: {tap_url}")
# Download the data
response = requests.get(tap_url)
# Check if successful
if response.status_code == 200:
print("Successful download...")
else:
print(f"Encounter error while downloading: {response.status_code}")
Downloading K2 Planets and Candidates data... URL: https://exoplanetarchive.ipac.caltech.edu/TAP/sync?query=SELECT%20%2A%20FROM%20k2pandc&format=csv Successful download...
import io
# Convert response to DataFrame
k2_data = pd.read_csv(io.StringIO(response.text))
print(f"Dataset shape: {k2_data.shape}")
print(f"Number of columns: {len(k2_data.columns)}")
print(f"First few column names: {list(k2_data.columns[:10])}")
Dataset shape: (4004, 361) Number of columns: 361 First few column names: ['pl_name', 'pl_letter', 'k2_name', 'epic_hostname', 'epic_candname', 'hostname', 'hd_name', 'hip_name', 'tic_id', 'gaia_id']
# Check the data types and basic info
print("\n=== Dataset Info ===")
print(k2_data.info())
print("\n=== First 5 rows ===")
print(k2_data.head())
print("\n=== Disposition counts ===")
print(k2_data['disposition'].value_counts())
=== Dataset Info ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4004 entries, 0 to 4003
Columns: 361 entries, pl_name to sy_kmagerr2
dtypes: float64(241), int64(26), object(94)
memory usage: 11.0+ MB
None
=== First 5 rows ===
pl_name pl_letter k2_name epic_hostname epic_candname \
0 EPIC 220192485.01 NaN NaN EPIC 220192485 EPIC 220192485.01
1 K2-151 b b K2-151 b EPIC 220621087 EPIC 220621087.01
2 K2-369 b b K2-369 b EPIC 206146957 EPIC 206146957.01
3 K2-232 b b K2-232 b EPIC 247098361 NaN
4 EPIC 211946007.01 NaN NaN EPIC 211946007 EPIC 211946007.01
hostname hd_name hip_name tic_id \
0 EPIC 220192485 NaN NaN TIC 248387177
1 K2-151 NaN NaN TIC 376936928
2 K2-369 NaN NaN TIC 49984349
3 K2-232 HD 286123 NaN TIC 68577662
4 EPIC 211946007 NaN NaN TIC 184892124
gaia_id ... sy_jmagerr1 sy_jmagerr2 \
0 Gaia DR2 2534555412207560960 ... 0.030 -0.030
1 Gaia DR2 2579620343673729408 ... 0.027 -0.027
2 Gaia DR2 2616766622463761024 ... 0.027 -0.027
3 Gaia DR2 3406687485600728192 ... 0.030 -0.030
4 Gaia DR2 661229701489213824 ... 0.032 -0.032
sy_jmagstr sy_hmag sy_hmagerr1 sy_hmagerr2 sy_hmagstr \
0 10.463±0.030 10.084 0.025 -0.025 10.084±0.025
1 10.930±0.027 10.335 0.022 -0.022 10.335±0.022
2 10.357±0.027 9.980 0.021 -0.021 9.9800±0.0210
3 8.739±0.030 8.480 0.018 -0.018 8.480±0.018
4 14.348±0.032 13.769 0.037 -0.037 13.769±0.037
sy_kmag sy_kmagerr1 sy_kmagerr2
0 9.947 0.019 -0.019
1 10.117 0.021 -0.021
2 9.934 0.019 -0.019
3 8.434 0.017 -0.017
4 13.499 0.043 -0.043
[5 rows x 361 columns]
=== Disposition counts ===
disposition
CONFIRMED 2315
CANDIDATE 1374
FALSE POSITIVE 293
REFUTED 22
Name: count, dtype: int64
k2_data.tail(5)
| pl_name | pl_letter | k2_name | epic_hostname | epic_candname | hostname | hd_name | hip_name | tic_id | gaia_id | ... | sy_jmagerr1 | sy_jmagerr2 | sy_jmagstr | sy_hmag | sy_hmagerr1 | sy_hmagerr2 | sy_hmagstr | sy_kmag | sy_kmagerr1 | sy_kmagerr2 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3999 | EPIC 211796070.01 | NaN | NaN | EPIC 211796070 | EPIC 211796070.01 | EPIC 211796070 | NaN | NaN | TIC 20497205 | Gaia DR2 658897706044705280 | ... | 0.021 | -0.021 | 12.163±0.021 | 11.574 | 0.024 | -0.024 | 11.574±0.024 | 11.466 | 0.023 | -0.023 |
| 4000 | EPIC 212044495.01 | NaN | NaN | EPIC 212044495 | EPIC 212044495.01 | EPIC 212044495 | NaN | NaN | TIC 386908279 | Gaia DR2 637093252297660416 | ... | 0.024 | -0.024 | 11.757±0.024 | 11.426 | 0.032 | -0.032 | 11.426±0.032 | 11.345 | 0.022 | -0.022 |
| 4001 | EPIC 220542353.01 | NaN | NaN | EPIC 220542353 | EPIC 220542353.01 | HIP 4601 | HD 5731 | HIP 4601 | TIC 266074740 | Gaia DR2 2577606244530236416 | ... | 0.020 | -0.020 | 7.842±0.020 | 7.616 | 0.040 | -0.040 | 7.616±0.040 | 7.560 | 0.026 | -0.026 |
| 4002 | EPIC 211965883.01 | NaN | NaN | EPIC 211965883 | EPIC 211965883.01 | EPIC 211965883 | NaN | NaN | TIC 203287539 | Gaia DR2 636574699422681088 | ... | 0.022 | -0.022 | 12.184±0.022 | 11.565 | 0.022 | -0.022 | 11.565±0.022 | 11.361 | 0.018 | -0.018 |
| 4003 | EPIC 210852232.01 | NaN | NaN | EPIC 210852232 | EPIC 210852232.01 | EPIC 210852232 | NaN | NaN | TIC 14080244 | Gaia DR2 51555211171046016 | ... | 0.024 | -0.024 | 11.807±0.024 | 11.373 | 0.027 | -0.027 | 11.373±0.027 | 11.248 | 0.022 | -0.022 |
5 rows × 361 columns
# Create the scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(k2_data['pl_trandep'], k2_data['pl_trandur'],
alpha=0.6, s=20, c='#1e40af', edgecolors='none')
# Add labels and title
plt.xlabel('K2 Depth [ppm]', fontsize=12, fontweight='bold')
plt.ylabel('K2 Duration [hrs]', fontsize=12, fontweight='bold')
plt.title('Transit Duration vs Transit Depth', fontsize=14, fontweight='bold')
# Add grid for better readability
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# where are the outliers?
# Box plots to identify outliers
plt.figure(figsize=(15, 5))
k2_data[['pl_orbper', 'pl_trandep', 'pl_trandur']].boxplot()
plt.title('Outlier Detection in Key Features')
plt.xticks(rotation=45)
plt.show()
# class distribution w.t.r. to koi_disposition
plt.figure(figsize=(8, 6))
class_counts = k2_data['disposition'].value_counts()
colors = ['#10b981', '#f59e0b', '#ef4444'] # Green, Orange, Red
bars = plt.bar(class_counts.index, class_counts.values, color=colors, alpha=0.8,width=0.3)
plt.title('Distribution of Exoplanet Classes (Original K2 Dataset)')
plt.xlabel('Class')
plt.ylabel('Count')
plt.xticks(rotation=0)
# Add value labels on bars
for bar, count in zip(bars, class_counts.values):
plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
f'{count:,}', ha='center', va='bottom', fontweight='bold')
plt.grid(axis='y', alpha=0.3)
plt.tight_layout()
plt.show()
print("Class distribution:")
for class_name, count in class_counts.items():
percentage = (count / len(k2_data)) * 100
print(f"{class_name}: {count:,} ({percentage:.1f}%)")
Class distribution: CONFIRMED: 2,315 (57.8%) CANDIDATE: 1,374 (34.3%) FALSE POSITIVE: 293 (7.3%) REFUTED: 22 (0.5%)
Data Preprocessing¶
Only disposition values corresponding to CONFIRMED FALSE POSITIVE and CANDIDATE are considered for classification.
# Removing FALSE POSITIVES for binary classification
k2_data_binary = k2_data[(k2_data['disposition'] != 'REFUTED') & (k2_data['disposition'] != 'FALSE POSITIVE')].copy()
print(f"Upon removing FALSE POSITIVES: {k2_data_binary.shape[0]} rows")
print(f"Remaining classes: {k2_data_binary['disposition'].value_counts()}")
# Drop rows where 'disposition' is NaN before creating y_binary
k2_data_binary.dropna(subset=['disposition'], inplace=True)
# Create y_binary target
# STEP 2: Create target variable FIRST
y_binary = k2_data_binary['disposition'].map({
'CANDIDATE': 0,
'CONFIRMED': 1
})
Upon removing FALSE POSITIVES: 3689 rows Remaining classes: disposition CONFIRMED 2315 CANDIDATE 1374 Name: count, dtype: int64
In the next step, null values are inspected. Columns with > 50% values missing are identified.
missing_percentages = (k2_data_binary.isnull().sum() / len(k2_data_binary)) * 100
high_missing = missing_percentages[missing_percentages > 50]
print(f"Columns with >50% missing data: {len(high_missing)}")
print(high_missing.sort_values(ascending=False))
# Show columns with >80% missing
excessive_missing = missing_percentages[missing_percentages > 80]
print(f"\nColumns with >80% missing data: {len(excessive_missing)}")
print(excessive_missing.sort_values(ascending=False))
Columns with >50% missing data: 179
pl_occdeperr2 100.000000
pl_occdeperr1 100.000000
sy_kepmagerr1 100.000000
sy_icmagerr1 100.000000
sy_icmag 100.000000
...
sy_rmag 56.465167
pl_trandurerr2 51.070751
pl_trandurerr1 50.989428
st_masserr2 50.365953
st_masserr1 50.203307
Length: 179, dtype: float64
Columns with >80% missing data: 126
pl_occdeperr2 100.000000
pl_occdeperr1 100.000000
sy_icmagerr2 100.000000
sy_icmag 100.000000
sy_kepmagerr2 100.000000
...
pl_insolstr 83.247493
pl_eqterr2 81.621036
pl_eqterr1 81.621036
st_denserr1 81.404175
st_denserr2 81.404175
Length: 126, dtype: float64
# drop column above due to null values
columns_drop = missing_percentages[missing_percentages > 50].index.tolist()
print(f"Columns to drop: {len(columns_drop)}")
print("Dropping these columns:")
for col in columns_drop:
print(f" - {col} ({missing_percentages[col]:.1f}% missing)")
# Apply the drops
k2_data_clean = k2_data_binary.drop(columns=columns_drop)
print(f"\nInitial dataset: {k2_data_clean.shape}")
print(f"Cleaned dataset: {k2_data_clean.shape}")
print(f"Removed {k2_data.shape[1] - k2_data_clean.shape[1]} columns")
Columns to drop: 179 Dropping these columns: - hd_name (97.2% missing) - hip_name (96.4% missing) - pl_orblpererr1 (94.2% missing) - pl_orblper (93.3% missing) - pl_orblpererr2 (94.2% missing) - pl_orblperlim (93.3% missing) - pl_orblperstr (93.3% missing) - pl_orbsmax (78.0% missing) - pl_orbsmaxerr1 (78.2% missing) - pl_orbsmaxerr2 (78.2% missing) - pl_orbsmaxlim (78.0% missing) - pl_orbsmaxstr (78.0% missing) - pl_orbincl (73.5% missing) - pl_orbinclerr1 (74.5% missing) - pl_orbinclerr2 (74.5% missing) - pl_orbincllim (73.5% missing) - pl_orbinclstr (73.5% missing) - pl_orbtper (99.2% missing) - pl_orbtpererr1 (99.2% missing) - pl_orbtpererr2 (99.2% missing) - pl_orbtperlim (99.2% missing) - pl_orbtperstr (99.2% missing) - pl_orbeccen (88.4% missing) - pl_orbeccenerr1 (93.9% missing) - pl_orbeccenerr2 (93.9% missing) - pl_orbeccenlim (88.4% missing) - pl_orbeccenstr (88.4% missing) - pl_eqt (77.2% missing) - pl_eqterr1 (81.6% missing) - pl_eqterr2 (81.6% missing) - pl_eqtlim (77.2% missing) - pl_eqtstr (77.2% missing) - pl_occdep (100.0% missing) - pl_occdeperr1 (100.0% missing) - pl_occdeperr2 (100.0% missing) - pl_occdeplim (100.0% missing) - pl_occdepstr (100.0% missing) - pl_insol (83.2% missing) - pl_insolerr1 (87.9% missing) - pl_insolerr2 (87.9% missing) - pl_insollim (83.2% missing) - pl_insolstr (83.2% missing) - pl_dens (90.3% missing) - pl_denserr1 (90.9% missing) - pl_denserr2 (90.9% missing) - pl_denslim (90.3% missing) - pl_densstr (90.3% missing) - pl_trandeperr1 (65.4% missing) - pl_trandeperr2 (65.4% missing) - pl_trandurerr1 (51.0% missing) - pl_trandurerr2 (51.1% missing) - sy_umag (56.5% missing) - sy_umagerr1 (56.5% missing) - sy_umagerr2 (56.5% missing) - sy_umagstr (56.5% missing) - sy_rmag (56.5% missing) - sy_rmagerr1 (56.5% missing) - sy_rmagerr2 (56.5% missing) - sy_rmagstr (56.5% missing) - sy_imag (56.5% missing) - sy_imagerr1 (56.5% missing) - sy_imagerr2 (56.5% missing) - sy_imagstr (56.5% missing) - sy_zmag (56.5% missing) - sy_zmagerr1 (56.5% missing) - sy_zmagerr2 (56.5% missing) - sy_zmagstr (56.5% missing) - sy_w4magerr1 (84.6% missing) - sy_w4magerr2 (84.6% missing) - sy_gmag (56.5% missing) - sy_gmagerr1 (56.5% missing) - sy_gmagerr2 (56.5% missing) - sy_gmagstr (56.5% missing) - st_metratio (57.3% missing) - st_spectype (88.2% missing) - sy_kepmagerr1 (100.0% missing) - sy_kepmagerr2 (100.0% missing) - st_rotp (95.3% missing) - st_rotperr1 (95.3% missing) - st_rotperr2 (95.3% missing) - st_rotplim (95.3% missing) - st_rotpstr (95.3% missing) - pl_projobliq (99.2% missing) - pl_projobliqerr1 (99.2% missing) - pl_projobliqerr2 (99.2% missing) - pl_projobliqlim (99.2% missing) - pl_projobliqstr (99.2% missing) - pl_rvamp (91.8% missing) - pl_rvamperr1 (92.4% missing) - pl_rvamperr2 (92.4% missing) - pl_rvamplim (91.8% missing) - pl_rvampstr (91.8% missing) - pl_trueobliq (99.7% missing) - pl_trueobliqerr1 (99.7% missing) - pl_trueobliqerr2 (99.7% missing) - pl_trueobliqlim (99.7% missing) - pl_trueobliqstr (99.7% missing) - sy_icmag (100.0% missing) - sy_icmagerr1 (100.0% missing) - sy_icmagerr2 (100.0% missing) - sy_icmagstr (100.0% missing) - pl_imppar (60.4% missing) - pl_impparerr1 (70.9% missing) - pl_impparerr2 (70.9% missing) - pl_impparlim (60.4% missing) - pl_impparstr (60.4% missing) - pl_cmassj (99.6% missing) - pl_cmassjerr1 (99.6% missing) - pl_cmassjerr2 (99.6% missing) - pl_cmassjlim (99.6% missing) - pl_cmassjstr (99.6% missing) - pl_cmasse (99.6% missing) - pl_cmasseerr1 (99.6% missing) - pl_cmasseerr2 (99.6% missing) - pl_cmasselim (99.6% missing) - pl_cmassestr (99.6% missing) - pl_massj (88.8% missing) - pl_massjerr1 (89.9% missing) - pl_massjerr2 (89.9% missing) - pl_massjlim (88.8% missing) - pl_massjstr (88.8% missing) - pl_masse (88.8% missing) - pl_masseerr1 (89.9% missing) - pl_masseerr2 (89.9% missing) - pl_masselim (88.8% missing) - pl_massestr (88.8% missing) - pl_bmassj (88.2% missing) - pl_bmassjerr1 (89.3% missing) - pl_bmassjerr2 (89.3% missing) - pl_bmassjlim (88.2% missing) - pl_bmassjstr (88.2% missing) - pl_bmasse (88.2% missing) - pl_bmasseerr1 (89.3% missing) - pl_bmasseerr2 (89.3% missing) - pl_bmasselim (88.2% missing) - pl_bmassestr (88.2% missing) - pl_bmassprov (88.2% missing) - pl_msinij (99.2% missing) - pl_msinijerr1 (99.2% missing) - pl_msinijerr2 (99.2% missing) - pl_msinijlim (99.2% missing) - pl_msinijstr (99.2% missing) - pl_msinie (99.2% missing) - pl_msinieerr1 (99.2% missing) - pl_msinieerr2 (99.2% missing) - pl_msinielim (99.2% missing) - pl_msiniestr (99.2% missing) - st_met (57.3% missing) - st_meterr1 (58.0% missing) - st_meterr2 (58.0% missing) - st_metlim (57.3% missing) - st_metstr (57.3% missing) - st_radv (95.1% missing) - st_radverr1 (95.1% missing) - st_radverr2 (95.3% missing) - st_radvlim (95.1% missing) - st_radvstr (95.1% missing) - st_vsin (89.4% missing) - st_vsinerr1 (91.2% missing) - st_vsinerr2 (91.4% missing) - st_vsinlim (89.4% missing) - st_vsinstr (89.4% missing) - st_lum (83.8% missing) - st_lumerr1 (86.0% missing) - st_lumerr2 (86.0% missing) - st_lumlim (83.8% missing) - st_lumstr (83.8% missing) - st_age (91.4% missing) - st_ageerr1 (92.3% missing) - st_ageerr2 (92.3% missing) - st_agelim (91.4% missing) - st_agestr (91.4% missing) - st_masserr1 (50.2% missing) - st_masserr2 (50.4% missing) - st_dens (79.3% missing) - st_denserr1 (81.4% missing) - st_denserr2 (81.4% missing) - st_denslim (79.3% missing) - st_densstr (79.3% missing) Initial dataset: (3689, 182) Cleaned dataset: (3689, 182) Removed 179 columns
# check if any values for disposition is Nan
# Check for NaN values in disposition column
nan_count = k2_data_clean['disposition'].isna().sum()
print(f"Number of NaN values in disposition column: {nan_count}")
Number of NaN values in disposition column: 0
empty_strings = (k2_data_clean['disposition'] == '').sum()
print(f"Empty strings: {empty_strings}")
Empty strings: 0
Feature Engineering & Selection¶
Of the initial 361 columns, 179 columns were removed. To reduce dimensionality further, Pearson correlation is used.
# plot correlation for numeric columns
numeric_columns = k2_data_clean.select_dtypes(include=[np.number]).columns
correlation_matrix = k2_data_clean[numeric_columns].corr()
print(f"Correlation matrix shape: {correlation_matrix.shape}")
Correlation matrix shape: (126, 126)
# Create correlation heatmap
plt.figure(figsize=(8, 6))
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool)) # Mask upper triangle
sns.heatmap(correlation_matrix,
mask=mask,
annot=False,
cmap='RdBu_r',
center=0,
square=True,
cbar_kws={'shrink': 0.8})
plt.title('Feature Correlation Matrix', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
Given the previous graph dimensionality can be reduced by firstly removikng highly correlated features, secondly by using the target variable disposition and exatracting the top 20 or 30 features the variable is correlated with, while dropping everything else.
This will allow us to keep only the information necessary.
# Remove highly correlated features
def remove_highly_correlated_features(df, threshold=0.8):
# Compute correlation matrix
corr_matrix = df.select_dtypes(include=[np.number]).corr()
# array containing highly correlated pairs
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)):
if abs(corr_matrix.iloc[i, j]) > threshold:
high_corr_pairs.append((corr_matrix.columns[i], corr_matrix.columns[j]))
features_to_remove = set()
for feat1, feat2 in high_corr_pairs:
if df[feat1].isnull().sum() <= df[feat2].isnull().sum():
features_to_remove.add(feat2)
else:
features_to_remove.add(feat1)
return df.drop(columns=list(features_to_remove)), list(features_to_remove)
k2_data_final, removed_features = remove_highly_correlated_features(k2_data_clean, threshold=0.8)
print(f"Initial features: {k2_data_clean.shape[1]}")
print(f"After removal correlated features: {k2_data_final.shape[1]}")
print(f"Removed {len(removed_features)} features:")
for feat in removed_features:
print(f" - {feat}")
Initial features: 182 After removal correlated features: 133 Removed 49 features: - pl_radeerr1 - elon - sy_w3magerr2 - sy_pmraerr2 - sy_w3magerr1 - sy_gaiamag - sy_jmagerr2 - sy_pmraerr1 - sy_w1magerr2 - sy_tmagerr2 - pl_ratrorerr2 - pl_tranmiderr2 - pl_radj - sy_hmagerr2 - y - pl_orbpererr1 - sy_gaiamagerr2 - sy_kmag - z - sy_vmagerr2 - sy_bmag - sy_plxerr2 - st_tefferr2 - sy_jmag - sy_vmag - pl_ratdorlim - pl_radelim - sy_w1mag - sy_disterr2 - sy_w2mag - sy_disterr1 - pl_radjerr1 - x - sy_w3mag - sy_hmag - sy_w2magerr2 - pl_radjerr2 - sy_pmerr2 - sy_bmagerr2 - pl_rade - sy_kmagerr1 - sy_pmdecerr2 - sy_kepmagstr - pl_orbpererr2 - sy_pmdecerr1 - sy_kmagerr2 - sy_kepmag - st_logg - pl_radeerr2
In the following step disposition target is now used to further reduce the dataframe dimensionality.
target = 'disposition'
if target in k2_data_clean.columns:
# Calculate correlation with 'koi_disposition'
target_corr = k2_data_clean.select_dtypes(include=[np.number]).corrwith(
pd.get_dummies(k2_data_clean[target]).iloc[:, 0]
).abs().sort_values(ascending=False)
print("Top features correlated with target 'koi_disposition':")
print(target_corr.head(10))
Top features correlated with target 'koi_disposition': sy_pnum 0.655884 rv_flag 0.463046 default_flag 0.450125 pl_nnotes 0.363169 sy_hmag 0.307843 sy_kmag 0.307203 pl_ratrorerr2 0.304416 sy_jmag 0.302544 sy_tmag 0.292194 sy_w1mag 0.288758 dtype: float64
# Compute correlation with target (koi_disposition)
target_corr = k2_data_final.select_dtypes(include=[np.number]).corrwith(
pd.get_dummies(k2_data_final['disposition']).iloc[:, 0]
).abs().sort_values(ascending=False)
# Select top features without the target
top_20_features = target_corr.head(20).index.tolist()
top_30_features = target_corr.head(30).index.tolist()
X_20_features = k2_data_final[top_20_features]
X_30_features = k2_data_final[top_30_features]
print(f"X_20 features shape: {X_20_features.shape}")
print(f"X_30 features shape: {X_30_features.shape}")
print(f"y_binary shape: {y_binary.shape}")
print(f"Target: disposition")
X_20 features shape: (3689, 20) X_30 features shape: (3689, 30) y_binary shape: (3689,) Target: disposition
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
# Calculate correlation matrix for top 20 features
correlation_matrix = k2_data_clean[top_20_features].corr()
# Create the heatmap
plt.figure(figsize=(10, 10))
sns.heatmap(correlation_matrix,
annot=True,
cmap='coolwarm',
center=0,
square=True,
fmt='.2f',
cbar_kws={'shrink': 0.8})
# Rotate the labels
plt.xticks(rotation=45, ha='right')
plt.yticks(rotation=0)
plt.title('K2 Top 20 Features Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('k2_correlation_matrix.png', dpi=300, bbox_inches='tight')
plt.show()
# Differences with 30-feature model
additional_features = set(top_30_features) - set(top_20_features)
print(f"\nAdditional features in 30-feature model:")
for i, feature in enumerate(additional_features, 1):
importance = target_corr[feature]
print(f" {i:2d}. {feature:25s} {importance:.3f}")
Additional features in 30-feature model: 1. pl_ntranspec 0.108 2. sy_pmra 0.113 3. st_raderr2 0.120 4. sy_hmagerr1 0.091 5. sy_pmdec 0.091 6. st_tefferr1 0.115 7. disc_year 0.094 8. st_mass 0.100 9. st_nphot 0.121 10. st_loggerr2 0.093
# Let us plot a graph to understand
fig, ax = plt.subplots(figsize=(8, 6))
top_20_features_corr = target_corr.head(20)
# Create bars
bars = ax.barh(range(len(top_20_features_corr)), top_20_features_corr.values,
color='#1e40af', alpha=0.8, height=0.8)
# Styling
ax.set_yticks(range(len(top_20_features_corr)))
ax.set_yticklabels(top_20_features_corr.index, fontsize=10)
ax.set_xlabel('Correlation Coefficient', fontsize=12, fontweight='bold')
ax.set_title('K2 Feature Scores for Top 20 Features', fontsize=14, fontweight='bold')
# create a grid
ax.grid(axis='x', alpha=0.3, linestyle='-', linewidth=0.5)
# add statistics
n_features = len(top_20_features_corr)
max_corr = top_20_features_corr.max()
ax.text(0.02, 0.98, f'Top {n_features} Features\nMax Correlation: {max_corr:.3f}',
transform=ax.transAxes, fontsize=10,
verticalalignment='top', bbox=dict(boxstyle='round', facecolor='white', alpha=0.8))
ax.invert_yaxis()
plt.tight_layout()
plt.show()
Apply Scikit-Learn Iterative Imputer¶
Dispite the dimensionality is reduce there exists still null entries in the dataset. Dropping the column containing null values can cause loss of information effecting the accuracy of the model. To address this problem scikit-learn iterative imputater is used as per [1]
The iterative imputer computes p(x|y) with x being the feature with missing values and y the features containing velues.
# check if X_features datasets are pd
print(type(X_20_features))
print(type(X_30_features))
<class 'pandas.core.frame.DataFrame'> <class 'pandas.core.frame.DataFrame'>
# Use RandomForestImputer
# Apply imputation to feature datasets only
imputer_20 = IterativeImputer(
estimator=RandomForestRegressor(n_estimators=10, random_state=42),
max_iter=10,
random_state=42,
verbose=1
)
imputer_30 = IterativeImputer(
estimator=RandomForestRegressor(n_estimators=10, random_state=42),
max_iter=10,
random_state=42,
verbose=1
)
# Impute missing values
X_20_imputed = imputer_20.fit_transform(X_20_features)
X_30_imputed = imputer_30.fit_transform(X_30_features)
# Convert to DataFrames
X_20_imputed_df = pd.DataFrame(X_20_imputed, columns=X_20_features.columns, index=X_20_features.index)
X_30_imputed_df = pd.DataFrame(X_30_imputed, columns=X_30_features.columns, index=X_30_features.index)
print(f"20-feature DataFrame after imputation: {X_20_imputed_df.shape}")
print(f"30-feature DataFrame after imputation: {X_30_imputed_df.shape}")
[IterativeImputer] Completing matrix with shape (3689, 20) [IterativeImputer] Change: 1365.8289471095313, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 84.066111851, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 46.85257698000026, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 90.6564069, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 56.4809425, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 54.82016341700001, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 48.649899100000006, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 86.62866443300004, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 33.11205755000007, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 31.85761967000003, scaled tolerance: 9.319510000000001 [IterativeImputer] Completing matrix with shape (3689, 30) [IterativeImputer] Change: 1584.8118238046268, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 1178.94519572, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 1913.3184013999994, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 1139.129011206, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 473.13393140000005, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 597.9553620999999, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 1779.0361205, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 1156.8816909999996, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 567.4880454999999, scaled tolerance: 9.319510000000001 [IterativeImputer] Change: 825.5374065620001, scaled tolerance: 9.319510000000001 20-feature DataFrame after imputation: (3689, 20) 30-feature DataFrame after imputation: (3689, 30)
Model Training¶
For the model training, the dataset is split into training and testing dataset. The ratio is 80/20. Subsequently StandardScaler is applied to the split data to avoid any data leakege of the test dataset into the model.
# Split data using the y_binary by 80/20 split
X_train_20, X_test_20, y_train_20, y_test_20 = train_test_split(
X_20_imputed_df, y_binary, test_size=0.2, random_state=42
)
X_train_30, X_test_30, y_train_30, y_test_30 = train_test_split(
X_30_imputed_df, y_binary, test_size=0.2, random_state=42
)
print(f"20-feature training set: {X_train_20.shape}")
print(f"20-feature test set: {X_test_20.shape}")
print(f"30-feature training set: {X_train_30.shape}")
print(f"30-feature test set: {X_test_30.shape}")
20-feature training set: (2951, 20) 20-feature test set: (738, 20) 30-feature training set: (2951, 30) 30-feature test set: (738, 30)
# Define a standard scaler
scaler_20 = StandardScaler()
scaler_30 = StandardScaler()
# Fit scalers on training data only
X_train_20_scaled = scaler_20.fit_transform(X_train_20)
X_test_20_scaled = scaler_20.transform(X_test_20)
X_train_30_scaled = scaler_30.fit_transform(X_train_30)
X_test_30_scaled = scaler_30.transform(X_test_30)
# Convert back to DataFrames
X_train_20_scaled_df = pd.DataFrame(X_train_20_scaled, columns=X_train_20.columns, index=X_train_20.index)
X_test_20_scaled_df = pd.DataFrame(X_test_20_scaled, columns=X_test_20.columns, index=X_test_20.index)
X_train_30_scaled_df = pd.DataFrame(X_train_30_scaled, columns=X_train_30.columns, index=X_train_30.index)
X_test_30_scaled_df = pd.DataFrame(X_test_30_scaled, columns=X_test_30.columns, index=X_test_30.index)
print(f"20-feature training set (scaled): {X_train_20_scaled_df.shape}")
print(f"20-feature test set (scaled): {X_test_20_scaled_df.shape}")
20-feature training set (scaled): (2951, 20) 20-feature test set (scaled): (738, 20)
Now the ensemble for the datafrane are created. Note that the maximum iteration is setto 500, after experimenting it was noted using 1000 provides negligible improvement to the model accurancy whilest increasing the training time.
ensemble_20 = VotingClassifier([
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('xgb', XGBClassifier(random_state=42)),
('svm', SVC(probability=True, random_state=42)),
('lr', LogisticRegression(random_state=42, max_iter=2000)),
('deep_nn', MLPClassifier(hidden_layer_sizes=(200, 100, 50, 25, 10), max_iter=500, random_state=42))
], voting='soft')
ensemble_30 = VotingClassifier([
('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
('xgb', XGBClassifier(random_state=42)),
('svm', SVC(probability=True, random_state=42)),
('lr', LogisticRegression(random_state=42, max_iter=2000)),
('deep_nn', MLPClassifier(hidden_layer_sizes=(200, 100, 50, 25, 10), max_iter=500, random_state=42))
], voting='soft')
print("NaN values in y_binary:", y_binary.isna().sum())
print("Total y_binary values:", len(y_binary))
print("Unique y_binary values:", y_binary.unique())
NaN values in y_binary: 0 Total y_binary values: 3689 Unique y_binary values: [0 1]
# Train ensemble
ensemble_20.fit(X_train_20_scaled_df, y_train_20)
# Evaluation
y_pred_20 = ensemble_20.predict(X_test_20_scaled_df)
Note when using a multi-classifier
y_binary = k2_data_binary['disposition'].map({
'CANDIDATE': 0,
'FALSE POSITIVE': 1,
'CONFIRMED': 2
})
the following metrics were obtained:
Ensemble_20 Accuracy: 0.949
precision recall f1-score support
0 0.92 0.93 0.92 268
1 0.65 0.53 0.59 45
2 0.99 1.00 0.99 484
accuracy 0.95 797
macro avg 0.85 0.82 0.83 797
weighted avg 0.95 0.95 0.95 797
Therefore it made sense to focus on a binary classfier for higher accuracy.
print(f"Ensemble_20 Accuracy: {accuracy_score(y_test_20, y_pred_20):.3f}")
print(classification_report(y_test_20, y_pred_20))
Ensemble_20 Accuracy: 0.992
precision recall f1-score support
0 1.00 0.98 0.99 265
1 0.99 1.00 0.99 473
accuracy 0.99 738
macro avg 0.99 0.99 0.99 738
weighted avg 0.99 0.99 0.99 738
# Compare with 20-feature model
y_pred_proba_20 = ensemble_20.predict_proba(X_test_20_scaled)
auc_20 = roc_auc_score(y_test_20, y_pred_proba_20[:, 1])
print(f"20-Feature Model AUC Score: {auc_20:.3f}")
20-Feature Model AUC Score: 0.999
# Saving 20-feature model
model_package_20 = {
'model': ensemble_20,
'imputer': imputer_20,
'scaler': scaler_20,
'features': top_20_features,
'accuracy': 0.992,
'auc_score': 0.999,
'mission': 'kepler',
'classes': ['CANDIDATE', 'CONFIRMED'],
'class_mapping': {'CANDIDATE': 0, 'CONFIRMED': 1},
'n_features': 20,
'performance': {
'candidate_precision': 1.00,
'candidate_recall': 0.98,
'confirmed_precision': 0.99,
'confirmed_recall': 1.00
}
}
joblib.dump(model_package_20, 'k2_20_features_model.pkl')
['k2_20_features_model.pkl']
Train 30-feature model¶
# Train ensemble
ensemble_30.fit(X_train_30_scaled_df, y_train_30)
# Evaluation
y_pred_30 = ensemble_30.predict(X_test_30_scaled_df)
print(f"Ensemble_30 Accuracy: {accuracy_score(y_test_30, y_pred_30):.3f}")
print(classification_report(y_test_30, y_pred_30))
Ensemble_30 Accuracy: 0.993
precision recall f1-score support
0 1.00 0.98 0.99 265
1 0.99 1.00 0.99 473
accuracy 0.99 738
macro avg 0.99 0.99 0.99 738
weighted avg 0.99 0.99 0.99 738
# Get prediction probabilities
y_pred_proba_30 = ensemble_30.predict_proba(X_test_30_scaled)
# Calculate AUC score
auc_30 = roc_auc_score(y_test_30, y_pred_proba_30[:, 1])
print(f"30-Feature Model AUC Score: {auc_30:.3f}")
30-Feature Model AUC Score: 0.999
Note that using 30 features improve the ensamble accuracy across all metrics. Now both models are saved using .pkl
# Save the 30-feature model
model_package_30 = {
'model': ensemble_30,
'imputer': imputer_30,
'scaler': scaler_30,
'features': top_30_features,
'accuracy': 0.993,
'auc_score': 0.999,
'mission': 'kepler',
'classes': ['CANDIDATE', 'CONFIRMED'],
'class_mapping': {'CANDIDATE': 0, 'CONFIRMED': 1},
'n_features': 30,
'performance': {
'candidate_precision': 1.00,
'candidate_recall': 0.98,
'confirmed_precision': 0.99,
'confirmed_recall': 1.00
}
}
joblib.dump(model_package_30, 'k2_final_30_features.pkl')
print("Final Kepler model saved with 30 features")
Final Kepler model saved with 30 features
Results & Discussion¶
The Hybrid ML Ensemble Architecture provides improved performance for the 30 top features comapred to Top-20 features model.
Both ensamble achieved high accuracy ~ 0.99.
References¶
Saha, R 2021. Comparing Classification Models on Kepler Data. arXiv preprint arXiv:2101.01904 [astro-ph.EP], viewed 5/10/2025, https://arxiv.org/abs/2101.01904.
Luz, T. S. F., Braga, R. A. S., & Ribeiro, E. R. (2024). Assessment of Ensemble-Based Machine Learning Algorithms for Exoplanet Identification. Electronics, 13(19), 3950. https://doi.org/10.3390/electronics13193950