Exoplanet Habitability

The Planetary Habitability Laboratory (PHL) describes the Habitable Exoplanets Catalog as "a database of potentially habitable worlds discovered by ground and space telescopes in the last decade. The exoplanet data comes from the NASA Exoplanet Archive and includes planets up to 2.5 Earth radii or 10 Earth masses orbiting within the optimistic stellar habitable zone to be as inclusive as possible."

As our capability to gather this type of data improves, we can turn to techniques from machine learning to help process and understand the growing amounts of exoplanet data. In this project I modelled planet habitability using the other features in the Habitable Exoplanets Catalog.

Missing values in the dataset were either removed or were imputed using median values and a KNN imputer. Categorical features were encoded using a target encoder or ordinal mappings where appropriate. Interestingly enough, though the dataset is highly imbalanced, resampling the training dataset did not improve performance for almost all of the models I used, in fact it had the opposite effect.

The best performing models were AdaBoost and XGBoost, both having almost a perfect precision, recall and f1 score for each target class. In my opinion their success can be attributed to the preprocessing, feature selection and hyperparameter tuning methods used.

Here is a link to the Github repository containing the entire project.