Python Pandas Matplotlib Scikit-learn Environmental Data Apr – Jun 2025

Ocean Cleaning
Analysis

Applied ML models to global ocean monitoring data to identify pH levels and species counts as leading indicators of marine heatwave events — then synthesized findings into decision-ready insights for marine conservation stakeholders.


12+
Global Monitoring Sites
40+
Environmental Features
3
ML Models Compared

01 / The Problem

Marine heatwaves are accelerating — and largely unpredicted.

Marine heatwaves — prolonged periods of anomalously high sea surface temperatures — are increasing in frequency and severity. They bleach coral reefs, disrupt fisheries, and destabilize entire ecosystems. Yet most monitoring programs report conditions after stress events occur rather than flagging leading indicators early enough to act.

The goal: determine whether existing ocean chemistry and biodiversity measurements — pH levels, dissolved oxygen, species count changes — could reliably predict heatwave onset before temperatures spike, giving conservation teams a useful early-warning window.


02 / The Dataset

Global ocean monitoring — 40+ features, multiple sites.

Environmental Indicators

Sea surface temperature, pH level, dissolved oxygen concentration, salinity, chlorophyll-a density, ocean current velocity, atmospheric pressure, solar irradiance.

Biodiversity Metrics

Species count, diversity index, migration patterns — tracked across 12+ global monitoring sites spanning the Pacific, Atlantic, and Indian Oceans over multi-year observation windows.

Temporal alignment across sites with different sampling frequencies required careful resampling and interpolation. Missing data patterns were non-random — sensors go offline more often during storm events, which are correlated with the stress conditions we were trying to predict.


03 / Methodology

01

Exploratory Data Analysis

Visualized temperature anomalies, pH trends, and species count fluctuations across all monitoring sites using Matplotlib and Seaborn. Identified strong site-level variation — Pacific equatorial buoys showed pH decline rates nearly double those of Atlantic sites — and surfaced clear seasonal patterns that needed to be controlled for in modeling.

02

Feature Engineering & Selection

Created lag features (pH and species count readings 2–8 weeks prior to target events), rolling averages, and rate-of-change derivatives. A Stress Index — combining pH deviation, dissolved oxygen drop, and species count decline into a single composite score — proved more predictive than any individual sensor reading. Correlation analysis removed 14 redundant features before modeling.

03

Model Training & Evaluation

Trained and compared a Logistic Regression baseline, Random Forest, and Gradient Boosting classifier on labeled heatwave onset events. Used time-series cross-validation to prevent data leakage across temporal folds. Evaluated on recall and lead-time accuracy — a model that correctly flags a heatwave 3 weeks out is far more valuable than one that detects it 3 days out.

04

Insight Synthesis

Translated model outputs into plain-language findings structured for two audiences: technical conservation scientists (feature importance rankings, confidence intervals) and policy stakeholders (risk threshold maps, recommended monitoring trigger points). Findings were visualized as geographic heatmaps and time-series trend panels.


04 / Key Findings

The ocean signals were there — weeks in advance.

pH drop below 8.0 (lagged 4 weeks) strongest heatwave predictor
Species count decline (>15% over 3 weeks) 78% heatwave co-occurrence
Dissolved oxygen < 6 mg/L 65% correlation with thermal stress
Composite Stress Index > 0.7 3-week early-warning window
Pacific equatorial sites vs. Atlantic 2× faster acidification rate

05 / Results & Impact

From raw sensor data to actionable policy.

The Gradient Boosting model achieved the best balance of precision and recall, with an average lead-time of 19 days before confirmed heatwave onset — long enough for conservation teams to reposition monitoring equipment, alert local fisheries, and initiate protective measures for sensitive reef zones.

Average lead time

19 days

before heatwave onset

Model recall

84%

heatwave events detected

Features reduced

40 → 18

after feature selection

Policy Recommendations

Prioritize pH and species count sensors in Pacific equatorial zones — these showed the strongest predictive signal and the fastest deterioration rates.

Adopt the Composite Stress Index as a standardized alert threshold: values above 0.7 sustained for 5+ days should trigger conservation review protocols.

Expand real-time monitoring frequency at the 4 highest-risk sites identified by the model from weekly to daily sampling intervals.


Tools & Technologies

Python 3 Pandas NumPy Scikit-learn Matplotlib Seaborn Gradient Boosting Random Forest Logistic Regression Jupyter Notebook

Let's Work
Together?

Get in Touch View Resume

© 2026 Owen Duffy