When Data Science Is Not Enough: Deriving Signal from Maritime Observations

We examine the limits of "data science-first" thinking - letting technical skills drive the analysis, and only later adding domain understanding.

By Ian White.

Illegal Fishing

I recently read an article (paywall) in the WSJ about Paul Allen's Vulcan initiative to curb illegal fishing. It's insightful and sheds light on Big Data techniques to address societal problems. After thinking on the story, it struck me that it could be used as a pedagogical tool to synthesize data science with domain knowledge. To me, this stands as the biggest limitation of what I refer to as 'data science thinking'- letting technical skills drive the analysis, only later incorporating domain understanding.

This post somewhat reads like a case note from business school and the idea is to get data scientists, product managers and engineers talking earlier on in the process. I've laid it out to provide sufficient context around illegal fishing and how one might develop models to answer the key business question: can illegal fishing be combatted through novel approaches?

Next I reframe the issue by considering how additional data can help narrow uncertainty and offer a fresh perspective on the problem. Finally, I seek to reconcile a science-driven approach with one that incorporates more domain thinking. I suggest the reader starts with the article (if you don't have a sub, try Googling the article title and you might find it):


Why do we care about illegal fishing and poaching? It raises multiple economic and environmental concerns:

Monitoring/policing/enforcing illegal fishing activities is difficult for a variety of reasons:
  • No Registry: No unique vessel ownership IDs, leading to reflagging, renaming and other tricks to mask ship identity/activity (oddly the International Maritime Organization requires persistent IDs for other types seafaring vessels and is dragging its heels/anchor when it comes to fishing vessels)
  • Size: The oceans are very large and there is no magic technology to track assets
  • Rogue Actors: Non-signatories to illegal fishing regimes may harbor bad actors
  • Compliance: Enforcement activities are underfunded, lack centralization, training, etc...

So, the overarching business issue is how to use data to best stop illegal fishing?

From the article:
Australian government scientists and Vulcan Inc., Mr. Allen's private company, have developed a notification system that alerts authorities when suspected pirate vessels from West Africa arrive at ports on remote Pacific islands and South America.
The system, announced Sunday U.S. time, relies on anti-collision transponders installed on nearly all oceangoing craft as a requirement under maritime law. These devices are detectable by satellite.

A statistical model helps identify vessels whose transponders have been intentionally shut off. Other data identifies fishing boats that are loitering in risk areas, such as near national maritime boundaries.

The article references "anti-collision transponders," which is the AIS, used by maritime traffic to monitor/track all passenger ships and most cargo. Then there's the bit about "statistical models" which I suppose is some flavor of machine learning to estimate when a transponder is turned off and a vessel is engaged in nefarious activity. How the "notification system alerts authorities when suspected pirate vessels...arrive at ports" if AIS is not active is unclear. Likely predicting movement in some way.

"Other data identifies fishing boats that are loitering in risk areas" is vague but maybe of immense value- does this mean other vessels visually identifying a target, a super large-scale sort of geofence, satellite imagery or something else?

Is this information sufficient to know where and when illegal fishing occurs? With what level of confidence? And how can we test our predictions? And what about 'the last mile' of relaying this to local authorities- does that happen in real time or is there a (say) week lag, further impeding enforcement? To proceed we need to better understand what we do not know, clarify what we do know and make some informed assumptions about how ocean fishing works.

What We Don't Know

The answers below are clearly knowable, but arriving at the questions is the hard/interesting part. The relevant unknowns I identified are:
  • How often is AIS relayed? Is it standard to have continuous broadcast or every x hours? Is the interval such that it will provide an area of uncertainty that is too vast to send maritime police to intercept?
  • How extensive is AIS coverage? Just because an illegally fishing vessel turns off his transponder doesn't mean anybody will know. Knowing which swaths of the earth are covered and how frequently they are refreshed by satellites is crucial info. To reiterate, the earth is a big place.
  • Are AIS messages authentic/legitimate? AIS message types include a fair amount of metadata, some of which could be spoofed/incorrect in the hope of confusing enforcement regimes.
  • Why would AIS be inactive? Was the transponder turned off because of a technical issue (loss of power), inadvertent (not knowing you unplugged the radio) or something else (pirates)? While a 'dark ship' may not indicate nefarious activity, a broadcasting ship does not imply full compliance.
  • How Vulcan's initiative fit with Leo's World Fishing Watch or Pew's Project Eyes On The Sea. From what I understand, all rely on AIS data but focus on different regions. I'd hope they collaborate on their different approaches but who knows.

Understanding Inputs

First, let's look at (data) inputs. Below is a sample AIS message that has been formatted in JSON from source. The details don't matter; essentially lng/lat is broadcast periodically with a bunch of attributes.

"day": 14, 
"fix_type": 1, 
"hour": 11, 
"id": 4, 
"minute": 33, 
"mmsi": 2320717, 
"month": 3, 
"position_accuracy": 0, 
"raim": false, 
"repeat_indicator": 3, 
"second": 30, 
"slot_offset": 2250, 
"slot_timeout": 0, 
"sync_state": 0, 
"transmission_ctl": 0, 
"x": -5.782454967498779, 
"y": 57.842193603515625, 
"year": 2012

I imagine looking at AIS data on a screen is what one would expect: you'll see a blip/ship of whatever icon you choose, with vectors displaying bearing and speed. When the transponder is turned off, the blip/ship disappears. This graphic shows this (obvious) concept, but it illustrates a limitation of visualization/user interface- capturing temporal changes can be much more difficult if the user is not technical.

Maritime AIS Ship not found

Now it's time to think like somebody in the ocean fishing business. Why would you (willfully) turn off your transponder? What could induce the transponder to stop broadcasting without human intervention? How about unwillingly turning off the transponder?