When Data Science Is Not Enough: Deriving Signal from Maritime Observations
We examine the limits of "data science-first" thinking - letting technical skills drive the analysis, and only later adding domain understanding.
Thinking about feature engineering, we can begin to get a sense of what patterns of activity might be indications of fishing- maybe a reduction in speed and/or irregular course changes? Of course there might be shallow/dangerous areas that require slower speeds, but we also have to consider geographic target areas, ie, where the fish are. This could also play into weather.
Does illegal fishing happen with multiple ships? Maybe a large trawler rendezvous with smaller fishing vessels to transfer to a larger cargo hold? Now we also need to understand interactions of multiple ships in proximity considering transponder status, movement and location. It is helpful to think of activities performed by vessels that might be indicative of illegal behavior.
So far we have developed a fairly complex model, and we have more work to do! There are multiple factors to consider:
- AIS status
- Vessel movement patterns
- Geographic location
- Ships in proximity
- Weather/environmental conditions
This gets even more complicated when thinking about probabilities as we lack a robust ground truth- an indisputable source to help train our model. Without it we will have models with an unknown level of confidence.
So after all this (and a lot of technical work), we may not have enough varied data to meaningfully impact illegal fishing. Sigh. However, there still might be value in a rule-based system that encodes domain knowledge as that will help all future parties (it could also create adverse behavior to game the system, as described above).
From AIS data alone our analysis would require boiling the ocean to arrive at a manageable solution set that would still be challenging, at best, to test. The good news is is this was done in a relatively short timeframe using one brain.
But Wait, There's More Data Out There
Enter a second independent data source that could help increase our confidence identifying a bad actor. We'll use some flavor of remote sensing which permits us to observe ship location. It is unclear if we can track ships with imagery alone (see below). Cloud cover and other environmental events might limit what we can infer or see. As with AIS, the important questions to ask about this data type:
- Revisit Rate: are there enough satellite passes to track ships?
- Resolution: Can a human or machine identify the entity or is the picture too coarse?
- Sensor type: A bit technical, but are optics used or another instrument? This can help when environmental conditions are not favorable (SAR, for example, doesn't see clouds).
We've already defined a set of rule-based activities that indicate illegal fishing behaviours. The maritime industry would be a great source of knowledge and some common-sense ideas could also be considered.
Now let's revisit our conceptual model using both data types- ship broadcast AIS and remote-sensed imagery. The beauty of this approach is that the sources are not correlated, meaning a change in one does not impact the other. With this independence of measurement, we can use one source to validate the other. For this example, what if every time an AIS transponder went dark we could light up the target vessel using imagery, allowing us to track it using a different data source?
The convergence and interplay of these two data sources are what allow us to derive signal- confidence with the ability to act. The approach is well-used in quant hedge funds but applicability to non-financial markets is vast.
To get to this point we sought to make explicit what we didn't know and made (we hope) reasonable assumptions. By walking through the analysis it became clear that uncertainty was reduced by an order of magnitude when introducing the second data source.
Like many things, this may seem obvious, but hopefully only in hindsight. I wrote this case as a way to explain how domain-specific thinking can bolster data science. It is an emergent skill dominant in hedge funds with the rise of the quantalist. It's not a poke at data scientists, but rather a gap in how they can best collaborate with product management and business strategy. Simply put, don't spend time on high cost activities until it makes sense to do so, as represented below. This chart is a sort of conceptual Bayesian inference at its most simplistic.
Bio: Ian White is a founder who loves building data-derived products to support revenue growth and uncover hidden insight.
Original. Reposted with permission.