Does Your Company Need a Data Scientist?

Your company needs a data scientist... doesn't it? It very well may not, but you need to know either way. Read on to determine whether or not your company could benefit from the skills of an on-board data scientist.

Not Enough Historical Data


Data science at its core is about looking at the past to make predictions about the future.

One of the most common and recurring problems I’ve encountered in data science, however, is there not being enough historical data.

Time and again, I’d start a cohort analysis or look at aggregate counts of users. But to get true insight on metrics, you really need to put it into historical context. What was the metric in the past month, or past year? M/M or Y/Y trends give better context as to whether an observed behavior is an anomaly or part of a seasonal trend.

And with predictive modeling, you need historical data to build a training set for the features you want to assess. Without enough historical data, you can’t train a data set towards a specific signal in the future.

Even with large event volumes, it’s likely that you don’t have enough historical reference data to compare current data to. This could be due to your company simply being too new – it’s hard to do Y/Y growth rate analysis if you’re a seed-stage company. Or it could be that your company only recently instrumented your product with a client side analytics tool or event tagging.

Not having enough historical data makes it difficult for a data scientist to actually shine at their core strength – finding historical trends and making predictions for the future.

High Latency in Your Signal

True/flase positive ROC

Even with enough historical data, there are circumstances where a company’s business doesn’t lend itself to predictive modeling.

One circumstance of this is high latency in signal.

Again, data science is about making predictions or probabilistic assessments of some future behavior based on past behavior. Different companies have different signal sets that they are trying to optimize for – signups, churn, sales, etc.

But depending on your company, those signals may have different levels of latency. For a gaming company, churn of active users can be detected within days, if not hours. But for a SaaS or B2B company churn is seen on the order of months if not years, due to the long-term nature of contracts.

As a consequence, building predictive models around behaviors where the gap between input to signal is on the order of years, makes it extremely difficult to do your analysis. The number of externalities that can come into play make it difficult for any one feature to have high predictive power, and makes building meaningful ROC curves sometimes impossible.

Of course extremely high volumes of data can compensate for high latency in signal. But such conditions can be tough to find or optimize for. If you are a data scientist, be cautious of businesses or industries where there is expected high latency in signal, as it will make your job quite difficult.

Low Signal to Noise Ratio

Signal to noise

A circumstance where high event volumes may not compensate is when your signal to noise ratio is low.

Regardless of large event volumes, if the segment of data that actually carries the signal you are trying to optimize for is minuscule, then it will be difficult to do most of modeling.

For example, if you collect millions of events in user actions a day, but only a tiny percent actually ever performs an action core to your business, it’s unlikely you’re to find meaningful insights into what inputs drive adoption of your product.

When there is low signal to noise in your data sets, it probably necessitates larger discussions about your product, but it’s unlikely that a data science project could help in answering needed questions in such circumstances.

When You Need a Data Scientist

There are many reasons you should hire a data scientist.

A data scientist is a powerful asset in both data product and decision science – they are the aforementioned data sledgehammer. They can help your product with new recommendation engines, and assess affinity maps and other data to inform product direction. They can help guide your company’s operating metrics around KPIs essential to your business.

But leveraging a data science team appropriately requires a certain data maturity and infrastructure in place. You need some basic volume of events, and historical data for a data science team to provide meaningful insights on the future. Ideally your business operates on a model with low latency in signal and high signal to noise ratio.

Without these elements in place, you’ll have a sports car with no fuel. Ask yourself if more traditional roles like data analysts and business intelligence may suffice.

Sincerest thanks to Charles Pensig, data scientist at Optimizely and Jawbone, for his feedback on this essay.

Bilal Mahmood is a cofounder of Bolt. He formerly lead data warehousing and analytics at Optimizely, and is passionate about helping companies turn data into action.

Bolt is a data integration platform that immediately connects and transforms your user data from analytics, marketing, and payment platforms. We automate the data engineering so you can focus on the data science.

Original. Reposted with permission.