Reasons Why Data Projects Fail

Many companies seem to go through a pattern of hiring a data science team only for the entire team to quit or be fired around 12 months later. Why is the failure rate so high?

You need people who live and breathe selection bias, measurement bias, etc. or you'll never know the results are meaningless. These people are called scientists.

Oh, by the way:

Likewise - the opposite is very often true:

The hype around machine learning means there is a lot of readily available content out there. This can lead to the 'instant expert' phenomenon: now everyone has a great machine learning idea. Symptoms are your boss using words like 'regularize' and 'ensemble method' in the wrong context. Trust me, this will not end well.

The Cost-Effective HealthCare project used data from hospitals to process Emergency Room patients who had symptoms of pneumonia. They aimed to build a system that could predict people who had a low probability of death, so they could be simply sent home with antibiotics. This would allow care to be focused on the most serious cases, who were likely to suffer complicatations.

The neural network they developed had a very high accuracy but, strangely, it always decided to send asthma sufferers home. Weird, since asthmatics are actually at high risk of complications from pneumonia.

It turned out that asthmatics who present with pneumonia are always admitted to Intensive Care. Because of this, there were no cases of any asthmatics dying in the training data. The model concluded that asthmatics were low risk, when the opposite was actually true. The model had great accuracy but if deployed in production it would certainly have killed people.

Moral of the story: use a simple model you can understand [1]. Only then move onto something more complex, and only if you need to.

The core of science is reproducibility. Please do all of these things. Don't say I didn't warn you.

An applied science lab is a big commitment. Data can often be quite threatening to people who prefer to trust their instincts. R&D has a high risk of failure and unusually high levels of perseverance are table stakes. Do some soul searching - will your company really accept this culture?

Never let UX designers and product managers design a data product (even wireframes) using fake data. As soon as you apply real data it will be apparent that the wireframe is complete fantasy.

The real data will have weird outliers, or be boring. It will be too dynamic. It will be either too predictable or not predictable enough. Use live data from the beginning or your project will end in misery and self-hatred. Just like this poor leopard, weasel thing.

[1] See here for more.

Have I missed anything? This is a live project. Please get in touch with your own data science failure stories.

Bio: Martin Goodson has worked in data science for over 15 years. His interests are in natural language processing and statistical modeling using internet-scale data sets. He blogs, speaks and offers consultancy on data science strategy and product development.

Original. Reposted with permission.