The hardest parts of data science
The hardest part of data science is not building an accurate model or obtaining good, clean data, but defining feasible problems and coming up with reasonable ways of measuring solutions.
By Yanir Seroussi.
Contrary to common belief, the hardest part of data science isn’t building an accurate model or obtaining good, clean data. It is much harder to define feasible problems and come up with reasonable ways of measuring solutions. This post discusses some examples of these issues and how they can be addressed.
The not-so-hard parts
Before discussing the hardest parts of data science, it’s worth quickly addressing the two main contenders: model fitting and data collection/cleaning.
Model fitting is seen by some as particularly hard, or as real data science. This belief is fueled in part by the success of Kaggle, that calls itself the home of data science. Most Kaggle competitions are focused on model fitting: Participants are given a well-defined problem, a dataset, and a measure to optimise, and they compete to produce the most accurate model. Coupling Kaggle’s excellent marketing with their competition setup leads many people to believe that data science is all about fitting models. In reality, building reasonably-accurate models is not that hard, because many model-building phases can easily be automated. Indeed, there are many companies that offer model fitting as a service (e.g., Microsoft, Amazon, Google and others). Even Ben Hamner, CTO of Kaggle, has said that he is “surprised at the number of ‘black box machine learning in the cloud’ services emerging: model fitting is easy. Problem definition and data collection are not.”
Data collection/cleaning is the essential part that everyone loves to hate. DJ Patil (US Chief Data Scientist) is quoted as saying that “the hardest part of data science is getting good, clean data. Cleaning data is often 80% of the work.” While I agree that collecting data and cleaning it can be a lot of work, I don’t think of this part as particularly hard. It’s definitely important and may require careful planning, but in many cases it just isn’t very challenging. In addition, it is often the case that the data is already given, or is collected using previously-developed methods.
Problem definition is hard
There are many reasons why problem definition can be hard. It is sometimes due to stakeholders who don’t know what they want, and expect data scientists to solve all their data problems (either real or imagined). This type of situation is summarised by the following Dilbert strip. It is best handled by cleverly managing stakeholder expectations, while stirring them towards better-defined problem.