Biernbaum: Data Science is going 99% too fast

Currently Data Science (and Big Data) is a train that is going 99% faster than the rails can support. It can derail and become a failed fad. To succeed, Data Science and Big Data need to specialize.

Guest blog by Mark Biernbaum, Dec 25, 2013.

I am very interested in Big Data and Data Science, and I want Data Science to succeed.

I want to offer some suggestions. I am in agreement that current Data Science is not yet a mature field of inquiry. I think this is a critical time in the development of Data Science and that there are certain steps that the field needs to embrace right now.

Train going too fastMy current visualization of Data Science is a train that is going 99% faster than the rails its on can support. It could derail, go off a cliff, and become nothing but a failed fad. And if it does this, there is a good chance that it will take statistics and other quant fields with it, and we will enter a phase where business will decide that analysis has nothing to offer. I know that everyone reading this knows that if business entered an anti-data phase, a place they have been before, this would be very bad for everyone.

So, it is very important that Data Science stop running 99% too fast. I read an article recently where some Data Scientists were bemoaning the fact that there are many CEOs out there who are not jumping on the Big Data bandwagon, and they couldn't understand why. They suggested it was because these CEOs really didn't understand Big Data. But there is another possibility to explain this- it's Big Data's brand that is impeding it's adoption by more businesses. Right now, Big Data looks like a total mess- like a drag queen whose wig is crooked, who lost a heel, whose lipstick is smeared and who is racing down a runway so fast you know that crazy queen is heading straight for a major collision. Really.

Why? Well, I'm learning that there is a lot of Big Data out there, the variety is huge, and some types of Big Data and methods barely resemble other types. Consumer data, finance data, energy data- they are quite different, they want different things, they approach their data very differently. Financial data really needs to be clean and exact. Predictions have to be extremely focused to be useful. Not so for consumer data, where trends are important, "exact" doesn't mean much, but speed of analysis and immediate applicability to marketing is vital. Smashing all these different types of data under one tent, looks like the biggest monster truck collision ever. All some fields have in common with others is the "Big" part, and that is making Big Data look like a Big Mess.

Psychology went through a phase just like this. People who were concerned with stress in the workplace were smashed together with people who were trying to understand how babies learn to walk, with others who wanted to better understand how mobs form, with people who were interested in minimal brain damage, etc. Selling psychology in that form was really difficult because no one could tell what psychology was, and was not.

Medicine went through this too. The answer was specialization. That way, people knew not to take their grandmother to a Pediatrician, and when they has stomach problems they didn't see a Neurologist. One potential strategy that will neaten up Big Data is separate tents. And that doesn't mean that you all can't get together sometimes. The first division in the American Psychological Association is "General Psychology" which publishes a journal with articles that have wide appeal.

So if Big Data created specializations, then business would be able to pick out the specializations most central to their particular work. So could energy firms, architects and engineers, etc. They all could look at Big Data and easily know where to go with their questions. Right now Big Data is like "speed dating: 50 guys in 50 minutes!" You meet with a hot guy who works on cars, then an enginerd with impossible hair, then a CEO who is old enough to be your father- it shouldn't be this hard to find the right guy.

But specialization only works if additional steps that define the specialties and clearly show them off are also taken. Professional societies need to be formed, and each society needs to have a public avenue that demonstrates expertise and excellence. And I'm sorry to say that a blog will not do it. There are so many Big Data blogs right now, and I know that you will agree with me when I say that some of them are absolutely terrible and are making Data Science look really bad.

I know you will moan and cry out in pain when I tell you this, but I am trying to help. Folks, you need professional journals that are peer reviewed. Sorry to say that Data Science is not exempt from this requirement that all other fields meet in order to demonstrate their worth. It is a necessary evil. There was a lot of posting recently about a method to reduce error in a Logistic regression. This method had been presented at conferences, but these conferences were not real peer reviewed, and your conferences must be peer reviewed. The method was presented in a book, but the book was not peer reviewed. The only way new data and methods can really be regarded as helpful, is if they are peer revirewed. CEOs tend to trust fields that are peer reviewed and they are 100% right. A field that isn't peer reviewed is equivalent to a hobby.

Transparency of method is what tells you if the results if an analysis can be trusted. It's the methods which put a stamp of approval on an analysis, not the results, since they can only be as good as the method. Data Science needs methodology badly. This is where statisticians and scientists can come in and be of real help to Big Data. If new methods need to be created, I would rather they be created in collaboration with someone who knows the importance of methodology. Every field needs trusted methods. Even auto mechanics use them, and thank goodness they do, because if they didn't, when you go in for brake service, they might tear your transmission apart. And thats very bad.

Statisticians and scientists need to learn some things that Big Data excels in. First is data visualization, where it is clear that Data Science has far surpassed traditional statistical graphs and figures. We tend to put most of our data in tabular form, and we need new ways to "show" our data. Help us. Data Science has also focused on Pattern Recognition. We have a few methods to do this (Fourier Analysis), but they are quite cumbersome and do not produce the kind of easily comprehensive results that Big Data is currently getting in their Pattern Recognition work. Help us.

And let us help you. Predictive Analytics is an area of great importance to Data Science. It is also the area of Data Science that is most unstable right now, most subject to criticism, and most likely to be done poorly. That's because most Data Scientists don't really know the first thing about inferential statistics. And to be honest, you cannot learn inferential statistics from blogs. And just because you can run a Logistic regression does not mean that you have any idea what you did to the data. And you should know exactly what that procedure did to your data. But hey, no need to go back to college. Instead, find a colleague- a statistician, perhaps, who actually does know what that procedure does to the data, what specifications need to be made, if Logistic regression is really the method you want (Discriminant Function Analysis might be better), and how to describe the findings in the way most consistent with the method and the data.

For the holidays, I thought I would try being constructive. Data Science and Statistics need each other. We can manage on our own, but damn - don't we look good when we put it all together!

Mark A. BiernbaumMark A. Biernbaum, PhD is a Researcher with 25 years experience; Children's Institute, Clinical/Social Science in Psychology, University of Rochester.

Gregory Piatetsky, Editor: This excellent post was of the responses in the intense debate (over 250 comments) on LinkedIn Group Advanced BA, DM and PM, prompted by my post Why statistical community is disconnected from Big Data and how to fix it. It is re-posted here, with small edits, by permission of the author.