Will Data Science Eliminate Data Science?

There are elements of what we do which are AI complete. Eventually, Artificial General Intelligence will eliminate the data scientist, but it’s not around the corner.

By Balázs Kégl, Data scientist, co-creator of RAMP.

Data science, and beyond!

A friend of mine asked me this question: is there really a market for a premium ML as a service, hundreds of data scientist thinking about features and ever more clever algorithms? Or big data will take over even our jobs since more data eventually trumps clever algorithms?

The short answer: there are elements of what we do which are AI complete. Eventually, Artificial General Intelligence will eliminate the data scientist, but it’s not around the corner.

Still, I found it a very good question. Here are some loosely connected thoughts to elaborate on my short answer.

  1. The combination of i) the data science buzz, ii) the slowness of higher education to answer such a sudden demand, and iii) the brain-drain fueled by limitless cashflow in the IT-centric industry puts an enormous pressure on the meta data scientist to automate her own job. We are in the right position: automating stuff is at the core of our job description. So don’t be surprised if data science automation (and standardization!) will evolve faster than predicted.

    For the same reason, I do believe in horizontal MLaaS (unlike some of the gurus of data science and entrepreneurship). A well designed MLaaS can bring value to small and medium size businesses or even larger non IT companies which cannot afford or do not want to build a data science ecosystem internally. Automated and crowdsourced ML may not be able to compete with the top data science R&D teams assembled by, say, Google or Facebook, but they can with the team having two members who became “data scientist” after a 60 hour fast track course.

  2. There are certainly cases where collecting more data is an option. In these cases we can estimate how much more data will help (by extrapolating learning curves). In this range, error is dominated by overfitting, tricks (e.g., dropout, regularization) are competing with data collection, and so (expensive) data science time can compete with (cheap) labeler time.
  3. There are certainly cases where more data, even though it would be available, won’t help. In our HEP anomaly challenge we had practically infinite data yet all off-the-shelf predictors were stuck. We were dominated by underfitting: we could not express (or approximate) the right predictor well enough with off the shelf techniques. Underfitting problems are design problems and this is where human ingenuity (informed, knowledge-driven model search) can help. Deep nets also solved an underfitting problem, classical computer vision methods saturated with data size (see slide 8 here).
  4. Finally, there are certainly cases where collecting more data is not an option. Many times distributions are shifting, so you have limited amount of unbiased data at any moment. Many times data scales linearly with the number of tasks or classes, so per task we are data-limited. In these cases we are stuck in small data world where overfitting rules. Generic (automated, hyperopted) regularization techniques may bring you to a certain level, but there is nothing more valuable here than informed priors (aka domain knowledge). And whenever you have domain knowledge, it will have to be absorbed and transformed, giving the data scientist ample job security.

Of course, this is just a short excerpt of what could be said about the topic. What do you think? Are our data science jobs in danger?

Bio: Balázs Kégl is a senior research scientist at CNRS and head of the Center for Data Science of the Université Paris-Saclay. He is co-creator of RAMP (www.ramp.studio).

Original. Reposted with permission.