Silver BlogWhat 70% of Data Science Learners Do Wrong

Lessons learned from repeatedly smashing my head with a 2-meter long metal pole for a college engineering course.

By Dan Becker, Team Lead for Kaggle Learn

I actively searched for hard and useful classes for most of my time in college. But, I got tired by my final year, and I wanted a break. So I took a “fun” class from the engineering department called “The Physics of Sailing.”

We diagrammed the forces allowing a sailboat to go faster than the wind. We learned how the shape of a boat can make it stable or unstable. I’d already taken more physics than most of my classmates. So, I did well on the homework and assumed I’d be a natural if I ever went sailing.

I tested this assumption at the end of the semester when my class went to the small Mascoma Lake to try sailing on a real boat. It didn’t go like I’d expected.


Boats on Mascoma Lake. They aren’t as gentle as they seem.


The boats felt tippy, and my knowledge of buoyancy and “righting arms” didn’t keep me in the boat. Turning required coordinating multiple movements. And when I got the timing wrong, a two-meter long metal pole (called the boom) swung around and clocked me in the head. The cracking sound of the boom on my head made my ears ring for minutes each time.

The physics of sailing are fun to learn about, but apparently they don’t help you actually sail.

What does this have to do with data science?

Much as I learned the physics of sailing without learning to sail, most data science courses go into great detail about a few algorithms while glossing over the skills needed for successful data science projects.

Corporate data science is still a new field. Many academics haven’t worked on real problems for real businesses yet. So they teach textbook algorithms in a way that’s separated from data and business context. This can be intellectually fun. But, students are mistaken if they assume these courses prepare them well to work as data scientists.

So, how can focus your efforts on practically important skills? Here are some guidelines:

  1. Use standard open source libraries. Practical data science relies on libraries that are well-documented, well-tested, and have well-designed APIs. Implementing alternative versions yourself is a source of complexity (and bugs) that distracts you from the data and from the context where your model will be applied.
  2. Spend more time looking at your data and manipulating it into the format you need. Most projects involve a lot of data manipulation and relatively little model tuning. Friends who are currently hiring tell me many job candidates can describe algorithms, but the vast majority lack the Pandas skills to be efficient with real work.
  3. Learn about techniques in the context of applications. If you need technical jargon to describe the practical relevance of what you are learning, you probably aren’t ready to apply it.
  4. Learn how to interpret model output. For example, you need to understand measures of model accuracy to know if you can trust a model. Learn machine learning explainability techniques, like permutation importance.
  5. Build projects in a domain you find interesting. It can be about movies, current events, sports, food, or anything else. This will teach you how to frame amorphous questions about the world in a way where you can apply technical tools. This is one of the most important skills for data scientists. Sharing your work will teach you how to interpret and discuss results, which might be the single most important skill.
  6. If you skip the algorithmic theory underlying many books and courses, will it be easy to become a data scientist? No.

There’s a lot to learn about manipulating data, interpreting it, and connecting your tools to reality. I’ve intentionally reduced the amount of abstract theory I teach, to help learners focus on practical skills. I think this approach will keep you from smacking themselves in the back of the head when you start your serious projects.

Bio: Dan Becker (@dan_s_becker) has consulted Fortune 100 companies and contributed to the Keras library for deep learning. He now runs Kaggle Learn (

Original. Reposted with permission.