Top 7 Things I Learned in my Data Science Masters

Even though I’m still in my studies, here’s a list of the most important things I’ve learned (as of yet).

By Dario Radečić, Data Science Student

Some of them will already be familiar to you, but I wouldn’t suggest skipping on them — another opinion always comes in handy.

Photo by Charles DeLoye on Unsplash

1. Always Consult with a Domain Expert

By ‘Always’ I mean if you have the luxury of doing so.

This was one of the first things learned. We were introduced to this guy which is like a rock star when it comes to the world of data, churn modeling to be more precise. And hearing that sentence was probably when the first myth I had had about data science got destroyed.

You can read more about him and the whole case here:

Attribute Relevance Analysis in Python — IV and WoE
Recently I’ve written about Recursive Feature Elimination — one of many feature selection techniques I use most often…

Originally I thought that data scientists are this rare species which can do almost anything provided the right data. But for most cases, that can’t be further from the truth. Yes, you can analyze everything with everything, and you will find something interesting by doing so, but is it really the best use of your time?

Ask yourself,

What is the Question? How X connects to Y? Why?

Knowing the will lead you in a good general direction for solving the problem. And that’s when domain experts come in handy.

Another really important thing is feature engineering. The same professor stressed out that you can use domain experts for feature engineering process. It makes sense if you take a minute to think about it.

2. You’ll Spend Most of the Time Preparing Data

Yes, you’ve read that right.

One of the top reasons for me to involve in Data Science masters was machine learning — I didn’t care much what the data was about, and how it was gathered and prepared. Due to this attitude, I was a bit shocked and disappointed when the semester started.

Of course, If you work in a bigger company which has both data engineers and machine learning engineers this probably won’t apply to you, because you’ll be doing machine learning for the most of the day. But if that isn’t the case, you’ll spend only about 15% of the time doing machine learning.

Which is actually great. Machine learning isn’t all that interesting. Hear me out on this before you jump to the comment section with your big F-words.

via GIPHY

The reason I think machine learning isn’t all that interesting is that, for the most projects, it boils down to trying out several learning algorithms and then optimizing on the best one.

If you didn’t do a good job prior to this, ergo on the data preparation process, your model will most probably suck and there won’t be much you can do about it— except to tweak hyperparameters, adjust threshold and similar.

That’s why data preparation and exploratory analysis is the king, and machine learning is just something that comes naturally after it.

Once I’ve realized that I’ve lost most of the hype for machine learning. I’ve found out that I enjoy in data gathering and visualization more because I learn the most about your data there.

In particular, I really enjoy web scraping, because a relevant dataset is hard to find. If that sound like something you would enjoy, please check out this article:

No Dataset? No Problem. Scrape one Yourself.
Use the Power of Python and BeautifulSoup to Scrape Data that Matters to You.

3. Don’t Reinvent the Wheel

Libraries are made for a reason. Google before you act.

I’ll show you a very trivial example of a ‘mistake’ you probably haven’t done, but it will help you to understand this point.

It’s about two ways to calculate the median. Median is defined as:

The middle of a sorted list of numbers.

So to calculate it you would have to implement the following logic:

Sort the input list
Check if the length of the list is even or odd
If even, the median is the average of two middle numbers
If odd, the median is the middle number

Luckily, there exist libraries such as Numpy, which does all the heavy lifting for you. Just take a look at the code below, the first 17 rows refer to calculating the median by yourself, and the last two rows use the power of Numpy to achieve the same:

Median calculation — https://gist.github.com/dradecic/7f295913c01172ffebe84052c8158703

As I said, this is only a trivial example you probably haven’t done yourself. But just imagine how many lines of code you have written in vain because you didn’t realize that there’s already a library for that.

4. Master lambdas and List Comprehensions

Although not something data science specific, I would say that I use list comprehensions all of the time for stuff like feature engineering, and lambda functions for data cleaning and preparation.

Down below is a simple example of feature engineering. Given a list of strings, you need to create a variable that will equal to 1 if the given string contains a question mark (?) and 0 otherwise. You can see how you could achieve this with and without list comprehensions (hint: they are a massive time saver):

List Comprehension Example — https://gist.github.com/dradecic/9f23eb0c8073ecc8957f8fd533388cef

And now for the lambdas, let’s say you have a list of phone numbers in a format you don’t like. Basically you want to replace ‘/’ with ‘-’. This is an almost trivial process, provided that your dataset is in Pandas DataFrame format:

Lambdas — https://gist.github.com/dradecic/68e81f6610b26fe8da68e25d217c5052

Take a moment to think about how you could apply those to your dataset. Cool, right?

5. Know your Stats

If you haven’t been living under a rock, you know the importance of statistics in data science. It’s a fundamental skill you must develop.

Let me quote Edureka:

Statistics is used to process complex problems in the real world so that Data Scientists and Analysts can look for meaningful trends and changes in Data. In simple words, Statistics can be used to derive meaningful insights from data by performing mathematical computations on it.[1]

From what I’ve learned on my masters so far with regards to statistics, is that it is necessary for you to know it to be able to ask the right question.

If your stat skills are rusty, I would strongly suggest you check out StatQuest channel on YouTube, more precisely this playlist on the basis of statistics:

Statistics Playlist

6. Learn Algorithms and Data Structures

There’s no point in being able to ask the right question (see point 5) if you can’t deliver the solution — right?

I’ve been guilty of neglecting algorithms and data structures because I thought that only software engineers should worry about those. I was terribly wrong, to say at least.

Now I’m not saying that you must know in your sleep how to code out Binary search algorithm, but just a basic understanding will help you to see a clearer picture of how to think in code — ergo how to write code that gets the job done, but also get’s the job done as fast as possible.

For a person without a computer science background, I would strongly recommend this course:

Learn Python for Data Structures, Algorithms & Interviews
PLEASE NOTE: IF YOU ARE A COMPLETE BEGINNER TO PYTHON, CHECK OUT MY OTHER COURSE: COMPLETE PYTHON BOOTCAMP TO LEARN…

Also, make sure to check out the interview questions — they help, A LOT!

7. Go Beyond the Scope

Always be that person who works the harders. It pays off.

At least in my case, my group was evaluated based on the initial performance on one of the classes. It wasn’t about who knows the most, because it would be a stupid thing to do in the first semester, but it was about who will show work ethics and discipline.

As I wasn’t working full time then, I worked my ass off for this project. Because I did, and others didn’t, I was appointed to a full-scale data science project, which will last for two years and will serve for my master’s thesis.

And yeah, I’m able to put that on my CV.

So, was sacrificing a couple of weeks of my personal life worth it? Judge for yourself, but I would say that it was.

References

https://www.edureka.co/blog/math-and-statistics-for-data-science/

Bio: Dario Radečić is a 22-year-old student of Data Science, who has also been working in the field for a while. Writer at Medium and Towards Data Science.

Original. Reposted with permission.

Related: