My Data Science Learning Journey So Far
These are some obstacles the author faced in their data science learning journey in the past year, including how much time it took to overcome each obstacle and what it has taught the author.
By Arnuld on Data, freelance Data Scientist
Eric Weber (yes, that nice-looking guy with a lovely dog) wrote a post on LinkedIn recently about 10 things he wished he had done less when he started his data science career. This post is my journey through those 10 points. First, you should go ahead and read his post. Here is a screenshot:
First things first, this is not going to be a “content” post.
There are so many articles and blog-posts on that already, so check them out. Here we will talk about your focus and direction when it comes to your desire to become a data scientist and get noticed by the industry.
1) Thinking I needed to learn everything
Yeah, this one takes lots of your time and energy. This obstacle is one you should deal with right away. I struggled with it in the beginning but in a few months, it died down. I attribute this breakthrough to my daily reading habit.
I keep on reading LinkedIn posts (especially from Eric Weber himself). Also, I read a lot on Towards Data Science, Medium, KDnuggets, and individual blogs from different data scientists and machine learning engineers for an hour or two or more every single day. This has taught me the importance of data science when it comes to industrial work: how much value you are adding to an organization with your skill set. You define value by building something you have an interest in or by building something to solve a problem. You choose what to learn by answering this question and it will give you an idea of what to learn and what not to.
It took me several months to realize this (I guess 6 months). I will add these months together as we progress point by point to see how much time we could have saved.
2) Prepping for interview trivia.
Yes, this is another struggle, primarily because of several reasons:
- There is no one agreeable definition of what a data scientist is. Only a vague idea on his job responsibilities and how those responsibilities are different from a data analyst or a machine learning engineer?
- Then there are confusing job descriptions. Since there is no agreeable definition of data scientist, you will see descriptions who want you to be a master of everything: machine learning, software engineering, Python, R, years of Statistics, Calculus, Linear Algebra, Big-O, and whatnot. Looking at the job description, you feel like you need to be 50+ to apply for the jobs.
Don’t fall for this. Don’t take a job description to your heart. Mostly “interview trivia” is a combination of this newness of data science along with a poor communication channel between talent acquisition, data science, and software engineering teams in an organization. Rather than feeling overwhelmed at this, you need to focus on how to crack it.
One way to crack this is by looking at reality. If you know any real-life data scientists, data analysts, and machine learning engineers (offline, in the physical world), it will be a great idea to talk to them about their work. If you don’t know anyone then you can always check blogs and articles.
I don’t know any professional in this field offline. So I learned by reading blogs and articles. What I learned is companies get many people for interviews, all of the kind who “know” stuff but very few who have “built” stuff. So focus on building stuff than mere learning and education (e.g. deployment and production are two major things). It took me 5–6 months to realize this.
6 + 6 = 12 months so far
3) Trying to emulate someone else’s path
Aha, this is my favorite :-) because this is where I had wasted most of my time:
- Tetiana Ivanova landed a job in 6 months
- Kelly Peng landed a job in one year after she quit her data analyst job
- Natassha Selvaraj landed a job and she is studying in college
- Mikko Koskinen did not even plan to become a data scientist
- Thomas Hepner felt lost at at anything other than Titanic dataset and a year later he landed as a data scientist in the industry
Look at my profile, I have 4.5 years of experience in software development (C language) and been doing data science for 8 months now and still nowhere near answering this question:
What’s your favorite machine learning algorithm and why?
Yes, I agree my case looks like the worst case of Big-O: O(n^n)
I have read hundreds and thousands (no, I am not exaggerating) of blog-posts and articles of the people who have landed data science jobs and changed industries. I traced and emulated their data science journeys into my life, from their thinking-patterns to the choice of their courses, even their choice of certain chapters in certain books, like a perfect carbon-copy. And I still failed at answering the question above because I don’t even know why I will like one machine learning algorithm over the other. After all, I am just mindlessly chewing all the models in the name of “becoming like them”.
Two days ago I gave it up and decided to follow what I think I should do. (Surprisingly, I came across Eric’s post today. It is as if Universe is trying to tell me I am on the right path, a path that belongs to me)
I think each of us have to personalize our journeys. Our environment, our talent, our experience, our attitude, our work ethic, our backgrounds, and our learning capacities, all are different and unique. That is why maybe tracing someone else’s path never works.
So I decided I will experiment and carve my own path to become a data scientist. This is not to say that I will stop reading other people’s journeys, I will still read but instead of following them blindly and trying to copy it into my life, I will use them as a compass, as a guiding mechanism. This has cost me 8 months. Better late than never though.
6 + 6 + 8 = 20 months
4) Focusing on perfect solutions.
My computer programming experience took care of this. I spent half of a decade doing programming in the industry, writing code to generate money for my employers, that already taught me “done” is better than “perfect”. Finding a problem someone is facing and building a solution is actually the only important thing that matters. Mere learning and education don’t.
6 + 6 + 8 + 0 = 20 months
5) Learning advanced stats I rarely used
Back in 2018, I spent a lot of time learning Mathematics and Statistics for data science. I spent 4 months studying:
- Algebra I and II at Khan Academy
- College Level Algebra and Problem Solving from Arizona State University at edX
- MIT Big Picture Calculus from YouTube
- Calculus Made Easy by Silvanus P. Thompson. Available for free from Project Gutenberg
- Calculus 1A: Differentiation from MIT at edX.
- Limits and Integral Calculus from Calculus-1 at Khan Academy.
- Reading different books on Statistics to get a statistical mindset
What a mistake it was :-( . From what I know today, all I needed was this:
- Basics of Statistics. Not Statistics per se but only the topics specifically necessary for Machine Learning and Data Analysis
- Basics of Bayes Theorem
- Basics of Linear Algebra (only a few small things like matrix multiplication and transposing etc )
- Basics of Big-O Notation (Check out Interview Cake’s Explanation)
Yes, nothing fancy but only the basics. All the fancy stuff you can do after you land a job. Till then you use Python or R Libraries. Instead of trying to learn Mathematical formulas just like in school or college, try to learn how to use it using Library calls in Python e.g. calculating t-test using Scipy, and learn the math needed to understand it:
3.1. Statistics in Python - Scipy lecture notes
A simple linear regression Given two set of observations, x and y, we want to test the hypothesis that y is a linear...
Well, there went 8–10 months:
6 + 6 + 8 + 0 + 10 = 30 months
6) Thinking the R vs. Python debate required picking just 1.
I struggled with this one:
- Started with R for Data Science by Hadley Wickham. Read a few chapters and then gave up because I read Python is gaining ground in the industrial world.
- I started with Python and tried a few books and then I came back to R because ggplot looked better than matplotlib.
- Then I went back to Python because it had a more software Engineering feel to it.
- Went back to R because tidyverse as a package looked much more mature at data analysis and visualization than Python’s tools.
This problem went away when I got a take-home assignment from a company who approached me for R related work. After using both R and Python for take-home assignment work, I never wanted to touch R again. From my experience Python suits better for software engineering practices and software engineering practices are definitely needed when it comes to writing data science code for real-life industrial work. It is almost the same as when you are doing software development. I went fully Python after that. Personally, If I ever have to use another language, I will use Julia instead. Around 4–6 months on this.
6 + 6 + 8 + 0 + 10 + 4 = 34 months
7) Spending lots of time thinking about unstructured data
This mistake I did after the “the math mistake”. I spent months contemplating SQL vs NoSQL. We look at something and we think of it from our viewpoint and think this is what it means. E.g we all know this is the age of data and millions and millions of megabytes of data is being generated each day. Most of it is unstructured. I guessed I should learn NoSQL. But then almost all of the job descriptions mention only SQL. Then I will think of doing SQL.
I learned neither SQL nor NoSQL. This is how being two-minds about a thing kills months of your time.
Instead of interpreting things in my way, I started looking at the people who landed data science jobs and what they learned. All of them had listed SQL as a skill. So I switched to SQL. A good place to start is SQLBolt.
I won’t consider any time wastage here because even though I did not learn anything, I used that time to learn other stuff. So, the equation so far is:
6 + 6 + 8 + 0 + 10 + 4 + 0 = 34 months
8) Thinking about the tech, not the business
This is one area where you need a serious change in mindset and I needed such change too. My computer programming background makes me a 100% tech guy who really does not know how to be more than a team-worker. Contributing to the team is where my social and my communication skills ended.
I never knew this in beginning but thanks to my reading habit, I came across so many characteristics of data science that put it at odds with other tech jobs. One way I overcome this is by talking about Big Data with people I know or I meet. By explaining data science, machine learning concepts to my friends and other people. But because my freelance work and data science learning require me to spend a lot of time in front of my computer, I don’t get the opportunity to exercise this method much.
Data science is not just programming, data science is not just web-development, it is not just about analyzing data and building models. This is half of the story. Another half of data science is being able to communicate to not so tech-savvy people. Business stakeholders, decision-makers in management, and clients are three different types of non-tech people you are going to deal with. So collaborating with people is going to be a big pain if we think of it as “another tech job”. There is an excellent book on communicating data insights titled “Storytelling With Data” by Cole Nussbaumer Knaflic. It is kind of a must-read.
There is another side to this. Business Problems. The model you build, the comparisons you did, and the accuracy you achieved, how it is benefiting the business? You see, a data scientist’s job has no meaning if he can’t bring some profit or benefit or some value addition to the business. This is a hard thing to get hold of and become good at if you come from a tech-background like mine. What the tech-mentality does, in this case, is to make your mind focus only on building the model and analyzing data because it is what we do. We do not have a business context.
I don’t have a great solution for this because never had any personal experience with it. So take my advice with a grain of salt here. Search yourself too. I could only read blogs, posts and articles to understand what to do. I don’t know any product manager either (I have met one or two managers in IT service but I don’t know if that qualifies). The only method I have come across to solve this is two-fold:
- Read about case-studies, product case-studies. This is what a product manager does. So if you know any product manager (or even a project manager) you should talk to them about how their product/project brought value to the company.
- Read books like Cracking the PM Interview by Gayle Laakmann McDowell and Jackie (Bodine) Bavaro
Not understanding this makes you work on your tech skills long and hard if you are a programmer or a software developer. Wastage of 6 months:
6 + 6 + 8 + 0 + 10 + 4 + 0 + 6 = 40 months
9) Trying to keep up with all the papers
Another pitfall you need to avoid. I got stuck in this for a while. I want to implement a paper or two myself but now the first focus of mine is always on “building something”. Learn as less as you need to start working on to build something.
Yes, all those papers look really, really impressive, and beautiful. And papers are mostly about academics. You are trying to land a job in the industry. Academics and industry do not match, with two possible exceptions:
- You are looking for a research position within the industry. In this case, your portfolio will be limited to only 10–20% of the employers.
- You want to work for the big 4 a.k.a Facebook, Amazon, Google, and Microsoft.
Except for the above, I don’t see any point in drifting from my focus of landing a data scientist position at a good tier I or II company. Don’t take me wrong, I love to do research. In fact, back in college, I wanted to do a Ph.D. in microkernel research. Research work takes a hell of lot of time and energy. I think a better way to live is to find balance in your career: a balance between your interests and the market/industry needs. Avoid falling on either side.
Instead of keeping up with all the papers, a better way to balance your learning is:
- Learn the basics of data cleaning using Pandas (Kaggle datasets have done the 90% of work for you. In real-life, you gotta do all the cleaning. Learn to scrape some data and clean it)
- Learn the basics of machine learning modeling and why we choose one model over the other. What kinds of model fit what kind of domain problems e.g. healthcare vs finances
- Learn how to deploy a model into production (you will know a bit of how the real work feels like when you will use Strealmlit, Heroku, and Voila. I have implemented the bear-detection model using Voila here. )
6 + 6 + 8 + 0 + 10 + 4 + 0 + 6 + 10= 50 months
10) Believing there was only one way to do something
This one is a biggie. I think I am struggling with this for life. Some people have it and some people don’t. I am inclined to say that maybe smart people don’t have this problem (the smart ones I have met or read about, they don’t). People like me spend a lifetime trying to beat it. It is a jail, trust me. It is quite frustrating to live with a mindset of “only one way to do something”. Ideas don’t have any limits if you look at real-life stories.
This is more of a personal-development obstacle than a technical one because no matter which field you will work in, this one will show up there, it absolutely has nothing to do with the tech. I am still trying to work on it. A solution I have found so far is when I can’t find my way around a problem then I will get off the machine and go for a walk if it is evening or read a completely unrelated book if it is not evening (some non-fiction e.g.) or go on a motorcycle ride and completely forget about the problem. Then I will come back later and try to learn the same thing from a different article or blog post while not referring to the original point where I was stuck. Just a fresh new perspective on the same problem from someone else.
I can’t put any time-limit on this. I have struggled this for all of my life:
6 + 6 + 8 + 0 + 10 + 4 + 0 + 6 + 10 + Life = 50 + Life
So, I wasted almost 50 months?
All of these points overlap with each other when it comes to where I wasted time. It is actually 12 months. Dec 2019 to Nov 2020. For a few months, in the beginning, I did not even know what I needed to do. Things started making sense only in March 2020 this year. I think I could have saved 4–6 months if things were clearer to me but this is just a wild guess, some really smart people have told me: it takes whatever time it takes to break down the obstacles. Let me re-iterate:
Each of us has a personal data science journey. Our environment, our talent, our experience, our attitude, our work ethic, our backgrounds, and our learning capacities, all are different and unique. That is why maybe tracing someone else’s path never works. That is why you need to keep on pushing yourself to learn what you can, to keep yourself informed of what is going on in the industry and keep on correcting your path (just like apps like maps on our smartphones keep on correcting us and showing the way)
BONUS — Your Mental Outlook
I was trying to learn neural networks before I could comprehend what kind of problems logistic regression fits better than linear regression. I was doing deep learning before machine learning made any sense. In my case it was because of:
- Media-hype about AI and deep learning
- My focus on building something great and truly impressive
- The assumption that everyone is doing it and I need to do better than them if I want to land a job. After all, the market is so competitive.
- Focusing on the big 4
- I have an interest in healthcare data and Practical Deep Learning for Coders has chapters on medical imaging diagnosis. You can see one example here.
Deep learning and AI are in media everywhere. We tend to think we need to be better than everyone else and others are already writing highly mathematical blog posts with their flashy formulas along with lots of code. Don’t believe me? Check this out then. Who will approach us when such people have already mastered deep learning and data science?
Yeah, it is so common that they got a name for it. It is called “Imposter Syndrome”. Go read about it a bit. I thought I was the only one suffering from it. But then I realized it is so common. Yes, the market is competitive and because of the current pandemic, many have lost jobs. I have seen posts on LinkedIn where several data scientists and machine learning engineers have lost jobs. I have seen them even literally begging to “like and share” that they are looking for a job. It is heartbreaking to see that. Everyone deserves a good life.
Let’s look at the positive side, this pandemic has disrupted the world, it has brought many businesses to a halt while some businesses have their client number shot sky-high (podcast and video conferencing services for one). In such disruptive times, we need to be more resilient to pain and suffering and find ways to strengthen our resolve. I believe it is not by chance that we were born in a certain year and that is how we got in the middle of this pandemic. I think we were supposed to learn from it, we are supposed to make a better life out of these times. I wish you good luck in your data science learning journey and I hope we keep on learning from each other to make ourselves better.
Bio: Arnuld is an industrial software developer with 5 years of experience working in C, C++, Linux, and UNIX. After transitioning to Data Science and working as data science content writer for over a year, Arnuld currently works as a freelance data scientist.
Original. Reposted with permission.
- My Data Science Online Learning Journey on Coursera
- Learn Data Science for free in 2021
- A Journey from Software to Machine Learning Engineer