How to Become a Data Scientist: The Definitive Guide
Data science educator Jose Portilla provides this definitive guide on becoming a data scientist, which includes everything from resources for acquiring specific skills, to searching for the first job, to mastering the interview.
By Jose Portilla, Udemy Data Science Instructor.
How to become a Data Scientist
Hi! I’m Jose Portilla and I’m an instructor on Udemy with over 250,000 students enrolled across various courses on Python for Data Science and Machine Learning, R Programming for Data Science, Python for Big Data, and many more.
Almost every day a student will ask me some form of this question:
What should I do to become a data scientist?
In this post, I’ll try my best to help answer this question and point to resources that can help guide you to an answer, also hopefully this post serves as something I can quickly link to my students :)
I’m also currently writing a book on acing data scientist interviews! Check it out here.
Now on to the rest of this post! I’ve broken down the steps into some key topics and discussed helpful details for each.
“The secret of getting ahead is getting started.” — Mark Twain
If you are interested in becoming a data scientist the best advice is to begin preparing for your journey now! Taking the time to understand core concepts will not only be very useful once you are interviewing, but it will also help you decide whether you are truly interested in this field.
Before starting on the path to becoming a data scientist, its important that you are honest with yourself about why you want to do this. There are probably some questions you should ask yourself:
- Do you enjoy statistics and programming? (Or at least what you’ve learned so far about them?)
- Do you enjoy working in a field where you need to constantly be learning about the latest techniques and technologies in this space?
- Are you interested in becoming a data scientist, even if it just paid an average salary?
- Are you okay with other job titles (e.g. Data Analyst, Business Analyst, etc…)?
Ask yourself these questions and be honest with yourself. If you answered yes, then you are on your way to become a data scientist!
The path to becoming a data scientist will most likely take you some time, depending on your previous experience and your network. Leveraging these two can help place you in a data scientist role faster, but be prepared to always be learning! Let’s now jump to discussions on some more tangible topics!
“Do not worry about your difficulties in Mathematics. I can assure you mine are still greater.” — Albert Einstein
The main topics concerning mathematics that you should familiarize yourself with if you want to go into data science are probability, statistics, and linear algebra. As you learn more about other topics such as statistical learning (machine learning) these core mathematical foundations will serve as a base for you to continue learning from. Let’s briefly describe each and give you a few resources to learn from!
Probability — is the measure of the likelihood that an event will occur. A lot of data science is based on attempting to measure likelihood of events, everything from the odds of an advertisement getting clicked on, to the probability of failure for a part on an assembly line.
For this classic topic I recommend going with a book, such as A First Course in Probability by Sheldon Ross or Probability Theory by E.T. Jaynes. Since these are textbooks they can be quite expensive if you buy new directly from amazon, so I suggest looking at used copies online or at pdf versions to save yourself some money!
If you prefer learning through a video format, you can also check out Khan Academy’s video series on probability. You can also check out MIT’s OpenCourseWare lectures on Probability and Statistics. Both can be found easily for free on Youtube with a simple search.
Statistics — Once you have a firm grasp on probability theory you can move on to learning about statistics, which is the general branch of mathematics that deals with analyzing and interpreting data. Having a full understanding of the techniques used in statistics requires you to understand probability and probability notation!
Again, I’m more of a textbook person, and fortunately there are two great online textbooks that are completely free for you to reference:
If you prefer more old-school textbooks, I like Statistics by David Freedman. I would suggest using this book as your main base and then checking out the other resources listed here for deeper dives into other topics (like ANOVA).
For practice problems I really enjoyed using Shaum’s Outlines Series (you can find books in this series for both Probability and Statistics).
If you prefer video, check out Brandon Holtz’s great series on statistics on Youtube!
Linear Algebra — is the branch of math that covers the study of vector spacing and linear mapping between these spaces. Its used heavily in machine learning, and if you really want to understand how these algorithms work, you will need to build a basic understanding of Linear Algebra.
I recommend checking out Linear Algebra and Its Applications by Strang, its a great textbook that is also used in the MIT Linear Algebra course you can access via OpenCourseWare! With these two resources you should be able to build a solid foundation in linear algebra.
Depending on your position and workflow, you may not need to dive very deep into some of the more complex details of linear algebra, once you get more familiar with programming, you’ll see that some libraries tend to handle a lot of the linear algebra tasks for you. But it is still important to understand how these algorithms work!
“Measuring programming progress by lines of code is like measuring aircraft building progress by weight.” — Bill Gates
The data science community has mainly adopted R and Python as its main languages for programming. Other languages such as Julia and Matlab are used as well, but R and Python are by far the most popular in this space.
In this section I’m going to describe some of the main basic topics of programming and data science, and then point out the main libraries used for both R and Python!
This is a topic that is extremely dependent on your personal preference, I’m just going to briefly describe some of the more popular options for development environments (IDEs) for data science with R and Python.
Python — Since Python is a general programming language lots of options are available! You could just use a plain text editor such as Sublime Text or Atom and then customize to your own liking, I personally use this approach for larger projects. Another popular IDE for python is PyCharm from JetBrains, which provides a free community edition that has plenty of features for most users. My favorite environment for Python has to be the Jupyter Notebook , previously known as iPython Notebooks, this notebook environment uses cells to break up your code and provides instant output, so you can interact with the code and visualizations easily! Jupyter Notebook supports many kernels, including Scala, R, Julia, and more. Python is by far the best supported out of all of these, although the other languages improve all the time! Jupyter notebooks are extremely popular in the field of data science and machine learning. I use this for all my Python courses and most students have really enjoyed it. While probably not the best solution for larger projects that need to be deployed, its fantastic for a learning environment.
As far as getting Python installed on your computer, you can always use the official source — python.org , but I usually suggest using the Anacondadistribution, which comes with many of the packages I’ll discuss in this section!
R — RStudio is probably the most popular development environment for R. It has a great community behind it, its basic full version is completely free. It displays visualizations well, gives you lots of options for customizing experience and a lot more. It is pretty much my go to for anything with R! Jupyter Notebooks also support R kernels, and while I have used them, I have found the experience lacking compared to Jupyter Notebook’s capabilities with Python.
Python — For data analysis, two libraries are the main workhorses of Python: NumPy and Pandas. NumPy is a numerical scientific computing package that serves as the base for almost all the other Python packages in the Python Data Science ecosystem. Pandas is a data analysis library that is built directly off of NumPy that is designed to mimic many of the built-in features or R, such as DataFrames! You can think of it as a super version of Excel that allows you to quickly clean and analyze data. If you become a data scientist that uses Python, pandas will quickly become one of your main tools! It is personally my favorite Python library! I would also recommend checking out SciPy for details and links for the libraries in the PyData system.
R — For the most part R already comes with a lot of data analysis features built-in, such as Dataframes! But the R community has also created a lot of useful packages for helping deal with data in an even more efficient manner! These packages are known as the “tidyverse”, and its a collection of useful packages for data science, all designed with a similar philosophy of working with data, meaning that they all work very well together. These packages include dplyr for data manipulation, tidyr for cleaning your data, readr for reading in data, and packages like purr and tibble which improve some built-in functionalities of R. Learning the tidyverse of packages is a must for a data scientist using R! ggplot2 is also part of the tidyverse, but is for data visualization, so let’s jump to that topic next!
Python — The “grandfather” of visualization with Python is matplotlib. Matplotlib was created to provide a visualization API for Python reminiscent of the style used in MatLab. If you have used MatLab for visualization before, the transition will feel very natural. However, due to its huge library of capabilities, a lot of other visualization libraries have been created off of matplotlib in an attempt to simplify things or provide more specific functionality!
Seaborn is a great statistical plotting library that works very well with pandas and is written with the use matplotlib. It creates beautiful plots with just a few lines of code.
Pandas also comes with built-in plotting capabilities built off of matplotlib!
R — By far the most popular plotting library for R is ggplot2. It philosophy on designed and its layer based API makes it easy to use and allows you to make basically any major plot you can think of! What is also great is that is works easily with Plotly, allowing you to quickly convert ggplot2 graphs into interactive visualizations through the use of ggplotly!
Python — SciKit-learn is the most popular machine learning library for Python, with built-in algorithms and models for classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. If you are more interested in building statistical inference models (such as analyzing p-values after a linear regression), you should check out statsmodels, it also is a great choice for working with time series data! For Deep Learning, check out TensorFlow, PyTorch, or Keras. I recommend Keras for beginners due to its simplified API. For Deep Learning topics you should always reference the official documentation, as this is a field that changes very fast!
R — One of the issues with R for beginner data scientists is that it has a huge variety of options for packages when it comes to machine learning. Each major algorithm can have its own separate packages, each with different focuses. When you are starting out I recommend first checking out the caret package, which provides an nice interface for classification and regression tasks. Once you’ve moved on to unsupervised learning techniques such as clustering, your best bet is to do a quick google search to see which packages are the most popular for whatever technique you plan to use, you’ll even discover that R already had some of the basic algorithms built-in, such as kmeans clustering.
Where to learn these libraries and skills?
I teach these topics in full, you can check out the courses for 95% off by using the links below.
My Python for Data Science and Machine Learning Bootcamp:
Learn how to use NumPy, Pandas, Seaborn , Matplotlib , Plotly , Scikit-Learn , Machine Learning, Tensorflow , and more! (www.udemy.com)
My course on R for Data Science, Visualization, and Machine Learning:
Learn how to use the R programming language for data science and machine learning and data visualization! (www.udemy.com)
Now that we’ve gone over the general background of programming topics, let’s discuss the path to actually landing a data science job!