How to Learn Python for Data Science the Right Way

The biggest mistake you can make while learning Python for data science is to learn Python programming from courses meant for programmers. Avoid this mistake, and learn Python the right way by following this approach.

By Manu Jeevan, KDnuggets on June 14, 2019 in Advice, Data Science, Jupyter, Matplotlib, Pandas, Python, scikit-learn, StatsModels

comments

The right way to learn Python for data science

Most aspiring data scientists begin to learn Python by taking programming courses meant for developers. They also start solving Python programming riddles on websites like LeetCode with an assumption that they have to get good at programming concepts before starting to analyzing data using Python.

This is a huge mistake because data scientists use Python for retrieving, cleaning, visualizing and building models; and not for developing software applications. Therefore, you have to focus most of your time in learning the modules and libraries in Python to perform these tasks.

Follow this incremental steps to learn Python for data science.

Configure your programming environment

The Jupyter Notebook is a powerful programming environment for developing and presenting data science projects.

The simplest way for you to install Jupyter Notebook on your computer is by installing Anaconda. Anaconda is the most widely used Python Distribution for data science and comes pre-loaded with all the most popular libraries.

You can go through the blog post titled "A Beginner’s Guide to Installing Jupyter Notebook Using Anaconda Distribution" to learn how to install Anaconda. While installing Anaconda, choose the latest Python 3 version.

After installing Anaconda, go through this article on Code Academy to learn how to use Jupyter Notebooks.

Learn just the basics of Python

Code Academy has an excellent course on Python, it takes you approximately 20 hours to complete it. You don’t have to upgrade to the Pro Version as your goal is just to get familiar with the basics of Python programming language.

Numpy and Pandas - An Excellent resource to learn them

Python is slow for numerically heavy algorithms and handling large amounts of data. You might ask then why is Python the most popular programming language for data science?

The answer is that in Python, it is easy to offload number-crunching tasks to the lower layer in the form of a C or Fortran extension. That is exactly what Numpy and Pandas do.

First, you should learn Numpy. It is the most fundamental module for scientific computing with Python. Numpy provides the support of highly optimized multidimensional arrays, which are the most basic data structure of most Machine Learning algorithms.

Next, you should learn Pandas. Data scientists spend most of their time cleaning data, which is also called as data munging or data wrangling.

Pandas is the most popular Python library for manipulating data. Pandas is as an extension of NumPy. The underlying code for Pandas uses the NumPy library extensively. The primary data structure in Pandas is called a data frame.

Wes McKinney, the creator of Pandas, has written a fantastic book called "Python for Data Analysis". Go through the chapters 4, 5, 7, 8 and 10 to learn Pandas and Numpy. These chapters cover most frequently used Numpy and Pandas features for manipulating data.

Learn to visualize data using Matplotlib

Matplotlib is the fundamental Python package for creating basic visualizations. You must learn how to use Matplotlib to create some of the most common charts like Line charts, Bar charts, Scatter plots, Histograms and Box plots.

Another good plotting library that is built on top of Matplotlib and closely integrated with Pandas is called Seaborn. At this stage, I would suggest you to quickly learn just how to create the basic charts in Matplotlib and not to focus on Seaborn.

I have written a four-part tutorial on how to develop basic graphs using Matplotlib.

Part one: Basic figures in Matplotlib

Part two: How to control the style and color of a figure, such as markers, line thickness, line patterns and using color maps.

Part three: Annotation, controlling the axis range, aspect ratio and coordinate system

Part four: Working with complex figures

You go through these tutorials to grasp the basics of Matplotlib.

A quick note, you don’t have to spend too much time learning Matplotlib because nowadays companies have started adopting tools like Tableau and Qlik for creating interactive visualizations.

How to use SQL and Python

In organizations, data resides in a database. Therefore, you need to know how to retrieve data using SQL and perform the analysis in Jupyter Notebook using Python.

Data Scientists manipulate data using both SQL and Pandas. Because there are certain data manipulation tasks that are easy to perform using SQL, and there are certain tasks that can be done efficiently using Pandas. I personally like to use SQL for retrieving data and do the manipulation in Pandas.

Today, companies use analytics platforms like Mode Analytics and Databricks to easily work with Python and SQL.

So, you should know how to efficiently use SQL and Python together. To learn that, you can install the SQLite database on your computer and store a CSV file in it and analyze it using Python and SQL. Here’s an amazing blog post that shows you how to do that: Programming with Databases in Python using SQLite.

Before you go through the above blog post, you should understand the basics of SQL. Mode Analytics has a good tutorial on SQL: Introduction to SQL.Go through their BASIC SQL section to understand the basics of SQL really well as every data scientist should definitely know how to efficiently retrieve data using SQL.

Learn basic Statistics with Python

Most aspiring Data Scientists directly jump to learn machine learning without even learning the basics of statistics.

Don’t make that mistake because Statistics is the backbone of data science. On the other hand, aspiring data scientists who learn statistics just learn the theoretical concepts instead of learning the practical concepts.

By practical concepts, I mean, you should know what sort of problems can be solved with Statistics. Understanding what challenges you can overcome using Statistics.

Here are some of the basic Statistical concepts you should know:

Sampling, frequency distributions, Mean, Median, Mode, Measure of variability, Probability basics, significant testing, standard deviation, z-scores, confidence intervals, and hypothesis testing (including A/B testing).

A very good book to teach practical Statistics is “Practical Statistics for Data Scientists: 50 Essential Concepts". Unfortunately, for Python lovers like me, the code examples in the book are written in R. I would recommend you to read the first four chapters of the book. Go through the first 4 chapters of the book to understand the basics statistical concepts that I mentioned previously, ignore the code examples and just understand the concepts. The rest of chapters in the book mostly focus on Machine Learning. I will talk about how to learn Machine Learning in the next section.

Most people recommend Think Stats to learn Statistics with Python but the author teaches his own custom functions instead of using standard Python libraries like Statsmodels for doing Statistics. That's the reason why I didn't recommend this book.

After this, your goal is to implement the basic concepts you learned in Python. StatsModels is a popular Python library used to build statistical models in Python. StatsModels website has good tutorials on how to implement statistical concepts using Python.

Alternatively, you can also watch this video by Gaël Varoquaux. He shows you how to perform inferential and exploratory statistics using Pandas and Stats Models.

Perform Machine Learning using Scikit-Learn

Scikit-Learn is one of the most popular Machine Learning Libraries in Python. Your goal is to learn how to implement some of the most common machine learning algorithms using Scikit-Learn.

Here’s how to do that.

First, watch week 1, 2, 3, 6, 7, and 8 videos of Andrew Ng’s Machine Learning course on Coursera. I skipped the sections on Neural Networks because as a starting point you have to focus on the most widely Machine Learning techniques.

Once you are done with that, read the book “Hands-On Machine Learning with Scikit-Learn and TensorFlow”. Just go through the first part of the book (around 300 pages). It is one of the most practical Machine Learning books available.

By doing the coding exercises in this book, you will learn how to implement the theoretical concepts you learned in Andrew Ng’s course using Python.

Conclusion

Your final step is to do a data science project that covers all of the above steps. You can find a data set you like and then come up with interesting business questions that you can answer by analyzing it. But, don't choose generic datasets like Titanic Machine Learning for your project. You can read "19 places to find free data sets for your data science project" for finding data sets.

Another way is to apply data science to an area that you are passionate about. For example, if you want to predict the stock market prices then you can scrape real-time data from Yahoo Finance and store it in a SQL database and use Machine Learning to predict the stock prices.

If you are looking to transition to data science from a different industry, I recommend you to work on a project that leverages your domain expertise. I have given an in-depth explanation of this approach in my previous blog posts "A Step-by-Step Guide to Transitioning your Career to Data Science – Part 1" and "A Step-by-Step Guide to Transitioning your Career to Data Science – Part 2".

Let me know if you have any questions in comments!

Related: