KDnuggets Home » News » 2016 » Apr » Tutorials, Overviews » Comprehensive Guide to Learning Python for Data Analysis and Data Science ( 16:n15 )

Comprehensive Guide to Learning Python for Data Analysis and Data Science

Want to make a career change to Data Science using python? Well learning anything on your own can be a challenge & a little guidance could be a great help, that is exactly what this article will provide you with.

Python is widely used for data analysis and you might have considered learning it yourself (if not, or if you’re still looking for that bit of extra motivation to get started, see why you should be learning Python below). Of course, learning on your own can be a challenge and some guidance is always helpful. Guidance to learn Python for working with data is exactly what this article will provide you with.

We will discuss steps you should take for learning Python accompanied with some essential resources, such as the free Python for Data Analysis courses and tutorials from DataCamp as well as reading and learning materials.

Step 0: Reasons to Learn Python

Why learn Python as a data analytics tool?

  • It’s a Popular Data Analysis Tool: Firstly, by itself Python is one of the most popular tools for data analysis. With 35% of data scientists using Python, it is ahead of SQL and SAS, and behind only R.
  • General Purpose Programming: Despite there being other very popular and great computing tools used for analyzing data (e.g. R, SAS), Python is the only true general purpose programming language. Check out this infographic for a more thorough comparison.
  • Popular Programming Language: In addition, Python is one the most popular programming languages, when compared with other general purpose languages (e.g. Java, C++, PHP).
  • If that’s not enough, Python is also the language of choice for teaching computer science in top U.S. universities.

As a side note: as described in “R or Python? Consider learning both” we don’t recommend that you only learn Python and forget about the rest. However, learning Python is one of the best things you can do for your career. There are good reasons why Python is being adopted so widely by computer scientists, and why it’s a data analysis tool of choice for so many, the main one being the ease of learning and using Python. Nonetheless, it can be challenging to set a learning path, so that’s what we will do now.


Step 1: Setting up your Python Environment for Data Analysis

  • Download Python from Continuum Analytics: ANACONDA
  • Optional (Download Rodeo from Yhat): Rodeo IDE

Setting up your Python environment for performing data analysis is relatively simple. The most convenient way to go about this is to download the free Anaconda package from Continuum Analytics, as it contains the core Python language, as well as all of the essential libraries including NumPy, Pandas, SciPy, Matplotlib, and IPython. By using the graphical installer, downloading Python is as easy as downloading any computer program.

After installing, you will get a launcher containing a number of programs. The most important one is the iPython notebook, which is also called Jupyter notebook. Once  you launch the notebook, the terminal is opened and a notebook is opened in your browser. Don’t get confused here! You don’t need internet connection to create or use the notebooks. Simply, the browser is used instead of a separate program and serves as your environment,where you can code.

However, you are not limited to using the browser based Jupyter notebooks. If you prefer an IDE, a great option for data analysis is Rodeo from Yhat. If you are familiar with RStudio for R, Rodeo is something very similar for Python. Be sure to try out both alternatives, as, ultimately, the Python environment you use will depend on your personal preference.


Step 2: Learning the Basics and Fundamentals

Now you are ready to begin learning to code with Python. There are a couple of good ways to go about this. Given your interest to learn Python for data analysis, your best option is the Introduction for Python for Data Science from DataCamp. This free course consist of video tutorials and interactive in browser exercises and is a great way to learn by doing, as opposed to simply reading concepts and looking at examples. You wouldn’t begin learning how to paint by reading a book about it. You would pick up a brush and start painting. That’s the way we would suggest for you to start learning Python! In addition to the introductory course, DataCamp offers an Intermediate Python for Data Science course which takes you even further.

Another quite useful resource is the Python course from Codecademy. While this course is not about data, but rather programming with Python, it is a great way to both practice with Python syntax and gain exposure to programming concepts that will be useful to you when working with data.

Step 3: Python Packages for Data Analysis

Python is a general purpose language and is often used for things other than data analysis and data science. What makes Python extremely useful for working with data, however, are the libraries that give users the necessary functionality. Below are the major Python libraries that are used for working with data. You should take some time to familiarize yourself with the basic purposes of these packages.

  • Numpy and Scipy – fundamental scientific computing.
  • Pandas – data manipulation and analysis.
  • Matplotlib – plotting and visualization.
  • Scikit-learn – machine learning and data mining.
  • StatsModels – statistical modeling, testing, and analysis.