Using Excel with Pandas
In this tutorial, we are going to show you how to work with Excel files in pandas, covering computer setup, reading in data from Excel files into pandas, data exploration in pandas, and more.
By Harish Garg, Dataquest
Excel is one of the most popular and widely-used data tools; it's hard to find an organization that doesn't work with it in some way. From analysts, to sales VPs, to CEOs, various professionals use Excel for both quick stats and serious data crunching.
With Excel being so pervasive, data professionals must be familiar with it. You'll also want a tool that can easily read and write Excel files — pandas is perfect for this.
Pandas has excellent methods for reading all kinds of data from Excel files. You can also export your results from pandas back to Excel, if that's preferred by your intended audience. Pandas is great for other routine data analysis tasks, such as:
- quick Exploratory Data Analysis (EDA)
- drawing attractive plots
- feeding data into machine learning tools like scikit-learn
- building machine learning models on your data
- taking cleaned and processed data to any number of data tools
Pandas is better at automating data processing tasks than Excel, including processing Excel files.
In this tutorial, we are going to show you how to work with Excel files in pandas. We will cover the following concepts.
- setting up your computer with the necessary software
- reading in data from Excel files into pandas
- data exploration in pandas
- visualizing data in pandas using the matplotlib visualization library
- manipulating and reshaping data in pandas
- moving data from pandas into Excel
Note that this tutorial does not provide a deep dive into pandas. To explore pandas more, check out our course.
In addition to Python and Jupyter Notebook, you will need the following Python modules:
- matplotlib - data visualization
- NumPy - numerical data functionality
- OpenPyXL - read/write Excel 2010 xlsx/xlsm files
- pandas - data import, clean-up, exploration, and analysis
- xlrd - read Excel data
- xlwt - write to Excel
- XlsxWriter - write to Excel (xlsx) files
There are multiple ways to get set up with all the modules. We cover three of the most common scenarios below.
- If you have Python installed via Anaconda package manager, you can install the required modules using the command
conda install. For example, to install pandas, you would execute the command -
conda install pandas.
- If you already have a regular, non-Anaconda Python installed on the computer, you can install the required modules using
pip. Open your command line program and execute command
pip install <module name>to install a module. You should replace
<module name>with the actual name of the module you are trying to install. For example, to install pandas, you would execute command -
pip install pandas.
- If you don't have Python already installed, you should get it through the Anaconda package manager. Anaconda provides installers for Windows, Mac, and Linux Computers. If you choose the full installer, you will get all the modules you need, along with Python and pandas within a single package. This is the easiest and fastest way to get started.
The data set
In this tutorial, we will use a multi-sheet Excel file we created from Kaggle's IMDB Scores data. You can download the file here.
Our Excel file has three sheets: '1900s,' '2000s,' and '2010s.' Each sheet has data for movies from those years.
We will use this data set to find the ratings distribution for the movies, visualize movies with highest ratings and net earnings and calculate statistical information about the movies. We will be analyzing and exploring this data using pandas, thus demonstrating pandas capabilities to work with Excel data.
Read data from the Excel file
We need to first import the data from the Excel file into pandas. To do that, we start by importing the pandas module.
We then use the pandas' read_excel method to read in data from the Excel file. The easiest way to call this method is to pass the file name. If no sheet name is specified then it will read the first sheet in the index (as shown below).
read_excel method read the data from the Excel file into a pandas DataFrame object. Pandas defaults to storing data in DataFrames. We then stored this DataFrame into a variable called
Pandas has a built-in
DataFrame.head() method that we can use to easily display the first few rows of our DataFrame. If no argument is passed, it will display first five rows. If a number is passed, it will display the equal number of rows from the top.
Excel files quite often have multiple sheets and the ability to read a specific sheet or all of them is very important. To make this easy, the pandas
read_excel method takes an argument called
sheetname that tells pandas which sheet to read in the data from. For this, you can either use the sheet name or the sheet number. Sheet numbers start with zero. If the
sheetname argument is not given, it defaults to zero and pandas will import the first sheet.
By default, pandas will automatically assign a numeric index or row label starting with zero. You may want to leave the default index as such if your data doesn't have a column with unique values that can serve as a better index. In case there is a column that you feel would serve as a better index, you can override the default behavior by setting
index_col property to a column. It takes a numeric value for setting a single column as index or a list of numeric values for creating a multi-index.
In the below code, we are choosing the first column, 'Title', as index (index=0) by passing zero to the
As you noticed above, our Excel data file has three sheets. We already read the first sheet in a DataFrame above. Now, using the same syntax, we will read in rest of the two sheets too.
Since all the three sheets have similar data but for different records\movies, we will create a single DataFrame from all the three DataFrames we created above. We will use the pandas
concat method for this and pass in the names of the three DataFrames we just created and assign the results to a new DataFrame object,
movies. By keeping the DataFrame name same as before, we are over-writing the previously created DataFrame.
We can check if this concatenation by checking the number of rows in the combined DataFrame by calling the method
shape on it that will give us the number of rows and columns.
Using the ExcelFile class to read multiple sheets
We can also use the ExcelFile class to work with multiple sheets from the same Excel file. We first wrap the Excel file using
ExcelFile and then pass it to
If you are reading an Excel file with a lot of sheets and are creating a lot of DataFrames,
ExcelFile is more convenient and efficient in comparison to
read_excel. With ExcelFile, you only need to pass the Excel file once, and then you can use it to get the DataFrames. When using
read_excel, you pass the Excel file every time and hence the file is loaded again for every sheet. This can be a huge performance drag if the Excel file has many sheets with a large number of rows.
Exploring the data
Now that we have read in the movies data set from our Excel file, we can start exploring it using pandas. A pandas DataFrame stores the data in a tabular format, just like the way Excel displays the data in a sheet. Pandas has a lot of built-in methods to explore the DataFrame we created from the Excel file we just read in.
We already introduced the method
head in the previous section that displays few rows from the top from the DataFrame. Let's look at few more methods that come in handy while exploring the data set.
We can use the
shape method to find out the number of rows and columns for the DataFrame.
This tells us our Excel file has 5042 records and 25 columns or observations. This can be useful in reporting the number of records and columns and comparing that with the source data set.
We can use the
tail method to view the bottom rows. If no parameter is passed, only the bottom five rows are returned.
In Excel, you're able to sort a sheet based on the values in one or more columns. In pandas, you can do the same thing with the
sort_values method. For example, let's sort our movies DataFrame based on the Gross Earnings column.
Since we have the data sorted by values in a column, we can do few interesting things with it. For example, we can display the top 10 movies by Gross Earnings.
We can also create a plot for the top 10 movies by Gross Earnings. Pandas makes it easy to visualize your data with plots and charts through matplotlib, a popular data visualization library. With a couple lines of code, you can start plotting. Moreover, matplotlib plots work well inside Jupyter Notebooks since you can displace the plots right under the code.
First, we import the matplotlib module and set matplotlib to display the plots right in the Jupyter Notebook.
We will draw a bar plot where each bar will represent one of the top 10 movies. We can do this by calling the
plot method and setting the argument
barh. This tells
matplotlibto draw a horizontal bar plot.
Let's create a histogram of IMDB Scores to check the distribution of IMDB Scores across all movies. Histograms are a good way to visualize the distribution of a data set. We use the
plotmethod on the IMDB Scores series from our movies DataFrame and pass it the argument
This data visualization suggests that most of the IMDB Scores fall between six and eight.
|Top Stories Past 30 Days|