Using Excel with Pandas
In this tutorial, we are going to show you how to work with Excel files in pandas, covering computer setup, reading in data from Excel files into pandas, data exploration in pandas, and more.
Getting statistical information about the data
Pandas has some very handy methods to look at the statistical data about our data set. For example, we can use the
describe method to get a statistical summary of the data set.
describe method displays below information for each of the columns.
- the count or number of values
- standard deviation
- minimum, maximum
- 25%, 50%, and 75% quantile
Please note that this information will be calculated only for the numeric values.
We can also use the corresponding method to access this information one at a time. For example, to get the mean of a particular column, you can use the
mean method on that column.
Just like mean, there are methods available for each of the statistical information we want to access. You can read about these methods in our free pandas cheat sheet.
Reading files with no header and skipping records
Earlier in this tutorial, we saw some ways to read a particular kind of Excel file that had headers and no rows that needed skipping. Sometimes, the Excel sheet doesn't have any header row. For such instances, you can tell pandas not to consider the first row as header or columns names. And If the Excel sheet's first few rows contain data that should not be read in, you can ask the
read_excel method to skip a certain number of rows, starting from the top.
For example, look at the top few rows of this Excel file.
This file obviously has no header and first four rows are not actual records and hence should not be read in. We can tell read_excel there is no header by setting argument
Noneand we can skip first four rows by setting argument
skiprows to four.
We skipped four rows from the sheet and used none of the rows as the header. Also, notice that one can combine different options in a single read statement. To skip rows at the bottom of the sheet, you can use option
skip_footer, which works just like
skiprows, the only difference being the rows are counted from the bottom upwards.
The column names in the previous DataFrame are numeric and were allotted as default by the pandas. We can rename the column names to descriptive ones by calling the method
columnson the DataFrame and passing the column names as a list.
Now that we have seen how to read a subset of rows from the Excel file, we can learn how to read a subset of columns.
Reading a subset of columns
Although read_excel defaults to reading and importing all columns, you can choose to import only certain columns. By passing parse_cols=6, we are telling the
read_excel method to read only the first columns till index six or first seven columns (the first column being indexed zero).
|0||Intolerance: Love's Struggle Throughout the Ages||1916||Drama|History|War||NaN||USA||Not Rated||123|
|1||Over the Hill to the Poorhouse||1920||Crime|Drama||NaN||USA||NaN||110|
|2||The Big Parade||1925||Drama|Romance|War||NaN||USA||Not Rated||151|
|4||Pandora's Box||1929||Crime|Drama|Romance||German||Germany||Not Rated||110|
Alternatively, you can pass in a list of numbers, which will let you import columns at particular indexes.
Applying formulas on the columns
One of the much-used features of Excel is to apply formulas to create new columns from existing column values. In our Excel file, we have Gross Earnings and Budget columns. We can get Net earnings by subtracting Budget from Gross earnings. We could then apply this formula in the Excel file to all the rows. We can do this in pandas also as shown below.
Above, we used pandas to create a new column called Net Earnings, and populated it with the difference of Gross Earnings and Budget. It's worth noting the difference here in how formulas are treated in Excel versus pandas. In Excel, a formula lives in the cell and updates when the data changes - with Python, the calculations happen and the values are stored - if Gross Earnings for one movie was manually changed, Net Earnings won't be updated.
Let's use the
sot_values method to sort the data by the new column we created and visualize the top 10 movies by Net Earnings.
Pivot Table in pandas
Advanced Excel users also often use pivot tables. A pivot table summarizes the data of another table by grouping the data on an index and applying operations such as sorting, summing, or averaging. You can use this feature in pandas too.
We need to first identify the column or columns that will serve as the index, and the column(s) on which the summarizing formula will be applied. Let's start small, by choosing Year as the index column and Gross Earnings as the summarization column and creating a separate DataFrame from this data.
We now call
pivot_table on this subset of data. The method
pivot_table takes a parameter
index. As mentioned, we want to use Year as the index.
This gave us a pivot table with grouping on Year and summarization on the sum of Gross Earnings. Notice, we didn't need to specify Gross Earnings column explicitly as pandas automatically identified it the values on which summarization should be applied.
We can use this pivot table to create some data visualizations. We can call the
plot method on the DataFrame to create a line plot and call the
show method to display the plot in the notebook.
We saw how to pivot with a single column as the index. Things will get more interesting if we can use multiple columns. Let's create another DataFrame subset but this time we will choose the columns, Country, Language and Gross Earnings.
We will use columns Country and Language as the index for the pivot table. We will use Gross Earnings as summarization table, however, we do not need to specify this explicitly as we saw earlier.
Let's visualize this pivot table with a bar plot. Since there are still few hundred records in this pivot table, we will plot just a few of them.
Exporting the results to Excel
If you're going to be working with colleagues who use Excel, saving Excel files out of pandas is important. You can export or write a pandas DataFrame to an Excel file using pandas
to_excel method. Pandas uses the
xlwt Python module internally for writing to Excel files. The
to_excel method is called on the DataFrame we want to export.We also need to pass a filename to which this DataFrame will be written.
By default, the index is also saved to the output file. However, sometimes the index doesn't provide any useful information. For example, the
movies DataFrame has a numeric auto-increment index, that was not part of the original Excel data.
You can choose to skip the index by passing along index-False.
We need to be able to make our output files look nice before we can send it out to our co-workers. We can use pandas
ExcelWriter class along with the
XlsxWriter Python module to apply the formatting.
We can do use these advanced output options by creating a
ExcelWriter object and use this object to write to the EXcel file.
We can apply customizations by calling
add_format on the workbook we are writing to. Here we are setting header format as bold.
Finally, we save the output file by calling the method
save on the writer object.
As an example, we saved the data with column headers set as bold. And the saved file looks like the image below.
Like this, one can use
XlsxWriter to apply various formatting to the output Excel file.
Pandas is not a replacement for Excel. Both tools have their place in the data analysis workflow and can be very great companion tools. As we demonstrated, pandas can do a lot of complex data analysis and manipulations, which depending on your need and expertise, can go beyond what you can achieve if you are just using Excel. One of the major benefits of using Python and pandas over Excel is that it helps you automate Excel file processing by writing scripts and integrating with your automated data workflow. Pandas also has excellent methods for reading all kinds of data from Excel files. You can export your results from pandas back to Excel too if that's preferred by your intended audience.
On the other hand, Excel is a such a widely used data tool, it's not a wise to ignore it. Acquiring expertise in both pandas and Excel and making them work together gives you skills that can help you stand out in your organization.
Bio: Harish Garg is an Entrepreneur, Technical Trainer, and Lead Software Developer with extensive experience in Data Science, Python, Web, and Mobile Development. He is passionate about Data Science and Artificial Intelligence.
Original. Reposted with permission.
- Getting Started with Analytics: What’s the Upfront Investment?
- Python Data Preparation Case Files: Removing Instances & Basic Imputation
- Python Data Preparation Case Files: Group-based Imputation