Using Excel with Pandas
In this tutorial, we are going to show you how to work with Excel files in pandas, covering computer setup, reading in data from Excel files into pandas, data exploration in pandas, and more.
Getting statistical information about the data
Pandas has some very handy methods to look at the statistical data about our data set. For example, we can use the describe
method to get a statistical summary of the data set.
movies.describe()
Year | Duration | Aspect Ratio | Budget | Gross Earnings | Facebook Likes - Director | Facebook Likes - Actor 1 | Facebook Likes - Actor 2 | Facebook Likes - Actor 3 | Facebook Likes - cast Total | Facebook likes - Movie | Facenumber in posters | User Votes | Reviews by Users | Reviews by Crtiics | IMDB Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 4935.000000 | 5028.000000 | 4714.000000 | 4.551000e+03 | 4.159000e+03 | 4938.000000 | 5035.000000 | 5029.000000 | 5020.000000 | 5042.000000 | 5042.000000 | 5029.000000 | 5.042000e+03 | 5022.000000 | 4993.000000 | 5042.000000 |
mean | 2002.470517 | 107.201074 | 2.220403 | 3.975262e+07 | 4.846841e+07 | 686.621709 | 6561.323932 | 1652.080533 | 645.009761 | 9700.959143 | 7527.457160 | 1.371446 | 8.368475e+04 | 272.770808 | 140.194272 | 6.442007 |
std | 12.474599 | 25.197441 | 1.385113 | 2.061149e+08 | 6.845299e+07 | 2813.602405 | 15021.977635 | 4042.774685 | 1665.041728 | 18165.101925 | 19322.070537 | 2.013683 | 1.384940e+05 | 377.982886 | 121.601675 | 1.125189 |
min | 1916.000000 | 7.000000 | 1.180000 | 2.180000e+02 | 1.620000e+02 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.000000e+00 | 1.000000 | 1.000000 | 1.600000 |
25% | 1999.000000 | 93.000000 | 1.850000 | 6.000000e+06 | 5.340988e+06 | 7.000000 | 614.500000 | 281.000000 | 133.000000 | 1411.250000 | 0.000000 | 0.000000 | 8.599250e+03 | 65.000000 | 50.000000 | 5.800000 |
50% | 2005.000000 | 103.000000 | 2.350000 | 2.000000e+07 | 2.551750e+07 | 49.000000 | 988.000000 | 595.000000 | 371.500000 | 3091.000000 | 166.000000 | 1.000000 | 3.437100e+04 | 156.000000 | 110.000000 | 6.600000 |
75% | 2011.000000 | 118.000000 | 2.350000 | 4.500000e+07 | 6.230944e+07 | 194.750000 | 11000.000000 | 918.000000 | 636.000000 | 13758.750000 | 3000.000000 | 2.000000 | 9.634700e+04 | 326.000000 | 195.000000 | 7.200000 |
max | 2016.000000 | 511.000000 | 16.000000 | 1.221550e+10 | 7.605058e+08 | 23000.000000 | 640000.000000 | 137000.000000 | 23000.000000 | 656730.000000 | 349000.000000 | 43.000000 | 1.689764e+06 | 5060.000000 | 813.000000 | 9.500000 |
The describe
method displays below information for each of the columns.
- the count or number of values
- mean
- standard deviation
- minimum, maximum
- 25%, 50%, and 75% quantile
Please note that this information will be calculated only for the numeric values.
We can also use the corresponding method to access this information one at a time. For example, to get the mean of a particular column, you can use the mean
method on that column.
movies["Gross Earnings"].mean()
48468407.526809327
Just like mean, there are methods available for each of the statistical information we want to access. You can read about these methods in our free pandas cheat sheet.
Reading files with no header and skipping records
Earlier in this tutorial, we saw some ways to read a particular kind of Excel file that had headers and no rows that needed skipping. Sometimes, the Excel sheet doesn't have any header row. For such instances, you can tell pandas not to consider the first row as header or columns names. And If the Excel sheet's first few rows contain data that should not be read in, you can ask the read_excel
method to skip a certain number of rows, starting from the top.
For example, look at the top few rows of this Excel file.
This file obviously has no header and first four rows are not actual records and hence should not be read in. We can tell read_excel there is no header by setting argument header
to None
and we can skip first four rows by setting argument skiprows
to four.
movies_skip_rows = pd.read_excel("movies-no-header-skip-rows.xls", header=None, skiprows=4) movies_skip_rows.head(5)
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Metropolis | 1927 | Drama|Sci-Fi | German | Germany | Not Rated | 145 | 1.33 | 6000000.0 | 26435.0 | ... | 136 | 23 | 18.0 | 203 | 12000 | 1 | 111841 | 413 | 260.0 | 8.3 |
1 | Pandora's Box | 1929 | Crime|Drama|Romance | German | Germany | Not Rated | 110 | 1.33 | NaN | 9950.0 | ... | 426 | 20 | 3.0 | 455 | 926 | 1 | 7431 | 84 | 71.0 | 8.0 |
2 | The Broadway Melody | 1929 | Musical|Romance | English | USA | Passed | 100 | 1.37 | 379000.0 | 2808000.0 | ... | 77 | 28 | 4.0 | 109 | 167 | 8 | 4546 | 71 | 36.0 | 6.3 |
3 | Hell's Angels | 1930 | Drama|War | English | USA | Passed | 96 | 1.20 | 3950000.0 | NaN | ... | 431 | 12 | 4.0 | 457 | 279 | 1 | 3753 | 53 | 35.0 | 7.8 |
4 | A Farewell to Arms | 1932 | Drama|Romance|War | English | USA | Unrated | 79 | 1.37 | 800000.0 | NaN | ... | 998 | 164 | 99.0 | 1284 | 213 | 1 | 3519 | 46 | 42.0 | 6.6 |
5 rows × 25 columns
We skipped four rows from the sheet and used none of the rows as the header. Also, notice that one can combine different options in a single read statement. To skip rows at the bottom of the sheet, you can use option skip_footer
, which works just like skiprows
, the only difference being the rows are counted from the bottom upwards.
The column names in the previous DataFrame are numeric and were allotted as default by the pandas. We can rename the column names to descriptive ones by calling the method columns
on the DataFrame and passing the column names as a list.
movies_skip_rows.columns = ['Title', 'Year', 'Genres', 'Language', 'Country', 'Content Rating', 'Duration', 'Aspect Ratio', 'Budget', 'Gross Earnings', 'Director', 'Actor 1', 'Actor 2', 'Actor 3', 'Facebook Likes - Director', 'Facebook Likes - Actor 1', 'Facebook Likes - Actor 2', 'Facebook Likes - Actor 3', 'Facebook Likes - cast Total', 'Facebook likes - Movie', 'Facenumber in posters', 'User Votes', 'Reviews by Users', 'Reviews by Crtiics', 'IMDB Score'] movies_skip_rows.head()
Title | Year | Genres | Language | Country | Content Rating | Duration | Aspect Ratio | Budget | Gross Earnings | ... | Facebook Likes - Actor 1 | Facebook Likes - Actor 2 | Facebook Likes - Actor 3 | Facebook Likes - cast Total | Facebook likes - Movie | Facenumber in posters | User Votes | Reviews by Users | Reviews by Crtiics | IMDB Score | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Metropolis | 1927 | Drama|Sci-Fi | German | Germany | Not Rated | 145 | 1.33 | 6000000.0 | 26435.0 | ... | 136 | 23 | 18.0 | 203 | 12000 | 1 | 111841 | 413 | 260.0 | 8.3 |
1 | Pandora's Box | 1929 | Crime|Drama|Romance | German | Germany | Not Rated | 110 | 1.33 | NaN | 9950.0 | ... | 426 | 20 | 3.0 | 455 | 926 | 1 | 7431 | 84 | 71.0 | 8.0 |
2 | The Broadway Melody | 1929 | Musical|Romance | English | USA | Passed | 100 | 1.37 | 379000.0 | 2808000.0 | ... | 77 | 28 | 4.0 | 109 | 167 | 8 | 4546 | 71 | 36.0 | 6.3 |
3 | Hell's Angels | 1930 | Drama|War | English | USA | Passed | 96 | 1.20 | 3950000.0 | NaN | ... | 431 | 12 | 4.0 | 457 | 279 | 1 | 3753 | 53 | 35.0 | 7.8 |
4 | A Farewell to Arms | 1932 | Drama|Romance|War | English | USA | Unrated | 79 | 1.37 | 800000.0 | NaN | ... | 998 | 164 | 99.0 | 1284 | 213 | 1 | 3519 | 46 | 42.0 | 6.6 |
5 rows × 25 columns
Now that we have seen how to read a subset of rows from the Excel file, we can learn how to read a subset of columns.
Reading a subset of columns
Although read_excel defaults to reading and importing all columns, you can choose to import only certain columns. By passing parse_cols=6, we are telling the read_excel
method to read only the first columns till index six or first seven columns (the first column being indexed zero).
movies_subset_columns = pd.read_excel(excel_file, parse_cols=6) movies_subset_columns.head()
Title | Year | Genres | Language | Country | Content Rating | Duration | |
---|---|---|---|---|---|---|---|
0 | Intolerance: Love's Struggle Throughout the Ages | 1916 | Drama|History|War | NaN | USA | Not Rated | 123 |
1 | Over the Hill to the Poorhouse | 1920 | Crime|Drama | NaN | USA | NaN | 110 |
2 | The Big Parade | 1925 | Drama|Romance|War | NaN | USA | Not Rated | 151 |
3 | Metropolis | 1927 | Drama|Sci-Fi | German | Germany | Not Rated | 145 |
4 | Pandora's Box | 1929 | Crime|Drama|Romance | German | Germany | Not Rated | 110 |
Alternatively, you can pass in a list of numbers, which will let you import columns at particular indexes.
Applying formulas on the columns
One of the much-used features of Excel is to apply formulas to create new columns from existing column values. In our Excel file, we have Gross Earnings and Budget columns. We can get Net earnings by subtracting Budget from Gross earnings. We could then apply this formula in the Excel file to all the rows. We can do this in pandas also as shown below.
movies["Net Earnings"] = movies["Gross Earnings"] - movies["Budget"]
Above, we used pandas to create a new column called Net Earnings, and populated it with the difference of Gross Earnings and Budget. It's worth noting the difference here in how formulas are treated in Excel versus pandas. In Excel, a formula lives in the cell and updates when the data changes - with Python, the calculations happen and the values are stored - if Gross Earnings for one movie was manually changed, Net Earnings won't be updated.
Let's use the sot_values
method to sort the data by the new column we created and visualize the top 10 movies by Net Earnings.
sorted_movies = movies[['Net Earnings']].sort_values(['Net Earnings'], ascending=[False]) sorted_movies.head(10)['Net Earnings'].plot.barh() plt.show()
Pivot Table in pandas
Advanced Excel users also often use pivot tables. A pivot table summarizes the data of another table by grouping the data on an index and applying operations such as sorting, summing, or averaging. You can use this feature in pandas too.
We need to first identify the column or columns that will serve as the index, and the column(s) on which the summarizing formula will be applied. Let's start small, by choosing Year as the index column and Gross Earnings as the summarization column and creating a separate DataFrame from this data.
movies_subset = movies[['Year', 'Gross Earnings']] movies_subset.head()
Year | Gross Earnings | |
---|---|---|
0 | 1916.0 | NaN |
1 | 1920.0 | 3000000.0 |
2 | 1925.0 | NaN |
3 | 1927.0 | 26435.0 |
4 | 1929.0 | 9950.0 |
We now call pivot_table
on this subset of data. The method pivot_table
takes a parameter index
. As mentioned, we want to use Year as the index.
earnings_by_year = movies_subset.pivot_table(index=['Year']) earnings_by_year.head()
Gross Earnings | |
---|---|
Year | |
1916.0 | NaN |
1920.0 | 3000000.0 |
1925.0 | NaN |
1927.0 | 26435.0 |
1929.0 | 1408975.0 |
This gave us a pivot table with grouping on Year and summarization on the sum of Gross Earnings. Notice, we didn't need to specify Gross Earnings column explicitly as pandas automatically identified it the values on which summarization should be applied.
We can use this pivot table to create some data visualizations. We can call the plot
method on the DataFrame to create a line plot and call the show
method to display the plot in the notebook.
earnings_by_year.plot() plt.show()
We saw how to pivot with a single column as the index. Things will get more interesting if we can use multiple columns. Let's create another DataFrame subset but this time we will choose the columns, Country, Language and Gross Earnings.
movies_subset = movies[['Country', 'Language', 'Gross Earnings']] movies_subset.head()
Country | Language | Gross Earnings | |
---|---|---|---|
0 | USA | NaN | NaN |
1 | USA | NaN | 3000000.0 |
2 | USA | NaN | NaN |
3 | Germany | German | 26435.0 |
4 | Germany | German | 9950.0 |
We will use columns Country and Language as the index for the pivot table. We will use Gross Earnings as summarization table, however, we do not need to specify this explicitly as we saw earlier.
earnings_by_co_lang = movies_subset.pivot_table(index=['Country', 'Language']) earnings_by_co_lang.head()
Gross Earnings | ||
---|---|---|
Country | Language | |
Afghanistan | Dari | 1.127331e+06 |
Argentina | Spanish | 7.230936e+06 |
Aruba | English | 1.007614e+07 |
Australia | Aboriginal | 6.165429e+06 |
Dzongkha | 5.052950e+05 |
Let's visualize this pivot table with a bar plot. Since there are still few hundred records in this pivot table, we will plot just a few of them.
earnings_by_co_lang.head(20).plot(kind='bar', figsize=(20,8)) plt.show()
Exporting the results to Excel
If you're going to be working with colleagues who use Excel, saving Excel files out of pandas is important. You can export or write a pandas DataFrame to an Excel file using pandas to_excel
method. Pandas uses the xlwt
Python module internally for writing to Excel files. The to_excel
method is called on the DataFrame we want to export.We also need to pass a filename to which this DataFrame will be written.
movies.to_excel('output.xlsx')
By default, the index is also saved to the output file. However, sometimes the index doesn't provide any useful information. For example, the movies
DataFrame has a numeric auto-increment index, that was not part of the original Excel data.
movies.head()
Title | Year | Genres | Language | Country | Content Rating | Duration | Aspect Ratio | Budget | Gross Earnings | ... | Facebook Likes - Actor 2 | Facebook Likes - Actor 3 | Facebook Likes - cast Total | Facebook likes - Movie | Facenumber in posters | User Votes | Reviews by Users | Reviews by Crtiics | IMDB Score | Net Earnings | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Intolerance: Love's Struggle Throughout the Ages | 1916.0 | Drama|History|War | NaN | USA | Not Rated | 123.0 | 1.33 | 385907.0 | NaN | ... | 22.0 | 9.0 | 481 | 691 | 1.0 | 10718 | 88.0 | 69.0 | 8.0 | NaN |
1 | Over the Hill to the Poorhouse | 1920.0 | Crime|Drama | NaN | USA | NaN | 110.0 | 1.33 | 100000.0 | 3000000.0 | ... | 2.0 | 0.0 | 4 | 0 | 1.0 | 5 | 1.0 | 1.0 | 4.8 | 2900000.0 |
2 | The Big Parade | 1925.0 | Drama|Romance|War | NaN | USA | Not Rated | 151.0 | 1.33 | 245000.0 | NaN | ... | 12.0 | 6.0 | 108 | 226 | 0.0 | 4849 | 45.0 | 48.0 | 8.3 | NaN |
3 | Metropolis | 1927.0 | Drama|Sci-Fi | German | Germany | Not Rated | 145.0 | 1.33 | 6000000.0 | 26435.0 | ... | 23.0 | 18.0 | 203 | 12000 | 1.0 | 111841 | 413.0 | 260.0 | 8.3 | -5973565.0 |
4 | Pandora's Box | 1929.0 | Crime|Drama|Romance | German | Germany | Not Rated | 110.0 | 1.33 | NaN | 9950.0 | ... | 20.0 | 3.0 | 455 | 926 | 1.0 | 7431 | 84.0 | 71.0 | 8.0 | NaN |
5 rows × 26 columns
You can choose to skip the index by passing along index-False.
movies.to_excel('output.xlsx', index=False)
We need to be able to make our output files look nice before we can send it out to our co-workers. We can use pandas ExcelWriter
class along with the XlsxWriter
Python module to apply the formatting.
We can do use these advanced output options by creating a ExcelWriter
object and use this object to write to the EXcel file.
writer = pd.ExcelWriter('output.xlsx', engine='xlsxwriter') movies.to_excel(writer, index=False, sheet_name='report') workbook = writer.book worksheet = writer.sheets['report']
We can apply customizations by calling add_format
on the workbook we are writing to. Here we are setting header format as bold.
header_fmt = workbook.add_format({'bold': True}) worksheet.set_row(0, None, header_fmt)
Finally, we save the output file by calling the method save
on the writer object.
writer.save()
As an example, we saved the data with column headers set as bold. And the saved file looks like the image below.
Like this, one can use XlsxWriter
to apply various formatting to the output Excel file.
Conclusion
Pandas is not a replacement for Excel. Both tools have their place in the data analysis workflow and can be very great companion tools. As we demonstrated, pandas can do a lot of complex data analysis and manipulations, which depending on your need and expertise, can go beyond what you can achieve if you are just using Excel. One of the major benefits of using Python and pandas over Excel is that it helps you automate Excel file processing by writing scripts and integrating with your automated data workflow. Pandas also has excellent methods for reading all kinds of data from Excel files. You can export your results from pandas back to Excel too if that's preferred by your intended audience.
On the other hand, Excel is a such a widely used data tool, it's not a wise to ignore it. Acquiring expertise in both pandas and Excel and making them work together gives you skills that can help you stand out in your organization.
Bio: Harish Garg is an Entrepreneur, Technical Trainer, and Lead Software Developer with extensive experience in Data Science, Python, Web, and Mobile Development. He is passionate about Data Science and Artificial Intelligence.
Original. Reposted with permission.
Related:
- Getting Started with Analytics: What’s the Upfront Investment?
- Python Data Preparation Case Files: Removing Instances & Basic Imputation
- Python Data Preparation Case Files: Group-based Imputation