Topics: Coronavirus | AI | Data Science | Deep Learning | Machine Learning | Python | R | Statistics

KDnuggets Home » News » 2018 » Jan » Tutorials, Overviews » Using Excel with Pandas ( 18:n04 )

Using Excel with Pandas


In this tutorial, we are going to show you how to work with Excel files in pandas, covering computer setup, reading in data from Excel files into pandas, data exploration in pandas, and more.



Getting statistical information about the data

 
Pandas has some very handy methods to look at the statistical data about our data set. For example, we can use the describe method to get a statistical summary of the data set.

movies.describe()


Year Duration Aspect Ratio Budget Gross Earnings Facebook Likes - Director Facebook Likes - Actor 1 Facebook Likes - Actor 2 Facebook Likes - Actor 3 Facebook Likes - cast Total Facebook likes - Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score
count 4935.000000 5028.000000 4714.000000 4.551000e+03 4.159000e+03 4938.000000 5035.000000 5029.000000 5020.000000 5042.000000 5042.000000 5029.000000 5.042000e+03 5022.000000 4993.000000 5042.000000
mean 2002.470517 107.201074 2.220403 3.975262e+07 4.846841e+07 686.621709 6561.323932 1652.080533 645.009761 9700.959143 7527.457160 1.371446 8.368475e+04 272.770808 140.194272 6.442007
std 12.474599 25.197441 1.385113 2.061149e+08 6.845299e+07 2813.602405 15021.977635 4042.774685 1665.041728 18165.101925 19322.070537 2.013683 1.384940e+05 377.982886 121.601675 1.125189
min 1916.000000 7.000000 1.180000 2.180000e+02 1.620000e+02 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000e+00 1.000000 1.000000 1.600000
25% 1999.000000 93.000000 1.850000 6.000000e+06 5.340988e+06 7.000000 614.500000 281.000000 133.000000 1411.250000 0.000000 0.000000 8.599250e+03 65.000000 50.000000 5.800000
50% 2005.000000 103.000000 2.350000 2.000000e+07 2.551750e+07 49.000000 988.000000 595.000000 371.500000 3091.000000 166.000000 1.000000 3.437100e+04 156.000000 110.000000 6.600000
75% 2011.000000 118.000000 2.350000 4.500000e+07 6.230944e+07 194.750000 11000.000000 918.000000 636.000000 13758.750000 3000.000000 2.000000 9.634700e+04 326.000000 195.000000 7.200000
max 2016.000000 511.000000 16.000000 1.221550e+10 7.605058e+08 23000.000000 640000.000000 137000.000000 23000.000000 656730.000000 349000.000000 43.000000 1.689764e+06 5060.000000 813.000000 9.500000


The describe method displays below information for each of the columns.

  • the count or number of values
  • mean
  • standard deviation
  • minimum, maximum
  • 25%, 50%, and 75% quantile

Please note that this information will be calculated only for the numeric values.

We can also use the corresponding method to access this information one at a time. For example, to get the mean of a particular column, you can use the mean method on that column.

movies["Gross Earnings"].mean()


48468407.526809327


Just like mean, there are methods available for each of the statistical information we want to access. You can read about these methods in our free pandas cheat sheet.

 

Reading files with no header and skipping records

 
Earlier in this tutorial, we saw some ways to read a particular kind of Excel file that had headers and no rows that needed skipping. Sometimes, the Excel sheet doesn't have any header row. For such instances, you can tell pandas not to consider the first row as header or columns names. And If the Excel sheet's first few rows contain data that should not be read in, you can ask the read_excel method to skip a certain number of rows, starting from the top.

For example, look at the top few rows of this Excel file.
img-excel-no-header-1

This file obviously has no header and first four rows are not actual records and hence should not be read in. We can tell read_excel there is no header by setting argument header to Noneand we can skip first four rows by setting argument skiprows to four.

movies_skip_rows = pd.read_excel("movies-no-header-skip-rows.xls", header=None, skiprows=4)
movies_skip_rows.head(5)


0 1 2 3 4 5 6 7 8 9 ... 15 16 17 18 19 20 21 22 23 24
0 Metropolis 1927 Drama|Sci-Fi German Germany Not Rated 145 1.33 6000000.0 26435.0 ... 136 23 18.0 203 12000 1 111841 413 260.0 8.3
1 Pandora's Box 1929 Crime|Drama|Romance German Germany Not Rated 110 1.33 NaN 9950.0 ... 426 20 3.0 455 926 1 7431 84 71.0 8.0
2 The Broadway Melody 1929 Musical|Romance English USA Passed 100 1.37 379000.0 2808000.0 ... 77 28 4.0 109 167 8 4546 71 36.0 6.3
3 Hell's Angels 1930 Drama|War English USA Passed 96 1.20 3950000.0 NaN ... 431 12 4.0 457 279 1 3753 53 35.0 7.8
4 A Farewell to Arms 1932 Drama|Romance|War English USA Unrated 79 1.37 800000.0 NaN ... 998 164 99.0 1284 213 1 3519 46 42.0 6.6

5 rows × 25 columns


We skipped four rows from the sheet and used none of the rows as the header. Also, notice that one can combine different options in a single read statement. To skip rows at the bottom of the sheet, you can use option skip_footer, which works just like skiprows, the only difference being the rows are counted from the bottom upwards.

The column names in the previous DataFrame are numeric and were allotted as default by the pandas. We can rename the column names to descriptive ones by calling the method columnson the DataFrame and passing the column names as a list.

movies_skip_rows.columns = ['Title', 'Year', 'Genres', 'Language', 'Country', 'Content Rating',
       'Duration', 'Aspect Ratio', 'Budget', 'Gross Earnings', 'Director',
       'Actor 1', 'Actor 2', 'Actor 3', 'Facebook Likes - Director',
       'Facebook Likes - Actor 1', 'Facebook Likes - Actor 2',
       'Facebook Likes - Actor 3', 'Facebook Likes - cast Total',
       'Facebook likes - Movie', 'Facenumber in posters', 'User Votes',
       'Reviews by Users', 'Reviews by Crtiics', 'IMDB Score']
movies_skip_rows.head()


Title Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings ... Facebook Likes - Actor 1 Facebook Likes - Actor 2 Facebook Likes - Actor 3 Facebook Likes - cast Total Facebook likes - Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score
0 Metropolis 1927 Drama|Sci-Fi German Germany Not Rated 145 1.33 6000000.0 26435.0 ... 136 23 18.0 203 12000 1 111841 413 260.0 8.3
1 Pandora's Box 1929 Crime|Drama|Romance German Germany Not Rated 110 1.33 NaN 9950.0 ... 426 20 3.0 455 926 1 7431 84 71.0 8.0
2 The Broadway Melody 1929 Musical|Romance English USA Passed 100 1.37 379000.0 2808000.0 ... 77 28 4.0 109 167 8 4546 71 36.0 6.3
3 Hell's Angels 1930 Drama|War English USA Passed 96 1.20 3950000.0 NaN ... 431 12 4.0 457 279 1 3753 53 35.0 7.8
4 A Farewell to Arms 1932 Drama|Romance|War English USA Unrated 79 1.37 800000.0 NaN ... 998 164 99.0 1284 213 1 3519 46 42.0 6.6

5 rows × 25 columns


Now that we have seen how to read a subset of rows from the Excel file, we can learn how to read a subset of columns.

 

Reading a subset of columns

 
Although read_excel defaults to reading and importing all columns, you can choose to import only certain columns. By passing parse_cols=6, we are telling the read_excel method to read only the first columns till index six or first seven columns (the first column being indexed zero).

movies_subset_columns = pd.read_excel(excel_file, parse_cols=6)
movies_subset_columns.head()


Title Year Genres Language Country Content Rating Duration
0 Intolerance: Love's Struggle Throughout the Ages 1916 Drama|History|War NaN USA Not Rated 123
1 Over the Hill to the Poorhouse 1920 Crime|Drama NaN USA NaN 110
2 The Big Parade 1925 Drama|Romance|War NaN USA Not Rated 151
3 Metropolis 1927 Drama|Sci-Fi German Germany Not Rated 145
4 Pandora's Box 1929 Crime|Drama|Romance German Germany Not Rated 110

Alternatively, you can pass in a list of numbers, which will let you import columns at particular indexes.

 

Applying formulas on the columns

 
One of the much-used features of Excel is to apply formulas to create new columns from existing column values. In our Excel file, we have Gross Earnings and Budget columns. We can get Net earnings by subtracting Budget from Gross earnings. We could then apply this formula in the Excel file to all the rows. We can do this in pandas also as shown below.

movies["Net Earnings"] = movies["Gross Earnings"] - movies["Budget"]


Above, we used pandas to create a new column called Net Earnings, and populated it with the difference of Gross Earnings and Budget. It's worth noting the difference here in how formulas are treated in Excel versus pandas. In Excel, a formula lives in the cell and updates when the data changes - with Python, the calculations happen and the values are stored - if Gross Earnings for one movie was manually changed, Net Earnings won't be updated.

Let's use the sot_values method to sort the data by the new column we created and visualize the top 10 movies by Net Earnings.

sorted_movies = movies[['Net Earnings']].sort_values(['Net Earnings'], ascending=[False])
sorted_movies.head(10)['Net Earnings'].plot.barh()
plt.show()


python-pandas-and-excel_44_0

 

Pivot Table in pandas

 
Advanced Excel users also often use pivot tables. A pivot table summarizes the data of another table by grouping the data on an index and applying operations such as sorting, summing, or averaging. You can use this feature in pandas too.

We need to first identify the column or columns that will serve as the index, and the column(s) on which the summarizing formula will be applied. Let's start small, by choosing Year as the index column and Gross Earnings as the summarization column and creating a separate DataFrame from this data.

movies_subset = movies[['Year', 'Gross Earnings']]
movies_subset.head()


Year Gross Earnings
0 1916.0 NaN
1 1920.0 3000000.0
2 1925.0 NaN
3 1927.0 26435.0
4 1929.0 9950.0


We now call pivot_table on this subset of data. The method pivot_table takes a parameter index. As mentioned, we want to use Year as the index.

earnings_by_year = movies_subset.pivot_table(index=['Year'])
earnings_by_year.head()


Gross Earnings
Year
1916.0 NaN
1920.0 3000000.0
1925.0 NaN
1927.0 26435.0
1929.0 1408975.0


This gave us a pivot table with grouping on Year and summarization on the sum of Gross Earnings. Notice, we didn't need to specify Gross Earnings column explicitly as pandas automatically identified it the values on which summarization should be applied.

We can use this pivot table to create some data visualizations. We can call the plot method on the DataFrame to create a line plot and call the show method to display the plot in the notebook.

earnings_by_year.plot()
plt.show()


python-pandas-and-excel_50_0

We saw how to pivot with a single column as the index. Things will get more interesting if we can use multiple columns. Let's create another DataFrame subset but this time we will choose the columns, Country, Language and Gross Earnings.

movies_subset = movies[['Country', 'Language', 'Gross Earnings']]
movies_subset.head()


Country Language Gross Earnings
0 USA NaN NaN
1 USA NaN 3000000.0
2 USA NaN NaN
3 Germany German 26435.0
4 Germany German 9950.0


We will use columns Country and Language as the index for the pivot table. We will use Gross Earnings as summarization table, however, we do not need to specify this explicitly as we saw earlier.

earnings_by_co_lang = movies_subset.pivot_table(index=['Country', 'Language'])
earnings_by_co_lang.head()


Gross Earnings
Country Language
Afghanistan Dari 1.127331e+06
Argentina Spanish 7.230936e+06
Aruba English 1.007614e+07
Australia Aboriginal 6.165429e+06
Dzongkha 5.052950e+05


Let's visualize this pivot table with a bar plot. Since there are still few hundred records in this pivot table, we will plot just a few of them.

earnings_by_co_lang.head(20).plot(kind='bar', figsize=(20,8))
plt.show()


python-pandas-and-excel_56_0

 

Exporting the results to Excel

 
If you're going to be working with colleagues who use Excel, saving Excel files out of pandas is important. You can export or write a pandas DataFrame to an Excel file using pandas to_excel method. Pandas uses the xlwt Python module internally for writing to Excel files. The to_excel method is called on the DataFrame we want to export.We also need to pass a filename to which this DataFrame will be written.

movies.to_excel('output.xlsx')


By default, the index is also saved to the output file. However, sometimes the index doesn't provide any useful information. For example, the movies DataFrame has a numeric auto-increment index, that was not part of the original Excel data.

movies.head()


Title Year Genres Language Country Content Rating Duration Aspect Ratio Budget Gross Earnings ... Facebook Likes - Actor 2 Facebook Likes - Actor 3 Facebook Likes - cast Total Facebook likes - Movie Facenumber in posters User Votes Reviews by Users Reviews by Crtiics IMDB Score Net Earnings
0 Intolerance: Love's Struggle Throughout the Ages 1916.0 Drama|History|War NaN USA Not Rated 123.0 1.33 385907.0 NaN ... 22.0 9.0 481 691 1.0 10718 88.0 69.0 8.0 NaN
1 Over the Hill to the Poorhouse 1920.0 Crime|Drama NaN USA NaN 110.0 1.33 100000.0 3000000.0 ... 2.0 0.0 4 0 1.0 5 1.0 1.0 4.8 2900000.0
2 The Big Parade 1925.0 Drama|Romance|War NaN USA Not Rated 151.0 1.33 245000.0 NaN ... 12.0 6.0 108 226 0.0 4849 45.0 48.0 8.3 NaN
3 Metropolis 1927.0 Drama|Sci-Fi German Germany Not Rated 145.0 1.33 6000000.0 26435.0 ... 23.0 18.0 203 12000 1.0 111841 413.0 260.0 8.3 -5973565.0
4 Pandora's Box 1929.0 Crime|Drama|Romance German Germany Not Rated 110.0 1.33 NaN 9950.0 ... 20.0 3.0 455 926 1.0 7431 84.0 71.0 8.0 NaN

5 rows × 26 columns


You can choose to skip the index by passing along index-False.

movies.to_excel('output.xlsx', index=False)


We need to be able to make our output files look nice before we can send it out to our co-workers. We can use pandas ExcelWriter class along with the XlsxWriter Python module to apply the formatting.

We can do use these advanced output options by creating a ExcelWriter object and use this object to write to the EXcel file.

writer = pd.ExcelWriter('output.xlsx', engine='xlsxwriter')

movies.to_excel(writer, index=False, sheet_name='report')

workbook = writer.book

worksheet = writer.sheets['report']


We can apply customizations by calling add_format on the workbook we are writing to. Here we are setting header format as bold.

header_fmt = workbook.add_format({'bold': True})
worksheet.set_row(0, None, header_fmt)


Finally, we save the output file by calling the method save on the writer object.

writer.save()


As an example, we saved the data with column headers set as bold. And the saved file looks like the image below.

img-excel-output-bold-1

Like this, one can use XlsxWriter to apply various formatting to the output Excel file.

 

Conclusion

 
Pandas is not a replacement for Excel. Both tools have their place in the data analysis workflow and can be very great companion tools. As we demonstrated, pandas can do a lot of complex data analysis and manipulations, which depending on your need and expertise, can go beyond what you can achieve if you are just using Excel. One of the major benefits of using Python and pandas over Excel is that it helps you automate Excel file processing by writing scripts and integrating with your automated data workflow. Pandas also has excellent methods for reading all kinds of data from Excel files. You can export your results from pandas back to Excel too if that's preferred by your intended audience.

On the other hand, Excel is a such a widely used data tool, it's not a wise to ignore it. Acquiring expertise in both pandas and Excel and making them work together gives you skills that can help you stand out in your organization.

 
Bio: Harish Garg is an Entrepreneur, Technical Trainer, and Lead Software Developer with extensive experience in Data Science, Python, Web, and Mobile Development. He is passionate about Data Science and Artificial Intelligence.

Original. Reposted with permission.

Related:


Sign Up

By subscribing you accept KDnuggets Privacy Policy