KDnuggets Home » News » 2019 » Oct » Tutorials, Overviews » 5 Advanced Features of Pandas and How to Use Them ( 19:n41 )

5 Advanced Features of Pandas and How to Use Them


The pandas library offers core functionality when preparing your data using Python. But, many don't go beyond the basics, so learn about these lesser-known advanced methods that will make handling your data easier and cleaner.



Pandas is the gold standard library for all things data. With the functionality to load, filter, manipulate, and explore data, it’s no wonder that it’s a favorite among Data Scientists.

Most of us naturally stick to the very basics of Pandas. Load up data from a CSV file, filter a few columns, and then jump right into the data visualizations. Yet Pandas actually comes with many lesser-known but useful functions that can make handling data a whole lot easier and cleaner.

This tutorial will guide you through 5 of those more advanced functions — what they do and how to use them. Even more fun with data!

 

(1) Configuring Options and Settings

Pandas comes with a set of user-configurable options and settings. They’re huge productivity boosters since they let you tailor your Pandas environment exactly to your liking.

We can, for example, change some of Pandas’s display settings to change how many rows and columns are shown and to what precision floating point numbers are displayed.

The code above ensures that Pandas always displays 10 rows and 10 columns at a maximum, with floating-point values showing 2 decimal places at most. That way, our terminal or Jupyter Notebook won’t look like a mess when we try to print out a big DataFrame!

That’s just a basic example. There’s a lot more to explore beyond the simple display settings. You can check out all the options in the official documentation.

 

(2) Combining DataFrames

A relatively unknown part of Pandas DataFrames is that there are actually two different ways to combine them. Each method produces a different result, so selecting the proper one based on what you want to achieve is very important. In addition, they contain many parameters that further customize the merging. Let’s check them out.

Concatenating

Concatenating is the most well-known method of combining DataFrames and can be thought of intuitively as “stacking.” That stacking can be done either horizontally or vertically.

Imagine that you have a huge dataset in CSV format. It makes sense to split it up into multiple files for easier handling (this is common practice for large datasets, referred to as sharding).

When you load it into pandas you can vertically stack the DataFrame of each CSV to create one big DataFrame for all of the data. For example, if we have 3 shards, each with 5 Million rows, then after we vertical stack them all, our final DataFrame will have 15 Million rows.

The code below shows how to concatenate DataFrames in Pandas vertically.

You can do something similar by splitting up your dataset according to the columns instead of the rows — a few columns for each CSV file (with all the rows of the dataset). It’s like we’re dividing up the dataset’s features into different shards. You would then horizontally stack them to combine those columns/features.

Merging

Merging is more complicated yet more powerful, combining Pandas DataFrames in an SQL-like style, i.e., the DataFrames will be joined by some common attribute.

Imagine that you have two DataFrames describing your YouTube channel. One of them contains a list of user IDs and how much time each user has spent on your channel in total. The other contains a similar list of user IDs and how many videos each user has seen. Merging allows us to combine the 2 DataFrames into a single one by matching up the user IDs and then putting the ID, time spent, and video count into a single row for each user.

Merging two DataFrames in Pandas is done with the merge function. You can see an example of how it works in the code below. The left and right parameters refer to the two DataFrames you wish to merge, while on specifies the column to be used for the matching.

To go even further into emulating SQL joins, the how parameter allows you to select the type of SQL-style join you want to perform: inner, outer, left, or right. To learn more about SQL joins, see the W3Schools tutorial.

 

(3) Reshaping DataFrames

There are several ways to reshape and restructure Pandas DataFrames. These range from simple and easy to powerful and complex. Let’s check out the three most common ones. For all of the following examples, we’ll be using this Dataset of superheroes!

Transpose

The easiest of them all. Transposing swaps a DataFrame’s rows with its columns. If you have 5000 rows and 10 columns, and then transpose your DataFrame, you’ll end up with 10 rows and 5000 columns.

Groupby

Groupby’s main usage is to split up DataFrames into multiple parts based on some keys. Once the DataFrame is split up into parts, you can loop through and apply some operations on each part independently.

For example, we can see how, in the code below, we created a DataFrame of Players with corresponding Years and Points. We then did a groupby to split up the DataFrame into multiple parts, according to the player. Thus, each player gets its own group showing how many points that player got for each year they were active.

Stacking

Stacking transforms the DataFrame into having a multi-level index, i.e., each row has multiple sub-parts. These sub-parts are created using the DataFrame’s columns, compressing them into the multi-index. Overall, stacking can be thought of as compressing columns into multi-index rows.

This is best illustrated by an example, shown down below.

 

(4) Working with time data

The Datetime library is a staple in Python. Whenever you’re dealing with anything related to real-world date and time information, it’s your go-to library. And lucky for us, Pandas also comes with functionality for using Datetime objects.

Let’s illustrate with an example. In the code below, we first create a DataFrame with 4 columns: Day, Month, Year, and data, and then sort it by year and month. As you can see, it’s quite messy; we’re using up 3 columns just to store the date, when in actuality, we know that a calendar date is just one value.

We can clean things up with datetime.

Pandas conveniently comes with a function called to_datetime() that can compress and convert multiple DataFrame columns into a single Datetime object. Once it’s in that format, you have all the flexibility of the Datetime library at your disposal.

To use the to_datetime() function, you’ll need to pass it all of the “date” data from the relevant columns. That’s the “Day”, “Month”, and “Year” columns. Once we have things in Datetime format, we no longer need the other columns and can simply drop them. Check out the code below to see how that all works!

 

(5) Mapping Items into Groups

Mapping is a neat trick that helps with organizing categorical data. Imagine, for example, that we have a huge DataFrame with thousands of rows where one of the columns has items we wish to categorize. Doing so can greatly simplify both the training of Machine Learning models and visualizing the data effectively.

Check out the code below for a mini example where we have a list of foods that we want to categorize.

In the code above, we put our list into a pandas series. We’ve also created a dictionary showing the mapping we want, categorizing each food item as a “Protein” or a “Carb.” This is a toy example, but if this series was at a large scale, say a length of 1,000,000 items, then looping through it wouldn’t be practical at all.

Instead of the basic for-loop, we can write a function using Pandas’s built-in .map() function to perform the mapping in an optimized way. Check out the code below to see the function and how it’s applied.

In the function, we first loop through our dictionary to create a new dictionary where the keys represent every possible item in the pandas series and the value represents the new mapped item, “Protein” or “Carbs”. Then, we simply apply Pandas’s built-in map function to map all of the values in the series

Check out the output below to see the results!

['Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Carbs', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein', 'Carbs', 'Carbs', 'Protein', 'Protein', 'Protein']

 

 

Conclusion

So there you have it! Your 5 advanced features of Pandas and how to use them!

If you’re hungry for more, not to worry! There’s a whole more to learn about Pandas and Data Science. As a recommended reading, the KDNuggets website is, of course, the best resource on the subject!

Related:


Sign Up

By subscribing you accept KDnuggets Privacy Policy