7 Steps to Mastering Data Wrangling with Pandas and Python
Starting out on your data journey? Here’s a 7-step learning path to master data wrangling with pandas.
Image generated with DALLE 3
Are you an aspiring data analyst? If so, learning data wrangling with pandas, a powerful data analysis library, is an essential skill to add to your toolbox.
Almost all data science courses and bootcamps cover pandas in their curriculum. Though pandas is easy to learn, its idiomatic usage and getting the hang of common functions and method calls requires practice.
This guide breaks down learning pandas—into 7 easy steps—starting with what you probably are familiar with and gradually exploring the powerful functionalities of pandas. From prerequisites—through various data wrangling tasks—to building a dashboard, here’s a comprehensive learning path.
Step 1: Python and SQL Fundamentals
If you’re looking to break into data analytics or data science, you first need to pick up some basic programming skills. We recommend starting with Python or R, but we’ll focus on Python in this guide.
Learn Python and Web Scraping
To refresh your Python skills you can use one of the following resources:
Python is easy to learn and start building. You can focus on the following topics:
- Python basics: Familiarize yourself with Python syntax, data types, control structures, built-in data structures, and basic object-oriented programming (OOP) concepts.
- Web scraping fundamentals: Learn the basics of web scraping, including HTML structure, HTTP requests, and parsing HTML content. Familiarize yourself with libraries like BeautifulSoup and requests for web scraping tasks.
- Connecting to databases: Learn how to connect Python to a database system using libraries like SQLAlchemy or psycopg2. Understand how to execute SQL queries from Python and retrieve data from databases.
While not mandatory, using Jupyter Notebooks for Python and web scraping exercises can provide an interactive environment for learning and experimenting.
SQL is an essential tool for data analysis; But how will learning SQL help you learn pandas?
Well, once you know the logic behind writing SQL queries, it's very easy to transpose those concepts to perform analogous operations on a pandas dataframe.
Learn the basics of SQL (Structured Query Language), including how to create, modify, and query relational databases. Understand SQL commands such as SELECT, INSERT, UPDATE, DELETE, and JOIN.
To learn and refresh your SQL skills you can use the following resources:
By mastering the skills outlined in this step, you will have a solid foundation in Python programming, SQL querying, and web scraping. These skills serve as the building blocks for more advanced data science and analytics techniques.
Step 2: Loading Data From Various Sources
First, set up your working environment. Install pandas (and its required dependencies like NumPy). Follow best practices like using virtual environments to manage project-level installations.
As mentioned, pandas is a powerful library for data analysis in Python. Before you start working with pandas, however, you should familiarize yourself with the basic data structures: pandas DataFrame and series.
To analyze data, you should first load it from its source into a pandas dataframe. Learning to ingest data from various sources such as CSV files, excel spreadsheets, relational databases, and more is important. Here’s an overview:
- Reading data from CSV files: Learn how to use the
pd.read_csv()function to read data from Comma-Separated Values (CSV) files and load it into a DataFrame. Understand the parameters you can use to customize the import process, such as specifying the file path, delimiter, encoding, and more.
- Importing data from Excel files: Explore the
pd.read_excel()function, which allows you to import data from Microsoft Excel files (.xlsx) and store it in a DataFrame. Understand how to handle multiple sheets and customize the import process.
- Loading data from JSON files: Learn to use the
- Reading data from Parquet files: Understand the
pd.read_parquet()function, which enables you to import data from Parquet files, a columnar storage file format. Learn how Parquet files offer advantages for big data processing and analytics.
- Importing data from relational database tables: Learn about the
pd.read_sql()function, which allows you to query data from relational databases and load it into a DataFrame. Understand how to establish a connection to a database, execute SQL queries, and fetch data directly into pandas.
We’ve now learned how to load the dataset into a pandas dataframe. What’s next?
Step 3: Selecting Rows and Columns, Filtering DataFrames
Next, you should learn how to select specific rows and columns from a pandas DataFrame, as well as how to filter the data based on specific criteria. Learning these techniques is essential for data manipulation and extracting relevant information from your datasets.
Indexing and Slicing DataFrames
Understand how to select specific rows and columns based on labels or integer positions. You should learn to slice and index into DataFrames using methods like
.iloc, and boolean indexing.
.loc: This method is used for label-based indexing, allowing you to select rows and columns by their labels.
.iloc: This method is used for integer-based indexing, enabling you to select rows and columns by their integer positions.
- Boolean indexing: This technique involves using boolean expressions to filter data based on specific conditions.
Selecting columns by name is a common operation. So learn how to access and retrieve specific columns using their column names. Practice using single column selection and selecting multiple columns at once.
You should be familiar with the following when filtering dataframes:
- Filtering with conditions: Understand how to filter data based on specific conditions using boolean expressions. Learn to use comparison operators (>, <, ==, etc.) to create filters that extract rows that meet certain criteria.
- Combining filters: Learn how to combine multiple filters using logical operators like '&' (and), '|' (or), and '~' (not). This will allow you to create more complex filtering conditions.
- Using isin(): Learn to use the
isin()method to filter data based on whether values are present in a specified list. This is useful for extracting rows where a certain column's values match any of the provided items.
By working on the concepts outlined in this step, you’ll gain the ability to efficiently select and filter data from pandas dataframes, enabling you to extract the most relevant information.
A Quick Note on Resources
For steps 3 to 6, you can learn and practice using the following resources:
- 10 minutes to pandas - pandas user guide
- Pandas and Python for Data Analysis by Example - freeCodeCamp
- Intro to pandas - Kaggle Learn
Step 4: Exploring and Cleaning the Dataset
So far, you know how to load data into pandas dataframes, select columns, and filter dataframes. In this step, you will learn how to explore and clean your dataset using pandas.
Exploring the data helps you understand its structure, identify potential issues, and gain insights before further analysis. Cleaning the data involves handling missing values, dealing with duplicates, and ensuring data consistency:
- Data inspection: Learn how to use methods like
describe(), and the
shapeattribute to get an overview of your dataset. These provide information about the first/last rows, data types, summary statistics, and the dimensions of the dataframe.
- Handling missing data: Understand the importance of dealing with missing values in your dataset. Learn how to identify missing data using methods like
isnull(), and handle it using
fillna(), or imputation methods.
- Dealing with duplicates: Learn how to detect and remove duplicate rows using methods like
drop_duplicates(). Duplicates can distort analysis results and should be addressed to ensure data accuracy.
- Cleaning string columns: Learn to use the
.straccessor and string methods to perform string cleaning tasks like removing whitespaces, extracting and replacing substrings, splitting and joining strings, and more.
- Data type conversion: Understand how to convert data types using methods like
astype(). Converting data to the appropriate types ensures that your data is represented accurately and optimizes memory usage.
In addition, you can explore your dataset using simple visualizations and perform data quality checks.
Data Exploration and Data Quality Checks
Use visualizations and statistical analysis to gain insights into your data. Learn how to create basic plots with pandas and other libraries like Matplotlib or Seaborn to visualize distributions, relationships, and patterns in your data.
Perform data quality checks to ensure data integrity. This may involve verifying that values fall within expected ranges, identifying outliers, or checking for consistency across related columns.
You now know how to explore and clean your dataset, leading to more accurate and reliable analysis results. Proper data exploration and cleaning are super important or any data science project, as they lay the foundation for successful data analysis and modeling.
Step 5: Transformations, GroupBy, and Aggregations
By now, you are comfortable working with pandas DataFrames and can perform basic operations like selecting rows and columns, filtering, and handling missing data.
You’ll often want to summarize data based on different criteria. To do so, you should learn how to perform data transformations, use the GroupBy functionality, and apply various aggregation methods on your dataset. This can further be broken down as follows:
- Data transformations: Learn how to modify your data using techniques such as adding or renaming columns, dropping unnecessary columns, and converting data between different formats or units.
- Apply functions: Understand how to use the
apply()method to apply custom functions to your dataframe, allowing you to transform data in a more flexible and customized way.
- Reshaping data: Explore additional dataframe methods like
stack(), which allow you to reshape data and make it suitable for specific analysis needs.
- GroupBy functionality: The
groupby()method lets you group your data based on specific column values. This allows you to perform aggregations and analyze data on a per-group basis.
- Aggregate functions: Learn about common aggregation functions like sum, mean, count, min, and max. These functions are used with
groupby()to summarize data and calculate descriptive statistics for each group.
The techniques outlined in this step will help you transform, group, and aggregate your data effectively.
Step 6: Joins and Pivot Tables
Next, you can level up by learning how to perform data joins and create pivot tables using pandas. Joins allow you to combine information from multiple dataframes based on common columns, while pivot tables help you summarize and analyze data in a tabular format. Here’s what you should know:
- Merging DataFrames: Understand different types of joins, such as inner join, outer join, left join, and right join. Learn how to use the
merge()function to combine dataframes based on shared columns.
- Concatenation: Learn how to concatenate dataframes vertically or horizontally using the
concat()function. This is useful when combining dataframes with similar structures.
- Index manipulation: Understand how to set, reset, and rename indexes in dataframes. Proper index manipulation is essential for performing joins and creating pivot tables effectively.
- Creating pivot tables: The
pivot_table()method allows you to transform your data into a summarized and cross-tabulated format. Learn how to specify the desired aggregation functions and group your data based on specific column values.
Optionally, you can explore how to create multi-level pivot tables, where you can analyze data using multiple columns as index levels. With enough practice, you’ll know how to combine data from multiple dataframes using joins and create informative pivot tables.
Step 7: Build a Data Dashboard
Now that you’ve mastered the basics of data wrangling with pandas, it's time to put your skills to test by building a data dashboard.
Building interactive dashboards will help you hone both your data analysis and visualization skills. For this step, you need to be familiar with data visualization in Python. Data Visualization - Kaggle Learn is a comprehensive introduction.
When you’re looking for opportunities in data, you need to have a portfolio of projects—and you need to go beyond data analysis in Jupyter notebooks. Yes, you can learn and use Tableau. But you can build on the Python foundation and start building dashboards using the Python library Streamlit.
Streamlit helps you build interactive dashboards—without having to worry about writing hundreds of lines of HTML and CSS.
If you’re looking for inspiration or a resource to learn Streamlit, you can check out this free course: Build 12 Data Science Apps with Python and Streamlit for projects across stock prices, sports, and bioinformatics data. Pick a real-world dataset, analyze it, and build a data dashboard to showcase the results of your analysis.
With a solid foundation in Python, SQL, and pandas you can start applying and interviewing for data analyst roles.
We’ve already included building a data dashboard to bring it all together: from data collection to dashboard and insights. So be sure to build a portfolio of projects. When doing so, go beyond the generic and include projects that you really enjoy working on. If you are into reading or music (which most of us are), try to analyze your Goodreads and Spotify data, build out a dashboard, and improve it. Keep grinding!
Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she's working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more.