Python Pandas For Data Discovery in 7 Simple Steps
Just getting started with Python's Pandas library for data analysis? Or, ready for a quick refresher? These 7 steps will help you become familiar with its core features so you can begin exploring your data in no time.
When I first started using Python to analyze data, the first line of code that I wrote was ‘import pandas as pd’. I was very confused about what pandas was and struggled a lot with the code. Many questions were in my mind: Why does everyone apply ‘import pandas as pd’ in their first line on Python? What does pandas do that is so valuable?
I believed it would be better if I could have an understanding of its background. Because of my curiosity, I have done some research through different sources, for example; online courses, Google, teachers, etc. Eventually, I got an answer. Let me share that answer with you in this article.
Pandas, the short form from Panel Data, was first released on 11 Jan 2008 by a well-known developer called Wes McKinney. Wes McKinney hated the idea of researchers wasting their time. I eventually understood the importance of pandas from what he said in his interview.
“Scientists unnecessarily dealing with the drudgery of simple data manipulation tasks makes me feel terrible,”
“I tell people that it enables people to analyze and work with data who are not expert computer scientists,”
reference: coding club
It’s time to start! Let’s get your hands dirty with some coding! It’s not difficult and is suitable for any beginner. There are 7 steps in total.
Step 1: Importing library
Step 2: Reading data
Method 1: load in a text file containing tabular data
Method 2: create a DataFrame in Pandas from a Python dictionary
Here, I used: df.head() Remark: python lists are 0-indexed so the first element is 0, second is 1, so on.
Step 3: Understanding the data types, number of rows and columns
(10,3) #(row, column)
Number of variables: 3
Number of rows: 10
The data type is an important concept in programming, it contains classification or categorization of data items.
If you are not familiar with data types, this table may be useful for you.
From the output, we know there are 3 columns, taking 153MB of memory.
Step 4: Observing categorical data
The table above highlights the unique values of each column that could allow you to determine which values may be potentially categorical. For example, the unique number of sex is 2 (which makes sense [M/F]). It is less likely that name and year_born are categorical variables because the number of unique is pretty large.
6 females and 4 males
Step 5: Exploring data
To use & (AND), ~ (NOT) and | (OR), you have to add “(“ and “)” before and after the logical operation.
Next, I’ll use loc to do the data selection.
other aggregations: min(), max(),sum(), mean(), std()
From the above examples, you should know how to use the function of iloc and loc. iloc is short for “integer location”. iloc gives us access to the DataFrame in ‘matrix’ style notation, i.e., [row, column] notation. loc is label-based, which means that you have to specify rows and columns based on their row and column labels (names).
From my experience, people would easily mix up with the usage of loc and iloc. Therefore I would prefer to stick on one — loc.
Step 6: Finding the missing values
2 missing values in the column of ‘year_born’.
Step 7: Handling missing values
When inplace=True is passed, the data is renamed in place.
The year_born of Julie and Albert is 1982.25 (replaced by mean).
4th and 9th rows are dropped.
Step 8: Visualising data
Matplotlib is a Python 2D plotting library. You can easily generate plots, histograms, power spectra, bar charts, scatterplots, etc., with just a few lines of code. The example here is plotting a histogram. The “%matplotlib inline” will make your plot outputs appear and be stored within the notebook, but it is not related to how pandas.hist() works.
The year_born where sex=’W’.
Great! I have already gone through all 7steps on data discovery using pandas library. Let me sum up what functions that I have used:
Bonus: Let me introduce the fastest way to do exploratory data analysis (EDA) with only two lines of code: pandas_profiling
Dataset info: Number of variables, Number of observations, Missing cells, Duplicate rows 0, Total size in memory and Average record size in memory
- Variables: Missing values and its percentage, Distinct count, Unique
- Missing Values: ‘Matrix’ and ‘Count’
- Sample: First 10 rows and Last 10 rows
Original. Reposted with permission.
- Data Cleaning and Preprocessing for Beginners
- 5 Advanced Features of Pandas and How to Use Them
- How to Speed up Pandas by 4x with one line of code