Feature Engineering of DateTime Variables for Data Science, Machine Learning
Learn how to make more meaningful features from DateTime type variables to be used by Machine Learning Models.
By Samarth Agrawal, Data Scientist at Toyota
Feature Engineering of DateTime Variables. Image by Author.
INTRODUCTION
DateTime fields require Feature Engineering to turn them from data to insightful information that can be used by our Machine Learning Models. This post is divided into 3 parts and a Bonus section towards the end, we will use a combination of inbuilt pandas and NumPy functions as well as our functions to extract useful features.
- Part 1 — Extract Date / Time Components
- Part 2 — Create Boolean Flags
- Part 3 — Calculate Date / Time Differences
- Bonus — Feature Engineering in 2 lines of code using
fast_ml
BACKGROUND
Whenever I have worked on e-commerce related data, in some way or the other dataset contains DateTime columns.
- User registration date-time
- User login date-time
- Transaction date-time
- Disputed transaction date-time
- … and many more
At the outset, this date field gives us nothing more than a specific point on a timeline. But these DateTime fields are potential treasure troves of data. These fields are immensely powerful ‘if used rightly’ for uncovering patterns.
As a Data Scientist, your job is to bring the insight to the table, and for that, you are required to ask the right questions. For Ex.
- Ques 1 — When do you see most carts getting created?
- Ques 2 — When do you see most carts getting abandoned?
- Ques 3 — When do you see the most fraudulent transactions?
- Ques 4 — When do the maximum users subscribe?
- Ques 5 — When are certain items purchased most often?
- Ques 6 — After how many days/hours after registration user makes the first order?
- Ques 7 — After how many days of inactivity customer never returns to your site?
- … etc
Now, To answer these questions, you get back to data to engineer these DateTime fields. And then a whole lot of patterns can be discovered.
Part 1 of the post will provide you feature engineering steps to answer questions like 1, 2 & 3
- Ans 1 — When do you see most carts getting created? The first week of the month
- Ans 2 — When do you see most carts getting abandoned? Wednesday-Afternoon
- Ans 3 — When do you see the most fraudulent transactions? Friday-Late nite
Part 2 of the post will provide you feature engineering steps to answer questions like in 4 & 5
- Ans 4 — When do the maximum users subscribe? At the start of the year
- Ans 5 — When are certain items purchased most often? At the start of the month
Part 3 of the post will provide you feature engineering steps to answer questions like in 6 & 7
- Ans 6 — After how many days/hours after registration user makes the first order? Within 2 hours
- Ans 7 — After how many days of inactivity customer never returns to your site? After 14 days of inactivity
I have used an example of e-commerce data where I have personally found a lot of use cases but by no means the scope of extracting information is related to just that. We will see in this post how some of the behaviors that can be learned by asking the right question ie. doing the right feature engineering has proved useful in multiple industries across a variety of problems.
LET’S GET STARTED
Load the dataset
import pandas as pddf = pd.read_csv('/kaggle/input/loan-data/loan.csv',
parse_dates = ['date_issued', 'date_last_payment'])
Pandas provide a very simple yet very powerful way to deal with DateTime-related variables by parsing them as dates. You can pass as a list all the variables that are date-time related in the parameter parse_dates
.
Let’s say you are not aware of the DateTime variables upfront and after investigating the data you found that some of the variables are date-time. So instead of reloading the data, pandas provide another useful function to_datetime
to convert the data type to DateTime.
df['date_issued'] = pd.to_datetime(df['date_issued'],
errors = 'coerce')
Part 1. Extract Date / Time Component
As illustrated in the example above, we can extract the component of the date-time part (year
, quarter
, month
, day
, day_of_week
, day_of_year
, week_of_year
, time
, hour
, minute
, second
, day_part
) from the given date-time variable. The below list provides several of such components that can be extracted using pandas inbuilt functions.
Syntax:
We can extract all these components using .dt
accessor. Read more about the date accessors here
Extracted components from datetime variable issued_date. Image by Author
Below is the code as shown in the image. All the other components can also be extracted in a similar way
#1
df[‘date_issued:year’] = df[‘date_issued’].dt.year#2
df[‘date_issued:month’] = df[‘date_issued’].dt.month#3
df[‘date_issued:day_of_week’] = df[‘date_issued’].dt.day_of_week#4
df[‘date_issued:week_of_year’] = df[‘date_issued’].dt.week_of_year#5
df[‘date_issued:hour’] = df[‘date_issued’].dt.hour
Note:
- For Monday :
day_of_week = 0
, - Tuesday :
day_of_week=1
, - …
- Sunday :
day_of_week=6
Creating Day Part:
#day_part function
def day_part(hour):
if hour in [4,5]:
return "dawn"
elif hour in [6,7]:
return "early morning"
elif hour in [8,9,10]:
return "late morning"
elif hour in [11,12,13]:
return "noon"
elif hour in [14,15,16]:
return "afternoon"
elif hour in [17, 18,19]:
return "evening"
elif hour in [20, 21, 22]:
return "night"
elif hour in [23,24,1,2,3]:
return "midnight"
#Run function with apply method
df['date_issued:day_part'] = df['date_issued:hour'].apply(day_part)df.head()
Creating day part for datetime variable issued_date. Image by Author
Part 2. Create Boolean Flags
As illustrated in the example above, we can extract a lot of boolean Flags (is_month_start
, is_month_end
, is_quarter_start
, is_quarter_end
, is_year_start
, is_year_end
, is_weekend
) from the given date-time variable. The below list provides several of such components that can be extracted using pandas inbuilt functions as well as by creating some of our functions.
Syntax:
Again, we can use .dt
accessor to extract a lot of these boolean flags.
Extracted boolean flags from datetime variable issued_date. Image by Author
#1
df['date_issued:is_year_start'] = df['date_issued'].dt.is_year_start#2
df['date_issued:is_quarter_start'] = df['date_issued'].dt.is_quarter_start#3
df['date_issued:is_month_start'] = df['date_issued'].dt.is_month_start#4
df['date_issued:is_month_end'] = df['date_issued'].dt.is_month_end
Creating Weekend Flag:
Creating weekend flag for datetime variable issued_date. Image by Author
If we check the calendar, we will see that 26th of Oct, 2013 was a Saturday — a weekend.
df['date_issued:is_weekend'] = np.where(df['date_issued:day_of_week'].isin([5,6]), 1,0)
Part 3. Calculate Date / Time Differences
Often your questions/analysis will be relative to another point of reference. Like,
- After how many days/hours after registration user makes the first order? registration_date & first_order_date
- In how many days/hours customer’s complaint was resolved? complain_date & resolution_date
- From today, how recently customer ordered from your site? today & last_order_date
- …etc
In our example dataset, we have two columns date_last_payment & date_issued. Let’s see what happens when we just take a difference of these 2 columns.
Calculating time differences in pandas. Image by Author.
Pandas by default provide the difference in ‘days’. Notice the dtype: timedelta64[ns]
.
From the numpy documentation:
“Because NumPy doesn’t have a physical quantities system in its core, the timedelta64 data type was created to complement datetime64”
Now, if we just want the numeric part and not the entire string 947 days
, we can do that by using the .dt accessor.
Calculating time differences in pandas using .dt accessor. Image by Author.
Unfortunately we can’t get the months in similar fashion.
Calculating time differences in pandas using .dt accessor is not scalable. Image by Author.
Here the timedelta64
from NumPy becomes very useful.
Syntax:
In order to get the number of months between date loan was issued and date last payment was done, we will write this
Calculating time differences in pandas using NumPy timedelta64. Image by Author.
(df['date_last_payment'] - df['date_issued'])/np.timedelta64(1, 'M')
timedelta64 can take following parameters for calculating the difference between 2 dates:
- ‘D’ → for Days
- ‘W’ → for Weeks
- ‘M’ → for Months
- ‘Y’ → for Years
- ‘h’ → for Hours
BONUS!
You can use fast_ml to create all these datetime features
First, install fast_ml
package
!pip install fast_ml — upgrade
And then, from the feature_engineering
module import the method for engineering date time features
from fast_ml.feature_engineering import FeatureEngineering_DateTime
Now, this works in the exact same way as other transformers, preprocessors from sklearn.
- Instantiate
- Fit
- Transform
#Instantiate
dt_fe = FeatureEngineering_DateTime()#Fit
dt_fe.fit(df, datetime_variables=['date_issued'],
prefix = 'date_issued:')#Transform
df = dt_fe.transform(df)
df.head()
Feature Engineering of Datetime variables using Fast_ml. Image by Author.
All the columns are not visible in the screenshot. Let’s just look at the columns of the dataset
df.columns---Output---
Index(['customer_id', 'disbursed_amount', 'interest', 'market', 'employment', 'time_employed', 'householder', 'income', 'date_issued', 'target', 'loan_purpose', 'number_open_accounts', 'date_last_payment', 'number_credit_lines_12',(Notice from here ------->)
'date_issued:year', 'date_issued:quarter', 'date_issued:month', 'date_issued:day', 'date_issued:day_of_week', 'date_issued:day_of_year', 'date_issued:weekofyear', 'date_issued:is_month_end', 'date_issued:is_month_start', 'date_issued:is_quarter_end', 'date_issued:is_quarter_start', 'date_issued:is_year_end', 'date_issued:is_year_start', 'date_issued:time', 'date_issued:hour', 'date_issued:minute', 'date_issued:second', 'date_issued:is_weekend', 'date_issued:day_part'],
dtype='object')
Thanks for reading!!
- If you enjoyed this, follow me on medium for more.
- Your claps are a huge encouragement and help in writing more and writing better.
- Interested in collaborating? Let’s connect on Linkedin.
- Please feel free to write your thoughts/suggestions/feedback.
- Kaggle link
- Fast_ml link
Notebook is available at the following location with fully functional code:
Bio: Samarth Agrawal is a Data Scientist at Toyota, and a Data Science practitioner and communicator.
Original. Reposted with permission.
Related: