Submit a blog to KDnuggets -- Top Blogs Win A Reward

Topics: AI | Data Science | Data Visualization | Deep Learning | Machine Learning | NLP | Python | R | Statistics

KDnuggets Home » News » 2021 » Apr » Tutorials, Overviews » Feature Engineering of DateTime Variables for Data Science, Machine Learning ( 21:n17 )

Feature Engineering of DateTime Variables for Data Science, Machine Learning


Learn how to make more meaningful features from DateTime type variables to be used by Machine Learning Models.



By Samarth Agrawal, Data Scientist at Toyota



Feature Engineering of DateTime Variables. Image by Author.

 

INTRODUCTION

 
DateTime fields require Feature Engineering to turn them from data to insightful information that can be used by our Machine Learning Models. This post is divided into 3 parts and a Bonus section towards the end, we will use a combination of inbuilt pandas and NumPy functions as well as our functions to extract useful features.

  • Part 1 — Extract Date / Time Components
  • Part 2 — Create Boolean Flags
  • Part 3 — Calculate Date / Time Differences
  • Bonus — Feature Engineering in 2 lines of code using fast_ml

 

BACKGROUND

 
Whenever I have worked on e-commerce related data, in some way or the other dataset contains DateTime columns.

  • User registration date-time
  • User login date-time
  • Transaction date-time
  • Disputed transaction date-time
  • … and many more

At the outset, this date field gives us nothing more than a specific point on a timeline. But these DateTime fields are potential treasure troves of data. These fields are immensely powerful ‘if used rightly’ for uncovering patterns.

As a Data Scientist, your job is to bring the insight to the table, and for that, you are required to ask the right questions. For Ex.

  • Ques 1 — When do you see most carts getting created?
  • Ques 2 — When do you see most carts getting abandoned?
  • Ques 3 — When do you see the most fraudulent transactions?
  • Ques 4 — When do the maximum users subscribe?
  • Ques 5 — When are certain items purchased most often?
  • Ques 6 — After how many days/hours after registration user makes the first order?
  • Ques 7 — After how many days of inactivity customer never returns to your site?
  • … etc

Now, To answer these questions, you get back to data to engineer these DateTime fields. And then a whole lot of patterns can be discovered.

Part 1 of the post will provide you feature engineering steps to answer questions like 1, 2 & 3

  • Ans 1 — When do you see most carts getting created? The first week of the month
  • Ans 2 — When do you see most carts getting abandoned? Wednesday-Afternoon
  • Ans 3 — When do you see the most fraudulent transactions? Friday-Late nite

Part 2 of the post will provide you feature engineering steps to answer questions like in 4 & 5

  • Ans 4 — When do the maximum users subscribe? At the start of the year
  • Ans 5 — When are certain items purchased most often? At the start of the month

Part 3 of the post will provide you feature engineering steps to answer questions like in 6 & 7

  • Ans 6 — After how many days/hours after registration user makes the first order? Within 2 hours
  • Ans 7 — After how many days of inactivity customer never returns to your site? After 14 days of inactivity

I have used an example of e-commerce data where I have personally found a lot of use cases but by no means the scope of extracting information is related to just that. We will see in this post how some of the behaviors that can be learned by asking the right question ie. doing the right feature engineering has proved useful in multiple industries across a variety of problems.

 

LET’S GET STARTED

 

Load the dataset

 

import pandas as pddf = pd.read_csv('/kaggle/input/loan-data/loan.csv', 
                 parse_dates = ['date_issued', 'date_last_payment'])


Pandas provide a very simple yet very powerful way to deal with DateTime-related variables by parsing them as dates. You can pass as a list all the variables that are date-time related in the parameter parse_dates.

Let’s say you are not aware of the DateTime variables upfront and after investigating the data you found that some of the variables are date-time. So instead of reloading the data, pandas provide another useful function to_datetime to convert the data type to DateTime.

df['date_issued'] = pd.to_datetime(df['date_issued'], 
                                   errors = 'coerce')


 

Part 1. Extract Date / Time Component

 
As illustrated in the example above, we can extract the component of the date-time part (year, quarter, month, day, day_of_week, day_of_year, week_of_year, time, hour, minute, second, day_part) from the given date-time variable. The below list provides several of such components that can be extracted using pandas inbuilt functions.

 

Syntax:

 
We can extract all these components using .dt accessor. Read more about the date accessors here



Extracted components from datetime variable issued_date. Image by Author

 

Below is the code as shown in the image. All the other components can also be extracted in a similar way

#1
df[‘date_issued:year’] = df[‘date_issued’].dt.year#2
df[‘date_issued:month’] = df[‘date_issued’].dt.month#3
df[‘date_issued:day_of_week’] = df[‘date_issued’].dt.day_of_week#4
df[‘date_issued:week_of_year’] = df[‘date_issued’].dt.week_of_year#5
df[‘date_issued:hour’] = df[‘date_issued’].dt.hour


Note:

  • For Monday : day_of_week = 0,
  • Tuesday : day_of_week=1,
  • Sunday : day_of_week=6

 

Creating Day Part:

 

#day_part function
def day_part(hour):
    if hour in [4,5]:
        return "dawn"
    elif hour in [6,7]:
        return "early morning"
    elif hour in [8,9,10]:
        return "late morning"
    elif hour in [11,12,13]:
        return "noon"
    elif hour in [14,15,16]:
        return "afternoon"
    elif hour in [17, 18,19]:
        return "evening"
    elif hour in [20, 21, 22]:
        return "night"
    elif hour in [23,24,1,2,3]:
        return "midnight"

#Run function with apply method
df['date_issued:day_part'] = df['date_issued:hour'].apply(day_part)df.head()




Creating day part for datetime variable issued_date. Image by Author

 

Part 2. Create Boolean Flags

 
As illustrated in the example above, we can extract a lot of boolean Flags (is_month_start, is_month_end, is_quarter_start, is_quarter_end, is_year_start, is_year_end, is_weekend) from the given date-time variable. The below list provides several of such components that can be extracted using pandas inbuilt functions as well as by creating some of our functions.

 

Syntax:

 
Again, we can use .dt accessor to extract a lot of these boolean flags.



Extracted boolean flags from datetime variable issued_date. Image by Author

 

#1
df['date_issued:is_year_start'] = df['date_issued'].dt.is_year_start#2
df['date_issued:is_quarter_start'] = df['date_issued'].dt.is_quarter_start#3
df['date_issued:is_month_start'] = df['date_issued'].dt.is_month_start#4
df['date_issued:is_month_end'] = df['date_issued'].dt.is_month_end


 

Creating Weekend Flag:

 



Creating weekend flag for datetime variable issued_date. Image by Author

 

If we check the calendar, we will see that 26th of Oct, 2013 was a Saturday — a weekend.

df['date_issued:is_weekend'] = np.where(df['date_issued:day_of_week'].isin([5,6]), 1,0)


 

Part 3. Calculate Date / Time Differences

 
Often your questions/analysis will be relative to another point of reference. Like,

  1. After how many days/hours after registration user makes the first order? registration_date & first_order_date
  2. In how many days/hours customer’s complaint was resolved? complain_date & resolution_date
  3. From today, how recently customer ordered from your site? today & last_order_date
  4. …etc

In our example dataset, we have two columns date_last_payment & date_issued. Let’s see what happens when we just take a difference of these 2 columns.



Calculating time differences in pandas. Image by Author.

 

Pandas by default provide the difference in ‘days’. Notice the dtype: timedelta64[ns].

From the numpy documentation:


“Because NumPy doesn’t have a physical quantities system in its core, the timedelta64 data type was created to complement datetime64”


Now, if we just want the numeric part and not the entire string 947 days, we can do that by using the .dt accessor.



Calculating time differences in pandas using .dt accessor. Image by Author.

 

Unfortunately we can’t get the months in similar fashion.



Calculating time differences in pandas using .dt accessor is not scalable. Image by Author.

 

Here the timedelta64 from NumPy becomes very useful.

 

Syntax:

 
In order to get the number of months between date loan was issued and date last payment was done, we will write this



Calculating time differences in pandas using NumPy timedelta64. Image by Author.

 

(df['date_last_payment'] - df['date_issued'])/np.timedelta64(1, 'M')


timedelta64 can take following parameters for calculating the difference between 2 dates:

  • ‘D’ → for Days
  • ‘W’ → for Weeks
  • ‘M’ → for Months
  • ‘Y’ → for Years
  • ‘h’ → for Hours

 

BONUS!

 
You can use fast_ml to create all these datetime features

First, install fast_ml package

!pip install fast_ml — upgrade


And then, from the feature_engineering module import the method for engineering date time features

from fast_ml.feature_engineering import FeatureEngineering_DateTime


Now, this works in the exact same way as other transformers, preprocessors from sklearn.

  • Instantiate
  • Fit
  • Transform
#Instantiate
dt_fe = FeatureEngineering_DateTime()#Fit
dt_fe.fit(df, datetime_variables=['date_issued'], 
          prefix = 'date_issued:')#Transform
df = dt_fe.transform(df)
df.head()




Feature Engineering of Datetime variables using Fast_ml. Image by Author.

 

All the columns are not visible in the screenshot. Let’s just look at the columns of the dataset

df.columns---Output---
Index(['customer_id', 'disbursed_amount', 'interest', 'market', 'employment', 'time_employed', 'householder', 'income', 'date_issued', 'target', 'loan_purpose', 'number_open_accounts', 'date_last_payment', 'number_credit_lines_12',(Notice from here ------->)
'date_issued:year', 'date_issued:quarter', 'date_issued:month', 'date_issued:day', 'date_issued:day_of_week', 'date_issued:day_of_year', 'date_issued:weekofyear', 'date_issued:is_month_end', 'date_issued:is_month_start', 'date_issued:is_quarter_end', 'date_issued:is_quarter_start', 'date_issued:is_year_end', 'date_issued:is_year_start', 'date_issued:time', 'date_issued:hour', 'date_issued:minute', 'date_issued:second', 'date_issued:is_weekend', 'date_issued:day_part'],
dtype='object')


 

Thanks for reading!!

 

  • If you enjoyed this, follow me on medium for more.
  • Your claps are a huge encouragement and help in writing more and writing better.
  • Interested in collaborating? Let’s connect on Linkedin.
  • Please feel free to write your thoughts/suggestions/feedback.
  • Kaggle link
  • Fast_ml link

Notebook is available at the following location with fully functional code:

Feature Engineering of DateTime Variables
Explore and run machine learning code with Kaggle Notebooks | Using data from loan_data

 
Bio: Samarth Agrawal is a Data Scientist at Toyota, and a Data Science practitioner and communicator.

Original. Reposted with permission.

Related:


Sign Up

By subscribing you accept KDnuggets Privacy Policy