Date Processing and Feature Engineering in Python

Have a look at some code to streamline the parsing and processing of dates in Python, including the engineering of some useful and common features.



Figure
Photo by Sonja Langford on Unsplash

 

Maybe, like me, you deal with dates a lot when processing data in Python. Maybe, also like me, you get frustrated with dealing with dates in Python, and find you consult the documentation far too often to do the same things over and over again.

Like anyone who codes and finds themselves doing the same thing more than a handful of times, I wanted to make my life easier by automating some common date processing tasks, as well as some simple and frequent feature engineering, so that my common date parsing and processing tasks for a given date could be done with a single function call. I could then select which features I was interested in extracting at a given time afterwards.

This date processing is accomplished via the use of a single Python function, which accepts only a single date string formatted as 'YYYY-MM-DD' (because that's how dates are formatted), and which returns a dictionary consisting of (currently) 18 key/value feature pairs. Some of these keys are very straightforward (e.g. the parsed four 4 date year) while others are engineered (e.g. whether or not the date is a public holiday). If you find this code at all useful, you should be able to figure out how to alter or extend it to suit your needs. For some ideas on additional date/time related features you may want to code the generation of, check out this article.

Most of the functionality is accomplished using the Python datetime module, much of which relies on the strftime() method. The real benefit, however, is that there is a standard, automated approach to the same repetitive queries.

The only non-standard library used is holidays, a "fast, efficient Python library for generating country, province and state specific sets of holidays on the fly." While the library can accommodate a whole host of national and sub-national holiodays, I have used the US national holidays for this example. With a quick glance at the project's documentation and the code below, you will very easily determine how to change this if needed.

So, let's first take a look at process_date() function. The comments should provide insight into what is going on, should you need it.

import datetime, re, sys, holidays

def process_date(input_str: str) -> {}:
    """Processes and engineers simple features for date strings

    Parameters:
      input_str (str): Date string of format '2021-07-14'

    Returns:
      dict: Dictionary of processed date features
    """

    # Validate date string input
    regex = re.compile(r'\d{4}-\d{2}-\d{2}')
    if not re.match(regex, input_str):
        print("Invalid date format")
        sys.exit(1)

    # Process date features
    my_date = datetime.datetime.strptime(input_str, '%Y-%m-%d').date()
    now = datetime.datetime.now().date()
    date_feats = {}

    date_feats['date'] = input_str
    date_feats['year'] = my_date.strftime('%Y')
    date_feats['year_s'] = my_date.strftime('%y')
    date_feats['month_num'] = my_date.strftime('%m')
    date_feats['month_text_l'] = my_date.strftime('%B')
    date_feats['month_text_s'] = my_date.strftime('%b')
    date_feats['dom'] = my_date.strftime('%d')
    date_feats['doy'] = my_date.strftime('%j')
    date_feats['woy'] = my_date.strftime('%W')

    # Fixing day of week to start on Mon (1), end on Sun (7)
    dow = my_date.strftime('%w')
    if dow == '0': dow = 7
    date_feats['dow_num'] = dow

    if dow == '1':
        date_feats['dow_text_l'] = 'Monday'
        date_feats['dow_text_s'] = 'Mon'
    if dow == '2':
        date_feats['dow_text_l'] = 'Tuesday'
        date_feats['dow_text_s'] = 'Tue'
    if dow == '3':
        date_feats['dow_text_l'] = 'Wednesday'
        date_feats['dow_text_s'] = 'Wed'
    if dow == '4':
        date_feats['dow_text_l'] = 'Thursday'
        date_feats['dow_text_s'] = 'Thu'
    if dow == '5':
        date_feats['dow_text_l'] = 'Friday'
        date_feats['dow_text_s'] = 'Fri'
    if dow == '6':
        date_feats['dow_text_l'] = 'Saturday'
        date_feats['dow_text_s'] = 'Sat'
    if dow == '7':
        date_feats['dow_text_l'] = 'Sunday'
        date_feats['dow_text_s'] = 'Sun'

    if int(dow) > 5:
        date_feats['is_weekday'] = False
        date_feats['is_weekend'] = True
    else:
        date_feats['is_weekday'] = True
        date_feats['is_weekend'] = False

    # Check date in relation to holidays
    us_holidays = holidays.UnitedStates()
    date_feats['is_holiday'] = input_str in us_holidays
    date_feats['is_day_before_holiday'] = my_date + datetime.timedelta(days=1) in us_holidays
    date_feats['is_day_after_holiday'] = my_date - datetime.timedelta(days=1) in us_holidays

    # Days from today
    date_feats['days_from_today'] = (my_date - now).days

    return date_feats


A few points to note:

  • By default, Python treats days of the week as starting on Sunday (0) and ending on Saturday (6); For me, and my processing, weeks start on Monday, and end on Sunday — and I don't need a day 0 (as opposed to starting the week on day 1) — and so this needed to be changed
  • A weekday/weekend feature was easy to create
  • Holiday-related features were easy to engineer using the holidays library, and performing simple date addition and subtraction; again, substituting other national or sub-national holidays (or adding to the existing) would be easy to do
  • A days_from_today feature was created with another line or 2 of simple date math; negative numbers are the number of days a given dates was before today, while positive numbers are days from today until the given date

I don't personally need, for example, a is_end_of_month feature, but you should be able to see how this could be added to the above code with relative ease at this point. Give some customization a try for yourself.

Now let's test it out. We will process one date and print out what is returned, the full dictionary of key-value feature pairs.

import pprint
my_date = process_date('2021-07-20')
pprint.pprint(my_date)


{'date': '2021-07-20',
 'days_from_today': 6,
 'dom': '20',
 'dow_num': '2',
 'dow_text_l': 'Tuesday',
 'dow_text_s': 'Tue',
 'doy': '201',
 'is_day_after_holiday': False,
 'is_day_before_holiday': False,
 'is_holiday': False,
 'is_weekday': True,
 'is_weekend': False,
 'month_num': '07',
 'month_text_l': 'July',
 'month_text_s': 'Jul',
 'woy': '29',
 'year': '2021',
 'year_s': '21'}


Here you can see the full list of feature keys, and corresponding values. Now, in a normal situation I won't need to print out the entire dictionary, but instead get the values of a particular key or set of keys.

We can demonstrate how this might work practically with the below code. We will create a list of dates, and then process this list of dates one by one, ultimately creating a Pandas data frame of a selection of processed date features, printing it out to screen.

import pandas as pd

dates = ['2021-01-01', '2020-04-04', '1993-05-11', '2002-07-19', '2024-11-03', '2050-12-25']
df = pd.DataFrame()

for d in dates:
    my_date = process_date(d)
    features = [my_date['date'],
                my_date['year'],
                my_date['month_num'],
                my_date['month_text_s'],
                my_date['dom'],
                my_date['doy'],
                my_date['woy'],
                my_date['is_weekend'],
                my_date['is_holiday'],
                my_date['days_from_today']]
    ds = pd.Series(features)
    df = df.append(ds, ignore_index=True)

df.rename(columns={0: 'date',
                   1: 'year',
                   2: 'month_num',
                   3: 'month',
                   4: 'day_of_month',
                   5: 'day_of_year',
                   6: 'week_of_year',
                   7: 'is_weekend',
                   8: 'is_holiday',
                   9: 'days_from_today'}, inplace=True)

df.set_index('date', inplace=True)
print(df)


            year month_num month day_of_month day_of_year week_of_year is_weekend  is_holiday  days_from_today
date                                                                                                          
2021-01-01  2021        01   Jan           01         001          00         0.0         1.0           -194.0
2020-04-04  2020        04   Apr           04         095          13         1.0         0.0           -466.0
1993-05-11  1993        05   May           11         131          19         0.0         0.0         -10291.0
2002-07-19  2002        07   Jul           19         200          28         0.0         0.0          -6935.0
2024-11-03  2024        11   Nov           03         308          44         1.0         0.0           1208.0
2050-12-25  2050        12   Dec           25         359          51         1.0         1.0          10756.0


And this data frame hopefully gives you a better idea of how this functionality could be useful in practice.

Good luck, and happy data processing.

 
Related: