Building a Tractable, Feature Engineering Pipeline for Multivariate Time Series
A time series feature engineering pipeline requires different transformations such as imputation and window aggregation, which follows a sequence of stages. This article demonstrates the building of a pipeline to derive multivariate time series features such that the features can then be easily tracked and validated.
Image by Christophe Dion on Unsplash
Accurate time series forecasting is crucial for business problems such as predicting the evolution of material properties in manufacturing and sales forecasting. Modern forecasting techniques include the use of machine learning algorithms like Xgboost to build regression models on tabular data to predict the future. Tabular data allows regression models to forecast by leveraging non-target factors like daily crowd size of a retail store that correlates with the actual sales.
Feature engineering is essential to enrich the influential factors that enhance forecasting accuracy. However, building a time series feature engineering pipeline is not trivial because it involves different transformations at various stages such as the aggregation of window values. Rolling window based features such as “the average of two weeks of records” are basic yet useful for forecasting and a pipeline that supports this type of feature generation is valuable. Furthermore, the transformed features output order can be shuffled upon the transformations and this poses the challenge for feature tracking. It is crucial to track and validate the transformed features of a pipeline before the final model training. This github repository provides an example of designing a time series pipeline that serves the mentioned needs and this article explains some of the key steps to achieve this.
Scikit-learn based transformers may support the get_feature_names_out() function to provide their output names upon the transformations. For example, OneHotEncoder transformer has this built-in function but not the SimpleImputer transformer. Scikit-learn pipeline requires all the underlying transformers supporting the get_feature_names_out() if we wish to retrieve the output names conveniently. Our solution is to design a FeatureNamesMixin mixin class together with a Factory Method design pattern class, TransformWithFeatureNamesFactory, to equip a transformer with the get_feature_names_out() function.
We can then create a customized SimpleImputer and use it as part of a Scikit-learn Pipeline’s steps or ColumnTransformer as shown below
This only requires the transformer class initialized with any desired output names specified by the parameter names.
Any other transformers can also be customized similarly to support the retrieval of output names provided we introduce the get_feature_names_out(). Here is an example of the TsfreshRollingMixin class that leverages the roll_time_series() utility function from TSFresh library to extract the rolling windows of time series. To track the derived rolling window features, we simply require the features assigned as the derived_names attribute. TsfreshRollingMixin class has the other helper functions like prepare_df(), get_combined() to facilitate the casting of numpy arrays into pandas.DataFrame and combining the derived features together with the input features. Interested readers are welcome to study the repository to understand better the implementation of this. Together with the extract_features() of TSFresh, we can design a TSFreshRollingTransformer class that allows us to specify the relevant TSFresh parameters to derive the desired time series features as shown below.
We mark the derived features of TSFreshRollingTransformer with a names like (window #) to indicate the aggregation window size.
In addition to the TSFresh derived time series features, RollingLagsTrasformer class shows an example of how to extract the lag value from a given rolling window using the TsfreshRollingMixin class and track the output features with (lag #) as the lag order. Any other specific transformation performed on a rolling window can follow this example to design the transformers that are compatible with the tractable pipelines.
This script demonstrates the building of time series pipelines using the above transformers to derive the features at different stages. For instance, the feature extracted from the first pipeline has a feature named Pre-MedianImputer__Global_intensity to indicate that the raw feature “Global_intensity” is imputed based on the SimpleImputer using the median strategy. Subsequently, rolling based feature derived with a name like TSFreshRolling__Pre-MedianImputer–Global_intensity__maximum(window 30) refers to the maximum of the imputed “Global_intensity” within a window size of 30 steps. Finally, the rolling window based feature will be imputed followed by standard scaling, and the suggested name ImputeAndScaler__TSFreshRolling__Pre-MedianImputer–Global_intensity__median(window 30) indicates that the raw feature “Global_intensity” has gone through all the previously mentioned transformations. We can conveniently track the transformed names provided by the pipelines based on the relevant markers and validate the transformed outputs.
We complement this demonstration with an example of building a linear regression model using the Darts package for forecasting. Once the pipelines are ready, it is straightforward to load the input data set and derive the features accordingly. The following chart shows the forecasting result based on a linear regression model trained using the derived features from the pipelines. We hope this article can provide the readers a basic template to build a time series pipeline conveniently for their time series forecasting use cases.
Linear regression model based forecasting based on the engineered features from the designed pipeline. Image by the author.
Jing Qiang Goh is a time series enthusiast.