AutoML for Temporal Relational Data: A New Frontier
While AutoML started out as an automation approach to develop optimal machine learning pipelines, extensions of AutoML to Data Science embedded products can now enable the processing of much more, including temporal relational data.
By Flytxt, Marketing Automation and Marketing AI solution provider for b2c enterprises.
Building and maintaining real-world machine learning systems require data scientists and domain specialists who are in limited supply. Automated Machine Learning (AutoML) has emerged as a promising research area due to its wide applicability in progressively automating key steps in the construction and maintenance of machine learning workflows. It relieves human experts to focus on complex, non-repetitive, and creative aspects of the learning problem. Recent AutoML advances include sophisticated feature synthesis by the automatic discovery of meaningful inter-table relationships from temporal relational databases (e.g., Deep Feature Synthesis), automatic model adaptation to handle concept-drift (e.g., AutoGBT), and automatic design of Deep Learning models (e.g., Neural Architecture Search) as depicted in Fig 1. These advances significantly boost the practical utility of AutoML systems by improving the productivity of data scientists, and by enabling non-experts to solve real-world data science problems from diverse domains.
Fig 1. AutoML Evolution.
AutoML for temporal relational databases:
Datasets for several machine learning applications, such as online advertising, recommender systems, automated customer engagement, etc. span across multiple related tables with timestamps to indicate the timing of events. Conventional approaches involve a manual combination of tables by domain experts in a cumbersome trial-and-error experimental methodology to derive meaningful features. AutoML for temporal relational data provides automatic feature synthesis considering temporal join of relevant key fields by automatically discovering important relationships across tables.
Enabling real-world AutoML use cases relying on temporal relational data involves automatically generating useful temporal information and efficiently merging features across multiple subtables in the absence of domain information, without causing a data leak. Along with these difficulties, one needs to automatically choose the best learning model and set of hyper-parameters subject to resource constraints so that the solution will be generic enough and adheres to the time and memory budget.
Interestingly, this year’s KDD cup had an AutoML challenge based on this theme, inviting AI/ML researchers and practitioners across the world to advance state-of-the-art AutoML for temporal relational databases.
Our workflow consists of steps pertaining to preprocessing, automatic feature synthesis across relational tables, model learning, and prediction. Preprocessing involves feature transformation for skew correction and augmentation of the square and cubic features. It also includes frequency encoding for categorical features whereas features are automatically synthesized using temporal join of aggregated metrics from subtables. Instances from the majority class are downsampled to maintain a ratio of 1:3. Catboost implementation of Gradient Boosting Decision Trees (GBDT) is used as the learning algorithm, and cross-validation is used for parameter-tuning to decide an optimal number of trees. Fig 2 depicts our workflow at a high level.
Fig 2. Our Model Pipeline.
Temporal Data Aggregation:
As temporal relational data spans across multiple tables, finding out important associations among tables and then optimally performing data aggregation helps in feature extraction. To extract the right feature representation, aggregation operation such as mean, sum, etc. are used for numerical features whereas count, mode, etc. are used for categorical features. Computation of aggregation metric needs to be done over an appropriate temporal window discovered using cross-validation.
Joining multiple database tables yield highly skewed features. Our feature preprocessing step involves skewness correction along with feature transformation and augmentation. Feature augmentation involves adding square and cubic transformed variants of numerical features, as well as sine or cosine, transformed variants of DateTime features (e.g., month, hour, and minute) with cyclic nature, to enrich the feature space. Frequency encoding of categorical features was performed to further augment the feature space.
Experimenting with several linear and non-linear models can be expensive in terms of computation and memory. We limit our model portfolio to CatBoost implementation of gradient boosting decision trees due to its robustness in handling categorical features and scalability. Hyperparameters such as the number of trees are tuned using cross-validation to avoid overfitting.
Our solution extends our existing AutoML research portfolio to enable use cases involving learning from temporal relational databases. Our solution can be accessed from our Github repository.
With industries increasingly focusing on quickly deriving value from AI and reducing the cycle time from prototyping to production deployment of machine learning models, AutoML has emerged as a key enabler by reducing the entry barrier to AI and through progressive automation of AI workflows. Increasingly, the AutoML community is focusing on enabling real-world use cases involving learning from structured and unstructured data, temporal relational databases, and data streams affected by concept drift. Though AutoML originally focused on the automatic construction of optimal machine learning pipelines, its scope is getting widened to handle automatic maintenance of such pipelines over time, increasing model autonomy. AutoML advances, along with the availability of powerful computational infrastructure would advance the fusion of human-machine intelligence, relieving human experts to focus more on complex, non-repetitive, and creative aspects of the learning problem to arrive at better solutions.
Bio: Flytxt is a Dutch company with more than 10 years of experience providing intelligent marketing technology for b2c enterprises, mostly telcos. Flytxt's products harness analytics, artificial intelligence, and automation to maximize the value of every interaction across customers’ digital journeys.
- Can we trust AutoML to go on full autopilot?
- 3 Reasons Why AutoML Won’t Replace Data Scientists Yet
- Unleash Big Data by SaaS-based End-to-End AutoML