The Data Science Machine, or ‘How To Engineer Feature Engineering’

MIT researchers have developed what they refer to as the Data Science Machine, which combines feature engineering and an end-to-end data science pipeline into a system that beats nearly 70% of humans in competitions. Is this game-changing?



The Analytical Engine Recent research by MIT Master's student Max Kanter has led to the implementation of what he refers to as the 'Data Science Machine.' A paper on the Data Science Machine (DSM) and its underlying innovation, the Deep Feature Synthesis algorithm, by Kanter and Kalyan Veeramachaneni, his thesis supervisor at CSAIL, is set to be presented at the IEEE International Conference on Data Science and Advanced Analytics next week. Their paper 'Deep Feature Synthesis: Towards Automating Data Science Endeavors' is available online now. The DSM is concisely described by Kanter & Veeramachaneni as "an automated system for generating predictive models from raw data," which combines the authors' innovative feature engineering approach with an end-to-end data science pipeline. The DSM has, thus far, managed to beat 68.9% of teams in data science competitions that it has been entered into. Perhaps most noteworthy, submissions which attain this success rate are generally completed in under 12 hours, as opposed to the months which teams of humans can labor for. The DSM is premised on the observations that data science competition problems generally have the following properties in common: they are structured and relational, they model human interaction with a complex system, and there is an attempt made to predict some aspect of humanity. Deep Feature Synthesis As with any data science problem, features must first be identified from existing variables, or be created from leveraging existing variables. While conceding that feature engineering has made significant recent advancements in the areas of non-relational data such as text and images, Kanter & Veeramachaneni note that it is still this task that most relies on human intervention in the data science pipeline, and can be difficult and time consuming even for seasoned data scientists. It is also this task that must most closely replicate the efficiency of a human being if it is to be truly automated. Deep Feature Selection (DFS), the DSM's feature engineering algorithm, is strictly for relational datasets, and is used to automate the identification and generation of insight-eliciting features. DFS takes relational tables as input, and is able to process the various types of data held within such a data structure. To be successful, DFS aims to think like a data scientist, looking to turn insightful questions into input features. The DFS algorithm walks the relationships and applies feature-selection functions as it does so, creating a final feature step by step. As it performs this walk, DFS stacks the calculations of the mathematical functions to a particular depth, and this is where the name DFS is derived. Depending on the input data types, a number of mathematical functions are applied at 2 distinct levels in the DSM: entity and relational. Entity level features focus on conversion and translation functions, such as changing data representations, rounding numbers, and extracting existing generalized attributes into more numerous and concise attributes. Relational level features are concerned with the relationships between entities in tables (think about your primary and foreign keys). These feature functions are then able to extract related data from other tables to associate with a given feature (for example, finding the max item price or item count associated with an order), data which could potentially be exploited as a useful feature to feed into a model. Machine Learning Pathway To start off the DSM's machine learning pathway, one of the input features is chosen to model, which is referred to as the target value, and which is used to form the prediction problem. Appropriate features, known as predictors, are selected via metadata to help in the prediction process. The DSM then creates a pathway for data preprocessing, feature selection, dimensionality reduction, modeling, and evaluation, all of which is parameterized and available for re-use if necessary. Parameter optimization is accomplished using a Copula Process, and an attempt is made to reduce the number of features by observing correlation. The reduced set of features is then tested on sample data, recombining them in different ways to optimize the accuracy of the predictions they yield. By its use of autotuning, which the authors argue is asolutely critical to its performance, the DSM was able to increase its score at all three of its competitions. Discussion What this all seems to suggest, essentially, is this: The DSM uses intelligent relational database relation-walking to help build and establish candidate features, narrows this feature set down by looking for correlated values, and uses combinatorics in what amounts to brute force feature engineering, to apply iterative feature subsets to sample data while recombining them for optimization until the best possible solution is found. To measure the DSM's performance, it was entered in competitions at KDD Cup 2014, IJCAI, and KDD Cup 2015, where, as mentioned, it outperformed more than 2/3 of the human competitors. Kanter & Veeramachaneni claim that even during its worst performance (IJCAI), the DSM still managed to frame the prediction problem in similar terms to human competitors, evidenced by the fact that it proceeded in the task by pursuing similar avenues of data modeling. In this same competition, it finished with an AUC difference of approximately 0.04 of the contest winner, suggesting that the DSM captured what could be considered the major aspects of the competition dataset. Kanter & Veeramachaneni argue that, while it cannot currently compete with the highest performing human scientists, the DSM nevertheless has a role alongside them. Even though a number of humans beat out the DSM in each of its competitions, it was able to outperform the majority of them with considerably less effort (less than 12 hours versus months, in some cases). They suggest that, in light of this, it can be used for setting benchmarks as well as for fostering creativity. Front-loading feature engineering and generating sets of potential top performing sets could allow humans to move on to rethinking the problem within hours, effectively starting with the DSM solution and moving forward from that point. It should be noted that, while the DSM is impressive, it's hardly the first system aiming to automate machine learning. Other examples include many systems that automatically build models to bid on advertising, or KXEN Model Factory (now part of SAP), which offered Automated Model Building already in 2010. Also, it is clear that the DSM is not useful for all types of data, and is a system implemented solely focusing on the exploition of relational datasets. It is also yet to be shown that it can be effective in relational datasets that do not conform to the previously-identified data science competition problem pattern. The DSM has already been spun-off into a startup called FeatureLab, touted as "Insights with an Interface," with Kanter as its CEO. The website states "Do more with your data, without more data scientists," and claims that it is the "best solution for companies looking to increase their data science resources." These are both bold claims, especially in light of the fact that none of the individual pieces of DSM can really be considered breakthroughs. It is entirely possible that FeatureLab gets lost in a cloud of "business intelligence" service platforms. But Big Data is going nowhere, and feature engineering has been one of the hottest topics in machine learning over the past 12 months. It just may be that the DSM's particular combination of technologies at what may end up being the right time leads to a new way of thinking about data science. Margo Seltzer, a Harvard computer science professor, has stated in reference to the DSM, "I think what they've done is going to become the standard quickly - very quickly." If this is the case, FeatureLabs stands to be well-positioned. You can read more about Kanter & Veeramachaneni's Data Science Machine here. Bio: Matthew Mayo is a computer science graduate student currently working on his thesis parallelizing machine learning algorithms. He is also a student of data mining, a data enthusiast, and an aspiring machine learning scientist. Related: