Can we trust AutoML to go on full autopilot?

We put an AutoML tool to the test on a real-world problem, and the results are surprising. Even with automatic machine learning, you still need expert data scientists.

By Norman Niemer, Chief Data Scientist, UBS, David (Yingxiang) Chen*, Saad Naqvi*, Zeana Kaynat*, Yongcheng (Riley) Zhu*, *Columbia University

The Promise of AutoML

Data scientists are in short supply, and Automatic Machine Learning (AutoML) promises to alleviate this problem. H2O Driverless AI employs the techniques of expert data scientists in an easy to use application that helps scale your data science efforts. It lets “everyone develop trusted machine learning models.”

The Experiment

A Chief Data Scientist and four Columbia University students with varying levels of experience put the technology to the test on a real-word problem predicting stock price movements (a challenging task!).

The data was 5 years of stocks in the Russell 1000 index on a monthly frequency. The features were 80 stock-specific factors like price momentum, P/B, EPS growth, ROE. The target was 12-month forward stock returns (beta-adjusted alpha to be specific). The goal was to predict the stock returns using the stock features. The success metrics were mean absolute error and rank correlation between predicted and actual returns.

The special feature of the data is that it is panel data in which there are high cross-sectional correlation observations on one day. Furthermore, there is a high autocorrelation between observations across time because data is provided monthly, but returns are 12 months out, so 11/12 months are the same between 2 subsequent observations.

Naive Baseline

Before evaluating a model, you need to have a naive baseline as discussed in Top 10 Statistics Mistakes Made by Data Scientists. The first baseline is predicting a 0 return, which is close to the median, meaning there is no way of predicting returns. The second baseline is an average of all the feature ranks meaning all features are equally important in forecasting returns. The last baseline is the return from a month ago - given the nature of the dataset, this includes forward-looking information and is a good baseline for assessing any overtraining and look-ahead bias.

Success Metrics for Naive Baseline
model MAE Correl
0 return forecast 0.145 NaN
factor equal-weight NaN 0.009
return 1 month ago 0.222 0.238


First hurdle: Technical Setup

Before we got to modeling, the first hurdle was to get everyone set up with DAI. H2O recommends installing Driverless AI on modern data center hardware with GPUs, CUDA support, and 64GB RAM. We followed the instructions and installed the software on Google Cloud Platform with the recommended settings.

The result: we burned through $200 of GCP credits in less than a week! We soon realized that such high computation power is not required as our dataset was just 1GB in size. We installed Driverless AI on a GCP machine with much lesser specifications, and it also worked on a local laptop machine with just 8GB RAM. Overall, the installation of Driverless AI required some amount of technical expertise that not everyone on the team had.

Second hurdle: Data Preprocessing

The next hurdle was to get the data ready for analysis, and AutoML does not help that. We had to preprocess and combine data from five different data sources, which was a substantial effort. We used data workflow library d6tflow and data sharing library d6tpipe from Top 10 Coding Mistakes Made by Data Scientists to help us with the pre-modeling steps.

The result: The preprocessing d6tflow DAG is shown below. It takes stock return data from Bloomberg and stock factors data from WRDS, combines, cleans and normalized both data sources to make it suitable for machine learning in DAI which requires clean data all in one place.

└─--[TaskFactorComposite-{'idx': 'RIY Index', 'dt_start': '2011-01-01', 'dt_end': '2018-09-01'} (PENDING)]
   └─--[TaskFactorsIdx-{'idx': 'RIY Index', 'dt_start': '2011-01-01', 'dt_end': '2018-09-01'} (PENDING)]
      └─--[TaskFwdRtn-{'idx': 'RIY Index', 'dt_start': '2011-01-01', 'dt_end': '2018-09-01'} (PENDING)]
         └─--[TaskBbgHistory-{'idx': 'RIY Index', 'dt_start': '2011-01-01', 'dt_end': '2018-09-01'} (PENDING)]
            └─--[TaskBbgMembers-{'idx': 'RAY Index'} (PENDING)]


Running fully automatic - too good to be true?

At the beginning of our experiment, we worked directly with the driverless AI interface, uploaded datasets, selected target variable, adjusted the knobs for performance, interpretability, etc. We must say, the UI is neat, and it kind of resembles Tony Stark’s JARVIS. While the total data science newbies struggled with choosing the right settings, the team members with at least an intermediate level of data science knowledge easily knew what to do.

Once the launch button is hit, true to its name, the training was driverless. We watched the progress as it churned out hundreds of models and created thousands of new features and finally created a neat PDF report of the experiment results.

The result: MAE of 4% and correlation 96%. WOW! The machine really is better than the human, blowing the naive model out of the water. But it even beat the naive model with look-ahead bias - suspicious!

Adding realistic test sets manually

Following the out-sample testing advice for panel data in Top 10 Statistics Mistakes Made by Data Scientists, we built our own train/validation/test sets. DAI did not seem to have any functionality to do that, so we had to do it manually. We split the data by time and also did rolling forward tests. The skill bar to do this had risen to intermediate.

The result: MAE of 6% and correlation 90%. Still way higher than the naive model with forward-looking bias - suspicious again!

What does this thing do anyway?

At this point, we were not too sure what was driving this exceptional performance. Clearly it was too good to be true. We got a report with all the technical details and visuals. But how did it build models? What went on behind the closed curtains? It is difficult to understand what it tried. What parameters, what features does it engineer? What features engineered that subsequently dropped out? Not only the newbies got lost here but the pros too.

By carefully analyzing the output, we figured out what the problem was: DAI automatically adds lagged variables, and out of fold means. Given the nature of our dataset, it being panel data with overlapping target variables, those features caused the look-ahead bias and substantially inflated results.

The graph below illustrates the particular issue in more detail. Assume there are only two stocks: A and B. We want to forecast yearly returns from input data which updates monthly. In training, it is easy to forecast A2 from A1 because of the large overlap. But when running the model live, we don’t have the same input data, as illustrated by A13 and A14. So we have to be careful with prior observations in training. Also if B is highly correlated with A, it is easy to forecast, but again that data is not available when the model runs live in production, and therefore the exceptional “test” performance would not hold up.

These issues also trip up human data scientists as illustrated in more detail with examples in Top 10 Statistics Mistakes Made by Data Scientists, see #7+8.

Going fully manual to avoid look-ahead bias

We had to dig deep into the DAI documentation and use the DAI python client to turn off those features and manually run every experiment to perform a roll-forward analysis with non-overlapping periods. With quite some effort, we built a fully automated machine learning workflow using d6tflow optimized for DAI. The skill bar to do this was now a professional.

The result: MAE of 23% and rank correlation 4%. Much more reasonable! And still beating the zero skill and equal-weight model.


First, DAI claims it does not overfit, but it did! Driverless AI showed inflated test performance that we would have never achieved running the model in production real-time. This serves as a case study where AutoML systems can give overly optimistic results out if the sample is not carefully analyzed. This could be due to the unique nature of our dataset, but if it happened to us, it could happen to others. Therefore, it is crucially important to understand the characteristics of your data and the type of predictions you make.

Second, the whole data pipeline involved a lot more than just AutoML model training. We needed to prepare data, manually control DAI via Python, and extract outputs to compare with naive models. We still had to know how to use pandas, d6tpipe, d6tflow, and how to build realistic test and training sets.

Third, despite transparency for the ultimate model, it was still unclear what features DAI generated and how it trained models, so it would be difficult for a novice to explain the methodology and output with confidence.

In summary, AutoML systems like H2O Driverless still need an educated data scientist to use, control, interpret, and explain the machine learning system. We liken it to flying airplanes: just because an airplane has an autopilot, that does not mean any odd passenger is able to safely operate the airplane and put the trust of 200 lives in their hands without having adequate training.

Original. Reposted with permission.

Bio: Norman Niemer is the Chief Data Scientist at a large asset manager where he delivers data-driven investment insights. He holds a MS Financial Engineering from Columbia University and a BS in Banking and Finance from Cass Business School (London).