Stop training more models, start deploying them

We are hardly living up to the promises of AI in healthcare. It’s not because of our training, it’s because of our deployment.

By Maurits Kaptein, Tilburg University


(photo credit: Michael Pardo)


The rumours that AI (and ML) will revolutionise healthcare have been around for a while [1]. And yes, we have seen some amazing uses of AI in healthcare [see, e.g., 2,3]. But, in my personal experience, the majority of the models trained in healthcare never make it to practice. Let’s see why (or, scroll down and see how we solve it).


Note: The statement ”the majority of the models trained in … never make it to practice” is probably true across disciplines. Healthcare happens to be the one I am sure about.


Developing models in healthcare

Over the last decade, I have developed AI methods allowing computers to learn “what works for whom”. I have worked on Bandit problems [e.g., 4] with applications in online marketing [5], thus learning “which product to show to whom”. More recently I became interested in “which treatment works for which patient?”; partly using the same data science methods, but this time with a positive impact.

For example, together with Lingjie Shen, I spend the last two years collaborating with the Dutch Cancer Registry (NCR). The great folks at NCR maintain detailed records containing the progression of cancer cases in the Netherlands. The registry contains the background characteristics of patients, their treatment (e.g., chemo-therapy yes or no) and the outcome (2 or 5 year mortality). It takes a lot of curation to get this data from the various hospitals, but the NCR, in the end, provides a clean and well-documented collection of cases. We worked with an excerpt from the registry containing over 50.000 patients suffering from colon cancer and whose tumor was surgically removed. After removal of the tumor there is debate on whether or not to administer chemo-therapy; while most randomized controlled trials (RCTs) show a positive effect, this seems to vary widely between patients and tumor types [6].

We set out to study whether we could learn which patients should receive adjuvant chemotherapy. A simple approach would be to fit a flexible supervised learning model to predict the 5-year survival rates. So we did, using Bayesian Additive Regression Trees [7]. However, clearly, this not suffice: the NCR data describes real-live cases as they are treated in the hospital: the treatment assignment could well be severely confounded. We might be mixing up causes for effects. To counter this, we examined various methods of controlling for this confounding. We ended up comparing our estimates to estimates from a large RCT. This allowed us to validate our “correction model”.

In the end, these steps allowed for “imputing the counterfactuals” [8]. We created a dataset containing the outcomes that we expected under treatment and under no-treatment for each and every patient. This allowed for generating personalized treatment rules [9]: we could finally say which treatment works for whom.

The three paragraphs above took over a year and required coordination between our team, the NCR, and a number of participating oncologists. Eventually, the project gave us an AI (or what-shall-we-call-it) model that, given a feature vector, determines the optimal treatment choice. That awesome result, however, brings up the logical next question; how do we make sure this model is actually used by healthcare professionals?


Models never leave their notebook

To find out, I decided to talk to those who had experienced the problem before. So, I about a year ago, I started conversations with the great people at NCR: “How do you deploy the models that you, and the researchers that use your data, train?” Cutting a long answer about public APIs, regulation, privacy, and cultural and organizational hurdles short, they hardly do. It proved to be extremely hard to deploy models.

And, it was not just NCR; I talked to Data Scientists at various health insurers. Same problem. They train models and validate them, but it’s hard to deploy them into healthcare practice. I talked to Dutch scientific funding sources: yes, they fund the training of models for various healthcare applications routinely. But, actually deploying these models is challenging, and most projects don’t make it past “the notebook stage”: a nice demonstration of the validity and usefulness of the model, but no impact in practice.

The bottom line here is that it was not just me, a researcher, who was failing to move trained models to production. It's omnipresent and it is hurting our ability to truly improve the lives of patients.


Training and deployment are different

So, we have a problem on our hands: how do we move models to production? Now, the answer to this question is going to be complicated; it will take more than just a technological solution. But, I happen to find the technological problem(s) interesting, so let’s focus on those for now.

In my view, a major hurdle in deploying models is caused by the vastly different requirements that we have for model training and model deployment. Let’s look at a few:

  • Data collection and curation: During training, we like to work with curated datasets that are exported from the primary process. This is clear in the NCR example: the data involved has been surgically removed (no pun intended) from the electronic patient records in various hospitals. In production, however, we are faced with the data as it is collected. We need to be able to deal with the (raw) data from individual patients.
  • Experimenting and exploring: During training, we like tools that allow us to quickly explore and pre-process our data. We like to easily draw up plots, create new feature vectors, and visualize our results. In production we don’t need these tools; we just need to make sure that the inferences are generated quickly and reliably.
  • Memory usage and computational performance: During training, we like to have access to our whole dataset to train our models. We like to make sure that model parameters are quickly learned (using e.g. GPUs). We need big storage, and fast machines to crank away at our problems, but we often have a few minutes to spare. In production, however, we only need to evaluate a single example at the time, but this evaluation needs to be done quickly (leading to low latency and low computational costs). In production, we need lean and efficient code.
  • Portability: During training, we really only need things to work on our own machine. In deployment things are very different; we would like a vast number of consumers to use our models, likely using a whole range of different systems. Thus, we should either provide access to our models using standardized interfaces, or we should make sure our models can run anywhere, anytime. The latter is hugely important in a health context as often we would like to send models to the data source instead of sending data to some external server.

Hence, while most data scientist understandably happily chop away using their jupyter notebooks (or various other tools), whatever happens on their local machine is all too often ill-suited to run in the hospital. Effectively, we need a method to bridge the gap from training requirements to deployment requirements.


Current deployment solutions fail

Obviously I am not the only one identifying this problem; in recent years a number of potential solutions have been suggested. Most of the solutions fall into one of the following three classes, each of which has its drawbacks.

  1. Containerize as is: A relatively simple, form the point of the data scientist, approach to deploying models is to just take the (e.g.,) `python` notebook code, wrap it all into a (Docker) container, and expose a REST API. If you don’t really think about it, this sounds pretty good. It isn’t really though: Containers are often extremely bloated and, combining this with python interfacing with the underlying c code for most models, this approach leads to deployments that are magnitudes larger (memory-wise) and magnitudes slower than they need to be.
  2. Rebuild from scratch: An alternative to just containerizing what you have is to rebuild the trained model from scratch into a stand-alone, highly performant, application. Redo the model in c or rust, and the strongly typing and memory management will make it small and blazingly fast. You might still need to containerize the executable or recompile for various targets, but in essence, you end up with a small and fast model that can be run everywhere. Rebuilding, however, consumes a lot of valuable development time.
  3. Store and retrieve: An alternative I encountered to containerizing or rebuilding models for deployment, is storing the inferences generated by a model for multiple points in covariate space into a database which serves as a lookup table in production. This is super fast, but error-prone and virtually impossible for larger covariate spaces.

Effectively, it would be great if we can somehow hit the sweet spot between approaches 1 and 2 above: can we allow data scientists too easily convert a fitted model into a small and efficient application that can be run anywhere?


Fast deployment using WebAssembly

Automatically rebuilding a model, in such a way that it can be run in extremely small and efficient containers might sound impossible, but come to think of it, it isn’t. We solved the issue using WebAssembly, the runtimes provided by our friends at wasmer, and, admittedly, quite a lot of fiddling around with the c code underlying (e.g.,) sklearn and various compiler directives. All of this resulted in a fully automatic way of compiling (or transpiling really) a stored model object to WebAssembly using a single-line-of-code. The result is a blazingly fast, super small “executable” that can be run on pretty much any environment; we can run the task on a server and disclose it using a simple API, but we can even run it in a browser or on the edge. The WebAssembly (or .wasm) executables can be shipped to the hospital, and ran within their systems, directly on the EPD data, without the data ever leaving the hospital.


A simple example

To illustrate, let’s look at some code for a simple linear regression model. The following code fits a simple regression model using sklearn, and subsequently uploads it to scalable using the sclblpy package:

Within seconds, the model is transpiled. To make things super easy, we even host the resulting .wasm executable on our servers. This particular model upload is available under id 7d4f8549-a637–11ea-88a1–9600004e79cc, and can be tested at here.

Or, alternatively, a simple cURL call like this:


curl - location - request POST '' - header 'Content-Type: application/x-www-form-urlencoded' - data-raw '{"input":{"content-type":"json","location":"embedded","data":"{\"input\": [[10]]}"},"output":{"content-type":"json","location":"echo"},"control":1,"properties":{"language":"WASM"}}'

immediately provides the resulting model inference.


Give it a try!

We think that our approach to putting models into production can bridge the gap from training to deployment in healthcare and beyond. It is easy, and the result is performant and portable. We however do need people to try it, give feedback, and help us improve. So, if you want to give it a try, get your own account at

Any comments are highly appreciated!



Let’s wrap this up. I think we are developing useful models faster than we are using them. I think model deployment is an issue in healthcare (and beyond). I think it is, in part, caused by the vastly different requirements that we have when training models versus deploying them. I think automatically transpiling models to WebAssembly solves this: it is easy, it is highly performant, and the resulting executable is portable. We are looking forward to seeing your feedback.







It's good to note my own involvement here: I am a professor of Data Science at the Jheronimus Academy of Data Science and one of the cofounders of Scailable. Thus, no doubt, I have a vested interest in Scailable; I have an interest in making it grow such that we can finally bring AI to production and deliver on its promises. The opinions expressed here are my own.

Bio: Prof. Dr. Maurits Kaptein is a professor of data science at Tilburg University, The Netherlands, and one of the co-founders of Scailable. Maurits has worked on various topics, from multi-armed bandits to Bayesian Additive Regression Tree (BART) models to efficient methods for adaptive clinical trials. After all that’s done, it’s time to go surfing.

Original. Reposted with permission.