Deploy your PyTorch model to Production

This tutorial aims to teach you how to deploy your recently trained model in PyTorch as an API using Python.

By Nicolás Metallo, Audatex

Following the last article about Training a Choripan Classifier with PyTorch and Google Colab, we will now talk about what are some steps that you can do if you want to deploy your recently trained model as an API. The discussion on how to do this with is currently ongoing (more) and will most likely continue until PyTorch releases their official 1.0 version. You can find more information in the Forums, PyTorch Documentation/Forums, and their respective GitHub repositories.


Saving and Loading Models


It’s recommended that you take a look at the PyTorch Documentation as it’s a great place to start, but in short, there are two ways to serialize and restore a model. One is loading only the weights and the other loading the entire model (and weights). You will need to first create a model to define its architecture otherwise you will end up with an OrderedDict with just the weight values. Both options would work for inference and/or for resuming a model's training from a previous checkpoint.


1. Using and torch.load()


This save/load process uses the most intuitive syntax and involves the least amount of code. Saving a model in this way will save the entire module using Python’s pickle module. The disadvantage of this approach is that the serialized data is bound to the specific classes and the exact directory structure used when the model is saved. The reason for this is because pickle does not save the model class itself. Rather, it saves a path to the file containing the class, which is used during load time. Because of this, your code can break in various ways when used in other projects or after refactors.

Save model, PATH)

Sometimes pickle is not able to serialize some model creations functions (e.g. resnext_50_32x4d which is found in previous versions of Fastai) so you need to use dill instead. Here's the fix.

import dill as dill, PATH, pickle_module=dill)

You can read more about the limitations of pickle in this article. A common PyTorch convention is to save models using either a .pt or .pth file extension.
Load model

# Model class must be defined somewhere
model = torch.load(PATH)

2. Using state_dict
In PyTorch, the learnable parameters (e.g. weights and biases) of an torch.nn.Module model are contained in the model’s parameters (accessed with model.parameters()). A state_dict is simply a Python dictionary object that maps each layer to its parameter tensor. Note that only layers with learnable parameters (convolutional layers, linear layers, etc.) have entries in the model’s state_dict.

We will need to re-initialize our model in the same way it was originally defined and created when we saved the weights, making sure the variables, classes, functions that go into creating the model are available, whether through module imports or directly within the same script/file. One potential advantage of using this method is that you can use updated scripts to load the old model if the parameters are the same, and it’s also the recommended approach by the official documentation. One thing you should also remember is that state_dict takes a dictionary object, not a path to a saved object, so you can't load using model.load_state_dict(PATH).

Save model, PATH)

Load model

model = TheModelClass(*args, **kwargs) # Model class must be defined somewhere
model.eval() # run if you only want to use it for inference

You run model.eval() after loading because you usually have BatchNorm and Dropout layers that by default are in train mode on construction. You don't need to call model.eval() if you want to resume your model training.

As we have done our training with Fastai, we can call and Learner.load to save and load models (more info in the documentation) . This is running state_dict() in the back so only the model parameters will be saved and not the architecture. This means that you will need to run the create_cnn method to get a pre-trained model from a given architecture (the same that you used before to train your model, e.g. models.resnet34) with a custom head that is suitable for your data. Models are saved and loaded from the path/model_dir directory, and the .pth extension is automatically added for both operations.


Simple Deployment With Flask


After we have trained our classifier using the free GPU available in Google Colab, we are ready to do inference on our end. We can do it locally or in the cloud and there are many different options (AWS, Paperspace, Google Cloud, etc) that we can choose from. As I have some free Amazon AWS credits remaining I’ll be using an Amazon AMI that comes with several ML libraries already installed hosted on a t2.medium instance. Here are some simple instructions to run a Docker image on your end (should be around the same when we are not doing GPU training).

Ready-to-run Docker images

Jupyter Docker Stacks are a great way to get a notebook up and going in no time with the latest libraries. These are ready-to-run Docker images that contain Jupyter applications and interactive computing tools. Learn more about them in the Official Documentation. We will use the Jupyter Notebook Data Science Stack from this repository.

Once you have installed Docker, open the terminal, cd into your working directory, and run

$ docker run --rm -p 8888:8888 -e JUPYTER_ENABLE_LAB=yes -v "$PWD":/home/jovyan/work jupyter/datascience-notebook:e5c5a7d3e52d

This will create a server that you can log into where you will connect to a Jupyter notebook. You can run some commands directly from there or you can also run bash or any command in a Docker container by getting the container id, typing docker ps in the terminal and then running

$ docker exec -it {container-id} /bin/bash

Install PyTorch & Fastai

Depending on your machine configuration you will want to run inference on either GPU or CPU. In our example, we are going to run everything on the CPU, so you need to run the following to install the latest PyTorch.

Now you can install Fastai with pip install fastai

Create a Flask Application

Install the Flask library by running

pip install -U flask

We will create a folder called flask_app and two new python files with our code to load the model weights and run the inference server and that sets some basic params to give us more flexibility in the future. The following is an example of what flask_app/ might look like. We will then import this with from settings import * into .

# add your custom labels
labels = ['Not Choripan', 'Choripan']
# set your data directory
data_dir = 'data'
# set the URL where you can download your model weights
MODEL_URL = '' # example weights
# set some deployment settings
PORT = 8080

We can now go through the flask_app/ This first part will import the libraries and settings.

# flask_app/
# import libraries
print('importing libraries...')
from flask import Flask, request, jsonify
import logging
import random
import time
from PIL import Image
import requests, os
from io import BytesIO
# import fastai stuff
from fastai import *
from import *
import fastai
# import settings
from settings import * # import
print('done!\nsetting up the directories and the model structure...')

In order to run our single image inference prediction, we first need to create a new model that follows the same folder structure that we used when we trained it. That’s why we are going to create a new empty dir based on the labels that we have set before in

# set dir structure
def make_dirs(labels, data_dir):
    root_dir = os.getcwd()
    make_dirs = ['train', 'valid', 'test']
    for n in make_dirs:
        name = os.path.join(root_dir, data_dir, n)
        for each in labels:
            os.makedirs(os.path.join(name, each), exist_ok=True)
make_dirs(labels=labels, data_dir=data_dir) # comes from
path = Path(data_dir)

Once that path is defined, we are going to create a new learn model and download the pre-trained weights for the Choripan Classifier.

# download model weights if not already saved
path_to_model = os.path.join(data_dir, 'models', 'model.pth')
if not os.path.exists(path_to_model):
    print('done!\nmodel weights were not found, downloading them...')
    os.makedirs(os.path.join(data_dir, 'models'), exist_ok=True)
    filename = Path(path_to_model)
    r = requests.get(MODEL_URL)
print('done!\nloading up the saved model weights...')
fastai.defaults.device = torch.device('cpu') # run inference on cpu
empty_data = ImageDataBunch.single_from_classes(
    path, labels, tfms=get_transforms(), size=224).normalize(imagenet_stats)
learn = create_cnn(empty_data, models.resnet34)
learn = learn.load('model')

There are already many great tutorials online that are amazingly detailed so I won’t explain much of how Flask works. I created a predict function that takes the URL that you receive as INPUT, runs it through learn.predict(img) to get the predicted class, and then returns a json.

print('done!\nlaunching the server...')
# set flask params
app = Flask(__name__)
def hello():
    return "Image classification example\n"
@app.route('/predict', methods=['GET'])
def predict():
    url = request.args['url']"Classifying image %s" % (url),)

    response = requests.get(url)
    img = open_image(BytesIO(response.content))
    t = time.time() # get execution time
    pred_class, pred_idx, outputs = learn.predict(img)
    dt = time.time() - t"Execution time: %0.02f seconds" % (dt))"Image %s classified as %s" % (url, pred_class))
    return jsonify(pred_class)
if __name__ == '__main__':"", debug=True, port=PORT)

Once we have that done, we can head over to the terminal, cd into the flask_app dir, and run python We should see something like this.

* Serving Flask app "server" (lazy loading)
 * Environment: production
   WARNING: Do not use the development server in a production environment.
   Use a production WSGI server instead.
 * Debug mode: on
 * Running on (Press CTRL+C to quit)
 * Restarting with stat
importing libraries...
setting up the directories and the model structure...
loading up the saved model weights...
launching the server...
 * Debugger is active!
 * Debugger PIN: 261-786-850

That’s it! Now we can run commands like these from the terminal (I’m running an AWS instance). Let’s take this image for example.

And this is how it looks from the server side

[2018-11-13 16:49:32,245] INFO in server: Classifying image
[2018-11-13 16:49:33,836] INFO in server: Execution time: 1.35 seconds
[2018-11-13 16:49:33,858] INFO in server: Image classified as Choripan

Not bad! You now have your very own “Choripan/Not Choripan” API. If you want to move to the next level, please check out this tutorial from the Flask Documentation to deploy to Production and/or this other tutorial if you want to Dockerize your Flask application (you can also use docker-compose).


Other Ways to Deploy to Production


1. Image Classification Example Using Clipper


There’s a great ipynb that you can follow in the ClipperTutorials GitHub with the basics of how everything works. They provide a Docker image or you can just run their Amazon AMI. Sadly, this is only working with PyTorch 0.4.0 which makes it a real pain to convert to when your models have been trained with the latest preview versions of PyTorch and Fastai. Works great with the example pre-trained model though.

 Creating a ClipperConnection

To start Clipper, you must first create a ClipperConnection object with the type of ContainerManager you want to use. In this case, you will be using the DockerContainerManager.

from clipper_admin import ClipperConnection, DockerContainerManager
clipper_conn = ClipperConnection(DockerContainerManager())

 Starting Clipper

Now that you have a ClipperConnection object, you can start a Clipper cluster.
The following command will start 3 Docker containers:

  1. The Query Frontend: The Query Frontend container listens for incoming prediction requests and schedules and routes them to the deployed models.
  2. The Management Frontend: The Management Frontend container manages and updates the cluster’s internal configuration state, such as tracking which models are deployed and which application endpoints have been registered.
  3. A Redis instance: Redis is used to persistently store Clipper’s internal configuration state. By default, Redis is started on port 6380 instead of the standard Redis default port 6379 to avoid collisions with any Redis instances that are already running.
clipper_addr = clipper_conn.get_query_addr()

Take a look at the containers Clipper has started.

!docker ps --filter label=ai.clipper.container.label

 Create an Application

app_name = "squeezenet-classsifier"
default_output = "default"

When you list the applications registered with Clipper, you should see the newly registered squeezenet-classifier application show up.


 Load an example pre-trained PyTorch model

from torchvision import models, transforms
model = models.squeezenet1_1(pretrained=True)

PyTorch models cannot just be pickled and loaded. Instead, they must be saved using PyTorch’s native serialization API. Because of this, you cannot use the generic Python model deployer to deploy the model to Clipper. Instead, you will use the Clipper PyTorch deployer to deploy it. The Docker container will load and reconstruct the model from the serialized model checkpoint when the container is started.


# First we define the preproccessing on the images:
normalize = transforms.Normalize(
   mean=[0.485, 0.456, 0.406],
   std=[0.229, 0.224, 0.225]
preprocess = transforms.Compose([
# Then we download the labels:
labels = {int(key):value for (key, value)
          in requests.get('').json().items()}

 Define a predict function and add metrics

import clipper_admin.metrics as metrics
def predict_torch_model(model, imgs):
    import io
    import PIL.Image
    import torch
    import clipper_admin.metrics as metrics

    metrics.add_metric("batch_size", 'Gauge', 'Batch size passed to PyTorch predict function.')
    metrics.report_metric('batch_size', len(imgs)) # TODO: Fill in the batch size

    # We first prepare a batch from `imgs`
    img_tensors = []
    for img in imgs:
        img_tensor = preprocess(
    img_batch =

    # We perform a forward pass
    with torch.no_grad():
        model_output = model(img_batch)

    # Parse Result
    img_labels = [labels[] for out in model_output]

    return img_labels

Clipper must download this Docker image from the internet, so this may take a minute

from clipper_admin.deployers import pytorch as pytorch_deployer
    func=predict_torch_model, # predict function wrapper
    pytorch_model=model, # pass model to function

Now link the generated pytorch-model to the application squeezenet-classsifier we created before.

clipper_conn.link_model_to_app(app_name="squeezenet-classsifier", model_name="pytorch-model")

That’s it!

How to query the API with Requests

import requests
import json
import base64

clipper_addr = 'localhost:1337'

for img in ['img1.jpg', 'img2.jpg', 'img3.jpg']: # example with local images
  req_json = json.dumps({
          base64.b64encode(open(img, "rb").read()).decode() # bytes to unicode

  response =
           "http://%s/%s/predict" % (clipper_addr, 'squeezenet-classsifier'),
           headers={"Content-type": "application/json"},


Stopping Clipper
If you run into issues and want to completely stop Clipper, you can do this by calling ClipperConnection.stop_all().


When you list all the Docker containers a final time, you should see that all of the Clipper containers have been stopped.

!docker ps --filter label=ai.clipper.container.label


2. Using Now from Zeit


Another option under discussion in the Forums is to use the Now service from Zeit. You can follow this guide in the Documentation. I have tried this but have not gotten accurate results (probably because of normalization). Seems very promising.

You will need to run these commands only one time. The first installs Now’s CLI (Command Line Interface).

sudo apt install npm # if not already installed
sudo npm install -g now

And the next downloads a starter pack for model deployment based on Lesson 2.

Upload your trained model file

Upload your trained model file (for example stage-2.pth) to a cloud service like Google Drive or Dropbox. Copy the download link for the file. Note: the download link is the one which starts the file download directly—and is normally different than the share link which presents you with a view to download the file (use if needed)

Customize the app for your model

  1. Open up the file inside the app directory and update the model_file_url variable with the url copied above
  2. In the same file, update the line classes = ['black', 'grizzly', 'teddys'] with the classes you are expecting from your model

On the terminal, make sure you are in the zeit directory, then type:


The first time you run this, it will prompt for your email address and create your Now account for you. After your account is created, run it again to deploy your project.

Every time you deploy with now it’ll create a unique Deployment URL for the app. It has a format of, and is shown while you are deploying the app.


3. Using Torch Script and PyTorch C++ API

These are some of the most recent changes will come with the official 1.0 version of PyTorch. You can follow the instructions in this article from the Documentation or check out the last chapter of the Intro to Deep Learning with PyTorch class in Udacity where they go through this step by step. This is just an overview of the main steps.

import torch
import torchvision
# An instance of your model.
model = torchvision.models.resnet18()
# An example input you would normally provide to your model's forward() method.
example = torch.rand(1, 3, 224, 224)
# Use torch.jit.trace to generate a torch.jit.ScriptModule via tracing.
traced_script_module = torch.jit.trace(model, example)
# Save the model"")

Build a minimal C++ application

  • Follow these steps and build example-app.cpp and CMakeLists.txt.
  • Install Anaconda and get CMAKE to run on your machine. You can install it through their binaries or, if you are running MacOS, type brew install cmake (install homebrew through these instructions). If you have some problems with CMAKE, remember to download the X-Code command line tools as a .dmg directly from here (MacOS 10.14 in my case).
  • Install Caffe2 from here and run conda install pytorch-nightly-cpu -c pytorch
  • Build the application

PyTorch, Libtorch, C++, and NodeJS

Look at



I tried my best to summarize some of the options available to deploy your recently trained PyTorch models. Hope this is useful and looking forward to reading your comments.

Bio: Nicolás Metallo is an award-winning entrepreneur with nearly 10 years of professional experience. He graduated from New York University with an MSc in Technology Management & Innovation and he works as a management consultant and freelance deep learning engineer. Nicolas is also the co-founder of INVIP Labs Inc., a social enterprise that helps blind and low vision individuals understand their environment better through computer vision. His particular interest in data science is focused on providing cities with data to make them more connected, efficient, resilient, vibrant, and prosperous.

Original. Reposted with permission.