Deploying a pretrained GPT-2 model on AWS

This post attempts to summarize my recent detour into NLP, describing how I exposed a Huggingface pre-trained Language Model (LM) on an AWS-based web application.

By Francesco Pochetti, Senior Data Scientist


In the Deep Learning (DL) world, I have always preferred Computer Vision (CV) to the rest. I find working with images a lot more fulfilling than anything else as I can inspect what gets in and out of my models by simply (literally) looking at it.

Driven by this personal taste, I tend to purposely ignore some other equivalently important fields where neural networks shine, and focus almost only on CV. You might argue that there is nothing wrong with investing time in what one likes. Nevertheless, every now and then it is important to take a step back and catch up with what is happening in the Machine Learning (ML) community. This is what I recently did with Natural Language Processing. NLP has been evolving at neck-breaking speed in the last couple of years and I felt I had almost completely lost touch with the field. It was time to sync up.

This post attempts to summarize my recent detour into NLP, describing how I exposed a Huggingface pre-trained Language Model (LM) on an AWS-based web application.

First, let’s get a sneak peek into how the service I built looks like. You can check it out on or find below a quick demo. As you can see it all boils down to prompting the network with some text and have it type the rest for us. There are a couple of interesting parameters (some more self-explanatory than others) a user can tweak to affect the quality of the result. We will deep dive into those later on.

This app is definitely not the first one offering NLP enthusiasts the possibility to experiment with LMs online. Write With Transformer by Huggingface and Talk To Transformer by Adam King are great examples of this technology. Nevertheless, I thought this could be a nice exercise to brush up my rusty NLP knowledge and combine it with some fun with AWS.


Deploying with Lambda, EC2, and DynamoDB


As shown above, the AWS architecture I implemented is relatively straightforward. By clicking on the “Let the machine takeover!” button on, users submit a POST request to an API Gateway endpoint. This event triggers a Lambda function (link to code), which takes care of the following:

  1. (code) It parses the arguments (a deep dive into those in a subsequent section) sent over by the frontend (via API Gateway) in JSON format. Those include:
    • prompt: the piece of text used to trigger the model.
    • num_samples: the number of text samples to be generated starting from the prompt.
    • length: the number of words the model generates (per sample) starting from the prompt.
    • temperature: how much randomness to inject into the text generation process.
    • top_p: the cumulative probability to use when selecting the top most likely words to come next at each model’s step. E.g., sort the next potential candidates’ tokens by descending likelihood, calculate the cumulative sum of probabilities starting from the top and cut the list when that number hits top_p.
    • top_n: the number of words to consider as potential candidates for the next token, selected among the top ones by probability.
  2. (code) It generates a random number (dynamoid) between 1 and 1M. This is needed to create an entry (associated with that ID) into the huggingface DynamoDB table, and store the text the LM generates.
  3. (code) It checks whether the c5.xlarge EC2 instance I have already set up is either stopped or up and running. In case it is stopped, it spins the machine up and waits for it to be operational. Otherwise, it moves on.
  4. (code) It executes a script on the EC2 instance. The process is achieved via an SSM agent, which forwards the list of commands to the machine. Here a great guide to set up the agent to successfully interface with EC2 (a matter of IAM permissions mainly). Those commands include:
    • Instruct the system to shut down the EC2 after 30 minutes
    • Activate the pytorch_p36 anaconda environment
    • Execute a python script running the actual text generation task. I will get into the details of what this step entails from a DL standpoint in a later section. Once the model is done with the ML magic, python pushes the result to the huggingface DynamoDB table, under the dynamoid ID.

In the meantime, Lambda keeps running. Given it does know how whether step 4 has ended successfully or not, it starts pinging the DynamoDB’s huggingface table, in an attempt to access the dynamoid entry. This is, of course, a way of indirectly checking when EC2 has completed its job. Once DynamoDB returns the entry, Lambda redirects it to the frontend, closing the API Gateway loop.


Why not going entirely serverless?

This seems like a lot of overhead for a process that consists of just running inference on a LM, without even the need for a GPU. Why can’t we get rid of EC2 and DynamoDB, and execute the whole thing in Lambda alone?

The problem is that we don’t have enough ephemeral disk space on Lambda to store the pre-trained LM weights. Remember, we are taking advantage of a pre-trained network. This means we have to store its parameters somewhere and load them into the model to make it operational. The smallest huggingface pre-trained LM (distilGPT2) carries 300Mb of weights. Lambda offers 512Mb of disk space in total. As we also have to make the relevant python libraries available in Lambda (torch, numpy, transformers, etc), we quickly run out of space. Therefore, contrary to what I had envisaged, the issue does not come from loading non-base python packages. Even more exoteric and heavy libraries as torch and transformers can successfully be imported leveraging the flexibility of Lambda Layers (nice tutorials here and here). Those consist of zipped folders containing the package’s source code. Building them literally amounts to:

  1. Install the library of choice to a Linux machine (yet another EC2!), paying attention that the python version you are using is the same as your Lambda’s runtime.
  2. Grab the folders created during the installation and zip them.
  3. Upload the zip file to Lambda directly, or S3 in case the size exceeds the allowed limit. Congrats, you have created a layer.
  4. Attach the layer to your Lambda function.
  5. Now you can import the python package in Lambda. FYI, what Lamda does at runtime is simply to unzip the folder and append it to the path.

For the transformers library, the above steps are exactly what I did to get it to work. Point 2, in this case, translates in looking for the transformers folder within the site-packages directory of the anaconda environment I was working on.

For torch, which is, by the way, a transformers‘ dependency, it gets a little trickier. You cannot use the same strategy, as the bundle pytorch+dependencies surpasses the hard limit of 250Mb for the uncompressed layer. Luckily enough, Matt McClean provides a publicly available pytorch layer, in which he implements a nice trick to get around the size issue. You just need to create a new layer with an appropriate ARN (arn:aws:lambda:<YOUR REGION>:934676248949:layer:pytorchv1-py36:2) and add the following snippet at the very top of your Lambda function.

# this import statement is needed if you want to use the AWS Lambda Layer called "pytorch-v1-py36"
# it unzips all of the pytorch & dependency packages when the script is loaded to avoid the 250 MB unpacked limit in AWS Lambda
    import unzip_requirements
except ImportError:


This is a dummy Lambda function I created to test Lambda Layers. As you can see I attached both the pytorch and the transformers (I named it huggingface) layers and managed to import all the relevant python packages within the Lambda runtime



Generating text with Huggingface transformers: a deep dive

The purpose of this section is to step through the inference code and re-construct what happens under the hood at prediction time. is, specifically, what gets invoked by Lambda and executed on the c5.xlarge EC2 instance. Let’s check out, line by line, its most important parts (my script is the simplified version of the code in the official huggingface repo).

Line 71 sets the CPU as the device where the model is going to run. No GPUs available on c5 instances.

Line 72 initializes the GPT2LMHeadModel and the GPT2Tokenizer. The former is the actual network, the latter the object containing information about the vocabulary, how to encode text into numbers and go the other way around during the decoding phase.

Lines 73-74 read the model’s weights and the vocabulary from disk and load them into the tokenizer and model objects instantiated before. Beware that for weights and vocabulary to be available on disk you need to save them beforehand. If you run the script for the first time, they won’t be there. In this case, instead of passing a directory (.) to from_pretrained, you need to provide the name of the model you intend to load. gpt2 in our case. Huggingface takes care of downloading the needful from S3. If you want to persist those files (as we do) you have to invoke save_pretrained (lines 78-79) with a path of choice, and the method will do what you think it does. Next time you run, lines 73-74 will not download from S3 anymore, but instead load from disk.

Lines 75-76 instruct the model to run on the chosen device (CPU) and set the network to evaluation mode. This is a way to inform the model that it will only be used for inference; therefore, all training-specific layers (such as dropout) don’t have to be called.

Lines 82-90 parse the arguments provided to python via the command line (promptdynamoidtemperature, etc).

Line 93: the tokenizer splits prompt into tokens and encodes them into their numerical values (context_tokens).

Line 94 invokes sample_sequence, the function in charge of generating num_samples text samples of length length. Let’s see what happens inside the function. Jumping to line 46.

Lines 47 and 48 first turn the context (a.k.a. context_tokens) list into a tensor of length lenght, then add the batch dimension at the front (unsqueeze(0)) and finally replicate the contents of the tensor num_samples times. This is because we ask to generate multiple pieces of text all starting from the same prompt, so we just create copies of it.

Line 49 assigns the context variable to the new variable generated.

Line 51: the actual inference loop. We enter the loop length times, each time to generate a new word given the previous ones and each time appending the model output to the chain to feed it back into the network. This is where things get interesting.

Line 53 defines the model inputs. No surprise in finding that it is just referring to generated.

Line 55 invokes the model on the inputs and spits out outputs. Those come in the form of logits, each one associated with a token in the vocabulary. To turn them into probabilities we need to pass through a softmax activation function.

Line 56 divides logits by temperature, a number generally between 0 and 1. Its value can actually be arbitrarily high, even though, as we’ll shortly realize, it would not make much sense. So, what is temperature? Previously, I had defined it as the parameter responsible for how much randomness to inject into the sampling process. The reason becomes more transparent if we stick logits through softmax and turn them into probabilities. As we know, this step consists in exponentiating logits and normalizing by their sum. The effect of this operation is twofold: turn activations into a probability distribution, and make the largest logit stand out of the crowd even more than what it already did. When temperature equals 1, nothing happens. The most likely word keeps dominating. If we try setting temperature to 0.1 instead, we are multiplying logits by 10. The largest value increases even further and softmax happily rewards it more, assigning it a probability close to 1 and squeezing all the other words to almost 0. The closer temperature approaches 0, the more we turn the sampling process into a no-brainer for the model. The most likely word has a probability of ~1 and all the rest ~0. In the extreme case of temperature equaling 0, we just perform greedy sampling, picking the one single most probable token. We therefore end up with a very predictable model, super confident of itself and with little randomness involved. This is what happens in lines 64-65.

On the contrary, in the extreme case of temperature being unreasonably high, (say 100) we would be dividing logits by a large number, flattening out the differences across words. The effect of applying softmax on logits scaled this way is that we end up with probabilities very close to each other, with no predominant one. Differently said, a very confused model about what to choose next.

Below, you can check a visual representation of the effect of temperature on the probability distribution of a dummy vocabulary of 4 words (excel file available here).

To convince yourself about what we just saw, you can play around with this parameter yourself on and check how it affects results.

temperature heavily interacts with two additional and equally important parameters: top_k and top_n. They both take the stage in the following line.

Line 63 invokes the top_k_top_p_filtering function, which, as the name suggests, filters logits applying two strategies (either one of the two, or both at the same time):

  1. top_k: sort words in the vocabulary by descending logit value and keep the top k as potential candidates for the next token. If top_k<=0 this filter is not applied – line 25.
  2. top_p (a.k.a. nucleus filtering): sort words in the vocabulary by descending logit, apply softmax to turn logits into probabilities and keep the top N tokens so that the sum of their probabilities is less or equal than top_p. If we set top_p=0.9 and the network believes the first most likely token has a probability of occurring of 95%, then that will be the only token retained (e.g. N equals 1, as 95% > 90%). On the contrary, if the LM is less sure about its predictions, the top words will likely be associated with similar (low) probs, hence more than one need to be kept to reach a cumulative probability of 90%. The idea here is that if the model is super confident about a specific token to be next, then we just pick it. If instead, the confidence level is low, it makes little sense to select the most probable word, as the second or third tokens will show very similar likelihoods to the top one. If top_p<=0 this filter is not applied – line 30.

Note that the above two filters make sense only if temperature is not set to 0. In this case ( temperature=0 ), as already mentioned above, the script performs greedy sampling, e.g. it always selects the top word by probability (line 64). Therefore, pre-filtering a list of tokens has no effect on the final result.

On the contrary, in case temperature is different than 0, the inference logic consists of randomly picking a word among the list of tokens pre-selected by top_k_top_p_filtering (line 67). This seems odd. Why would we NOT choose the most probable word all the time and get one at random?

Jeremy Howard provides a great answer to this question here. He touches upon greedy sampling, beam search (slightly more advanced sampling technique than greedy), then top_k and nucleus sampling. As always, you could not be looking for more clarity.

If you are still sticking around, here is my summary of the answer to the above question. The reason why it is a bad idea to stick to a greedy strategy is that human speech is rich, complex and in most cases, rather unpredictable. Take a look at the following chart from this paper, the one proposing the concept of nucleus filtering (e.g. top_p). On the X-axis, we have the sequence of tokens in a text. On the Y-axis the probability associated with each one of those tokens. The orange line shows the probability profile of a beam search sampling strategy, e.g. not strictly greedy but quite close, whereas the blue one reflects how a human-generated text would look like.


Figure 2


As the paper puts it

Figure 2 shows that the natural distribution of human text exhibits considerable fluctuations in the per-token perplexity while the distribution of machine text produced from maximum likelihood decoding leads to unnaturally flat and high per-token probability



The truth is that when we (humans) produce text, very often we end up employing unpredictable words, depending on out-of-scope contexts. A greedy strategy does not mimic at all this behavior of human speech. top_k and top_p sampling (partially) fix this issue by adding randomness to the process, randomly picking a token amongst top ones.

Line 68 concatenates the token selected by the previous logic with the precedent text, and inputs the result into the inference loop again ( generated =, next_token), dim=1) )


What is GPT-2, really?

By far the best resource to answer this question is Jay Alammar’s blogThis is the specific post diving into the OpenAI GPT-2 architecture. A truly outstanding essay. I won’t even try putting together anything close to that level of depth. I will just shamelessly steal a couple of visualizations from Jay’s post and provide a quick overview of the model.

GPT-2 is a stack of transformer’s style decoders, each one composed of a self-attention and a feed-forward layer.

The idea originates from the Attention Is All You Need paper, the one introducing the concept of a transformer to address seq2seq problems such as machine translation. The original transformer is composed of a stack of encoders (NNs encoding the input) connected to a stack of decoders (NNs decoding the encoders’ intermediate output into the final result). Striking enough, the NNs I am referring to are not recurrent, but just plain feed-forward ones with an attention mechanism.


Transformer architecture (credit to


GPT-2 ditches the encoder stack and lets decoders do the entire job. So, various GPT-2 models just differ in the number of decoders layers stacked on top of each other. On I have used the smallest one.


Different available GPT-2 models (credit to


As mentioned before, each encoder is composed of a self-attention layer followed by a feed-forward neural network. Input tokens are looked up into an embedding matrix, then into a positional encoding matrix (we are not using a recurrent architecture, so we need a way to provide positional info to the model) and then passed bottom-up through the stack.


A GPT-2 architecture (credit to


As for the attention mechanism, in a nutshell it consists in defining weights to build context around words, for instance, to help the model realize relationships between a noun and its pronoun appearing later on in the sentence. Again, Jay’s posts are the best way to deep-dive on that.

That’s it! Have fun with NLP!

Bio: Francesco Pochetti is an AWS Machine Learning Hero and Senior Data Scientist at Mash, in Luxembourg. Before building default scoring models in fintech he used to optimize taxi rides for Bolt, in Estonia.

Original. Reposted with permission.