Intro to RNN: Character-Level Text Generation With PyTorch

Train and deploy a PyTorch model in Amazon SageMaker

Published in

Better Programming

12 min readSep 20, 2020

Today, we’ll continue our journey through the fascinating world of natural language processing (NLP) by introducing the operation and use of recurrent neural networks to generate text from a small initial text. This type of problem is known as language modeling and is used when we want to predict the next word or character in an input sequence of words or characters.

But in language-modeling problems, the presence of words isn’t the only thing that’s important but also their order — i.e., when they’re presented in the text sequence. In other words, the context that surrounds each word becomes a fundamental piece to predict the next one.

And in this scenario, the traditional NLP methods, based on frequencies and probabilities of the words, aren’t very effective because they’re based on the premise of the independence of the words from each other.

Here is where RNN networks can become a fundamental tool because of their ability to remember the different parts of a series of inputs, which means they can take the previous parts of a sentence into account to interpret context.

Brief Description of RNN

In summary, in a vanilla neural network, the output of a layer is a function or transformation of its input applying some learnable weights.

In contrast, in an RNN, not only the input is taken into account but also the context or previous state of the network itself. As we progress in the forward pass through the network, it builds a representation of its state that aims to collect information obtained in previous steps, which is called the hidden state.

“Here, for each timestep t, we have an activation a<t> and an output y<t>. And we have one set of weights to transform the input to a hidden-layer representation, a second set of weights to bring information from the previous hidden state into the next timestep, and a third one to control how much information from the actual state is transmitted to the output.”
[3] “Introduction to recurrent neural networks” by Jeremy Jordan

RNN operations by Stanford CS-230 Deep Learning course

Each element of the sequence contributes to the current state, the input and the previous hidden state update the value of the hidden state for an arbitrarily long sequence of observations. RNNs can remember previous entries, but this capacity is restricted in time or steps — it was one of the first challenges to solve with these networks.

“The longer the input series is, the more the network “forgets”. Irrelevant data is accumulated over time and it blocks out the relevant data needed for the network to make accurate predictions about the pattern of the text. This is referred to as the vanishing gradient problem.” — Wikipedia

You can dive deeper into that problem at this link. This a common problem with very deep neural networks. In the field of NLP and RNN, to solve this problem some advanced architectures have been developed, like LSTM and GRUs.

Long Short-Term Memory (LSTM)

LSTM networks seek to preserve relevant information from much earlier steps, for which they contain multiple gates that control how much information to keep or delete from the input and the previous states:

From a “Designing neural network based decoders for surface codes” by Savvas Varsamopoulos

W is the recurrent connection between the previous hidden layer and the current hidden layer. U is the weight matrix that connects the inputs to the hidden layer, and C is a candidate hidden state that’s computed based on the current input and the previous hidden state. C is the internal memory of the unit.

Forget gate: How much information from the past should be considered now?
Input gate + cell gate: Should we add information to the state from the input and how much?
Output gate: How much information should we output from the previous state?

“In a similar way, an LSTM works as follows:
• It keeps track not just of short term memory, but also of long term memory
• In every step of the sequence, the long and short term memory in the step get merged
• From this, we get a new long term memory, short term memory, and prediction”
— Peter Foy, “An Introduction to Recurrent Neural Networks & LSTMs”

Create and Deploy an ML Model in Amazon SageMaker

First, we enumerate the steps in the general outline for SageMaker projects using a notebook instance (Amazon’s notebooks describe these steps):

Download or otherwise retrieve the data.
Process/prepare the data.
Upload the processed data to S3.
Train a chosen model.
Test the trained model (typically using a batch transform job).
Deploy the trained model.
Use the deployed model.

For this project, you’ll be following the steps in the general outline with some modifications, we are going to test the model on the deployed model.

The source code is publicly available in my github repository, this is the link to the full notebook. Here we will only show the more relevant sections.

Download and prepare the data set

Steps 1 and 2 aren’t specific to the SageMaker tool; they’re essentially the same regardless of the platform. So we’re not going to discuss them; we’ll just mention the source of our data set.

First, we’ll define the sentences that we want our model to output when fed with the first word or the first few characters. Our data set is a text file containing Shakespeare’s plays or books, from where we’ll extract a sequence of chars to use as the input to our model. Then our model will learn how to complete sentences like “Shakespeare would do.” This data set can be downloaded from Karpathy’s GitHub account.

Then, we only need to lowercase the text and create the corresponding dictionaries: char2int to transform the words to integers and int2char for the reverse process.

Encode the text and create the input and target data sets

Now we can encode our text, replacing every character by the integer value in the dictionary. When we have our data set unified and prepared, we should do a quick check to see an example of the data our model will be trained on. This is generally a good idea, as it allows you to see how each of the further processing steps affect the reviews, and it also ensures that the data has been loaded correctly.

As we’re going to predict the next character in the sequence at each time step, we’ll have to divide each sentence into:

Input data: The last input character should be excluded as it doesn’t need to be fed into the model (it’s the target label for the last input character)
Target/ground-truth label: This is one timestep ahead of the input data. as this will be the correct answer for the model at each timestep corresponding to the input data.

Upload the Data to Amazon S3

Now, we’ll need to upload the training dataset to S3 in order for our training code to access it. In fact, we’ll save it locally, and it’ll be uploaded to S3 later on when running the training.

Note: The cell above uploads the entire contents of our data directory. This includes the char_dict.pkl(char2int) and int_dict.pkl (int2char) files. This is fortunate as we’ll need this later on when we create an endpoint that accepts an arbitrary input text. For now, we’ll just take note of the fact that it resides in the data directory (and so also in the S3 training bucket) and that we’ll need to make sure it gets saved in the model directory.

From “Infinitely scalable machine learning with Amazon SageMaker” by Werner Vogels

Build and Train the PyTorch Model

A model in the SageMaker framework, in particular, comprises three objects:

Model artifacts,
Training code
Inference code

Each of these interact with one another.

We’ll start by implementing our own neural network in PyTorch along with a training script. For the purposes of this project, we need to provide the model object implementation in the model.py file, inside of the train folder.

The model is very simple with just a couple of layers:

The LSTM layer, acting as an encoder
A dropout layer to reduce overfitting
The decoder or a fully connected or dense layer that returns the probability of every character to be the next one

Train the Model on SageMaker

When a PyTorch model is constructed in SageMaker, an entry point must be specified. This is the Python file that’ll be executed when the model is trained. Inside of the train directory is a file called train.py that contains most of the necessary code to train our model.

Note: The train_main() function must be pasted into the train/train.py file where required.

The way that SageMaker passes hyperparameters to the training script is by arguments. These arguments can then be parsed and used in the training script. To see how this is done, take a look at the provided train/train.py file.

In summary, the main function in the train.py file executes the steps:

Load the datasets
Create the batch data generator
Create or restore the model from a previous execution
Train and evaluate the model
Save the model and dictionaries for inference

Main train algo

Once we have our train.py file, we’re ready to create a training job in SageMaker. First, we need to set which type of instance will run our training:

Local: We don’t launch a real compute instance, just a container where our scripts will run. This scenario is very useful to test that the train script is working fine because it’s faster to run a container than a compute instance. But, finally, when we confirm that everything is working, we must change the instance type for a real training instance.
ml.m4.4xlarge: This is a CPU instance
ml.p2.xlarge: A GPU instance to use when managing a big volume of data to train on.

At this point, SageMaker launches a compute instance where our training code is executed, and it usually take hours or days depending on the data and model complexity (in our case it takes about 45-60 minutes). You can follow the training progress on Amazon CloudWatch if it’s printed out. At the end, the model artifacts are stored in S3, and they’ll be loaded during the deployment step.

Define the Inference Algorithm

Now it’s time to create some custom inference code so we can send the model an initial string that hasn’t been processed and determine the next character on the string.

By default, the estimator we created, when deployed, will use the entry script and directory that we provided when creating the model. However, since we wish to accept a string as our input and our model expects a processed text, we need to write some custom inference code.

We’ll store the code for inference in the serve directory. Provided in this directory is the model.py file that we used to construct our model, a utils.py file that contains the one-hot-encode and encode_text preprocessing functions we used during the initial data processing, and predict.py, the file that’ll contain our custom inference code. Note also that requirements.txt is present, which will tell SageMaker what Python libraries are required by our custom inference code.

When deploying a PyTorch model in SageMaker, you’re expected to provide four functions that the SageMaker inference container will use.

model_fn: This function is the same function that we used in the training script, and it tells SageMaker how to load our model. This function must be called model_fn() and takes as its only parameter a path to the directory where the model artifacts are stored. This function must also be present in the Python file which we specified as the entry point. It also reads the saved dictionaries because they should be used during the inference process.
input_fn: This function receives the raw serialized input that has been sent to the model's endpoint, and its job is to deserialize and make the input available for the inference code. Later, we’ll mention what our input_fn function is doing.
output_fn: This function takes the output of the inference code, and its job is to serialize this output and return it to the caller of the model's endpoint.
predict_fn: The heart of the inference script, this is where the actual prediction is done and is the function that you’ll need to complete.

A entensive explanation can be found on Amazon documentation.

For the simple example we’re constructing during this project, the input_fn and output_fn methods are relatively straightforward. We’re required to accept a string as input, composed by the desired length of the output and the initial string. And we expect to return a single string as the output, the new text generated. You might imagine, though, that in a more complex application, the input or output may be image data or some other binary data that’d require some effort to serialize.

Finally, we must build a predict_fn method that’ll receive the input string, encode it (char2int), one-hot encode, and send it to the model. Every output will be decoded (int2char) and appended to the final output string.

Make sure you save the completed file as predict.py in the serve directory.

In short, the inference process consists of processing and encoding the input string, initializing the state of the model, executing a forward pass of the model for each character, and updating the state of the model. The output of each iteration returns the probability of each character to be the next. We sample on those probabilities to extract the next character, which we join to the output text string.

Deploy the Model for Inference

Now that the custom inference code has been written, we’ll create and deploy our model. To begin with, we need to construct a new PyTorchModel object pointing to the model artifacts created during training and also pointing to the inference code we wish to use. Then we can call the deploy method to launch the deployment container.

Note: The default behavior for a deployed PyTorch model is to assume that any input passed to the predictor is a numpy array. In our case, we want to send a string so we need to construct a simple wrapper around the RealTimePredictor class to accommodate simple strings. In a more complicated situation, you may want to provide a serialization object, for example if you wanted to sent image data.

Now, we can deploy our trained model

Note: When deploying a model, you’re asking SageMaker to launch a compute instance that’ll wait for data to be sent to it. As a result, this compute instance will continue to run until you shut it down. This is important to know since the cost of a deployed endpoint depends on how long it has been running for.

In other words, if you are no longer using a deployed endpoint, shut it down!

And the time for testing our model has arrived — it’s so simple:

init_text = sentences[963:1148]
test_text = str(len(init_text))+'-'+init_text
new_text = predictor.predict(test_text).decode('utf-8')
print(new_text)Text:  he did content to say it was for his country he did it to please his mother and to be partly proud; which he is, even till the altitude of his virtue. what he cannot help in his nature,Init text:  he did content to say it was for his country he did it toText predicted: he did content to say it was for his country he did it to please his mother and to be partly proud which he is even till the altitude of his virtue what he cannot help in his nature of

As we can observe, the predicted text is practically the same as the original text, which means that our network is able to generate the text that it has received in its training stage — its memory is working fine!

Finally, when the service isn’t going to be consumed, you must shutdown it.

predictor.delete_endpoint()

References

[1] “Text Generation with Python and TensorFlow/Keras” by Dan Nelson
[2] “Recurrent Neural Networks cheatsheet”Stanford CS-230 Deep Learning by Afshine Amidi and Shervine Amidi
[3] “Introduction to recurrent neural networks” by Jeremy Jordan
“The Unreasonable Effectiveness of Recurrent Neural Networks“” by Andrej Karpathy
“The Vanishing Gradient Problem” by Chi-Feng Wang
“Understanding architecture of LSTM cell from scratch with code” by Manik Soni
“An Introduction to Recurrent Neural Networks & LSTMs” by Peter Foy
“Designing neural network based decoders for surface codes” by Savvas Varsamopoulos, Koen Bertels, and Carmen G. Almudever