Deploy ML Models with AWS Lambda and Ephemeral Storage

Published in

Better Programming

6 min readApr 24, 2023

So you are a machine learning engineer and want a simple and potentially scalable way to deploy your large machine learning model? In this post, I will present you with a relatively straightforward solution that leverages Lambda’s recent feature of adding Ephemeral Storage.

Rest assured that you won’t have to navigate the AWS console and manually click-birth all the resources. Instead, we will utilize both the AWS command line interface (CLI) and a modicum of infrastructure-as-code for the AWS cloud development kit (CDK). The latter makes it possible to define our services and their relationships without the need to manually craft complex CloudFormation Templates.

The infrastructure looks pretty modest:

Historically, Lambda hasn’t been the go-to option for ML-Ops due to its limitation in storing large models. However, this particular constraint has since been alleviated, as Lambda now allows for up to 10GB ephemeral storage (/tmp memory) that we can leverage for downloading and caching our model. Moreover, we can easily expose Lambda by attaching a Function URL that acts as an API.

The deployment comprises the following steps:

ML Model: Create a machine learning model and upload it to a S3 bucket
Lambda Function: Create lambda function code for inference
Docker Image: Dockerize inference code and upload the image to the elastic container registry (ECR)
Infrastructure: Create resources with CDK

Let’s get hands-on with a deep learning toy example. We will build and deploy a named entity recognizer (NER) using Flair. Flair provides a simple interface for using state-of-art models from Huggingface to solve a large range of NLP tasks.

Prerequisites

You need to have the following tools and frameworks installed:

Python ≥ 3.8
Docker
AWS account with a 12-digit ID (e.g. 012345678901)
AWS CLI, the command-line interface for AWS services
AWS CDK CLI, the command-line interface for the AWS cloud development kit (CDK)

Setting up the Environment

Clone the repository, and install the dependencies:

git clone https://github.com/as-ideas/deep-lambda.git
cd deep-lambda
pip install -r requirements.txt

The repository structure is as follows:

deep-lambda/
|-- app.py 
|-- tagger.py
|-- infrastructure/
|-- ...

We only need two Python files, tagger.py, which contains the deep learning code and app.py which defines the AWS lambda function.

The CDK code for our infrastructure is defined in infrastructure/lambda_stack.py.

1. ML Model

First off, let's write a simple NER tagger using Flair:

deep-lambda/tagger.py

Running the above code downloads a pre-trained NER model from Huggingface, saves it to /tmp/my_ner_tagger.pt and outputs the following result:

python tagger.py

Span [1,2]: "George Washington"   [− Labels: PER (0.9985)]
Span [5]: "Washington"   [− Labels: LOC (0.9706)]

Simple enough. From now on let’s pretend that the saved model is some customized NER model that we want to deploy (here you can read how to fine-tune a pre-trained model with Flair).

In order to deploy our model we need it to reside within the AWS cloud. Therefore, we proceed by uploading the model to an S3 bucket via the command line interface (replace the region with your own):

aws s3api create-bucket --bucket deep-lambda --region eu-central-1 --create-bucket-configuration LocationConstraint=eu-central-1
aws s3 cp /tmp/my_ner_tagger.pt s3://deep-lambda-2/ --region eu-central-1

2. Lambda Function

In order to initiate the code by providing an AWS Lambda Function Stack it is required to define a lambda handler that serves as an entry point for the function. The lambda handler is responsible for receiving an event and generating a corresponding response.

Our tagging logic is invoked by the handler, which necessitates the prior loading of the model. We solve this by downloading and caching the model at the module level, leveraging lambda’s ephemeral storage on /tmp:

deep-lambda/app.py

Note that the model is downloaded to /tmp which by default is mounted to the ephemeral storage, so it is possible to consume models as large as several GB.

3. Docker Image

Our Lambda will need to access some dependencies (e.g. Flair) that we will bake into a docker image.

A simple Dockerfile can be built from a Python lambda base image:

FROM public.ecr.aws/lambda/python:3.8 as base

FROM base

COPY requirements.txt .
RUN  pip install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"
COPY app.py "${LAMBDA_TASK_ROOT}"
COPY tagger.py "${LAMBDA_TASK_ROOT}"

ENV PYTHONPATH="${LAMBDA_TASK_ROOT}"

CMD ["app.lambda_handler"]

That’s pretty standard — we just need to make sure that all the files are copied correctly to the Lambda-native location LAMBDA_TASK_ROOT, which is the given working directory. The path usually resolves to /var/task in the image.

Let’s build and tag the docker image (replace the 12-digit code with your own AWS ID):

docker build -t deep-lambda .
docker tag deep-lambda:latest 012345678901.dkr.ecr.eu-central-1.amazonaws.com/deep-lambda:latest

We now upload the image to the elastic container registry (ECR) via the command line interface:

aws ecr create-repository --repository-name deep-lambda
aws ecr get-login-password --region eu-central-1 | docker login --username AWS --password-stdin 012345678901.dkr.ecr.eu-central-1.amazonaws.com
docker push 012345678901.dkr.ecr.eu-central-1.amazonaws.com/deep-lambda:latest

That’s it. We successfully uploaded both our model and code to the AWS ecosystem. Now it’s time to provide the cloud infrastructure via CDK!

4. Infrastructure

Let’s write a simple CDK stack that defines the lambda function with access to the model bucket:

infrastructure/lambda_stack.py

It is important to provide enough memory_size to the Lambda as well as a large enough ephemeral_storage_size. Moreover, we need to point the PYTORCH_TRANSFORMERS_CACHE directory to the /tmp directory to allow the Transformers library to cache the model tokenizers to the ephemeral storage.

Now we are ready to deploy our function:

cd infrastructure
pip install -r requirements.txt
cdk synth
cdk deploy deep-lambda-stack

That’s it, the lambda should be up and running. Let’s quickly check it out in the online console:

On the bottom right you can find out the Function URL. Let’s use it to invoke our Lambda with a simple request via curl:

curl -X POST -H "Content-Type: text/plain" -d "I went to New York last easter." https://rrpj3itxliq4754rbwscjg7p3i0geucq.lambda-url.eu-central-1.on.aws/

Span[3:5]: "New York" → LOC (0.9988)

Great! You can now fire more requests to the endpoint. It is important to keep in mind that the initial request may take a bit longer due to the cold start process, as the lambda initializes and retrieves the model from the bucket. However, subsequent requests should execute swiftly.

Deploying a model is as simple as replacing the current model in the S3 bucket and restarting the lambda function, which can be done via console or CLI. Updating the code requires that you push a new image to the ECR using the AWS commands earlier and then deploy the updated image to the Lambda function.

Limitations and Making the Solution Scalable

The presented solution is quite useful if you want to quickly deploy your model for a showcase or testing, but it is probably not robust enough to be used in a production system. I will address the biggest limitations below:

Limitation: Manual deployment, Solution: Add CI/CD Pipeline

Consider adding an AWS Codepipeline that automatically triggers upon code changes, pushes new images to the ECR, and redeploys your Lambda function.

Limitation: Security, Solution: Add API Gateway

If you need to control your API exposure, i.e. restrict the maximum number of requests or block certain IP addresses, you may want to connect an API Gateway to your Lambda.

Limitation: Scalability, Solution: Add Queues

If you want to process large amounts of data, it is not advisable to open lengthy requests to the Lambda function. A more robust solution would be to connect the Function to a command queue (SQS) and store the result in another queue or a S3 bucket. This approach is highly scalable and straightforward to monitor, making it the ideal choice for most ML use cases.

Better Programming

Deploy ML Models with AWS Lambda and Ephemeral Storage

Prerequisites

Setting up the Environment

1. ML Model

2. Lambda Function

3. Docker Image

4. Infrastructure

Limitations and Making the Solution Scalable

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Better Programming

Written by Christian Schäfer

Responses (1)

More from Christian Schäfer and Better Programming

How to automatically deploy a ML classifier to the cloud with AWS CDK

In this step-by-step guide you will learn how to set up a basic deployment pipeline for a machine learning classifier using Amazon’s cloud…

How To Update Your Status During Standup Like a Senior Engineer

A status update is where you can showcase how well you manage ambiguity and is an important way to build trust with your team

Why I Prefer Regular Merge Commits Over Squash Commits

I used to think squash commits were so cool, and then I had to use them all day, every day. Here’s why you should avoid squash

Headliner — Easy training and deployment of seq2seq models

📰 Generating headlines from news articles using seq2seq models.

Recommended from Medium

Host an on-demand streaming video with S3 Bucket, CloudFront, & Route 53.

I will walk you through the process of hosting an on-demand streaming video using Amazon S3, CloudFront, Route 53, Certificate Manager, and…

Why Companies Are Saying GoodBye to Next.js?

Are you using Next.js or planning to for your next project? Then you need to know this before making a decision!

🚀 Build a ChatGPT on AWS Bedrock + Lambda + API Gateway (with GitHub & Diagrams)

Create your own serverless AI chatbot using AWS’s most powerful new tools.

How to Build a Local LLM Chatbot with CAG: Streamlit, vLLM, and Smart Context Caching

Retrieval-Augmented Generation (RAG) is a well-known technique for adding large amounts of external knowledge to an LLM. If the external…

This new IDE from Google is an absolute game changer

This new IDE from Google is seriously revolutionary.

How to Build a Multi-Agent Orchestrator Using Flink and Kafka

Build a scalable, event-driven multi-agent orchestrator using Apache Flink and Kafka.