Private LLMs on Your Local Machine and in the Cloud With LangChain, GPT4All, and Cerebrium

Published in

Better Programming

18 min readMay 29, 2023

The idea of private LLMs resonates with us for sure. The appeal is that we can query and pass information to LLMs without our data or responses going through third parties—safe, secure, and total control of our data. Operating our own LLMs could have cost benefits as well.

Fortunately for us, there is a lot of activity in the world of training open source LLMs for people to use. Some well-known examples include Meta’s LLaMA series, EleutherAI’s Pythia series, Berkeley AI Research’s OpenLLaMA model, and MosaicML.

To give one example of the idea’s popularity, a Github repo called PrivateGPT that allows you to read your documents locally using an LLM has over 24K stars.

The space is buzzing with activity, for sure. And there is a definite appeal for businesses who would like to process the masses of data without having to move it all through a third party.

All of this is good news for developers that like to build stuff! There are a lot of open source LLMs we can use to create private chatbots.

GPT4All is one of these popular open source LLMs. It’s also fully licensed for commercial use, so you can integrate it into a commercial product without worries. This is unlike other models, such as those based on Meta’s Llama, which are restricted to non-commercial, research use only.

In this article, we will go through using GPT4All to create a chatbot on our local machines using LangChain, and then explore how we can deploy a private GPT4All model to the cloud with Cerebrium, and then interact with it again from our application using LangChain.

But first, let’s learn a little more about GPT4All, and instruction tuning, one of the things that makes it such a great chatbot-style model.

GPT4All and instruction tuning
Using GPT4All’s chatbot UI application on local
Interacting with GPT4All locally using LangChain
Interacting with GPT4All on the cloud using LangChain and Cerebrium

GPT4All

A free-to-use, locally running, privacy-aware chatbot. No GPU or internet required.

That’s what the GPT4All website starts with. Pretty cool, right? It goes on to mention the following:

GTP4All is an ecosystem to train and deploy powerful and customized large language models that run locally on consumer grade CPUs.

Great, this means we can use it on our computers and expect it to work at a reasonable speed. No GPUs needed. Score! The base model is only around 3.5 GB, so again something we can work with on normal computers. So far, so good.

The goal is simple — be the best instruction tuned assistant-style language model that any person or enterprise can freely use, distribute and build on.

It has a noncommercial license, which means you can make money from it, which is pretty cool. Not all open source LLMs share that same license, so you could build a product on top of this without worrying about licensing issues.

It mentions it wants to be the “best instruction-tuned assistant-style” language model. If you are anything like me, you are wondering what that means. What is instruction tuning? Let’s dig in a little.

Instruction Tuning

A large language model (LLM) is trained on large textual datasets. They are mostly trained such that given a string of text, they can predict the next sequence of words statistically. This is very different from being trained to be good at responding to user questions as well (“assistant style”).

However, after training on a sufficiently large set of data, the LLMs start to form abilities like being able to answer much more complicated responses than initially predicted based on performance with smaller datasets. These are known as emergent abilities and enable some large LLMs to act as very convincing conversationalists.

So, the idea is that if we keep growing the size of the data set that these models are trained on, we should start to get better and better chatbot-style capabilities over time.

It was found, however, that making language models bigger does not inherently make them better at following a user’s intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users' intent to provide useful answers to questions.

Popularised in 2022, another way was discovered to create well-performing chatbot-style LLMs. That way was to fine-tune a model with several question-and-answer style prompts, similar to how users would interact with them. Using this method, we could use a base model trained on a much smaller base of information, then fine tune it with some question and answer, instruction style data, and we get performance that is on par, or sometimes even better, than a model trained on massive amounts of data.

Let’s look at the GPT4All model as a concrete example to try and make this a bit clearer.

If we check out the GPT4All-J-v1.0 model on hugging face, it mentions it has been finetuned on GPT-J. GPT-J is a model from EleutherAI trained on six billion parameters, which is tiny compared to ChatGPT’s 175 billion.

Let’s look at the types of data that GPT-J and GPT4All-J are trained on and compare their differences.

As mentioned on its hugging face page, GPT-J was trained on the Pile dataset, which is an 825 GB dataset, again from EleutherAI.

If we look at a dataset preview, it is essentially just chunks of information that the model is trained on. Based on this training, it can guess the next words in a text string using statistical methods. However, it does not give it great Q&A-style abilities.

Now, if we look at the dataset that GPT4All was trained on, we see it is a much more question-and-answer format. The total size of the GPT4All dataset is under 1 GB, which is much smaller than the initial 825 GB the base GPT-J model was trained on.

The GPT4All dataset uses question-and-answer style data

So GPT-J is being used as the pretrained model. We are fine-tuning that model with a set of Q&A-style prompts (instruction tuning) using a much smaller dataset than the initial one, and the outcome, GPT4All, is a much more capable Q&A-style chatbot.

To really shine a light on the performance improvements, we can visit the GPT-J page and read some of the information and warnings on the limitations of the model. Here’s an example:

Out-of-scope use
GPT-J-6B is not intended for deployment without fine-tuning, supervision, and/or moderation. It is not in itself a product and cannot be used for human-facing interactions. For example, the model may generate harmful or offensive text. Please evaluate the risks associated with your particular use case.
GPT-J-6B has not been fine-tuned for downstream contexts in which language models are commonly deployed, such as writing genre prose or commercial chatbots. This means GPT-J-6B will not respond to a given prompt the way a product like ChatGPT does.
Limitations and Biases
The core functionality of GPT-J is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work.

So, from a base model that is not specified to work well as a chatbot, question, and answer type model, we fine-tune it with a bit of question and answer type prompts, and it suddenly becomes a much more capable chatbot.

And that is the power of instruction fine tuning.

Getting started with the GPT4All Chatbot UI on Local

GPT4All is pretty easy to get set up. To start with, you don’t even need any coding. They have a pretty nice website where you can download their UI application for Mac, Windows, or Ubuntu.

Once you download the application and open it, it will ask you to select which LLM model you would like to download. They have different model variations with varying capability levels and features. You can read the features of each model in the description.

ggml-gpt4all-j-v1.3-groovy.bin
Current best commercially licensable model based on GPT-J and trained by Nomic AI on the latest curated GPT4All dataset.

Let’s try out the one which is commercially licensable. It’s 3.53 GB, so it will take some time to download. The idea here with this UI application is that you have different types of models you can work with. The application is a chat application that can interact with different types of models.

Once downloaded, you can start interacting.

Out of the box, the ggml-gpt4all-j-v1.3-groovy model responds strangely, giving very abrupt, one-word-type answers. I had to update the prompt template to get it to work better. Even on an instruction-tuned LLM, you still need good prompt templates for it to work well 😄.

To update the prompt, click on the gear icon in the top right, then update the Prompt Template in the Generation tab. This is what I set it to, to start getting some decent results.

# You are a friendly chatbot assitant. Reply in a friendly and conversational
# style Don't make tha answers to long unless specifically asked to elaborate
# on the question. 
### Human:
%1
### Assistant:

Once that was done, the chat started looking better.

Overall, it works pretty well, as far as I could tell. Nice interface. Speed was not bad also. It’s pretty cool. You are interacting with a local LLM, all on your computer, and the exchange of data is totally private. My computer is an Intel Mac with 32 GB of RAM, and the speed was pretty decent, though my computer fans were definitely going onto high-speed mode 🙂.

Still, running an LLM on a normal consumer-grade CPU with no GPUs involved is pretty cool.

Building on Local With LangChain and GPT4All

We are hackers, though, right? We don’t want ready-made UIs! We want to build it ourselves! LangChain to the rescue! :)

LangChain really has the ability to interact with many different sources; it is quite impressive. They have a GPT4All class we can use to interact with the GPT4All model easily.

If you want to download the project source code directly, you can clone it using the below command instead of following the steps below. Make sure the follow the readme to get your Cerebrium API setup properly in the .env file.

https://github.com/smaameri/private-llm.git

So, to get started, let’s set up our project directory, files, and virtual environment. We will also create a /models directory to store our LLM models in.

mkdir private-llm
cd private-llm
touch local-llm.py
mkdir models
# lets create a virtual environement also to install all packages locally only
python3 -m venv .venv
. .venv/bin/activate

Now, we want to add our GPT4All model file to the models directory we created so that we can use it in our script. Copy the model file from where you downloaded it when setting up the GPT4All UI application into the models directory of our project. If you did not setup the UI application, you can still go to the website and directly download just the model.

Again, make sure to store the downloaded model inside the models directory of our project folder.

Now, let's start coding!

The script to get it running locally is actually very simple. Install the following dependencies:

pip install langchain gpt4all

Add the below code to local-llm.py. Notice when setting up the GPT4All class, we are pointing it to the location of our stored mode. And that’s it. We can start interacting with the LLM in just three lines of code!

from langchain.llms import GPT4All

llm = GPT4All(model='./models/ggml-gpt4all-j-v1.3-groovy.bin')

llm("A red apple is ")

Now, let's run the script and see the output.

python3 local-llm.py

At first, we see it loads up the LLM from our model file and then proceeds to give us an answer to our question.

(.venv) ➜  private-llm-test python3 local-llm.py
Found model file.
gptj_model_load: loading model from './models/ggml-gpt4all-j-v1.3-groovy.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 5401.45 MB
gptj_model_load: kv self size  =  896.00 MB
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285
Paris, France

You will notice the answer is very blunt and not chatbot-style. The model acts more as a string completion model than a chatbot assistant. Feel free to explore how it works by changing the prompt and seeing how it responds to different inputs.

We want it to act more like a Q&A chatbot, and we need to give it a better prompt. Again, even instruction-tuned LLMs need good prompts.

We can create a prompt and pass it into our LLM like so:

llm("""
You are a friendly chatbot assistant that responds in a conversational
manner to users questions. Keep the answers short, unless specifically
asked by the user to elaborate on something.

Question: Where is Paris?

Answer:""")

This would get tedious if we needed to pass in the prompt every time we wanted to ask a question. To overcome this, LangChain has an LLMChain class we can use that accepts an llm and prompt_template in its constructor.

llm_chain = LLMChain(prompt=prompt, llm=llm)

Let’s use that now. We will create a new file, called local-llm-chain.py, and put in the following code. It sets up the PromptTemplate and GPT4All LLM, and passes them both in as parameters to our LLMChain.

touch local-llm-chain.py

from langchain import PromptTemplate, LLMChain
from langchain.llms import GPT4All
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

template = """
You are a friendly chatbot assistant that responds in a conversational
manner to users questions. Keep the answers short, unless specifically
asked by the user to elaborate on something.

Question: {question}

Answer:"""
prompt = PromptTemplate(template=template, input_variables=["question"])

llm = GPT4All(
    model='./models/ggml-gpt4all-j-v1.3-groovy.bin',
    callbacks=[StreamingStdOutCallbackHandler()]
)

llm_chain = LLMChain(prompt=prompt, llm=llm)

query = input("Prompt: ")
llm_chain(query)

Run the script

python3 local-llm-chain.py

And you should be prompted in the terminal for an input. Add your answer, and you should see your output streamed back.

Prompt: Where is Paris?
The city of Paris can be found at latitude 48 degrees 15 minutes North
and longitude 4 degrees 10

Let’s be honest. That is not the best answer. Maybe some more prompt engineering would help? I’ll leave that with you. I would have expected the LLM to perform a bit better, but it seems it needs some tweaking to get it working well.

In the prompt, I had to tell the bot to keep the answers short. Otherwise, the chatbot tended to go off on tangents and long rants about things only semi-related to our original question.

Cloud Time

Next, let’s get our private model into the cloud and start interacting with it.

I started investigating different ways to do this, where the model and application are bundled inside the same project, just like the local project we just built. There didn’t seem to be any easy or out-of-the-box way to do this. I was looking for something super simple, like a Streamlit app you could deploy with your application code and model all in one.

That’s when I realised bundling our application code and model together is likely not the way to go. What we want to do is deploy our model as a separate service and then be able to interact with it from our application. That also makes sense because each host can be optimised for their needs. For example, our LLM can be deployed onto a server with GPU resources to enable it to run fast. Meanwhile, our application can just be deployed onto a normal CPU server.

This would also allow us to scale them separately as needed. If our model gets too many requests, we can scale it separately. And if we see our applications need more resources, we can scale them on their own, which would be cheaper, of course.

Because over the long term, our application might do lots of things and talk to the LLM. For example, it might have a login system, profile page, billing page, and other stuff you might typically find in an application. The LLM may be only a small use case for the system as a whole.

Also, what if we wanted to interact with multiple LLMs, each one optimised for a different task? This seems to be a common concept around building agents these days. With this architecture, our LLMs deployment and main applications are separate, and we can add/remove resources as needed — without affecting the other parts of our setup.

After some searching around and trying a few different options, Cerebrium was the easiest way to deploy a GPT4All model to the cloud, and it had a free option ($10 credit). And what do you know, LangChain has a Cerebrium integration! So, we are all good to go.

The first thing to do is register and log into the Cerebrium website. Once you have done that, on login, it will ask you to create a project. I already have a list of projects in mind.

Click “Create New Project,” and let's call our project GPT4All (original, right?)

Once that is done, it will take you to the Dashboard section. You want to click on the “Pre-Built Models” on the left-hand menu.

This will take you to a list of prebuilt models you can deploy. This page is pretty cool, in my opinion. You can deploy various models, including Dreambooth, which uses Stable Diffusion for text-to-image generation, Whisper Large for speech-to-text, Img2text Laion for image-to-text, and quite a few more.

Loads to play around with here. We will try to control ourselves, stay focused, and deploy just the GPT4All model, which is what we came here for 🤓. That being said, feel free to play around with some of these other models. How we will deploy our GPT4All model and connect to it from our application would probably be similar for any of these.

OK, so click to deploy the GPT4All model.

That should take you back to the model's page, where you can see some of the usage stats for your model. Of course, it’s all at zero because we haven’t used it yet. Once we start using the model, we will see some numbers increase.

If you click on the “API Keys” option in the left-hand menu, you should see your public and private keys. We will need to use the public key inside our LangChain application.

But that’s it. All done! Our GPT4All model is now in the cloud and ready for us to interact with. And we can already start interacting with the model! In the example code tab, it shows you how you can interact with your chatbot using curl (i.e., over HTTPS).

curl --location --request POST 'https://run.cerebrium.ai/gpt4-all-webhook/predict' \
--header 'Authorization: public-<YOUR_PUBIC_KEY>' \
--header 'Content-Type: application/json' \
--data-raw '{"prompt":"Where is Paris?"}'

OK, so let’s get back to our LangChain application. Let’s create a new file called cloud-llm.py with the following command:

touch cloud-llm.py

We need to install the cerebrium package. This command will handle that for us:

pip install cerebrium

Now, again, in just a few lines of code, we are all done. First, set up your CEREBRIUMAI_API_KEY using the public key from the Cerebrium dashboard. Then use the CerebriumAI class to create a LangChain LLM. You also need to pass the endpoint_url into the CerebriumAI class. You can find the endpoint URL in the “Example Code” tab on your model dashboard page on Cerebrium.

Then we can immediately start passing prompts to the LLM and getting replies. Notice the max_length parameter in the CerebriumAI constructor. This defaults to 100 tokens and will limit the response to this amount.

import os
from langchain.llms import CerebriumAI

os.environ["CEREBRIUMAI_API_KEY"] = "public-"

llm = CerebriumAI(
  endpoint_url="https://run.cerebrium.ai/gpt4-all-webhook/predict",
  max_length=100
)

template = """Question: Where is france?
Answer: """

print(llm(template))

python3 cloud-llm.py

France is a country located in Western Europe. It is bordered by Belgium,
Luxembourg, Germany, Switzerland, Italy, Monaco, and Andorra. What are some
notable landmarks or attractions in France that tourists often visit? Some
notable landmarks or attractions in France that tourists often visit include
the Eiffel Tower in Paris, the Palace of Versailles outside of Paris

Again, with no prompt template, it goes off on a bit of a tangent.

Let’s look at the definitions of the GPT4All and CerebriumAI classes in LangChain. You will notice they both extend the LLM class.

class GPT4All(LLM):

class CerebriumAI(LLM):

When working with LangChain, I find looking at the source code is always a good idea. This will help you get a better idea of how the code works under the hood. You can clone the LangChain library onto your local machine and then browse the source code with PyCharm, or whatever your favourite Python IDE is.

git clone https://github.com/hwchase17/langchain

All right, so let’s make our chatbot a little more advanced. We will use an LLMChain to pass in a fixed prompt to it and also add a while loop so we can continuously interact with the LLM from our terminal. Here’s what that code looks like:

import os
from langchain import PromptTemplate, LLMChain
from langchain.llms import CerebriumAI

os.environ["CEREBRIUMAI_API_KEY"] = "public-"

template = """
You are a friendly chatbot assistant that responds in a conversational
manner to users questions. Keep the answers short, unless specifically
asked by the user to elaborate on something.
Question: {question}

Answer:"""

prompt = PromptTemplate(template=template, input_variables=["question"])
llm = CerebriumAI(
  endpoint_url="https://run.cerebrium.ai/gpt4-all-webhook/predict",
  max_length=100
)
llm_chain = LLMChain(prompt=prompt, llm=llm)

green = "\033[0;32m"
white = "\033[0;39m"

while True:
    query = input(f"{green}Prompt: ")
    if query == "exit" or query == "quit" or query == "q":
        print('Exiting')
        break
    if query == '':
        continue
    response = llm_chain(query)
    print(f"{white}Answer: " + response['text'])

Done. We now have a chatbot-style interface to interact with. It uses a LangChain application on our local machine and uses our own privately hosted LLM in the cloud. Run the script to start interacting with the LLM. Press q to exit the script at any time.

The model is still on Cerebrium, so not totally private, but the only other real way to have it private and in the cloud is to host your own servers, which is a story for another day.

Summary

So that’s it! We built a chatbot using our own private LLM locally and on the cloud. And it wasn’t even that hard. Pretty cool, right? Hopefully, this project helps get you started using open source LLMs. There are quite a few out there, and new ones are always coming out.

The quality and tuning techniques will continue to improve also. I was surprised that a chatbot-style prompt is still needed to get it to behave as expected. But I guess that is a requirement.

The responses also tended to go off on a tangent, which the tweaking of the prompt helped with also. Also, the answer sometimes seemed technical and did not feel like a natural conversation. I thought the LLM would respond better out of the box, but some prompt engineering is required to overcome some of these quirks.

If you are building an application to parse private or business documentation, that could definitely be one of the use cases where a private LLM is more appealing. I did write a detailed article on building a document reader chatbot, so you could combine the concepts from here and there to build your own private document reader chatbot. It includes ways to get a chat history working within your chat also.

I hope this was useful. Cheers!

If you enjoyed the article, and would like to stay up to date on future articles I release about building things with LangChain and AI tools, do hit the notification button so you can receive an email when they do come out.

April 2024 update: Am working on a LangChain course for web devs to help you get started building apps around Generative AI, Chatbots, Retrieval Augmented Generation (RAG) and Agents. If you liked my writing style, and the content sounds interesting, you can sign up here