Better Programming

Advice for programmers.

Follow publication

Building a Question-Answer Bot With Langchain, Vicuna, and Sentence Transformers

A Q/A bot with open source

Paolo Rechia
Better Programming
Published in
10 min readApr 26, 2023
Photo by Jon Tyson on Unsplash

My last story about Langchain and Vicuna attracted a lot of interest, more than I expected. I decided then to follow up on the topic and explore it a bit further. One topic I kept seeing being asked in the community is how to use embeddings with LLama models.

I’ve decided to give it a try and share my experience as I build a Question/Answer Bot using only Open Source.

To run the final example, you need a decent computer available. You should either have a GPU with at least 10GB VRAM or at least 32GB RAM to keep the model in memory and perform the inference on CPU.

Note for CPU users: there are lighter model versions for CPU out there, I just never tried them out.

What can you expect in this text:

  1. How to extract embeddings from Vicuna or any LLama-based model
  2. Extracting chunks from a text file into Chroma
  3. How to use the Sentence Transformers library to extract embeddings
  4. Comparing the Vicuna embeddings against the Sentence Transformer in a simple test
  5. Using our best embeddings to build a bot that answers questions about Germany, using Wikitext as the source of truth.

If you need more code examples throughout this exercise, you can use my repository, where you can find the complete source code for everything discussed here: https://github.com/paolorechia/learn-langchain. I plan to add an easy way to install and run the code samples in the next days.

(Update 01.05.2023: added support to use the popular text-generation-webui as the backend:

This makes it significantly easier to install and try the examples from this article)

How to extract embeddings from Vicuna or any LLama based model

Spoiler: these embeddings are not good, but I wanted to share my experience. Perhaps the community finds a better way of leveraging embeddings from Llama models.

Here, I assume you can use load a Vicuna model locally somehow. If you can’t, might want to skim over this step.

Inspecting the LLama source code in Hugging Face we see some functions to extract embeddings:

class LlamaForCausalLM(LlamaPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.model = LlamaModel(config)

self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

# Initialize weights and apply final processing
self.post_init()

def get_input_embeddings(self):
return self.model.embed_tokens

def set_input_embeddings(self, value):
self.model.embed_tokens = value

def get_output_embeddings(self):
return self.lm_head

def set_output_embeddings(self, new_embeddings):
self.lm_head = new_embeddings

(...)

Here we see we can take either input or output embeddings. I went with input embeddings initially, and wrote the following function:

def get_embeddings(model, tokenizer, prompt):
input_ids = tokenizer(prompt).input_ids
input_embeddings = model.get_input_embeddings()
embeddings = input_embeddings(torch.LongTensor([input_ids]))
mean = torch.mean(embeddings[0], 0).cpu().detach()
return mean

Here, we’re essentially tokenizing the input and extracting the input embeddings for each token. What I did was then serve this behind an HTTP server, though this is not really a hard requirement.

This is a Fast API endpoint definition:

@app.post("/embedding")
def embeddings(prompt_request: EmbeddingRequest):
params = {"prompt": prompt_request.prompt}
print("Received prompt: ", params["prompt"])
output = get_embeddings(model, tokenizer, params["prompt"])
return {"response": [float(x) for x in output]}

Extracting chunks from a text file into Chroma

This part is fairly simple, you should open your file as plain text

with open("germany.txt") as f:
book = f.read()

For the plain text I used the source code of the page about Germany.

Then you can pretty much just copy an example from langchain documentation to load the file and convert it to embeddings.

from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(book)
docsearch = Chroma.from_texts(
texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))]
)

However, we have not yet the embeddings. A quick glance into the Chroma.from_texts function leads us to a function called add_texts and then finally to this block of code:

if self._embedding_function is not None:
embeddings = self._embedding_function.embed_documents(list(texts))

We can then take a peak at the OpenAIEmbeddings class and see how this method is implemented:

def embed_documents(
self, texts: List[str], chunk_size: Optional[int] = 0
) -> List[List[float]]:
"""Call out to OpenAI's embedding endpoint for embedding search docs.

Args:
texts: The list of texts to embed.
chunk_size: The chunk size of embeddings. If None, will use the chunk size
specified by the class.

Returns:
List of embeddings, one for each text.
"""
# handle batches of large input text
(...)
# Actual code in here...

Another function that seemed relevant:

def embed_query(self, text: str) -> List[float]:
"""Call out to OpenAI's embedding endpoint for embedding query text.

Args:
text: The text to embed.

Returns:
Embedding for the text.
"""
# Actual code in here...

Alright, so we don’t care what’s inside these functions, we just care about their public interface. Let’s mimic it for our own embedding class:

class VicunaEmbeddings(BaseModel, Embeddings):
def _call(self, prompt: str) -> str:
p = prompt.strip()
print("Sending prompt ", p)
response = requests.post(
"http://127.0.0.1:8000/embedding",
json={
"prompt": p,
},
)
response.raise_for_status()
return response.json()["response"]

def embed_documents(
self, texts: List[str], chunk_size: Optional[int] = 0
) -> List[List[float]]:
"""Call out to Vicuna's server embedding endpoint for embedding search docs.

Args:
texts: The list of texts to embed.
chunk_size: The chunk size of embeddings. If None, will use the chunk size
specified by the class.

Returns:
List of embeddings, one for each text.
"""
results = []
for text in texts:
response = self.embed_query(text)
results.append(response)
return results

def embed_query(self, text) -> List[float]:
"""Call out to Vicuna's server embedding endpoint for embedding query text.

Args:
text: The text to embed.

Returns:
Embedding for the text.
"""
embedding = self._call(text)
return embedding

As you can see, I’m just delegating the hard work to our server. If you have a hard time implementing the full service yourself, remember you can always refer to the full implementation on GitHub.

Now, you should be able to glue it into a langchain app. Here’s a full-running executable that imports our previous class:

from langchain_app.models.vicuna_embeddings import VicunaEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma


embeddings = VicunaEmbeddings()

with open("germany.txt") as f:
book = f.read()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(book)
docsearch = Chroma.from_texts(
texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))]
)

while True:
query = input("Type your search: ")
docs = docsearch.similarity_search_with_score(query, k=1)
for doc in docs:
print(doc)

How to use the Sentence Transformers library

The Sentence Transformers library focus on building embeddings for similarity search. It also offers tight integration with Hugging Face, making it exceptionally easy to use. Check this model card, for instance for a very short example.

So all we need to do is load the model and pass a list of strings to encode. It’s that simple! How can we connect it to Langchain? In the latest version, it’s already integrated.

This makes our required code as simple as 2 lines of code (note that you need the sentence transformers library installed first):

from langchain.embeddings import SentenceTransformerEmbeddings 
embeddings = SentenceTransformerEmbeddings(model="all-MiniLM-L6-v2")

Our new code version, using sentence transformer embeddings instead:

from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import SentenceTransformerEmbeddings

embeddings = SentenceTransformerEmbeddings(model="all-MiniLM-L6-v2")

with open("germany.txt") as f:
book = f.read()

text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_text(book)
docsearch = Chroma.from_texts(
texts, embeddings, metadatas=[{"source": str(i)} for i in range(len(texts))]
)

while True:
query = input("Type your search: ")
docs = docsearch.similarity_search_with_score(query, k=1)
for doc in docs:
print(doc)

You should be able to run this example, even with hardware limitations. Since the model all-MiniLM-L6-v2 is very lightweight, we can just use it directly in CPU.

Comparing the Vicuna embeddings against the Sentence Transformer in a simple test

So using the scripts above, I tested both embeddings. Let’s look first at the Vicuna embeddings I created:

Type your search: What is Germany?

(Document(page_content=’{{clear}}\n\n=== Law ===\n\n{{Main|Law of Germany|Judiciary of Germany|Law enforcement in Germany}}’, metadata={‘source’: ‘56’}), 0.25790396332740784)

Doesn’t look too promising, let’s try another one:

Type your search: How is climate in Germany?

(Document(page_content=’=== Infrastructure ===\n\n{{Main|Transport in Germany|Energy in Germany|Telecommunications in Germany|Water supply and sanitation in Germany}}\n\n[[File:ICE 3 Oberhaider-Wald-Tunnel.jpg|thumb|right|An [[ICE 3]] on the [[Cologne–Frankfurt high-speed rail line]]]]’, metadata={‘source’: ‘72’}), 0.19269950687885284)

So sadly they look completely unrelated to our original query. Let’s compare with the sentence transformer model:

Type your search: What is Germany?

(Document(page_content=”Germany has been described as a [[great power]] with [[Economy of Germany|a strong economy]]; it has the [[List of sovereign states in Europe by GDP (nominal)|largest economy in Europe]], the world’s [[List of countries by GDP (nominal)|fourth-largest economy by nominal GDP]] and the [[List of countries by GDP (PPP)|fifth-largest by PPP]]. As a global power in industrial, [[Science and technology in Germany|scientific and technological]] sectors, it is both the world’s [[List of countries by exports|third-largest exporter]] and [[List of countries by imports|importer]]. As a [[developed country]] it [[Social security in Germany|offers social security]], [[Healthcare in Germany|a universal health care system]] and [[Higher education in Germany|a tuition-free university education]]. Germany is a member of the [[United Nations]], the European Union, [[NATO]], the [[Council of Europe]], the [[G7]], the [[G20]] and the [[OECD]]. It has the [[List of World Heritage sites in Germany|third-greatest number]] of [[UNESCO World Heritage Site]]s.”, metadata={‘source’: ‘4’}), 0.7956284284591675)

Wow, much better :)

Type your search: How is climate in Germany?

(Document(page_content=’=== Climate ===\nMost of Germany has a [[temperate]] climate, ranging from [[Oceanic climate|oceanic]] in the north and west to [[Continental climate|continental]] in the east and southeast. Winters range from the cold in the Southern Alps to cool and are generally overcast with limited precipitation, while summers can vary from hot and dry to cool and rainy. The northern regions have prevailing westerly winds that bring in moist air from the North Sea, moderating the temperature and increasing precipitation. Conversely, the southeast regions have more extreme temperatures.<ref>{{cite web|url=https://www.britannica.com/place/Germany/Climate|website=Encyclopedia Britannica|title=Germany: Climate|accessdate=23 March 2020|archiveurl=https://web.archive.org/web/20200323124307/https://www.britannica.com/place/Germany/Climate|archivedate=23 March 2020|url-status=live}}</ref>’, metadata={‘source’: ‘43’}), 0.602766215801239) T

So unsurprisingly, the search functionality works pretty well with sentence transformers! These models are tailored to this specific use case, after all.

You can see more query examples here.

You may see cases where the search does not work well because it returns just a small chunk with the topic title, a possible workaround is increasing the number of returned documents.

Using our best embeddings to build a bot that answers questions about Germany, using a Wikitext as the source of truth.

As you probably noticed in the previous example that we’re returning a pretty raw wikitext with a lot of tags, which is very hard to read. Let’s fix that by creating a full program that uses these results with Vicuna LLM.

We’ve covered how to set up a local Vicuna LLM API in the previous article, you can look it up there. Currently, the hardest part is probably installing all dependencies — I’ll eventually add a proper installation script to make things easier. Assuming you are able to setup the server, you can run a quantized version like this:

export USE_7B_MODEL=true && export USE_4BIT=true && uvicorn servers.vicuna_server:app

Once you have the LLM server running, we can proceed to writing the final program and executing it.

First, we’ll create a tool that searches in the embeddings we have:

from pydantic import BaseModel, Field
class SearchInEmbeddings(BaseModel):
query: str = Field()


def search(search_input: SearchInEmbeddings):
docs = docsearch.similarity_search_with_score(search_input, k=1)
return docs

tools = [
Tool(
name="Search",
func=search,
description="useful for when you need to answer questions about Germany",
)
]

We’ll then initialize an Agent with memory. Memory is not really necessary for this example, the important piece is having the tools passed in, and using the correct agent type.

print("Initializing VicunaLLMClient")
memory = ConversationBufferMemory(memory_key="chat_history")
llm = VicunaLLM()
agent = initialize_agent(
tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True, memory=memory
)

And now our final loop:

while True:
query = input("Type your question: ")
agent.run(
input=query
)

Let’s give it a try:

Type your question: What is the Climate like in Germany?


> Entering new AgentExecutor chain...
I should look it up
Action: Search
Action Input: "climate of Germany"
Observation: [(Document(page_content='=== Climate ===\nMost of Germany has a [[temperate]] climate, ranging from [[Oceanic climate|oceanic]] in the north and west to [[Continental climate|continental]] in the east and southeast. Winters range from the cold in the Southern Alps to cool and are generally overcast with limited precipitation, while summers can vary from hot and dry to cool and rainy. The northern regions have prevailing westerly winds that bring in moist air from the North Sea, moderating the temperature and increasing precipitation. Conversely, the southeast regions have more extreme temperatures.<ref>{{cite web|url=https://www.britannica.com/place/Germany/Climate|website=Encyclopedia Britannica|title=Germany: Climate|accessdate=23 March 2020|archiveurl=https://web.archive.org/web/20200323124307/https://www.britannica.com/place/Germany/Climate|archivedate=23 March 2020|url-status=live}}</ref>', metadata={'source': '43'}), 0.46628236770629883)]
Thought:I now know the climate of Germany
Final Answer: Germany has a temperate climate, ranging from oceanic in the north and west to continental in the east and southeast. Winters range from cold in the Southern Alps to cool and are generally overcast with limited precipitation, while summers can vary from hot and dry to cool and rainy. The northern regions have prevailing westerly winds that bring in moist air from the North Sea, moderating the temperature and increasing precipitation. Conversely, the southeast regions have more extreme temperatures.

> Finished chain.

Here’s the final answer in an easier-to-read format:

Germany has a temperate climate, ranging from oceanic in the north and west to continental in the east and southeast. Winters range from cold in the Southern Alps to cool and are generally overcast with limited precipitation, while summers can vary from hot and dry to cool and rainy. The northern regions have prevailing westerly winds that bring in moist air from the North Sea, moderating the temperature and increasing precipitation. Conversely, the southeast regions have more extreme temperatures.

I think it looks pretty good!

I decided to give it a more challenging example, (full chain output here), let’s see how it goes. I manually extracted the answers from the output. If you want to automate it, you can just add a simple parser that extracts the “Final Answer” line:

What are the states in Germany?

The states in Germany are the sixteen constituent states, which are collectively referred to as Länder. Each state has its own constitution and is largely autonomous in regard to its internal organization. There are 401 districts at a municipal level, consisting of 294 rural districts and 107 urban districts.

Can you tell me the name of each state in Germany?

The name of each state in Germany is: [[Bavaria]], [[Berlin]], [[Brandenburg]], [[Hamburg]], [[Hesse]], [[Mecklenburg-Vorpommern]], [[Lower Saxony]], [[North Rhine-Westphalia]], [[Rhineland-Palatinate]], [[Saarland]], [[Saxony]], [[Saxony-Anhalt]], [[Thuringia]], [[Bremen]], [[Dortmund]], [[Duisburg]], [[Essen]], [[Gelsenkirchen]], [[Hamburg]], [[Munich]], [[Nuremberg]], [[Stuttgart]], [[Hanover]], [[Berlin]], [[Leipzig]], [[Erfurt]], [[Schleswig-Holstein]], [[Thuringia]], [[Hamburg]], [[Bremen]]

We see two problems: first it returned the names inside [[]] — this is not a big problem, we can probably be fixed with further prompting/preprocessing, etc.

The second problem is that it also confused cities with states.

This was a very challenging question which shows how can LLMs get limited at understanding the data, especially when it’s not well preprocessed beforehand.

That’s it! I hope you enjoyed this article and that it helps you with your goals.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Paolo Rechia
Paolo Rechia

Written by Paolo Rechia

Software Developer / Data Engineer - Connect with me on Linkedin: https://www.linkedin.com/in/paolo-rechia/

Responses (4)

Write a response