Harnessing Retrieval Augmented Generation With Langchain

Implementing RAG using Langchain

Published in

Better Programming

19 min readAug 2, 2023

Image by Author, generated using Adobe Firefly

Retrieval Augmented Generation (RAG) is more than just a buzzword in the AI developer community; it’s a groundbreaking approach that’s rapidly gaining traction in organizations and enterprises of all sizes.

As we delve deeper into the capabilities of Large Language Models (LLMs), uncovering new applications along the way, the value and appeal of RAG are becoming increasingly clear. And for good reason!

Despite its recent surge in popularity, the foundations of Retrieval-Augmented Generation were laid in 2020, when Facebook AI Research (FAIR) popularized this innovative approach in their seminal paper, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”.

This paradigm shift has since catalyzed significant advancements in Natural Language Processing (NLP), offering a unique methodology for tackling knowledge-intensive and domain-specific tasks with LLMs.

What is it?

Retrieval Augmented Generation (RAG) is a new generative paradigm that fuses Large Language Models and traditional Information Retrieval (IR) techniques, enabling seamless AI interactions leveraging your custom data.

History of Retrieval Augmentation

The concept of retrieval augmentation in the context of language models was first introduced by Google, in their paper — REALM: Retrieval-Augmented Language Model Pre-Training. In it, they explored utilizing document retrieval to optimize the pre-training process of Language Models, also called parametric learning.

To understand how standard language models encode world knowledge in their parameters, one should first review how these models are pre-trained. Since the invention of BERT, the fill-in-the-blank task, called masked language modeling, has been widely used for pre-training language representation models.

Given a text with certain words masked, the task is to fill in the missing words.

The Authors of the paper used a retriever to augment the pre-training process of Language Models.

The critical intuition of REALM is that a retrieval system should improve the model’s ability to fill in missing words. The release of REALM helped drive interest in developing end-to-end retrieval-augmented generation models, as demonstrated by Facebook AI research.

In the subsequent RAG paper, the researchers similarly used a combination of a Retriever module (dense-passage retrieval system) with a generative transformer model — BART in this case, and jointly fine-tuned to achieve state-of-the-art results on Open-Domain Question Answering (ODQA) benchmarks.

Source: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

However, the concept & methodology was still within the realm of fine-tuning, (excuse the pun!) where retrieval was used in the pre-training or fine-tuning pipeline of language models.

A lot has changed in NLP since then! Transformer-based Language Models have grown larger, and more capable, while generative AI tools like ChatGPT have become commonplace. The barrier to entry for NLP and Data Science has lowered significantly, and thanks to new productivity tools, it’s easier than ever to experiment with LLMs and implement these complex architectures.

Fast forward to now, and we have a thriving NLP & AI research ecosystem where brilliant minds all across the world are solving the most challenging problems in the field of LLMs. One such recent development in this context is the paper RETRO — Improving language models by retrieving from trillions of tokens by DeepMind.

RETRO, which stands for “Retrieval Enhanced TRansfOrmers” demonstrated that the latest batch of language models can be much smaller yet achieve GPT-3 like performance by being able to query an external database or search the web for information.

A key indication is that building larger models is not the only way to improve performance!

The input query is first matched with a vector database to retrieve the closest matching results.

The retrieved results are added to the input of the language model to enhance the generated text. Source — The Illustrated Retrieval Transformer by Jay Alammar

Another exciting new development has been FLARE — Active Retrieval Augmented Generation. Forward-Looking Active REtrieval augmented generation (FLARE), is a novel retrieval-augmented generation method that leverages the prediction of upcoming sentences to anticipate future content, which is then used as a query to retrieve relevant documents.

Source: Active Retrieval Augmented Generation

Architecture

Implementing the RAG architecture has gotten much simpler now but the core concept essentially remains the same.

We use a retrieval system for the input prompt to augment the output generated by the LLM. This technique allows us to bypass fine-tuning as we can easily expose the model to external data (non-parametric), instead of having to retrain it on our domain-specific data.

In a nutshell, the system comprises three key components: a knowledge-base index, a retriever that fetches the indexed documents, and an LLM (ideally instruct-tuned) such as LLaMA or ChatGPT. These generative models can then be prompted to yield useful responses, based on the retrieved knowledge as the added context within the prompt.

Why should you care?

Retrieval Augmented Generation has several pertinent advantages over just using off-the-shelf generative models, including:

Easy Knowledge Acquisition: RAG methods can easily acquire knowledge from external data sources, such as Wikipedia or internal documents, to improve the performance of LLMs within domain-specific tasks.
Minimal Training Cost: The only training needed is the indexing of your knowledge base. No fine-tuning necessary. This reduced resource requirement enables comparable performance while utilizing fewer computational and data resources. This is especially valuable for organizations or researchers with limited training infrastructure or budget constraints.
Multiple Sources of Knowledge: With RAG, one can make use of multiple sources of knowledge, including those that are baked into the model parameters as well as information contained within many different knowledge bases, allowing it to outperform other state-of-the-art models in tasks like question-answering.

Source: Retrieval-augmented Generation across Heterogeneous Knowledge

4. Strong Scalability: Using performant vector databases, we can easily scale RAG to large datasets and handle complex queries, making them useful for commercial applications.

5. Improved Performance & Reduced Hallucination: RAG generates more accurate and contextually informed content by leveraging retrieval techniques, reducing the likelihood of generating incorrect or fabricated information. This makes it a powerful approach for various tasks, ensuring reliable and factual generation of content.

6. Overcome Context-Window Limit: All language models have a fixed length of tokens they can process at once, known as the context-window. Using Retrieval Augmentation, we can overcome this fixed text constraint, allowing the model to incorporate data from larger document collections, providing a broader context for generating more informed and contextually rich output.

7. Return Sources: As a cherry on top, RAG also offers explainability, which is essential for building trust in LLMs. Unlike a black-box LLM, RAG allows users to read the sources they retrieved and judge their relevance and credibility for themselves. By surfacing the sources used to generate the text, it provides transparency and accountability, making LLMs more trustworthy and explainable.

Source: Building Scalable, Explainable, and Adaptive NLP Models with Retrieval, Standford AI Lab

Implementation

The spirit of open-source software and research warrants special recognition here, as it has spawned a host of incredible new libraries that handle the majority of the heavy lifting for implementation.

Notable among these are the Transformers library by Hugging Face, which has become the de facto standard for open-source NLP; the bitsandbytes library which democratizes LLM training and inference; and of course Langchain, the most popular Python library for streamlining your LLM workflow.

Why Langchain?

It seems like everyone and their neighbor is using Langchain these days, but what makes it so popular, and why has it become the go-to choice? The answer lies in abstraction. Much like how Python is currently the most favored programming language in AI, due to its easy syntax and buzzing ecosystem, Langchain follows a similar pattern.

An illustration of Langchain’s Composability. Source: Reddit

‘Langchain’ appears as a portmanteau of the words — ‘Language’ (representing Language Models) and ‘Chain’ (referring to Chain-of-thought prompting). In essence, it’s a framework that allows us to elicit reasoning from language models. It offers many built-in wrappers and a diverse set of utilities, simplifying much of the groundwork for us developers.

Basically, Langchain provides a high-level interface for working with Large Language Models, enabling us to swiftly build applications without getting mired in the complexities of lower-level abstraction.

It features a modular, declarative design and provides us with a host of utility and helper functions, that we can easily plug and play to build our LLM-powered applications.

Use-Cases

So we’ve heard much about RAG and Langchain, but what can we build with them?

Generative Search: Generative Search is the latest search framework that employs LLMs and Retrieval Augmented Generation to revolutionize how users interact with search engines. Your favorite chat-search apps like Bing Chat, You.com, and Perplexity all use RAG under the hood.
Chat with your Data: You’ve probably seen a slew of new startups and products that enable you to ‘chat with your documents.’ Using RAG, we can transform static content into dynamic knowledge sources, making information retrieval effortlessly engaging.
Customer Service Chatbots: I think it’s safe to say that we’ve all had at least 1 terrible experience with a customer support chatbot. With the rise of RAG, I predict we’ll start to see a new generation of chatbots that provide accurate, personalized, and context-aware assistance by accessing a vast knowledge base of relevant information, fostering brand loyalty, and providing exceptional customer service.

Augmented Generation

To get a sense of how RAG works, let’s first have a look at Augmented Generation, as it underpins the approach.

Augmented Generation simply means adding external information to the input prompt fed into the LLM, thereby augmenting the generated response. A simple example of using a context-augmented prompt with Langchain is as follows —

from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI

# Load the document as a string
context = '''A phenotype refers to the observable physical
properties of an organism, including its appearance, development, and behavior.
It is determined by both the organism's genotype, which is the set of genes
it carries, and environmental influences upon these genes.'''

# Create the Prompt Template for base qa_chain
qa_template = """Context information is below.
    ---------------------
    {context}
    ---------------------
    Given the context information and not prior knowledge, 
    answer the question: {question}
    Answer:
"""
PROMPT = PromptTemplate(
    template=qa_template, input_variables=["context", "question"]
)
chain = LLMChain(llm=OpenAI(temperature=0), prompt=PROMPT)
query = "What's a phenotype?"
chain({"context": context, "question": query}, return_only_outputs=True)

Here, we insert some additional context within the prompt, before asking the model to answer our question.

By designing the prompt for the LLM in this way — an instruction followed by {context} and then the {question}, we can guide the model to stick to the specific context we provide it to answer the question. This is easily achieved by calling an LLMChain in Langchain.

Output —

A phenotype refers to the observable physical properties of an organism,
including its appearance, development, and behavior.
The interplay between genetic and environmental influences determines the
characteristics and features that are manifested in an organism's phenotype.

Retrieval + Augmented Generation

In the above example, the included context of phenotypes was just a paragraph long and only about 50 tokens, well below ChatGPT’s default context window of 4096 tokens.

But what if our information source is much longer than that? What do we do when we want to feed an entire corpus of text like a 10-page document or a 500-page book, far beyond the maximum context length? Retrieval is the solution!

The field of Information Retrieval is evolving at a rapid pace, as we move from keyword-based retrieval to neural network-based techniques. By leveraging Dense Vector Embeddings, in combination with traditional keyword-matching techniques, we have entered an era of Hybrid Search that empowers us with an expanded capacity to search semantically and deliver more accurate and relevant results than ever before.

Using a vector database, we can chunk and index our knowledge source of any arbitrary size, as a precursor.

This allows us to dynamically retrieve and inject only the most relevant chunks from our vast knowledge base into the prompt during inference, enabling us to overcome the context limit and also contextually control the output generated.

Conversational Retrieval Augmented Generation

Retrieval Augmented Generation is cool and all, but can I chat with it? That’s the real question. In today’s chatbot-driven era, users have grown accustomed to and now expect a conversational interface for any AI interaction.

Incorporating a conversational assistant powered by RAG fosters a seamless and intuitive user experience, facilitating natural and dynamic AI interactions that enhance engagement and overall satisfaction.

It’s increasingly clear that Conversational AI is the new UI.

So to demonstrate, let’s build a Conversational RAGbot! For this tutorial, I chose to build a Prompt Engineering Assistant — SPARK⚡️

SPARK stands for Smart Prompt Assistant and Resource Knowledgebase, here to help you with all your prompting needs! Whether it’s understanding key prompt engineering concepts or implementing the best practices, SPARK is here to help you craft your perfect prompt.

As the adoption of Large Generative Models continues to percolate, and as more organizations begin to realize the problems that AI can solve for them, the value and necessity of Prompt Engineering start to make sense.

But instead of having to hire someone else, we can choose to augment ourselves with assistants like SPARK. Let’s get started.

Step 1: Collect Your Data 🗂️

The first step is to identify the key sources of data you want to leverage. For this use case, I needed high-quality content related to prompt engineering and fortunately, there are some great resources online.

In fact, SPARK was inspired by the GPT Best Practices blog and the amazing OpenAI Cookbook. In it, they include sources for Prompting Guides, so let’s use the knowledge distilled within those sources to augment our Assistant.

Data Sources —

Brex’s Prompt Engineering Guide: Brex’s introduction to language models and prompt engineering.
promptingguide.ai: A prompt engineering guide that demonstrates many techniques.
OpenAI Cookbook: Techniques to improve reliability: A slightly dated (Sep 2022) review of techniques for prompting language models.
Lil’Log Prompt Engineering: An OpenAI researcher’s review of the prompt engineering literature (as of March 2023).
learnprompting.org: An introductory course to prompt engineering.

Step 2: Load your Data 📥

Source: Langhchain Docs — Data Connection

To index our knowledge base, we first need to load the data. Since we have a curated list of URLs, we can use one of Langchain’s many built-in data loaders— WebBaseLoader. We can also use the SitemapLoader, but some websites don’t have a public sitemap, in which case the simplest way is to just load the data using a list of webpage URLs.

from langchain.document_loaders import WebBaseLoader
urls = ["https://platform.openai.com/docs/guides/gpt-best-practices/",
"https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
"https://github.com/brexhq/prompt-engineering",
...]
loader = WebBaseLoader(urls)
data = loader.load()
data

If the list of URLs is very long, it is not the most elegant solution, but this allows us to skip many of the web-scraping challenges you might run into while using other loading methods.

import tiktoken
encoding_name = tiktoken.get_encoding("cl100k_base")
def num_tokens_from_string(string: str, encoding_name: str) -> int:
    """Returns the number of tokens in a text string."""
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens

After loading the data, we can utilize the tiktoken package to count the total tokens in our corpus. Token counting helps us assess the size and complexity of the dataset, providing valuable insights for model training and resource planning.

For SPARK, the knowledge base was compiled from 174 unique URLs, resulting in a total of ~200,000 tokens of useful prompting information.

Step 3: Chunk your Data ➗

Once the data is scraped and loaded, we move on to chunking & cleaning the data.

It’s important to chunk the data as we want to embed a meaningful length of context within our vector index. Embedding just a word or two is too little information to match relevant vectors, and embedding entire pages would be too long to fit within the context window of the prompt. Try to strike the right balance for your use case and dataset.

A helpful resource on data loading & chunking with Langchain. Source: James Briggs

There are many text splitters that Langchain supports. I decided to split by token, as that would make it easier for me to directly manage the context length, choosing a chunk size of about 500 tokens for my use case. We also set a small overlap length so that text continuity is preserved between our chunks.

from langchain.text_splitter import TokenTextSplitter
text_splitter = TokenTextSplitter(chunk_size=500, chunk_overlap=25)
docs = text_splitter.split_documents(data)

Step 4: Index your Data 📂

Once we’ve gathered our data sources, it’s time to build our knowledge-base index. In general, the term “index” refers to a data structure that is used to optimize the retrieval of information from a larger collection of data.

By creating an index, you are essentially creating a map or a reference that allows you to quickly locate vectors based on certain criteria, such as similarity or distance.

Just like an index in a book helps you find specific information by referring to page numbers or keywords, a vector index helps you locate and retrieve relevant vectors from a vector database.

Simply put — Text + Embeddings = Vector Index. Source: Pinecone

For this demo, I used Pinecone as the vector database, and Cohere for the Embeddings. As a prerequisite, get your API keys and log into your Pinecone account to create a new index.

You’ll be required to input the dimensionality of your index, which corresponds to the size of your embeddings. I used Cohere’s ‘embed-english-light-v2.0’ model, which has a max length of 1024 dimensions. Once created, we can upsert our document vectors to our index.

from langchain.vectorstores import Pinecone
import pinecone
from langchain.embeddings import CohereEmbeddings

embeddings = CohereEmbeddings(model='embed-english-light-v2.0',cohere_api_key='YOUR KEY')
# initialize pinecone
pinecone.init(
    api_key='YOUR API KEY',  # find at app.pinecone.io
    environment='us-west1-gcp',  # next to api key in console,
)

index_name = "spark"

docsearch = Pinecone.from_documents(docs, embeddings, index_name=index_name)

Step 5: Build a Retriever 🔍

Once our vector store is indexed, it’s time to define our retriever. Retriever is the module that determines how the relevant documents are fetched from the vector database, determined by its search algorithm.

Retrieval is a fundamental task in NLP, particularly in question-answering systems, search engines, and information retrieval applications. The goal of a retriever is to efficiently search through a large corpus of documents and identify the most relevant ones that are likely to contain the information sought by a user.

Some popular search algorithms include BM-25, similarity search, MMR (Maximal Marginal Relevance), HNSW (Hierarchical Navigable Small World), LSH (Locality Sensitive Hashing), and other learning-to-rank techniques that leverage ML algorithms to optimize retrieval performance.

*The Rerank endpoint acts as the last stage re-ranker of a search flow. Source:* *Cohere Rerank*.

For this demo, I experimented using a base retriever with cosine similarity as the metric and a second stage to post-process the retrieved results with Cohere’s Rerank endpoint. Langchain supports this easily with just a couple of lines of code.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
from langchain.vectorstores import Pinecone
# load index
docsearch = Pinecone.from_existing_index(index_name, embeddings)
# initialize base retriever
retriever = docsearch.as_retriever(search_kwargs={"k": 4})
# Set up cohere's reranker
compressor = CohereRerank()
reranker = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever
)

Step 6: Create a Conversational Retrieval Chain ⛓️

Now that we have all the components in place, we can build the Conversational Retrieval Chain. This chain builds on top of RetrievalQAChain to include a chat history component to facilitate conversational interactions.

Source: Langchain Blog— ChatGPT Over Your Data

It first combines the chat history and the question into a standalone question; then looks up relevant documents from the retriever, and finally passes those documents and the question to a question-answering chain to return a response.

This two-stage chaining overcomes the single-topic limitations I faced in my previous semantic-search article, now enabling true conversational search. Langchain provides some default prompts, but let’s customize them for our assistant SPARK. I used Azure’s OpenAI search Demo as inspiration for my base prompts.

Standalone Question Generator Prompt —

Below is a summary of the conversation so far, and a new question asked by the user that needs to be answered by searching in a knowledge base.
Generate a search query based on the conversation and the new question.

Chat History:
{chat_history}

Question:
{question}

Search query:

Answer Generator Prompt —

<SPARK Assistant Persona>
<Instructions>
Important:
Answer with the facts listed in the list of sources below. If there isn't enough information below, say you don't know.
If asking a clarifying question to the user would help, ask the question. 
ALWAYS return a "SOURCES" part in your answer, except for small-talk conversations.

Question: {question}
Sources:
---------------------
    {summaries}
---------------------

Chat History:
{chat_history}

Once our custom prompts are defined, we can initialize the Conversational Retrieval Chain. This utilizes Langchain’s memory management modules, and I chose the ConversationTokenBufferMemory which keeps a buffer of recent interactions in memory and uses token length to determine when to flush past interactions.

from langchain.memory import ConversationTokenBufferMemory
memory = ConversationTokenBufferMemory(llm=llm, memory_key="chat_history", return_messages=True, input_key='question', max_token_limit=1000)
question_generator = LLMChain(llm=llm, prompt=CONDENSE_QUESTION_PROMPT, verbose=True)
answer_chain = load_qa_with_sources_chain(llm, chain_type="stuff", verbose=True,prompt=chat_prompt)

chain = ConversationalRetrievalChain(
            retriever=reranker,
            question_generator=question_generator,
            combine_docs_chain=answer_chain,
            verbose=True,
            memory=memory,
            rephrase_question=False
)

Step 7: Build the CUI using Chainlit 🎨

Now that our chain is set up, we need a Conversational User Interface (CUI) to test and interact with our assistant. Enter Chainlit: The next generation of Streamlit for Langchain-powered conversational interfaces!

Chainlit is an amazing new open-source Python package that makes it incredibly fast to build and share LLM apps. It lets you create beautiful ChatGPT-like UIs on top of any Python code in minutes! Some of the key features include intermediary steps visualization, element management (images, text, carousel, etc.), and cloud deployment.

To get started with Chainlit, first install the library using pip—

pip3 install --upgrade chainlit

Next, we just need to wrap our previously built code with some Chainlit decorators. Langchain can be natively integrated using its built-in plug-and-play decorators. We can use @on_chat_start to react to the user websocket connection event and @on_message to process and send the response from our Conversation Retrieval Chain.

...
import chainlit as cl
import pinecone
from chainlit import user_session
from prompts import load_query_gen_prompt, load_spark_prompt
from chainlit import on_message, on_chat_start

# Load Assistant Prompt
spark = load_spark_prompt()

# Load Query Generator Promppt
query_gen_prompt = load_query_gen_prompt()
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(query_gen_prompt)

# Initialize Pinecone Index
pinecone.init(
            api_key=os.environ.get("PINECONE_API_KEY"),
            environment='us-west1-gcp',
    )
index_name = "spark"

@on_chat_start
def init(): 
    # Congigure ChatGPT as the llm, along with memory and embeddings
    llm = ChatOpenAI(temperature=0.7, verbose=True, openai_api_key = os.environ.get("OPENAI_API_KEY"), streaming=True,
                     callbacks=[context_callback])
    memory = ConversationTokenBufferMemory(llm=llm,memory_key="chat_history", return_messages=True,input_key='question',max_token_limit=1000)
    embeddings = CohereEmbeddings(model='embed-english-light-v2.0',cohere_api_key=os.environ.get("COHERE_API_KEY"))
    
    # Load retriever from existing piencone db
    docsearch = Pinecone.from_existing_index(
    index_name=index_name, embedding=embeddings
    )
    retriever = docsearch.as_retriever(search_kwargs={"k": 4})

    # Construct the chat prompt
    messages = [SystemMessagePromptTemplate.from_template(spark)]
    # print('mem', user_session.get('memory'))
    messages.append(HumanMessagePromptTemplate.from_template("{question}"))
    prompt = ChatPromptTemplate.from_messages(messages)

    # Load the query generator chain
    question_generator = LLMChain(llm=llm, prompt=CONDENSE_QUESTION_PROMPT, verbose=True)

    # Load the answer generator chain
    doc_chain = load_qa_with_sources_chain(llm, chain_type="stuff", verbose=True,prompt=prompt)

    # Final conversational retrieval chain
    chain = ConversationalRetrievalChain(
            retriever=retriever,
            question_generator=question_generator,
            combine_docs_chain=doc_chain,
            verbose=True,
            memory=memory,
            rephrase_question=False
        )

    # Set chain as a user session variable
    cl.user_session.set("conversation_chain", chain)

@on_message
async def main(message: str):
        # Read chain from user session variable
        chain = cl.user_session.get("conversation_chain")

        # Run the chain asynchronously with an async callback
        res = await chain.arun({"question": message},callbacks=[cl.AsyncLangchainCallbackHandler()])

        # Send the answer and the text elements to the UI
        await cl.Message(content=res).send()

The full project code and indexing notebook can be found here. Start the chainlit app using chainlit run app/spark.py, and voilà—our very own Prompt Assistant is alive! 🤖

Step 8: Deploy & Share 🚀

The final step in building our RAG-powered assistant is to deploy and share it with the rest of the world 🌐

We can use Docker to containerize our Python web application and deploy it on any cloud provider of choice. You can find many great Chainlit Deployment tutorials online, as well as detailed guides on the chainlit docs, chainlit cookbook, and community page.

For this open-source project, I decided to use Huggingface Spaces for free cloud hosting, and also to try out their new Docker Spaces feature. The new option now allows us to deploy any docker container and embed it anywhere! This is a game-changer for quickly deploying and sharing AI/ML applications.

For deployment, I used poetry as my package manager and modified the Chainlit Dockerfile Template to work with Huggingface Docker Spaces —

# The builder image, used to build the virtual environment
FROM python:3.11-slim-buster as builder

RUN apt-get update && apt-get install -y git

RUN pip install poetry==1.4.2

ENV POETRY_NO_INTERACTION=1 \
    POETRY_VIRTUALENVS_IN_PROJECT=1 \
    POETRY_VIRTUALENVS_CREATE=1 \
    POETRY_CACHE_DIR=/tmp/poetry_cache

WORKDIR /app

COPY pyproject.toml poetry.lock ./

RUN poetry install --without dev --no-root && rm -rf $POETRY_CACHE_DIR

# The runtime image, used to just run the code provided its virtual environment
FROM python:3.11-slim-buster as runtime

RUN useradd -m -u 1000 user

USER user

ENV HOME=/home/user \
    PATH="/home/user/.local/bin:$PATH" \
    VIRTUAL_ENV=/app/.venv \
    LISTEN_PORT=8000 \
    HOST=0.0.0.0

WORKDIR $HOME/app

COPY --from=builder --chown=user ${VIRTUAL_ENV} ${VIRTUAL_ENV}

COPY --chown=user ./app ./app
COPY --chown=user ./.chainlit ./.chainlit
COPY --chown=user chainlit.md ./

EXPOSE $LISTEN_PORT

RUN pip install -r app/requirements.txt

CMD ["chainlit", "run", "app/spark.py"]

The Dockerfile consists of two stages: builder and runtime. The builder sets up the virtual environment and installs dependencies using Poetry, while the runtime executes the code within the virtual environment. This enables containerization, facilitating easy deployment and scalability for chainlit-powered applications.

Results:

SPARK accurately answers from its vast prompting knowledge base.

Chainlit’s intermediate step visualization is a handy feature to see how the chain is working under the hood.

SPARK always cites the sources used to answer the question, improving trust & transparency.

SPARK is useful for learning new concepts and is also able to show prompt examples.

SPARK provides accurate and insightful answers to queries related to prompting. It can also act as a guide to learning the fundamental concepts of prompt design and engineering.

Feel free to explore SPARK here and experiment with its features to unleash the full potential of the Prompt Assistant. Hope it helps you on your prompting journey!

Conclusion

In conclusion, Retrieval Augmented Generation represents a significant breakthrough in the field of Natural Language Processing and Generative AI. RAG’s ability to seamlessly integrate large language models with traditional information retrieval techniques unlocks new possibilities for AI-powered applications, improving knowledge retrieval, content generation, and user experience.

Through the example of SPARK — Prompt Assistant, we see how Langchain and RAG can be combined to create intelligent assistants that facilitate natural, dynamic, and valuable AI interactions.

Embracing RAG can lead to improved AI experiences, better customer support, and more reliable and trustworthy language applications.

As these technologies continue to advance, we can expect to see more innovative and transformative use cases emerge, that enhance our ability to access knowledge more effectively, propelling AI into new frontiers!

Resources & References

Meta AI — Retrieval Augmented Generation: Streamlining the creation of intelligent natural language processing models
Pinecone — Generative Question-Answering with Long-Term Memory
Huggingface — Retrieval Augmented Generation with Huggingface Transformers and Ray
AWS — Question answering using Retrieval Augmented Generation with foundation models
Jay Alammar — The Illustrated Retrieval Transformer
Standford AI Lab — Building Scalable, Explainable, and Adaptive NLP Models with Retrieval
Prompting Guide — Retrieval Augmented Generation

See you at the next one. Happy prompting!

Follow me on LinkedIn and feel free to reach out if you have any queries.

Read more by me —

Decoding LLM Performance: A Guide to Evaluating LLM Applications

Exploring frameworks and strategies for evaluating LLM Applications

levelup.gitconnected.com

Exploring the Creativity of ChatGPT: A Step-by-Step Guide to Using the API

A Practical Introduction to Building with the ChatGPT API in Python