Build a Chatbot on Your CSV Data With LangChain and OpenAI

Chat with your CSV files with a memory chatbot🤖 | Made with Langchain🦜 and OpenAI🧠

Published in

Better Programming

5 min readApr 13, 2023

In this article, we’ll see how to build a simple chatbot🤖 with memory that can answer your questions about your own CSV data.

Hi everyone! In the past few weeks, I have been experimenting with the fascinating potential of large language models to create all sorts of things, and it’s time to share what I’ve learned!

We’ll use LangChain🦜to link gpt-3.5 to our data and Streamlit to create a user interface for our chatbot.

Unlike ChatGPT, which offers limited context on our data (we can only provide a maximum of 4096 tokens), our chatbot will be able to process CSV data and manage a large database thanks to the use of embeddings and a vectorstore.

Un schéma du processus utilisé pour créer notre chatbot sur vos données — A diagram of the process used to create a chatbot on your data, from LangChain Blog

The code

Now let’s get practical! We’ll develop our chatbot on CSV data with very little Python syntax.

Disclaimer: This code is a simplified version of the chatbot I created, it is not optimized to reduce OpenAI API costs, for a more performant and optimized chatbot, feel free to check out my GitHub project : yvann-hub/Robby-chatbot or just test the app at Robby-chatbot.com 🚀.

First, we’ll install the necessary libraries:


pip install streamlit streamlit_chat langchain openai faiss-cpu tiktoken

Import the libraries needed for our chatbot:

import streamlit as st
from streamlit_chat import message
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.vectorstores import FAISS
import tempfile

We ask the user to enter their OpenAI API key and download the CSV file on which the chatbot will be based.
To test the chatbot at a lower cost, you can use this lightweight CSV file: fishfry-locations.csv

user_api_key = st.sidebar.text_input(
    label="#### Your OpenAI API key 👇",
    placeholder="Paste your openAI API key, sk-",
    type="password")

uploaded_file = st.sidebar.file_uploader("upload", type="csv")

If a CSV file is uploaded by the user, we load it using the CSVLoader class from LangChain

if uploaded_file :
   #use tempfile because CSVLoader only accepts a file_path
    with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
        tmp_file.write(uploaded_file.getvalue())
        tmp_file_path = tmp_file.name

    loader = CSVLoader(file_path=tmp_file_path, encoding="utf-8", csv_args={
                'delimiter': ','})
    data = loader.load()

The LangChain CSVLoader class allows us to split a CSV file into unique rows. This can be seen by displaying the content of the data:

st.write(data)

0:"Document(page_content='venue_name: McGinnis Sisters\nvenue_type: Market\nvenue_address: 4311 Northern Pike, Monroeville, PA\nwebsite: http://www.mcginnis-sisters.com/\nmenu_url: \nmenu_text: \nphone: 412-858-7000\nemail: \nalcohol: \nlunch: True', metadata={'source': 'C:\\Users\\UTILIS~1\\AppData\\Local\\Temp\\tmp6_24nxby', 'row': 0})"
1:"Document(page_content='venue_name: Holy Cross (Reilly Center)\nvenue_type: Church\nvenue_address: 7100 West Ridge Road, Fairview PA\nwebsite: \nmenu_url: \nmenu_text: Fried pollack, fried shrimp, or combo. Adult $10, Child $5. Includes baked potato, homemade coleslaw, roll, butter, dessert, and beverage. Mac and cheese $5.\nphone: 814-474-2605\nemail: \nalcohol: \nlunch: ', metadata={'source': 'C:\\Users\\UTILIS~1\\AppData\\Local\\Temp\\tmp6_24nxby', 'row': 1})"

Cutting the CSV file now allows us to provide it to our vectorstore (FAISS) using OpenAI embeddings.
Embeddings allow transforming the parts cut by CSVLoader into vectors, which then represent an index based on the content of each row of the given file.
In practice, when the user makes a query, a search will be performed in the vectorstore, and the best matching index(es) will be returned to the LLM, which will rephrase the content of the found index to provide a formatted response to the user.
I recommend deepening your understanding of vectorstore and embeddings concepts for better comprehension.

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(data, embeddings)

We then add the ConversationalRetrievalChainby providing it with the desired chat model gpt-3.5-turbo (or gpt-4) and the FAISS vectorstore storing our file transformed into vectors by OpenAIEmbeddings().
This chain allows us to have a chatbot with memory while relying on a vectorstore to find relevant information from our document.

chain = ConversationalRetrievalChain.from_llm(
llm = ChatOpenAI(temperature=0.0,model_name='gpt-3.5-turbo'),
retriever=vectorstore.as_retriever())

This function allows us to provide the user’s question and conversation history to ConversationalRetrievalChain to generate the chatbot’s response.
st.session_state[‘history’] stores the user’s conversation history when they are on the Streamlit site.

If you want to add improvements to this chatbot you can check my GitHub 👀

    def conversational_chat(query):
        
        result = chain({"question": query, 
        "chat_history": st.session_state['history']})
        st.session_state['history'].append((query, result["answer"]))
        
        return result["answer"]

We initialize the chatbot session by creating st.session_state[‘history’] and the first messages displayed in the chat.
[‘generated’] corresponds to the chatbot’s responses.
[‘past’] corresponds to the messages provided by the user.
Containers are not essential but help improve the UI by placing the user’s question area below the chat messages.

    if 'history' not in st.session_state:
        st.session_state['history'] = []

    if 'generated' not in st.session_state:
        st.session_state['generated'] = ["Hello ! Ask me anything about " + uploaded_file.name + " 🤗"]

    if 'past' not in st.session_state:
        st.session_state['past'] = ["Hey ! 👋"]
        
    #container for the chat history
    response_container = st.container()
    #container for the user's text input
    container = st.container()

Now that the session.state and containers are configured.
We can set up the UI part that allows the user to enter and send their question to our conversational_chat function with the user’s question as an argument.

    with container:
        with st.form(key='my_form', clear_on_submit=True):
            
            user_input = st.text_input("Query:", placeholder="Talk about your csv data here (:", key='input')
            submit_button = st.form_submit_button(label='Send')
            
        if submit_button and user_input:
            output = conversational_chat(user_input)
            
            st.session_state['past'].append(user_input)
            st.session_state['generated'].append(output)

This last part allows displaying the user’s and chatbot’s messages on the Streamlit site using the streamlit_chat module.

    if st.session_state['generated']:
        with response_container:
            for i in range(len(st.session_state['generated'])):
                message(st.session_state["past"][i], is_user=True, key=str(i) + '_user', avatar_style="big-smile")
                message(st.session_state["generated"][i], key=str(i), avatar_style="thumbs")

All that’s left is to launch the script:

streamlit run name_of_your_chatbot.py #run with the name of your file

The result after launch the last command

Et voilà! You now have a beautiful chatbot running with LangChain, OpenAI, and Streamlit, capable of answering your questions based on your CSV file!

I hope this article will help you to create nice things, do not hesitate to contact me on Twitter or at barbot.yvann@gmail.com if you need. 💬

You also can find the full project on my GitHub.

Build a Chatbot on Your CSV Data With LangChain and OpenAI

Chat with your CSV files with a memory chatbot🤖 | Made with Langchain🦜 and OpenAI🧠

The code

Written by Yvann