Build a Chatbot on Your CSV Data With LangChain and OpenAI

Chat with your CSV files with a memory chatbot🤖 | Made with Langchain🦜 and OpenAI🧠

Yvann
Better Programming

--

image made with StableDiffusion

In this article, we’ll see how to build a simple chatbot🤖 with memory that can answer your questions about your own CSV data.

Hi everyone! In the past few weeks, I have been experimenting with the fascinating potential of large language models to create all sorts of things, and it’s time to share what I’ve learned!

We’ll use LangChain🦜to link gpt-3.5 to our data and Streamlit to create a user interface for our chatbot.

Unlike ChatGPT, which offers limited context on our data (we can only provide a maximum of 4096 tokens), our chatbot will be able to process CSV data and manage a large database thanks to the use of embeddings and a vectorstore.

Un schéma du processus utilisé pour créer notre chatbot sur vos données
A diagram of the process used to create a chatbot on your data, from LangChain Blog

The code

Now let’s get practical! We’ll develop our chatbot on CSV data with very little Python syntax.

Disclaimer: This code is a simplified version of the chatbot I created, it is not optimized to reduce OpenAI API costs, for a more performant and optimized chatbot, feel free to check out my GitHub project : yvann-hub/Robby-chatbot or just test the app at Robby-chatbot.com 🚀.

  • First, we’ll install the necessary libraries:

pip install streamlit streamlit_chat langchain openai faiss-cpu tiktoken
  • Import the libraries needed for our chatbot:
import streamlit as st
from streamlit_chat import message
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import ConversationalRetrievalChain
from langchain.document_loaders.csv_loader import CSVLoader
from langchain.vectorstores import FAISS
import tempfile
  • We ask the user to enter their OpenAI API key and download the CSV file on which the chatbot will be based.
  • To test the chatbot at a lower cost, you can use this lightweight CSV file: fishfry-locations.csv
user_api_key = st.sidebar.text_input(
label="#### Your OpenAI API key 👇",
placeholder="Paste your openAI API key, sk-",
type="password")

uploaded_file = st.sidebar.file_uploader("upload", type="csv")
  • If a CSV file is uploaded by the user, we load it using the CSVLoader class from LangChain
if uploaded_file :
#use tempfile because CSVLoader only accepts a file_path
with tempfile.NamedTemporaryFile(delete=False) as tmp_file:
tmp_file.write(uploaded_file.getvalue())
tmp_file_path = tmp_file.name

loader = CSVLoader(file_path=tmp_file_path, encoding="utf-8", csv_args={
'delimiter': ','})
data = loader.load()
  • The LangChain CSVLoader class allows us to split a CSV file into unique rows. This can be seen by displaying the content of the data:
st.write(data)
0:"Document(page_content='venue_name: McGinnis Sisters\nvenue_type: Market\nvenue_address: 4311 Northern Pike, Monroeville, PA\nwebsite: http://www.mcginnis-sisters.com/\nmenu_url: \nmenu_text: \nphone: 412-858-7000\nemail: \nalcohol: \nlunch: True', metadata={'source': 'C:\\Users\\UTILIS~1\\AppData\\Local\\Temp\\tmp6_24nxby', 'row': 0})"
1:"Document(page_content='venue_name: Holy Cross (Reilly Center)\nvenue_type: Church\nvenue_address: 7100 West Ridge Road, Fairview PA\nwebsite: \nmenu_url: \nmenu_text: Fried pollack, fried shrimp, or combo. Adult $10, Child $5. Includes baked potato, homemade coleslaw, roll, butter, dessert, and beverage. Mac and cheese $5.\nphone: 814-474-2605\nemail: \nalcohol: \nlunch: ', metadata={'source': 'C:\\Users\\UTILIS~1\\AppData\\Local\\Temp\\tmp6_24nxby', 'row': 1})"
  • Cutting the CSV file now allows us to provide it to our vectorstore (FAISS) using OpenAI embeddings.
  • Embeddings allow transforming the parts cut by CSVLoader into vectors, which then represent an index based on the content of each row of the given file.
  • In practice, when the user makes a query, a search will be performed in the vectorstore, and the best matching index(es) will be returned to the LLM, which will rephrase the content of the found index to provide a formatted response to the user.
  • I recommend deepening your understanding of vectorstore and embeddings concepts for better comprehension.
embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_documents(data, embeddings)
  • We then add the ConversationalRetrievalChainby providing it with the desired chat model gpt-3.5-turbo (or gpt-4) and the FAISS vectorstore storing our file transformed into vectors by OpenAIEmbeddings().
  • This chain allows us to have a chatbot with memory while relying on a vectorstore to find relevant information from our document.
chain = ConversationalRetrievalChain.from_llm(
llm = ChatOpenAI(temperature=0.0,model_name='gpt-3.5-turbo'),
retriever=vectorstore.as_retriever())
  • This function allows us to provide the user’s question and conversation history to ConversationalRetrievalChain to generate the chatbot’s response.
  • st.session_state[‘history’] stores the user’s conversation history when they are on the Streamlit site.

If you want to add improvements to this chatbot you can check my GitHub 👀

    def conversational_chat(query):

result = chain({"question": query,
"chat_history": st.session_state['history']})
st.session_state['history'].append((query, result["answer"]))

return result["answer"]
  • We initialize the chatbot session by creating st.session_state[‘history’] and the first messages displayed in the chat.
  • [‘generated’] corresponds to the chatbot’s responses.
  • [‘past’] corresponds to the messages provided by the user.
  • Containers are not essential but help improve the UI by placing the user’s question area below the chat messages.
    if 'history' not in st.session_state:
st.session_state['history'] = []

if 'generated' not in st.session_state:
st.session_state['generated'] = ["Hello ! Ask me anything about " + uploaded_file.name + " 🤗"]

if 'past' not in st.session_state:
st.session_state['past'] = ["Hey ! 👋"]

#container for the chat history
response_container = st.container()
#container for the user's text input
container = st.container()
  • Now that the session.state and containers are configured.
  • We can set up the UI part that allows the user to enter and send their question to our conversational_chat function with the user’s question as an argument.
    with container:
with st.form(key='my_form', clear_on_submit=True):

user_input = st.text_input("Query:", placeholder="Talk about your csv data here (:", key='input')
submit_button = st.form_submit_button(label='Send')

if submit_button and user_input:
output = conversational_chat(user_input)

st.session_state['past'].append(user_input)
st.session_state['generated'].append(output)
  • This last part allows displaying the user’s and chatbot’s messages on the Streamlit site using the streamlit_chat module.
    if st.session_state['generated']:
with response_container:
for i in range(len(st.session_state['generated'])):
message(st.session_state["past"][i], is_user=True, key=str(i) + '_user', avatar_style="big-smile")
message(st.session_state["generated"][i], key=str(i), avatar_style="thumbs")
  • All that’s left is to launch the script:
streamlit run name_of_your_chatbot.py #run with the name of your file
The result after launch the last command

Et voilà! You now have a beautiful chatbot running with LangChain, OpenAI, and Streamlit, capable of answering your questions based on your CSV file!

I hope this article will help you to create nice things, do not hesitate to contact me on Twitter or at barbot.yvann@gmail.com if you need. 💬

You also can find the full project on my GitHub.

--

--