Better Programming

Advice for programmers.

Follow publication

Fine-tuning My First WizardLM LoRA

Paolo Rechia
Better Programming
Published in
16 min readMay 13, 2023

Photo by Diane Picchiottino on Unsplash

Previously, I wrote about creating AI Agents with Langchain and a local LLM such as Vicuna. If you’re unfamiliar with the topic and are interested in learning more, I recommend that you read my previous article to get started.

Today, I’m taking this idea a couple of steps further.

First, we’ll use a much more powerful model to use with Langchain Zero Shot ReAct tooling, the WizardLM 7b model.

Second, we’ll use a couple of prompts with an LLM to generate a dataset that can be used to fine-tune any language model to understand how to use the Langchain Python REPL tool. In this example, we’ll fine-tune WizardLM itself using my fork of the alpaca-lora codebase.

Why are we doing this? Because unfortunately most models are not good at using more complex tools with the Langchain library, and we’d like to improve that. Our end goal is to have local LLMs running efficiently with Langchain tools, without the need for too much prompting as we need currently.

In summary, these are the sections of this article:

  1. About the WizardLM
  2. Generating the list of tasks
  3. Logging the prompts/outputs
  4. Executing the tasks
  5. Consolidating the dataset
  6. Fine-tuning the LoRA
  7. The result and next steps

The WizardLM model

In my experience, the WizardLM 7b performs much better at coding tasks, does not produce as much syntax errors and understands how to use the Langchain tools much better than the standard Vicuna models.

I personally had better results using the unquantized version, which you can find here.

There are also several quantized versions out there, as well as versions for running the model on CPU - it’s not hard to find them (in fact, the same user in HF uploads several of these versions: https://huggingface.co/TheBloke).

To use this model, there a few options:

  1. reuse our previous code to use Vicuna a model, since it’s the same architecture, we’re only changing the weights — you can find it here:
  2. use the text generation webui, and enabling the API

For more info on how to setup your local environment, please read the README on my repo linked above.

NOTE: if you want to have the prompt logging that we’ll see next, you have to use my server since it’s where I implemented it

Generating the list of tasks

I’ll assume here you have managed to run either the text-generation-webui, my sample llama server or figured it out your own solution. You could also probably use simpler / smaller models for this task.

Once you have an LLM that you can call somehow, we’ll write a short script. First, we need to write ourselves a few tasks as example input for the model, and make sure we set the temperature high enough to get variance in the outputs of the model we’re using. Then we just create an infinite loop and append the results to a file. Here’s what my script looked like:

from langchain_app.models.llama_http_llm import build_llama_base_llm

output = None
# Now let's test it out!
while True:
params = {"temperature": 1.3, "max_new_tokens": 1024, "stop": []}
llm = build_llama_base_llm(parameters=params)
# Finally, let's initialize an agent with the tools, the language model, and the type of agent we want to use.
output = llm._call("""
You are given a list of tasks. Please extend it with new unique tasks:

1. "Print hello world to the terminal",
2. "Fetch a Chuck Norris joke from this endpoint https://api.chucknorris.io/jokes/random",
3. "Parse this HTML page https://api.chucknorris.io/ and find all the API endpoints ",
4. "Generate 10 unique cat jokes and store them in a CSV file with two columns, punch line and joke finisher",
5. "Connect to a Postgres database and return the existing databases names. Use the following credentials: \n\nhost localhost\nport 7036\nuser admin\npassword admin",
6. List the existing files in the current directory",
7. "Find out your existing working directory" ,
8. "Fix the syntax error of this code snippet:\ndef myfunc():\n\tprint(“hello",
9. "Find the keys of the JSON payload stored in the variable response_json",
10. "Extract the key called 'address' from the JSON stored in the variable json_ and store into a variable called address",
11. "Create a joke about AI bots and save it in a local text file",
12. "Create an unit test for the following snippet of code:\ndef sum_2(x, y):\n\treturn x + y",
13. "Create random data and plot it using matplotlib and store the result as a .PNG image",
14. "Download a CSV file about suicide from the webpage https://catalog.data.gov/dataset/?res_format=CSV and plot a bar chart comparing the suicide numbers of male vs ,female",
15. "Design a Todo list system. Write the explanation in a file called 'todo_list_system_design.txt'",
16. Search for the source code called 'example.py' in the directory, inspect the file, write unit tests for it and execute them to make sure everything is correct.",
17. "Write a data pipeline that ingests data from the Crime Data from 2020 to present from https://catalog.data.gov/dataset/?res_format=CSV. Use the requests and pandas, save the csv to the local disk. Create a directory if necessary, give an appropriate name"
"""
)
with open("generated_tasks.txt", "a") as fp:
fp.write(output)

You’ll get both good and bad output. For instance, I got this back, which seemed useful to me:

These are all great tasks that will keep you busy for a while!
Note: Some tasks may require additional parameters, such as file names, API keys or database credentials.
The new tasks are:

11. "Create a joke about Submarine and save it in a local text file",
12. "Create an utillity function to read the content of a file in a specific line and return it, save it to a variable",
13. "Create a pipeline that extracts data from a response stored in the variable response and save it to a CSV file named training_data.csv , using pandas",
14. "Create another pipeline that extracts data from a response stored in the variable response and save it to a CSV file named testing_data.csv using pandas",
15. "Create a report about the verified dog breeds in the world using the pygame library, displays as splash screen with actual sodk wupps",
16. "Convert a zip file to a torrent file and vice versa using Python and store both in the folder downloads",
17. "Create a function that determines the size of a file or directory using Python and save it to a variable",
18. "Create a script that checks for updated dependencies in package.json and restart the server if necessary",
19. "Scrape data from the website www.ebay.com and save it to a CSV file, using Beautiful Soup and pandas",
20. "Create a function that counts the occurrences of a word in a given text, using Python and save it to a variable"
Was not able to complete all tasks even after retrying for a few days. Please advice.t

But also a lot of junk in the middle, which means we’ll have to clean it up later. I let this script run for 8 hours and was happy with the generated amount of content.

To filter, I used two simple snippets. First a bash script:

cat generated_tasks.txt \
| tr -s ' ' | \
grep -oE '\s*[0-9]+\.[A-Za-z, ]+[A-Za-z, ]+\.' | \
awk 'length >= 50' | \
sed -e 's/[0-9\. ]*//' > filtered_generated.txt

In plain English, this will:

  1. make sure that we truncate duplicate whitespaces
  2. keep only lines that start with a number and dot
  3. keep only lines that have a length of at least 50
  4. remove the numbers and dot from the start of the line

This is quite restrictive and does cut out a lot of good content. In fact, the example output above was also cut off! But it was not a problem, I still got back as a result 570 lines of output, which was sufficient for my experiment.

However, I did notice there was duplication, so I wrote a simple Python script:

import json
with open("filtered_generated.txt", "r") as fp:
tasks = fp.readlines()

with open("dedup_generated_tasks.json", "w") as fp:
json.dump(list(set(tasks)), fp, indent=4)

And now we have a nice list of tasks in JSON format:

[
"Create a function that folds a list of strings into a single string.\n",
"Write data canvas for Google Maps using JavaScript and store the result as an image file.\n",
"Creating a joke about AI bots is a fun task for practicing creating and writing humor.\n",
"Create a web application using Flask and Tasken, that allows users to sell their old clothes online.\n",
"Write a function that reminds you of a task every hour.\n",
...
]

You can see the full generated list here

Logging the prompts/outputs

OK, so we want to generate pairs of inputs and outputs using our prompted WizardLM.

Here’s the prompt I wrote another day:

"""

For instance:

Question: Find out how much 2 plus 2 is.
Thought: I must use the Python shell to calculate 2 + 2
Action: Python REPL
Action Input:
2 + 2
Observation: 4

Thought: I now know the answer
Final Answer: 4

Example 2:
Question: You have a variable age in your scope. If it's greater or equal than 21, say OK. Else, say Nay.
Thought: I should write an if/else block in the Python shell.
Action: Python REPL
Action Input:
if age >= 21:
print("
OK") # this line has four spaces at the beginning
else:
print("Nay") # this line has four spaces at the beginning

Observation: OK
Thought: I have executed the task successfully.
Final Answer: I have executed the task successfully.

Example 3:

Question: Write and execute a script that sleeps for 2 seconds and prints 'Hello, World'
Thought: I should import the sleep function.
Action: Python REPL
Action Input:
from time import sleep
Observation:

Thought: I should call the sleep function passing 2 as parameter
Action: Python REPL
Action Input:
sleep(2)
Observation:

Thought: I should use the 'print' function to print 'Hello, World'
Action: Python REPL
Action Input:
print('Hello, World')
Observation:

Thought: I now finished the script
Final Answer: I executed the following script successfully:

from time import sleep
sleep(2)
print('Hello, World')


Additional Hints:
1. If an error thrown along the way, try to understand what happened and retry with a new code version that fixes the error.
2. DO NOT IGNORE ERRORS.
3. If an object does not have an attribute, call dir(object) to debug it.
4. SUPER IMPORTANT: ALWAYS respect the indentation in Python. Loops demand an idendentation. For example:

for i in range(10):
print(i) # this line has four spaces at the beginning

Same for ifs:

if True:
print("hello") # this line has four spaces at the beginning

An error be thrown because of the indentation, something like... "expected an indented block after 'for' statement on line..."

To fix, make sure to indent the lines!

5. Do not use \ in variable names, otherwise you'll see the syntax error "unexpected character after line continuation character..."
6. If the variable is not defined, use vars() to see the defined variables.
7. Do not repeat the same statement twice without a new reason.
8. NEVER print the HTML directly.

Now begin for real!

Question: {}

Alright, it’s quite long. But it has a nice hint at the end Now begin for real!. However, we don’t want the original prompt to appear in the dataset we’re generating.

So the first thing we should make sure of is that we cut the prompt out of what we’re logging:

Here’s a simple idea:

def log(self, input_str, prefix="input"):
filename = os.path.join(self._dir, f"{prefix}_{self.input_step}")
with open(filename, "w") as fp:
if prefix == "input":
input_str = input_str.split("Now begin for real!\n")[1]
fp.write(input_str)

Then when we log the input we save instead something like:


Question: Write a script to generate and delivery a gag joke to the user based on their current mood and mentioned fruits.


Thought:

Much leaner! And with this, we hope to train the model so that eventually it does not need that prompt anymore.

Of course, we want to log all steps executed by the Langchain, so I wrote this singleton class:

class PromptLogger:
_instances = {}

@staticmethod
def get(session):
if session not in PromptLogger._instances:
PromptLogger._instances[session] = PromptLogger(session)
return PromptLogger._instances[session]

def __init__(self, session) -> None:
self.input_step = 0
self.output_step = 0
self.session = session
self._dir = f"logged_prompts/session_{session}/"
try:
os.makedirs(self._dir)
except FileExistsError:
pass

def log(self, input_str, prefix="input"):
filename = os.path.join(self._dir, f"{prefix}_{self.input_step}")
with open(filename, "w") as fp:
if prefix == "input":
input_str = input_str.split("Now begin for real!\n")[1]
fp.write(input_str)

if prefix == "input":
self.input_step += 1
elif prefix == "output":
self.output_step += 1
else:
raise ValueError("Invalid prefix")

The idea is that we can track sessions across different requests, and then we generate several pairs of inputs/outputs files for each chain. Here’s what the file structure looks like after an execution:

If we peek at input_2 for example, we see:


Question: Create a function that folds a list of strings into a single string.


Thought:I should define a function that takes a list of strings and returns a single string with all the strings concatenated with a space in between.
Action: Python REPL
Action Input:
def fold_list(lst):
return ' '.join(lst)

Observation:
Thought:I have defined the function
Action: Python REPL
Action Input:
fold_list([1, 2, 3])

Observation: sequence item 0: expected str instance, int found
Thought:

And we find the new action taken by the model inside the output_3

I should check if the first element of the list is not a string
Action: Python REPL
Action Input:
fold_list([1, 2, 'hello'])

For this to work, our client code also has to pass a session ID identifying a new Langchain chain has started — we then just read this from the request.

In the client:

params = {
"temperature": 0,
"max_new_tokens": 2048,
"stop": ["Observation:"],
"logging_session": f"medium_size_dataset{idx+offset}" # dynamic number
}

llm = build_llama_base_llm(parameters=params)

In the server:

if prompt_request.logging_session is not None:
prompt_logger = PromptLogger.get(prompt_request.logging_session)
prompt_logger.log(prompt_request.prompt, prefix="input")
prompt_logger.log(output, prefix="output")

Please see the full server code here if you’re interested.

Executing the tasks

This part is actually pretty simple! Now that we figure out how to log our input/output pairs, we can just write a short script to read our generated tasks and call our server with a unique session ID for each task.

offset = 376
with open("task_generation/dedup_generated_tasks.json", "r") as fp:
tasks = json.load(fp)
tasks = tasks[offset:]


for idx, task in enumerate(tasks):
params = {"temperature": 0, "max_new_tokens": 2048, "stop": ["Observation:"], "logging_session": f"medium_size_dataset{idx+offset}"}

llm = build_llama_base_llm(parameters=params)
python_tool = PythonAstREPLTool()

tools = [
Tool(
name="Python REPL",
func=python_tool,
description="useful for when you need to execute Python code",
),
]
agent = initialize_agent(
tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True
)
first_task = tasks[idx]
try:
agent.run(prompt_template.format(first_task))
except Exception:
pass

I added an offset, so I could edit it and resume the task list processing again another time.

I also had to catch an exception, as sometimes the LLM fails to generate valid output and Langchain raises a parse exception.

Unfortunately, I notice the following problems while executing:

  1. The model tries to install packages and fail

2. The langchain agent tooling sometimes does not return the stdout correctly as part of the Observation (the stdout comes before the Observation), which makes the LLM confused

3. The model tends to ask for user input, blocking the flow

4. The model sometimes exits the chain by calling sys.exit()

5. The model often gets stuck in infinite loop

These problems tend to degrade a lot the generated dataset, and also required me to restart the execution of the tasks quite a few times (hence the offset).

Consolidating the dataset

We want now to make the logged prompts easy to ingest in a training pipeline. We’ll open the logged input/output pairs, zip them together and them consolidate into a text file. To keep things simple, each pair will be saved into its own file.

For instance, merging an input with an output will result in something like this:

#####PROMPT: 
Question: Also, tools like requests or wget can be used to download and save the CSV file.


Thought:#####OUTPUT: I should use the requests library to download a CSV file from a website.
Action: Python REPL
Action Input:
import requests
url = 'https://example.com/file.csv'
response = requests.get(url)
with open(url, 'w') as file:
file.write(response.content)

This shows us very clearly what is the prompt and what is the expected output.

So here’s a script to make this conversion:

import os

dataset_folder = "medium_size_generated_tasks"
# -1 means no number of max_actions
max_actions_per_task = -1

if __name__ == "__main__":
try:
os.makedirs(dataset_folder)
except FileExistsError:
pass
dir_ = "logged_prompts/"
sessions = os.listdir(dir_)
datapoints = 0
for session in sessions:
session_dir = os.path.join(dir_, session)
logs_files = os.listdir(session_dir)
inputs_step_tuple = [log.split("_") for log in logs_files if "input" in log]
outputs_step_tuple = [log.split("_") for log in logs_files if "output" in log]
inputs_step_tuple.sort(key=lambda x: x[1])
outputs_step_tuple.sort(key=lambda x: x[1])
i = 0
for input_tuple, output_tuple in zip(inputs_step_tuple, outputs_step_tuple):
input_filename = input_tuple[0]+"_"+input_tuple[1]
output_filename = output_tuple[0]+"_"+output_tuple[1]
input_ = os.path.join(session_dir, input_filename)
output_ = os.path.join(session_dir, output_filename)
with open(input_, "r") as fp:
prompt = fp.read()
with open(output_, "r") as fp:
output = fp.read()
datapoint_filename = os.path.join(dataset_folder, f"{datapoints}.txt")
with open(datapoint_filename, "w") as fp:
fp.write(f"#####PROMPT: {prompt}")
fp.write(f"#####OUTPUT: {output}")
datapoints+=1
i += 1
if i == max_actions_per_task:
break

If we want to convert these files into a single JSON, we can then add a few extra steps:

dataset_list = []
dir_ = "medium_size_generated_tasks"
files_ = os.listdir(dir_)
for f in files_:
filename = os.path.join(dir_, f)
print(filename)
with open(filename, "r") as fp:
txt = fp.read()
prompt = txt.split("#####PROMPT:")[1].split("#####OUTPUT:")[0].strip()
output = txt.split("#####OUTPUT:")[1].strip()
dataset_list.append({
"prompt":prompt,
"output": output,
})

with open("data.json", "w") as fp:
json.dump(dataset_list, fp, indent=4)

I’ve uploaded the text files I’ve generated to Hugging Face

Beware, the resulting dataset was quite dirty! This was the first iteration, so we’d most likely need additional steps to control the data quality.

Fine tuning the LoRA

The easiest way for me was to fork alpaca-lora repository and modify the fine-tuning script.

As a result, here’s the file: https://github.com/paolorechia/vicuna-react-lora/blob/main/finetune_wizard_react.py

It’s worth going over some of the modifications I did. First, I’m fine-tuning the WizardLM itself, as I mentioned in the introduction section.

# Wizard
model_path = "TheBloke/wizardLM-7B-HF"

model = LlamaForCausalLM.from_pretrained(
model_path,
load_in_8bit=True,
device_map="auto",
)
tokenizer = LlamaTokenizer.from_pretrained(
model_path,
add_eos_token=True
)

I’m then injecting in the prompt the ReAct prelude that langchain adds to all prompts:

react_prompt_prelude = """
Received prompt: Answer the following questions as best you can. You have access to the following tools:

Python REPL: A Python shell. Use this to execute python commands. Input should be a valid python command. If you want to see the output of a value, you should print it out with `print(...)`.
Search: useful for when you need to ask with search

Use the following format:

Question: the input question you must answer
Thought: you should always think about what to do
Action: the action to take, should be one of [Python REPL, Search]
Action Input: the input to the action
Observation: the result of the action
... (this Thought/Action/Action Input/Observation can repeat N times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question

Begin!

"""

And then we just rebuild everything:

def generate_prompt(data_point):
return react_prompt_prelude + data_point["prompt"] + data_point["output"] + "\n\nObservation:"

As you might notice here, we end up concatenating the prompt with the output, which means that some of the logic of splitting input/output was not strictly required for fine-tuning this LoRA! However, having the dataset in this format makes it easier to use it for other models.

Here are the parameters I used for training:

MICRO_BATCH_SIZE = 4
BATCH_SIZE = 64
GRADIENT_ACCUMULATION_STEPS = BATCH_SIZE // MICRO_BATCH_SIZE
EPOCHS = 10
LEARNING_RATE = 3e-4
CUTOFF_LEN = 256
LORA_R = 8
LORA_ALPHA = 16
LORA_DROPOUT = 0.05
VAL_SET_SIZE = 1000

Of course, it’s also good to note where the fine-tuning script is saving the output, and you need to install the requirements to execute it. Once you have done this, we can use the results!

One last step, however. I ran into an issue where the model save was not working correctly, so I had to copy the binary checkpoint. The behavior I saw was that the fine-tuning did absolutely no effect at all on the model. Luckily, I found the solution here.

The result and next steps

So assuming you were able to fine-tune your model, you could now use it with my server with the following environment variables:

export USE_FINE_TUNED_LORA=true
export LORA_WEIGHTS=../vicuna-react-lora/lora-wizard-react # Change to your LoRA path
export MODEL_PATH=TheBloke/wizardLM-7B-HF
uvicorn servers.vicuna_server:app

With the LoRA loaded in memory, I ran the following script. Note that there is no prompt engineering here this time.

from langchain.agents import load_tools
from langchain.agents import initialize_agent
from langchain.agents import AgentType

from langchain_app.models.vicuna_request_llm import VicunaLLM

# First, let's load the language model we're going to use to control the agent.
llm = VicunaLLM()

# Next, let's load some tools to use. Note that the `llm-math` tool uses an LLM, so we need to pass that in.
tools = load_tools(["python_repl"], llm=llm)

# Finally, let's initialize an agent with the tools, the language model, and the type of agent we want to use.
agent = initialize_agent(
tools, llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=True
)

# Now let's test it out!
agent.run("""Write a Python script that prints 'Hello, world!""")

Here’s the LoRA executing the task.

(learn-langchain) paolo@paolo-MS-7D08:~/learn-langchain$ python3 -m langchain_app.agents.finetuned_hello
/home/paolo/learn-langchain/langchain_app/utils/deprecation_warning.py:4: UserWarning: 'langchain_app.models.vicuna_request_llm' is a deprecated module, should not be used and will be removed in future versions.
warnings.warn(f"'{name}' is a deprecated module, should not be used and will be removed in future versions.")


> Entering new AgentExecutor chain...
I should import the print function.
Action: Python REPL
Action Input:
import print

Observation: No module named 'print'
Thought:I should fix the import
Action: Python REPL
Action Input:
import print

Observation: No module named 'print'
Thought:I should check the documentation
Final Answer:
Action: Python REPL
Action Input:
import print


> Finished chain.

So the results are not good, the model insists there is the need to import the print function. But before judging so quickly, let’s compare with the vanilla WizardLM running the same prompt:

I should start by importing the Python standard library
Action: Import the standard library
Action Input: None
Observation: Import the standard library is not a valid tool, try another one.
Thought:I could use the print() function to output the string
Action: Use the print() function to output the string
Action Input: 'Hello, world!'
Observation: Use the print() function to output the string is not a valid tool, try another one.
Thought:I could use the input() function to get user input
Action: Use the input() function to get user input
Action Input: 'Hello, world!'

The original model completely fails to use the right format and just gets stuck into an infinite loop.

Of course, there’s a lot of work to be done to improve the quality of the training data before we have a good fine tune, but other it’s interesting to see how the fine-tuning changed our model behavior and notice how it can appropriately use the ReAct framework wit ha really small prompt.

I shared the LoRA weights in hugging face if you want to play with it, but don’t expect good results from it!

So let’s summarize the problems we had and deduce our next steps:

  1. The model license is not permissive with WizardLM, the experiment should be repeated with a model of a more permissive license
  2. We need to allow the model to install packages on its own so that we don’t end up with a training dataset that is just a bunch of failed installation attempts
  3. Clean the dataset a bit more before fine-tuning, to achieve even better results.

There are plenty of options to try for item (1), so it shouldn’t be a problem.

Regarding item (2), however, I’m still figuring out how to approach it, as it will probably involve writing a new custom tool for Langchain. But perhaps that will be the topic of the next article.

Last but not least, I also see a lot of potential in this approach for other tools that expect specific formats, like AutoGPT.

Hope you enjoyed this one!

Paolo Rechia
Paolo Rechia

Written by Paolo Rechia

Software Developer / Data Engineer - Connect with me on Linkedin: https://www.linkedin.com/in/paolo-rechia/

Write a response