The Right Way to Select the Best Prompting Strategy for Your LLM
How to find the optimal prompting strategy by treating your prompts as just another hyperparameter in your machine learning pipeline

Overview
In my last post, I introduced an idea on how to leverage GPT to augment your training dataset. For this post, I want to change gears a bit and discuss the topic of prompt engineering.
Coming up with the right prompt is more of an art than a science. People have quickly realized that one of the best ways to leverage large language models (LLMs) is to invest time in crafting the best prompt. This makes sense as many of these LLMs have been trained on giant corpuses of text (even the majority of the open internet in some cases). In a sense, these models should already have the necessary context to answer nearly all of our questions, the challenge just lies in figuring out how to craft the right prompt to coax this information out.
Prompt engineering is quickly establishing itself as the next data science frontier. Someone told me that the coding language of the future is going to be English. Understanding how to interact with these models is going to be critical and will decide who extracts the most value from these advances.
Take a look at the following interaction with chatGPT:

This is an admittedly silly example, but it illustrates just how susceptible the model is to a good prompt. In the rest of the article, I’m going to cover how to find the best prompt for your problem.
Problem Framing
One important aspect I glossed over in my previous post was how I came up with which prompt to use. If you remember, the goal of the article was to predict how likely recently discharged patients were to readmit back to the hospital. To leverage generative AI, I asked GPT to output enhanced descriptions of each patient that I incorporated as an additional feature in my model, which ultimately helped improve its accuracy.
The exact prompt I used for each patient was:
Tell me what a diagnosis code of {723} is and how may it affect a {caucasian} {female} patient between the age of {50} and {60} with an admission type of {elective} that was in the hospital for {1} day and was taking {21} medications? Be precise and answer only in clear facts using fewer than 100 words.
But why did I choose this particular prompt? Would my results have been any different if I had chosen a different prompt? How am I supposed to know which prompt is the best for my problem?
These are all important questions, and, unfortunately, there’s no one-size-fits-all answer. There is, however, an approach you can use that will help: treat all these choices as just another hyperparameter that you need to optimize as part of your model training process.
Much like carving out an internal train-test split to decide which learning rate or max depth to set when fitting an XGBoost model, you can think of choosing the optimal prompt as an additional parameter you need to search over.

We perform our hyperparameter optimization within this nested train-validation split. That means we retrain a model for every unique combination of hyperparameters and evaluate which combination leads to the best accuracy on our 20% test set.
Since we’re treating our prompt as just another hyperparameter, all we need to do is come up with a list of potential values to grid search over. For example, instead of the above prompt, I could just as easily have phrased my prompt it any of the following ways:
- You are an analyst at a hospital network and you want to build a predictive model that forecasts readmission risk across your patient population. What other descriptive detail can you tell me about a {caucasian} {female} patient between the age of {50} and {60} with a diagnosis code of {723} that may be helpful to include as a feature in a machine learning model?
- What are common symptoms for {caucasian} {female} patients between the age of {50} and {60} with diagnosis codes of {723}?
- What are potential treatments for {caucasian} {female} patients between the age of {50} and {60} with diagnosis codes of {723}? Are there any complications associated with these treatments?
- Tell me about a diagnosis code of {723} and how it may affect {caucasian} {female} patients between the age of {50} and {60}, in particular.
These are just a handful of examples — the universe of potential prompts is infinite. I’m sure you can come up with even better prompts than the ones I listed above (honestly, the hardest part was typing out all those squiggly brackets 😪).
Hyperparameter (Grid) Search
Our next step is to incorporate these prompts into our hyperparameter optimization.

When building a typical scikit-learn pipeline with an XGBoost classifier, you likely specify a set of hyperparameters to optimize using something like GridSearchCV. In our case, we’re going to expand the idea of hyperparameters to the entire pipeline instead of just the classifier, including the step where we prompt GPT to give us enhanced patient descriptions (GPT data augmentation with optimal prompt selection in the above blueprint). The set of hyperparameter values that we test will look something like this:
- max_depth: [1, 3, 5, 7]
- learning_rate: [0.05, 0.1]
- missing_value_imputation: [mean, median]
- roberta_max_sequence_length: [64, 128]
- gpt_data_augmentor: [prompt_1, prompt_2, prompt_3, prompt_4, prompt_5]
When all is said and done, we will have trained 160 models with different combinations of hyperparameters.
To do this programmatically, we first have to loop through each prompt and re-generate GPT responses across all of our patients like so:
prompts = [
prompt_1,
prompt_2,
prompt_3,
prompt_4,
prompt_5,
]
for idx, prompt in enumerate(prompts):
df[f"enhanced_description_{idx+1}"] = df.apply(
lambda x: get_description(x, prompt), axis=1
)
This code will output five additional text columns in our dataset, one for each prompt.
Once we have our updated dataset, all we need to do is create separate feature lists with different combinations of our new features. It’s not your standard bayesian hyperparameter optimization, but it gets the job done just the same!
Here’s what the different feature lists look like:
- original features + prompt_1
- original features + prompt_2
- original features + prompt_3
- original features + prompt_4
- original features + prompt_5
At this point, the hard work is pretty much done and all we need to do is train our five different models to see which one ends up with the best performance on our external validation set.
Results
I used the DataRobot modeling platform to run all these results quickly. Interestingly, we can see it was the second prompting strategy that led to the most accurate model (in this case lower scores are better since we’re measuring gamma deviance):

Our original prompting strategy led to the second most accurate results so we weren’t too far off. Regardless, this exercise shows that prompts matter. Had we skipped this analysis and just gone with our gut, we would’ve ignored a substantially better model!
Extra Credit
Until now, I had to outline all of my prompts in advance. Ugghhh, how time-consuming! Why should I have to do all this work when I could be doing more productive things like endlessly doom scrolling Instagram? Shouldn’t there be a way to automatically create these prompts?
Enter GPT…again.
I bet we can prompt GPT to come up with its own prompts (wow, meta!). I haven’t tried this yet, but the idea is to explain the problem to GPT, describe the patient data we have, and ask it to generate a prompt that an LLM could use to generate our responses. If you get a chance to try this out, let me know how it works in the comments.
That’s it for this post.
Connect with me on LinkedIn for more helpful data science tips. Thanks for reading!