Improving the Performance of NumPy Code
Experiments on the runtime of NumPy code

In a recent project, we decided to use AUC-Recall@K
as our primary metric for evaluating our model. We defined it as the proportion of area under an ideal Recall@k
curve that our model achieved.
My first implementation used NumPy’s functions, and the performance wasn’t bad at all, but it made me wonder how much better I could have done.. and how much worse.
Table of ContentsThe AUC-Recall@k MetricThe BaseLine ImplementationReplacing the for-loop with NumPy’s functionsRemoving the DataframeUsing Numba
Updates
4.11.22: Add code gists
The AUC-Recall@k Metric
Our metric of choice calculates the proportion of the ideal area under the Recall@k
curve our model achieves.
AUC Recall@k = Area under Recall@k / Area under Ideal Recall@k
This required two computations: the ideal recall@k
curve and of course, the recall@k
curve of our model. Let’s first define what Recall@k
is.
Recall@k
This is a popular metric for evaluating recommender systems — and rankings in general — and is often computed in tandem with Precision@k
.
Recall@k
answers the question:Out of all the correct recommendations, how many are present in the top-k recommendations of your model?
Recall@k = # correct recommendations in the top k / # total correct
It’s clear that as k increases, so does Recall@k
, ultimately reaching a value of 1
when k
equals the total number of documents. Note that it may reach 1
for a smaller k
as well.
Computing the Recall@k curve
Now that we have a way to compute Recall@k
, to get the curve, we compute Recall@k
for every k
from 0 to the total number of recommendations.
Recall@k Curve = [Recall@0, Recall@1, Recall@2, Recall@3, ..]
Computing the ideal Recall@k curve
The intuition behind the ideal Recall@k
curve is that in an ideal world, each relevant document will be ranked higher than the non-relevant one. What would result is a ranking order like this:
[1, 1, 1, ... 1, 0, 0 ... 0]
Every time we increase k
, we’re always considering one more relevant document. And as soon as we’ve covered all the relevant documents, our recall@k
has already hit 1
. After that point, it remains the same since there are no more relevant documents to affect our recall.
Ideal Recall@k = k / total relevant documents
(when k <= # relevant docs)
= 1
(when k > # relevant docs)
More succinctly,
Ideal Recall@k = min(1, k / total relevant documents)
And since we’re interested in the ideal recall@k
curve, we’ll extrapolate this for each k
.
Computing AUC-Recall@k
Now that we have our two curves — Recall@k and the ideal Recall@k — the AUC score would just be the ratio of their areas.
AUC-Recall@k = Area under Recall@k / Area under the ideal Recall@k
The BaseLine Implementation
Let’s start with a rudimentary implementation to appreciate the performance improvements in subsequent versions. Note that this implementation is intentionally slow and only serves as a representation of poorly written code.
The dataframe
We’ll first create a dataframe of our model’s confidence scores and the target, where 1
represents a relevant document while 0
represents an irrelevant one.
Implementing the computation of the Recall@k curve
To calculate your model’s recall@k
, iterate over the list of rankings and calculate the recall
at each point.
Implementing the computation of the ideal Recall@k curve
As we had defined in the previous section, the ideal Recall@k curve is a piecewise function. We’ll implement it using NumPy’s numpy.minimum
function.
The complete implementation
Now that we have our two curves, we’ll calculate the ratio of their areas to get our final AUC-Recall@k metric. We’ll use NumPy’s numpy.trapz
to compute this.
# Computing the final metric by getting the proportion of areas
return np.trapz(recall_at_k) / np.trapz(ideal_recall_at_k)
The entire implementation looks like this:
Replacing the for-loop With NumPy’s functions
One of the best ways to speed up your code is to eliminate loops. NumPy’s performance improvements work when we use its optimized functions.
Computing the Recall@k curve
For example, to calculate the Recall@k
curve for our model, we can use NumPy’s cumsum
function.
Let’s first store our rankings as a NumPy array. Recall that conf_df
was our dataframe of containing the expected predictions and the confidence scores our model assigned to them.
ranking = conf_df["expected"].to_numpy()
The cumsum
logic can be implemented like this. In our case, since 1
represents a relevant document, we can calculate the recall
by simply counting the number of 1
s in the first k ranks.
recall_at_k = (ranking == 1).cumsum() / (ranking == 1).sum()
Computing the ideal Recall@k Curve
While the overall implementation of the ideal curve won’t change here since we’re already using NumPy’s functions, we no longer have to use conf_df
since we have our rankings in a NumPy array already (ranking
).
We’ll substitute ranking
into the existing implementation.
The complete implementation
Our function now looks like this:
Removing the Dataframe
In our previous approach, though we started with a dataframe, we ended up working with NumPy arrays at the end. Removing the dataframe should give us a decent improvement.
If we look at the main reason for creating the dataframe, it was simply to sort the predictions based on our model’s confidence scores. This can be easily accomplished using NumPy. All we need to do is use the ranked indices of one array to reorder the other.
In our case, we’ll sort the confidence scores array in descending order like before and then use those indices to reorder our documents array.
Let’s call the documents array y_true
and the confidence scores array y_conf
. The sorting can be done with a combination of NumPy’s argsort
and indexing.
ranking = y_true[np.argsort(y_conf)[::-1]]
The complete implementation
Using Numba
While running these experiments, I was curious whether there was an easier way to get an additional boost in performance. I came across Numba, which according to them, is:
Numba is an open source JIT compiler that translates a subset of Python and NumPy code into fast machine code.
Their quickstart guide titled “A ~5-minute guide to Numba” mentions that Numba works best with loops and NumPy code but not with pandas. Since we were already using NumPy exclusively in the previous approach, it seemed a good idea to give it a go.
While I haven’t explored all that Numba has to offer, I found their jit
decorator, which optimizes any function tagged with.
So, our final Numba approach entails simply tacking on Numba’s jit
decorator on our previous function.
The Results
We’ll compare the performance of these approaches over four dataset sizes:
- 5 documents
- 100 documents
- 10k documents
- 100k documents
The raw values
To run these experiments, I made use of the timeit
magic command.
For example, on the 100k dataset, the baseline approach’s performance was calculated like this:
%timeit -o auc_recall_at_k_np_no_df_numba(np.random.randint(0, 2, 100000), np.random.rand(100000))
The final raw times (in milliseconds) were:

Relative performance
Let’s first look at the relative performance of all our experiments compared to the one with the least time — Numba on five data points. In the graph below, we see that the relative performance of our experiments improves pretty significantly as we go from the baseline approach, which had a for loop, to the approach that used Numba’s performance boost.

Key observation
Once we commit to using NumPy (the second experiment onwards), we see a very significant jump in performance. As we improve further, we get diminishing returns.
On 10k and 100k documents
If we zoom into the 10k and 100k dataset sizes, the trend is easier to visualize. The baseline implementation sees the most performance deterioration as we increase our dataset’s size. On the other side of the spectrum, adding Numba’s performance boost gives us a 30–40% faster implementation, which isn’t too bad considering that all we had to do was add four letters at the top of our function.
The data labels in the plots are raw times (ms) below:


References and Sample Code
Code
References
- Numpy: https://numpy.org/
- Numba’s five-minute guide: https://numba.readthedocs.io/en/stable/user/5minguide.html
- The timeit magic: https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-timeit
Conclusion
This article isn’t meant to be a comprehensive guide on writing better code. Rather, it’s to show what’s possible. While I’m sure there’s further scope for improvement here — for example, by exploring Numba further or storing any repeated function calls in variables — it’s clear how much of a difference replacing loops with NumPy’s built-in functions make.
If nothing else, I hope readers see how something so trivial can save you hours.