Better Programming

Advice for programmers.

Follow publication

Benchmarking in Go: Substrings vs. Regular Expressions

Stephen Wayne
Better Programming
Published in
4 min readAug 19, 2022

--

In this third part of a four-part series, we’ll cover benchmarking in Go using our text filtering tool as a test subject. You can find the other parts below:

Previously, we built a Go program to remove unwanted text, and we extended it with regular expression support. Now, we’ll learn about benchmarking in Go by comparing the initial substring search approach against the added regular expression pattern matching. Note that the methods used here are simple and intended to facilitate benchmarking discussions, not prove objectively that one method is strictly faster than another.

Go’s standard library testing package provides extensive test support (you might have seen functions of the form TestXxx(t *testing.T)), as well as benchmarking support via Type B. B facilitates managing benchmark timing, iterations, etc., as well as the typical utility functions to clean up, log, or fail a test.

Using these tools, we can garner insights on whether the tool runs faster (under these scenarios) using substring or pattern matching.

Setting Up the Benchmarks

Go benchmarks are functions that live in a <filename>_test.go file, where <filename> is typically the name of the file containing the functions to be tested. Both files are typically in the same package, and benchmark functions take the form of:

func BenchmarkXxx(b *testing.B) { // benchmark code }

When writing the benchmarks, we typically require some code-specific initial setup and then move to a loop that will run the code under test several times (determined by the testing package). In the end, we are told the number of loops that ran and how long it took (on average) to run each loop. From there, we can analyze code performance.

We will start by setting up a []stringwith a few lines of text — this will be common to substring and pattern match search. We then will build a config that contains either a set of keys or a pattern, and then we’ll run lineMatches()(in the benchmark loop) against the input text and look at how long each iteration takes. We’ll do this with varying amounts of keyphrases and pattern complexity (each built to match the same text) and compare the results.

Building the Benchmarks

Our benchmarks will be pretty simple: they’ll all use the same input text and generate a config with either a keys or a pattern. We’ll make congruent substring search and pattern matching benchmarks for each query type. Then, we just need to run the tests and compare the results.

We’ll use the following as our input test (same as in example/input.txt):

Benchmark matching “hello” — this will match all lines in our input

Benchmark matching “big” — this will match two input lines

Benchmark matching “big,” “brig,” “bight,” and “bright” — this will match three input lines

Running the Benchmarks

Finally, we can run all of our benchmarks by running this command:

go test -bench=.

in the root directory of the project repo, which yields the following results:

The first column is the name of the benchmark. The second is how many loops were used (think: the value of b.N), and the third column is the average time it took to run a loop. In our case, each loop was searching through all of inputText for either the substring or pattern matches.

Note that the -10 appended to each test name indicates the number of CPU cores used for the test (10 in my case).

We can see that, given the above tests on an Apple Silicon Mac, substring search is anywhere between 3.55x and 83.74x faster than pattern matching as implemented.

Wrapping up

Does this mean we should always choose substring search? Not necessarily.

In software, we must select the right tool for the right job. Can you put a screw in with a hammer? Technically yes, but it’s far from ideal. If you can match the text you care about with one or a few substrings, this tool will likely be faster.

But, there are plenty of benefits to using regular expressions. They will generally translate well between tools and can answer more complex queries than substrings (without significant repetition). One might also consider how much real-world time the performance difference translates to. As we can see, this will depend on query complexity and the input size.

In many cases, the performance difference may not be significant (fractions of a second to less than a minute).

Also, note that the above tests are not comprehensive but likely represent a variety of common queries. For example, it would be nice to see how performance scales with input size — that can be left as an exercise to the reader. We also have the option to call b.ResetTimer() just before the b.N loop if the initial setup takes significant time, but there’s no need to do that here. And as usual, there’s much more to the benchmarks part of the testing package to dive into!

If there are additional use cases you’d like to see benchmarked, please let me know in the comments! In the next article (the last of a four-part series), we’ll do some command line benchmarking of this tool against tools like grep (thanks for the idea, Aaron!).

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Stephen Wayne
Stephen Wayne

Written by Stephen Wayne

Backend cloud engineer at HashiCorp. Former Electrical Engineer turned to the dark side.

No responses yet

Write a response