Visualizing the Rates of Change in a Codebase Over Time With git-log(1)

See how your code and test coverage has changed over the life of your codebase

Noah Sussman
Better Programming

--

A growing plant with live and dead branches.
Photo: Jeremy Bishop

I was surprised at the reactions to a recent suggestion I made as to which metrics are useful in software quality assurance:

I thought more people knew how to view the rate of change of their test code versus their application code

I was surprised that so many people saw the tweet above and immediately asked how to make the measurement in the first place.

I know that measuring rate of change in Git isn’t an activity usually performed by generalist software engineers. But I had assumed a higher level of familiarity in the specialist software-testing community than apparently exists. This was especially interesting because many people were interested and intuitively drawn to the metric — they just didn’t know how to derive it.

Comparing the Rate of Change of Two Directories With git-log(1)

Typically, a software application’s source code is laid out such that a single top-level directory contains all application code (the codebase), and another top-level directory contains all of the automated test code and scaffolding (the testbase). (If your project isn’t laid out this way, just please keep reading and I’ll get to how to address that. Or just scroll to the bottom of the article if you’re impatient.)

For example, imagine a Git repository where all of the codebase is in the directory app/, and the testbase is in the directory test/.

So, for example, the Git repo would look like this:

A Mac OS Finder window showing a folder called app and a folder called test
A Git repo laid out such that the codebase and the testbase are in separate top-level-directories

Converting git-log(1) Output to a Time Series

The first thing to know here is that git-log(1) can take filesystem paths as parameters. So you’ll get different output from the command git log app versus the command git log test (if issued while standing in the Git repo root as pictured in the screenshot above).

However, the default output of git-log(1) is not a time series.

Below is a script that converts the output of git-log(1) to a time series. This script will produce a time series for the app directory and another time series for the test directory:

So the output would look something like the following:

app/
1 2020-02-27
1 2020-03-03
2 2020-03-06
3 2020-03-11
test/
1 2020-01-13
1 2020-02-12
1 2020-02-25
1 2020-03-05

A line like 3 2020–03–11 means that on March 3, 2020, there were three commits to this repository made over the course of that whole day.

You can read down the left-hand side of the list to see how many commits were made on each day following the initial commit, all the way up to whatever is the commit date of the current HEAD of the Git repository’s default branch.

Drawing Inferences From the Time Series

One thing that can immediately be inferred from the data above is the codebase and the testbase get committed to on different days. That this pattern exists in the source code would be a high-signal observation in a real project since, ideally, application code and its supporting test code should change together as functionality is added and modified.

The shell script above is configured to only look at the last three weeks’ data. In a decade of continuous-integration practice, I’ve found three weeks/21 days is a sufficiently wide temporal window in which trends in Git history can be observed without introducing a lot of noise from old and irrelevant source code changes.

You can change the time window from three weeks to anything you want by editing line 8 of the shell script above.

Comparing the Rate of Change of Two Arbitrary Filesets

At the beginning of this article, I acknowledged that not everyone’s project is arranged into a codebase and testbase top-level directory structure.

Fortunately, no matter how a codebase and testbase are laid out, you can always visualize the Git history of both distinct sets of files — both filesets— by using find(1) to filter your source tree. This works because git-log(1) can take more than one path as an argument.

So it’s just a matter of using find(1) to make two collections of paths: one for the codebase and one for the testbase.

For instance, if your test files are stored side by side with your code files, then you can extract the paths of just your test files with find(1), as shown in the shell script below. In this example, I assume that all test filenames end with Test.php, so all test files are named something like exampleTest.php.

Here’s how to find all files ending in Test.php, regardless of where that file is located within the Git repository’s filesystem:

This will produce output similar to that shown above, except in this case only the history of the testbase is shown:

      1 2020-01-13
3 2020-02-12
1 2020-02-25
1 2020-03-05

As before, this time series represents the total number of commits to all Test.php files over the past three weeks. And you can change the time window to whatever you want by changing the --since parameter on line 4 to some date other than three weeks ago.

Now, let’s move onto printing out the history of the codebase so it can be compared with the history of the testbase.

Assuming all your application files have a common extension, say .php, then you can collect the paths of all your codebase files (all of the files that could potentially be under test) with find(1) as follows:

find . -name "*.php" -not -name "*Test.php"

This will print out a list of all the files that end in .php except for those that end in Test.php, which will be omitted from the list.

Like the script we used to look at testbase files, this script would produce a time series showing days when commits were made to codebase files, something like the following:

      1 2020-01-13
1 2020-02-12
1 2020-02-25
1 2020-03-05

No matter how your codebase is laid out, you can use find(1) to collect the paths of your codebase and your testbase files so you can pass those paths into git-log(1) and compare the two histories!

--

--