How Homebrew Serves 52M Packages per Month?

Container registries are general-purpose data storages, so let’s learn from the case study of Homebrew

Published in

Better Programming

6 min readFeb 15, 2023

One can not easily imagine a developer macOS system without Homebrew. It is an open-source package manager allowing to install software not provided by Apple, which has recently received support for Linux and Windows platforms as well.

Initially, Homebrew operated by downloading software from the upstream and building it on your own computer by following the instructions specified in formulæ written in the Ruby programming language (see the terminology).

Having received tremendous popularity, Homebrew got two important features. First, it began managing casks, native distributions of macOS apps by different vendors, such as Sublime Text or Firefox. Second, it started offering pre-built software in the form of bottles, allowing the users to download roughly the same software from a trusted Internet source instead of heating the air building it. These bottles saved precious compilation time by spending much on storing the binaries. How exactly does it work for Homebrew?

Let us consider the example of SQLite, an embedded relational database. If we look at the corresponding Homebrew formula, sqlite.rb, written in a domain-specific language in Ruby, we notice the build instructions in the install method and several hexademical identifiers in the bottle block.

Homebrew uses these build instructions to produce and bottles, and uploads these bottles to the Internet to allow skipping the expensive build step on your computer by downloading the pre-built binaries for your platform.

If one runs a command like brew install sqlite on any system for which the bottle is available, Homebrew will download that bottle and install its contents to the system.

In the last month, Homebrew served more than 52M package installations, and this number might be higher as some percentage of the users disable analytics. Even if each bottle takes a megabyte of space, Homebrew should have served more than 50 TiB of traffic per month, which is prohibitively expensive on most online hosting providers. It is a non-profit project that accepts donations from sponsors, but there’s little chance that donations will cover a many-thousand-dollar monthly bill of such services as Amazon S3 for this amount of traffic.

Fortunately, the Homebrew team managed to find a hosting partner. Initially, they used the now-defunct platform called Bintray. Later, Homebrew enabled tapping bottles from GitHub Releases. This is a pretty common practice used by many open-source projects, such as Gensim (see the gensim-data repository as an example).

Unfortunately, there is no standard way to bring the structured metadata there, so one has to rely on ad-hoc means to distinguish versions and operating systems for the same package version. Now, they have finally moved to GitHub Packages—yes, the same GitHub packages that are used for storing Maven artefacts and Docker images.

A container registry is a good fit for storing files of different size, versions, and metadata. GitHub has a generous free tier for open-source projects, allowing unlimited traffic. So, when tapping a bottle in Homebrew, you are downloading pre-built binaries from GitHub Packages, aka GitHub Container Registry.

But how does Homebrew exactly resolve which binaries to download? The bottle block in the formula contains SHA-256 identifiers, aka digests of the platform-specific archives (see sqlite.rb again as an example). At the time of writing, the latest version of SQLite in Homebrew was 3.40.1.

Since GitHub packages is a container registry just like Docker Hub and many others, it is possible to reconstruct the generic download link as we know the image name (homebrew/core/sqlite) and its version (3.40.1). Thus, the bottles are located at ghcr.io/homebrew/core/sqlite:3.40.1. For instance, binaries for x86_64 Linux have the digest of 8d1bae…85bb06.

The only missing piece for accessing these binaries is the authentication token for the registry. The default value of this token is hard-coded in Homebrew as QQ==. However, we can request a dedicated token with a single relatively straightforward request.

TOKEN=$(curl "https://ghcr.io/token?scope=repository:homebrew/core/sqlite:pull" | jq -r .token)
# or just
TOKEN="QQ=="

Regardless of how we got the authentication token, combining all these three items, package, version, and token, allows us to directly grab SQLite 3.40.1 for x86_64 Linux as a binary large object (blob).

curl -I \
  -H "Authorization: Bearer $TOKEN" \
  "https://ghcr.io/v2/homebrew/core/sqlite/blobs/sha256:8d1baebd808a5cdb47c3fedbefd4de5cf7983700c41191432f3a9bed4885bb06"

Don’t forget to enable the --location option in cURL (-L) to follow redirects when downloading the file; my current example just fetches the file headers (-I). So it is relatively easy to download such blobs as Homebrew bottles or regular container images from registries like GitHub Packages without any specialized tools.

We could have stopped on using the already known digests in the Homebrew formulæ, but let us investigate where all these identifiers originated from. We know the image identifier, ghcr.io/homebrew/core/sqlite:3.40.1, so let’s use the Open Container Initiative (OCI) image specifications to recover the SHA-256 identifiers ourselves.

OCI Image Format Specification: Media Types (Source: https://github.com/opencontainers/image-spec/blob/main/img/media-types.png)

The upper-level entity is the image index that contains information about the operating system flavors in the JSON format, along with some other metadata fields. Let us fetch it first.

curl \
  -H "Authorization: Bearer $TOKEN" \
  -H "Accept: application/vnd.oci.image.index.v1+json" \
  "https://ghcr.io/v2/homebrew/core/sqlite/manifests/3.40.1"

The index refers to multiple image manifests. Each manifest corresponds to a particular platform, identified by the CPU architecture, operating system, and version. As all the metadata are machine-readable (and somewhat human-readable, too), we can easily spot two things.

First, there is an annotation field sh.brew.bottle.digest, which contains the same SHA-256 digest as specified in the Homebrew formula. But it was put there deliberately during the build process. Second, we don’t see any files here, and the manifest digest for x86_64 Linux is different: ff58c2…8c22da. We need now to retrieve the manifest of the image we need.

curl \
  -H "Authorization: Bearer $TOKEN" \
  -H "Accept: application/vnd.oci.image.manifest.v1+json" \
  "https://ghcr.io/v2/homebrew/core/sqlite/manifests/sha256:ff58c21da5e58b82bae6e19207a8ec01e398a31512081e9b6560b2dec88c22da"

An image is composed of a set of layers. Each layer is stored—and downloaded—as a separate file. In our case, the image has only one layer with the following title: sqlite--3.40.1.x86_64_linux.bottle.tar.gz, and its digest finally matches the one specified in the formula! That means we can query the container registry and get the correct files without relying on hand-crafted links. So we recovered the same link as seen in a few paragraphs above.

curl -I \
  -H "Authorization: Bearer $TOKEN" \
  "https://ghcr.io/v2/homebrew/core/sqlite/blobs/sha256:8d1baebd808a5cdb47c3fedbefd4de5cf7983700c41191432f3a9bed4885bb06"

We performed all these manipulations manually with cURL, which is educative but error-prone and somewhat unreliable.

As container registries are steadily evolving into general-purpose blob storage, OCI Registry As Storage (ORAS) is a recently funded project by the Cloud Native Computing Foundation (CNCF). It allows pushing and pulling images and fetching their metadata just like we did with cURL, but with the maintenance by CNCF and hopefully better error handling.

ORAS, OCI Registry as Storage (Source: https://oras.land/)

ORAS is implemented in the Go programming language and replaces our cURL invocations for index and manifest examples, correspondingly, with the two oras manifest fetch commands returning the same JSON representations.

oras manifest fetch "ghcr.io/homebrew/core/sqlite:3.40.1"
oras manifest fetch "ghcr.io/homebrew/core/sqlite:3.40.1@sha256:ff58c21da5e58b82bae6e19207a8ec01e398a31512081e9b6560b2dec88c22da"

It allows downloading the entire image to the current directory.

oras pull "ghcr.io/homebrew/core/sqlite:3.40.1"

Also, it is possible to upload the data using ORAS via oras push, but our token should have the corresponding permissions for the upstream container registry. It is not very difficult, and the core idea is the same, see the documentation.

Container registries are a popular way to distribute large files. You can store more than just software packages—machine learning parameters, training datasets, and much more—using the well-supported internet infrastructure you already know.

Better Programming

How Homebrew Serves 52M Packages per Month?

Container registries are general-purpose data storages, so let’s learn from the case study of Homebrew

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Better Programming

Written by Dmitry Ustalov

No responses yet