Member-only story
Memory Bandwidth Optimized Parallel Radix Sort in Metal for Apple M1 and Beyond
A sort shader in Metal for use on primitive data types

Parallel sorting is a well-trodden-yet-still-challenging-to-implement primitive on massively parallel architectures. Among the challenges are irregular memory access patterns, multiple passes over the data, intermediate storage requirements, and branching, leading to thread divergence. In this story, we will develop a sort shader in Metal for use on primitive data types.
The natural algorithm choice for sorting these data is a radix sort as its overall time complexity is O(k•n)
rather than O(n•log(n))
in comparison-based sorts, where n
is the number of values, and k
is the bits required to represent the largest value. Radix sort is a form of counting sort, and consequently, this work builds on our efficient scan and reduce primitives to split and scatter the values.
Brief Background
Many variations on radix-sort for concurrent systems have been discussed in literature. The commonality amongst these variations is a counting step whereby the frequency of a digit at a particular position in a sequence of values is determined, followed by a partitioning step where values are split into buckets based on that digit. These steps are repeated for each position in the value. For example, a radix-sort in base-2 yields digits of 0 and 1.
The number of 0's are counted so that the 1’s can be moved to positions after the 0’s. The process is repeated for each subsequent position in the number. For a 32-bit value, this would mean 32 passes over the data, with each pass consisting of the counting step and the move step. An astute reader will of course recognize that the partitioning by digit could occur in k
passes over one bit or one pass over k
bits. In the latter case, the count would no longer be a value but a histogram of the frequency of each digit.

A quick aside on elementary vocabulary — positional notation expresses numbers by digits arranged sequentially by place value. The sum of the product of the digit…