djbsort: Internals

djbsort includes infrastructure adapted from lib25519, libmceliece, etc. for selecting implementations and compilers. The djbsort distribution includes rules (which are automatically derived from benchmarks reported by djbsort-speed) for which implementations to prioritize depending on CPU features, and then at run time each CPU selects the top-priority implementation that it supports. For the current djbsort software, this boils down to the following:

The compiled library for Intel/AMD CPUs includes an AVX2 implementation with one compiler, plus a portable fallback. The AVX2 implementation is automatically selected at program startup on CPUs that support AVX2.
The compiled library for other CPUs includes simply a portable fallback.

To expand the tests to cover all combinations of implementations and compilers during development and optimization, configure with --no-trim.

Choices of sorting networks

The AVX2, SSE4.2, and NEON implementations of int32 sorting and int64 sorting are automatically generated by autogen/sort, a sorting-network compiler. The sorting networks used are mostly bitonic to support a hypercube data flow, allowing the hypercube to be sliced and rotated to reduce data movement in vectorized code. However, faster odd-even stages are used in some code segments for which the odd-even data flow does not cause complications.

Odd-even sorting is often stated only for powers of 2, and bitonic sorting is usually stated only for powers of 2. Knuth's reorganization of odd-even sorting as "merge exchange" easily handles arbitrary sizes. More importantly for the current version of djbsort, the formulation of bitonic sorting from 1998 Lang easily handles arbitrary sizes:

Merging 1*0*1* into 0*1*: If a bit array of size n (at least 2) matches regexp 1*0*1* (a V shape), and p is the largest power of 2 strictly below n, then one stage of distance-p bitonic merging (n−p comparators) has the same effect as distance-p bitonic merging (p comparators) on a hypothetical power-of-2 array of length 2p having the same contents right-padded with 1*. Merging can then continue recursively with regular power-of-2 bitonic merging on the left p elements, and not-necessarily-power-of-2 bitonic merging on the right n−p elements.
Recursive sorting (which happens first): As in the power-of-2 case, one can set up 1*0*1* by recursively sorting the left and right sides of the array (so now left is 0*1* and right is 0*1*) and then reversing the left side (so now left is 1*0* and right is 0*1*). The split of sides here does not have to match the p,n−p split used in the merging. Lang suggests a split into floor(n/2),ceil(n/2).

For each n in turn, one can calculate the optimal left-right split for the recursive sorting. It turns out that p,n−p uses more comparators than necessary, especially when n is only slightly larger than p. Using floor(n/2),ceil(n/2) also turns out to be suboptimal (perhaps surprisingly, uneven splits produce fewer comparators than even splits when n is a power of 2). To simplify vectorization and control flow, djbsort limits the choices of sizes for the left side, specifically taking a left side of size either p or 3p/4.

Reverse sorting

Instead of sorting the left side and then doing an extra pass through the data to reverse the result, one can call a sort-downwards function for the left side, at the expense of duplicating code; djbsort does this for some small functions. An alternative is to xor -1 into the array before and after sorting, integrating this into the sorting functions to avoid extra passes; larger functions in djbsort follow this approach to reduce code size.

Similarly, djbsort handles uint and float sorting (and down sorting) with small wrappers on top of int sorting. These wrappers use only one pass before sorting and only one pass after sorting.

Handling odd tails of arrays

An int32[64] array is easily loaded into eight 256-bit AVX2 vectors, each holding an int32[8]. But what about an int32[61] array? There might be other program data immediately after the array, so one cannot simply load and store a 256-bit vector for positions 56,...,63 as if the array had 3 more elements.

One solution (for array lengths above 8) is to use overlapping loads and stores, applying a vector operation to positions 53,...,60 along with 0,...,7 and 8,...,15 and so on through 48,...,55. This was used in earlier versions of djbsort and works well with, e.g., a min-max computation of two vectors, but it constrains sorting-network layouts.

Another solution is to copy the int32[61] array into an int32[64] array padded with maximum int32 values, sort, and copy back. This approach is used in djbsort for some small array sizes, but is unacceptable for in-place software handling large array sizes.

Yet another solution is to copy just the last 5 entries of the array into an int32[8] array padded with maximum int32 values, or the last 13 entries of the array into an int32[16] array padded with maximum int32 values, etc. This has less copying and, as a bonus, allows other sorting stages to assume that array lengths are multiples of 8, 16, etc.; the disadvantage is that the array tail has to be tracked separately through each sorting subroutine.

A conceptually simpler solution comes from the AVX2 VPMASKMOV instruction, which can load just int32[5] into an int32[8] vector (and store int32[5] later), given a mask that designates the first 5 out of 8 entries. It is easy to convert 61 into such a mask, and to blend the resulting vector with a vector of maximum int32 values.

Some vectorized sorting software uses VPMASKMOV, but djbsort does not, for two reasons. First, AMD's official documentation says that a CPU "may signal a data breakpoint or a page fault for doublewords that are zero-masked and not actually written". (This AMD statement is surprising: Intel's AVX2 documentation guarantees that faults "will not occur due to referencing any memory location if the corresponding mask bit for that memory location is 0"; AMD claims to support AVX2; tests on a variety of AMD CPUs that claim AVX2 support shows each CPU seeming to do what Intel says. But is the AMD statement so surprising that it's clearly wrong?) Second, loads with VPMASKMOV are very slow on AMD's AVX2 CPUs before Zen 2, and stores with VPMASKMOV are very slow even on newer AMD AVX2 CPUs.

The sorting-network compiler in djbsort understands two options internally. One option generates VPMASKMOV; the other generates a slightly longer sequence of instructions that adequately simulates VPMASKMOV for the case of loading and storing an array tail. Currently djbsort uses the slightly longer sequence of instructions. This marginally slows down Intel CPUs, but speeds up AMD CPUs and avoids correctness questions on AMD CPUs; djbsort also uses a similar instruction sequence for ARM CPUs. It would also be possible to automatically select the VPMASKMOV code for Intel CPUs and the non-VPMASKMOV code for AMD CPUs, at the expense of having both options in the compiled library.

Version: This is version 2026.02.09 of the "Internals" web page.