djbsort: Verification

The sortverif toolkit in djbsort, given an array size n, automatically verifies that djbsort_int32 correctly sorts all 232n possible length-n int32 arrays and that djbsort_int64 correctly sorts all 264n possible length-n int64 arrays. The time to run the tools gradually increases with n but is affordable for, e.g., n=1024.

This page explains how the sortverif toolkit works and how to run it. This is a long page organized as follows:

The structure of the verification tools

There are four main tools inside sortverif:

How to prove a min-max program

How does verification prove that a min-max program correctly sorts all inputs? Let's start with a standard observation about min-max programs, called the "0-1 principle": a min-max program correctly sorts all arrays if and only if it correctly sorts all 0-1 arrays. A 0-1 array means an array where each entry is either 0 or 1: for example, the 0-1 arrays of length 3 are 0,0,0 and 0,0,1 and 0,1,0 and 0,1,1 and 1,0,0 and 1,0,1 and 1,1,0 and 1,1,1.

The point is that if a min-max program mis-sorts an array, let's say producing an output ...,27,18,... with 27 and 18 in the wrong order, then the same min-max program will also mis-sort a 0-1 array that has 1 exactly where the original array has 27 or larger, producing an output ...,1,0,... with 1 and 0 in the wrong order.

The 0-1 principle is normally formulated for networks, meaning min-max programs consisting of comparators that overwrite a,b with min(a,b),max(a,b); but it's also true for more general min-max data flows. If a min-max program works for all 0-1 arrays, then the output of the same min-max program will also have the correct number of occurrences of 27 or larger (the same number of occurrences as in the original array) and will have those after the other integers. Now replace "27 or larger" with "larger than 27" to see that the output will have the correct number of occurrences of exactly 27, and the correct positioning of those occurrences.

The 0-1 principle reduces the verification cost for int32[n] sorting from 232n to 2n. A standard variant of the principle reduces the cost to polynomial in n. This variant says that a min-max program correctly merges all sorted arrays of length n−r with all sorted arrays of length r if and only if it correctly merges all sorted 0-1 arrays of length n−r with all sorted 0-1 arrays of length r. Notice that there are only r+1 sorted 0-1 arrays of length r: for example, for r = 3, these arrays are 0,0,0 and 0,0,1 and 0,1,1 and 1,1,1.

The most important step in the decompose tool is analyzing the data flow in a min-max program to compute a decomposition of the program into presumed 2-way-merging programs. The tool does not require the merge to be a merge of sorted inputs on the left with sorted inputs on the right; it allows arbitrary ordering or interleaving of the two lists of inputs. (In the current version of djbsort, left and right sides are often interleaved for vectorizability. Also, the left side is often in reverse order, as in many presentations of bitonic sorting.)

The tool then verifies each presumed 2-way-merging program by testing the program on a linear-size list of inputs. For example, a min-max program correctly merges all sequences of 3 inputs with all sequences of 5 inputs if and only if it correctly merges 0 1 2 with 3 4 5 6 7; 0 1 7 with 2 3 4 5 6; 0 6 7 with 1 2 3 4 5; and 5 6 7 with 0 1 2 3 4.

(Linear size refers to the number of inputs here being n. Each input fits into ceil(log2n) bits. Programmers using these techniques to verify, e.g., int8[300] sorting should note that these test inputs need int16 instead. The decompose tool uses Python's integers, which do not have size limits other than total memory.)

The fact that a quadratic-size list of test bits is enough for a merging network (whereas the best methods known to test a general sorting network take exponential time) is noted in Knuth's Art of Computer Programming. The fact that larger test inputs allow a linear-size list (analogously to an older speedup by Yao for general sorting networks) was pointed out in 1987 Chung–Ravikumar. However, the test set stated in that paper is too small (off by one). A correct test set appears in 2007 Even–Levi–Litman. Both of these papers consider only the balanced case, merging n/2 inputs with n/2 inputs, but the idea generalizes immediately to other cases.

This verification strategy suffices for djbsort, which handles each array size with a min-max program built as a composition of 2-way merging programs. Beyond djbsort, some sorting networks are not built as compositions of 2-way merging programs: for example, verifying the communication-efficient Schnorr–Shamir mesh sorting algorithm would require extending the tool to recognize the structure of that algorithm.

The rewrite rules for minmax

The point of the minmax tool is as follows. The programmer transforms a sequence of min-max operations into a sequence of operations in C code; the compiler transforms those operations into a sequence of machine instructions; unroll, using angr's binary analysis, transforms those machine instructions into a sequence of instructions in angr's language for arithmetic; minmax tries to transform those instructions back into the original sequence of min-max operations.

Internally, minmax follows various rewrite rules in a file rules, such as a rule

invert(unsignedle(S,T)) -> unsignedlt(T,S)

saying that complementing the 1-bit result of S<=T is the same as the 1-bit result of T<S, and more complicated rules such as

xor(and(signedrshift(xor(BMINUSA=sub(B,A),and(xor(B,BMINUSA),AXORB=xor(A,B))),Bits(32,const31)),AXORB),A) -> signedmin(A,B)

recognizing one of the portable ways to express min(A,B) using arithmetic operations.

Each minmax run generates a random input for the program it is analyzing, and tests that each rewrite being applied produces correct results for that input. It would be easy to scale up the number of inputs tested for each rewrite. However, this would not catch rewrite bugs that apply only to rare inputs. For higher assurance, there is a script checkrules (usage after angr installation: ./checkrules < rules) that uses an SMT solver (via angr's SMT-solving interface) to check that the rules are correct for various sizes of operands. SMT solvers are designed to reason about how all possible inputs behave, not just random inputs; for example,

echo 'equal(S,Concat(Bits(64,const31),Bits(64,const5))) -> bit0' | ./checkrules

rapidly prints out a specific 128-bit value of S disproving this rule.

How to install angr and verify djbsort

The sortverif toolkit has been tested only under Debian and Ubuntu. The toolkit needs the following system packages:

(Here's everything together: apt install build-essential clang python3 python3-venv python3-pyparsing time.) The remaining steps are as an unprivileged user.

Download and unpack djbsort. (The sortverif toolkit is included in the djbsort source distribution. It is not installed as part of a djbsort binary package, and is not incorporated into the djbsort-fulltest utility.)

Install djbsort as a home-directory installation if it isn't already installed on the system. (The toolkit tests machine code, not source code.)

Inside the unpacked djbsort distribution, switch into the sortverif directory and install angr, along with an angr patch to add support for more instructions:

cd sortverif
python3 -m venv angrenv
. angrenv/bin/activate
pip install angr==9.2.102 pycparser==2.23
for dir in angrenv/lib/python*
do
  (cd $dir; patch -p1 < ../../../angr-patch)
done

If you come back to this later in a new shell, you will have to switch to the sortverif directory again and run . angrenv/bin/activate again.

Run the verifymany tool to verify sorting for (e.g.) array lengths 0 through 16:

seq 0 16 | time ./verifymany &

The verifymany tool puts its outputs into a results directory. Check for success lines and fail lines in a file results/overview.

Actually, results is a symbolic link to a directory whose name matches results-*, where the * includes a timestamp. If you run verifymany again then the symbolic link will point to the new directory while the old results will still be in the previous directory.

By default, verifymany will parallelize across all OS threads. The OS normally sets up one thread per CPU core (times the number of CPU hyperthreads per core). This can be a problem if there is not a correspondingly large amount of RAM. You can instead run

seq 0 16 | time env THREADS=3 ./verifymany &

and then verifymany will run at most 3 verifications at once. A few components of angr are further parallelized, so the load average can be somewhat larger than 3; verifymany does not pin each verification to an OS thread.

Parallel computation will tend to have slightly lower latency if you try array lengths in the opposite order, changing seq 0 16 to seq 16 -1 0. You can also test a specific length using, e.g., echo 768.

Verifying multiple implementations

sortverif automatically analyzes every sorting implementation compiled into the library, not just the implementations automatically selected for the current machine. Concretely, on Intel/AMD CPUs, the verification tools automatically analyze both the fast AVX2 implementation and the portable fallback, whether or not the current machine supports AVX2.

If you configure djbsort with --no-trim, install, and then run sortverif, then further options will be compiled into the library and will be analyzed by sortverif. For example, sortverif will analyze the AVX2 implementation compiled with gcc and the AVX2 implementation compiled with clang, if both gcc and clang are installed when djbsort is configured and compiled.

Verifying cross-compiled implementations

For conventional tests, running native binaries is faster than running emulated binaries, although running an emulated binary on a big server can still outperform the same binary running natively on a smaller device. The sortverif analysis is different: binaries are emulated in all cases, so you might as well run the analysis for all architectures on a big server.

For concreteness, the following instructions focus on an Intel/AMD server running Debian or Ubuntu to verify djbsort cross-compiled for one architecture, 32-bit ARM. There are various ARM-specific parts here: the name arm32; Debian's armel name for little-endian arm32; gcc's arm-linux-gnueabi name for arm32 with a particular ABI; and more specific compiler options -mthumb -march=armv7-a -mtune=cortex-a7 -mgeneral-regs-only.

As root, install an ARM cross-compiler, libcpucycles for ARM, and an emulator, along with the system packages generally needed for sortverif:

dpkg --add-architecture armel
apt update
apt install \
  build-essential clang python3 python3-venv python3-pyparsing time \
  binutils-arm-linux-gnueabi crossbuild-essential-armel libc6:armel libcpucycles-dev:armel \
  binfmt-support qemu-user-static
ln -s /usr/bin/clang /usr/bin/arm-linux-gnueabi-clang

If your Debian or Ubuntu is too old to include libcpucycles-dev, do the above without libcpucycles-dev:armel and then install libcpucycles as follows:

cd /root
wget -m https://cpucycles.cr.yp.to/libcpucycles-latest-version.txt
version=$(cat cpucycles.cr.yp.to/libcpucycles-latest-version.txt)
wget -m https://cpucycles.cr.yp.to/libcpucycles-$version.tar.gz
tar -xzf cpucycles.cr.yp.to/libcpucycles-$version.tar.gz
cd libcpucycles-$version
echo arm-linux-gnueabi-gcc -Wall -fPIC -fwrapv -O2 > compilers/default
./configure --host=arm32 --prefix=/usr/lib/arm-linux-gnueabi && make -j8
mkdir -p /usr/include /usr/lib/arm-linux-gnueabi
cp build/0/package/lib/libcpucycles.so.1 /usr/lib/arm-linux-gnueabi/
ln -s libcpucycles.so.1 /usr/lib/arm-linux-gnueabi/libcpucycles.so
cp build/0/package/include/cpucycles.h /usr/include/

Now make an account specifically to verify sorting code compiled for 32-bit ARM:

adduser --disabled-password --gecos arm32sort arm32sort
su - arm32sort

All further steps are within this account. (If you want to verify sorting code for multiple architectures, make one account per architecture.)

Set up environment variables:

export LD_LIBRARY_PATH="$HOME/lib"
export LIBRARY_PATH="$HOME/lib"
export CPATH="$HOME/include"
export MANPATH="$HOME/man"
export PATH="$HOME/bin:$PATH"
export CROSS="arm-linux-gnueabi-"
export CC="arm-linux-gnueabi-gcc"
export CFLAGS="-L$HOME/lib"

Download, unpack, and install djbsort, configuring with --host=arm32 and two choices of cross-compilers:

cd
wget -m https://sorting.cr.yp.to/djbsort-latest-version.txt
version=$(cat sorting.cr.yp.to/djbsort-latest-version.txt)
wget -m https://sorting.cr.yp.to/djbsort-$version.tar.gz
tar -xzf sorting.cr.yp.to/djbsort-$version.tar.gz
cd djbsort-$version
( echo arm-linux-gnueabi-gcc -Wall -fPIC -fwrapv -O2 -mthumb -march=armv7-a -mtune=cortex-a7 -mgeneral-regs-only
  echo arm-linux-gnueabi-clang -Wall -fPIC -fwrapv -Qunused-arguments -O2 -mthumb -mcpu=cortex-a7 -mgeneral-regs-only
) > compilers/default
./configure --host=arm32 --no-trim --prefix=$HOME && make -j8 install

As a test at this point, running djbsort-speed should print various lines including djbsort arch arm32 near the top, and should continue with various lines mentioning arm-linux-gnueabi-gcc and arm-linux-gnueabi-clang, along with many cycle counts (higher cycle counts than running on a real ARM device; djbsort-speed is running in emulation under qemu).

Next, switch into the sortverif directory and install angr:

cd sortverif
python3 -m venv angrenv
. angrenv/bin/activate
pip install angr==9.2.102 pycparser==2.23
for dir in angrenv/lib/python*
do
  (cd $dir; patch -p1 < ../../../angr-patch)
done

Finally, run verifymany:

seq 0 16 | time env THREADS=3 ./verifymany &

Results are in results/overview as in the non-cross-compiled case.

Developing new rules for minmax

A change to the machine code for sorting—for example, because of a change to the target architecture or compiler options, a change to the compiler version, or a change to a different compiler—could break verification. This could be because the new machine code is wrong (try introducing a bug into the djbsort code and then see what sortverif does), but it could also be because the binary-analysis tools do not understand the instructions used.

Here is an example from the minmax development, where machine code for a portable implementation called bitonic2 used min-max idioms that were not recognized by a development version of minmax, but adding various lines to rules recognized the idioms.

The first sortverif experiments on 32-bit ARM (via cross-compilation) successfully verified int32 sorting but failed for int64 sorting with bitonic2. There was a line in results/overview saying minmaxfailed/int64/impl0/2 bitonic2 for gcc, and similarly for clang. Other files in the results directory included results/minmaxfailed/int64/impl0/2 saying "exit code 1"; results/minmaxstderr/int64/impl0/2 saying Exception: resulting operations ['Concat', 'Extract', 'ZeroExt', 'and', 'constant', 'signedrshift', 'sub', 'unsignedle', 'xor'] are not just signedmin, signedmax; and results/minmaxstdout/int64/impl0/2 with the following contents:

v1 = x_0_64
v3 = Extract(v1,63,32)
v2 = x_1_64
v4 = Extract(v2,63,32)
v5 = xor(v3,v4)
v6 = sub(v4,v3)
v7 = Extract(v1,31,0)
v8 = Extract(v2,31,0)
v12 = unsignedle(v7,v8)
v13 = ZeroExt(v12,31)
v14 = constant(32,1)
v15 = xor(v13,v14)
v16 = sub(v6,v15)
v17 = xor(v4,v16)
v18 = and(v17,v5)
v19 = xor(v16,v18)
v20 = constant(32,31)
v21 = signedrshift(v19,v20)
v22 = and(v5,v21)
v23 = xor(v3,v22)
v24 = xor(v7,v8)
v25 = and(v24,v21)
v26 = xor(v7,v25)
v27 = Concat(v23,v26)
y_0_64 = v27
v28 = xor(v4,v22)
v29 = xor(v8,v25)
v30 = Concat(v28,v29)
y_1_64 = v30

This is correct code using 32-bit arithmetic to compute the min and max of two 64-bit inputs, but that version of minmax did not see how to simplify the code down into just saying min and max.

The original C code already had several operations to replace a and b with their min and max, namely r = b^a; z = b-a; z ^= r&(z^b); z >>= 63; z &= r; a ^= z; b ^= z. The code then expanded because each of the variables here is an int64 variable. On a 32-bit CPU, an int64 operation typically turns into multiple 32-bit instructions: for example, a 64-bit r = b^a turns into two 32-bit xors, and a 64-bit z = b-a turns into a 32-bit subtraction followed by a 32-bit subtract-with-borrow instruction. In context, the 32-bit subtraction matters only for the borrow result, since z >>= 63 depends only on the top bit of z. The subtraction-with-borrow instruction is decomposed by angr into simpler instructions in the language used by angr for arithmetic: for example, the 32-bit borrow is expressed as xor(ZeroExt(C,31),constant(32,1)) where C is a comparison bit, ZeroExt(C,31) pads it to 32 bits, and constant(32,1) is the 32-bit constant with value 1.

Adding the following lines to rules simplified the computation:

xor(ZeroExt(C,31),Bits(32,0star1)) -> ZeroExt(invert(C),31)

This turns the 32-bit xor into a 1-bit inversion. Bits(32,...) is a pattern matching a 32-bit quantity ..., and 0star1 is a pattern matching a string of 0s followed by a 1, i.e., the integer 1. (Bit strings in angr are big-endian.)

invert(unsignedle(S,T)) -> unsignedlt(T,S)

The comparison bit is computed as unsignedle(S,T), meaning S<=T, where S and T are the bottom 32 bits of the 64-bit inputs. Inverting this comparison bit gives the same result as T<S.

sub(sub(Extract(Bits(64,B),63,32),Extract(Bits(64,A),63,32)),ZeroExt(unsignedlt(Extract(B,31,0),Extract(A,31,0)),31)) -> Extract(sub(B,A),63,32)

This recognizes what the compiler produces for the top 32-bit word of B-A.

and(signedrshift(xor(BMINUSATOP=Extract(sub(Bits(64,B),Bits(64,A)),63,32),and(xor(BTOP=Extract(B,63,32),BMINUSATOP),BXORATOP=xor(ATOP=Extract(A,63,32),BTOP))),Bits(32,const31)),BXORATOP) -> xor(ATOP,Extract(signedmin(A,B),63,32))
and(signedrshift(xor(BMINUSATOP=Extract(sub(Bits(64,B),Bits(64,A)),63,32),and(xor(BTOP=Extract(B,63,32),BMINUSATOP),xor(Extract(A,63,32),BTOP))),Bits(32,const31)),xor(ABOT=Extract(A,31,0),Extract(B,31,0))) -> xor(ABOT,Extract(signedmin(A,B),31,0))

This recognizes what the compiler produces for the top and bottom 32-bit words of r = b^a; z = b-a; z ^= r&(z^b); z >>= 63; z &= r, and rewrites these as the corresponding words of a xor'ed with the desired min.

xor(Extract(B,31,0),xor(Extract(A,31,0),Extract(signedmin(A,B),31,0))) -> Extract(signedmax(A,B),31,0)
xor(Extract(B,63,32),xor(Extract(A,63,32),Extract(signedmin(A,B),63,32))) -> Extract(signedmax(A,B),63,32)

This recognizes the top and bottom words of a computation of the max as the min xor'ed with the two inputs.

These extra rules changed results/minmaxstdout/int64/impl0/2 to simply

v1 = x_0_64
v2 = x_1_64
v27 = signedmin(v1,v2)
y_0_64 = v27
v30 = signedmax(v1,v2)
y_1_64 = v30

and produced success/int64/impl0/2. But there was still a failure for impl1/2, which used clang instead of gcc. The following extra rules fixed that:

and(signedrshift(xor(and(xor(BMINUSATOP=Extract(sub(Bits(64,B),Bits(64,A)),63,32),BTOP=Extract(B,63,32)),BXORATOP=xor(BTOP,ATOP=Extract(A,63,32))),BMINUSATOP),Bits(32,const31)),BXORATOP) -> xor(ATOP,Extract(signedmin(A,B),63,32))
and(signedrshift(xor(and(xor(BMINUSATOP=Extract(sub(Bits(64,B),Bits(64,A)),63,32),BTOP=Extract(B,63,32)),xor(BTOP,ATOP=Extract(A,63,32))),BMINUSATOP),Bits(32,const31)),xor(Extract(B,31,0),ABOT=Extract(A,31,0))) -> xor(ABOT,Extract(signedmin(A,B),31,0))

With these added rules, verification finished for all of the int64[2] sorting implementations in the library for 32-bit ARM. Larger array sizes then needed a few more rules, but the most complicated work was for int64[2].

Some verification timings

Each timing here is from a dual AMD EPYC 7742 running at 2.245GHz with overclocking disabled. The OS is Debian 12, with gcc 12.2.0 and clang 14.0.6.

Verifying 4 implementations (int32[n] AVX2, int32[n] portable fallback, int64[n] AVX2, int64[n] portable fallback) at various sizes (each run here covers all sizes down through 0 so there is some redundancy):

$ seq 64 -1 0 | time env THREADS=128 ./verifymany
for int32 20260210 amd64 will test avx2 portable4
for int64 20260210 amd64 will test avx2 portable4
results: 260
successes: 260
3134.56user 194.30system 0:22.16elapsed 15018%CPU (0avgtext+0avgdata 184720maxresident)k
0inputs+44304outputs (2334major+21639744minor)pagefaults 0swaps
$ seq 256 -1 0 | time env THREADS=128 ./verifymany
for int32 20260210 amd64 will test avx2 portable4
for int64 20260210 amd64 will test avx2 portable4
results: 1028
successes: 1028
32752.60user 863.10system 3:37.34elapsed 15466%CPU (0avgtext+0avgdata 552948maxresident)k
0inputs+907224outputs (3879major+214016122minor)pagefaults 0swaps
$ seq 1024 -1 0 | time env THREADS=128 ./verifymany
for int32 20260210 amd64 will test avx2 portable4
for int64 20260210 amd64 will test avx2 portable4
results: 4100
successes: 4100
568367.05user 9276.42system 1:11:27elapsed 13474%CPU (0avgtext+0avgdata 6478188maxresident)k
32inputs+22838544outputs (5324major+3143164874minor)pagefaults 0swaps

The timings with djbsort-20260127 were 3614.18user 191.98system 0:38.63elapsed 9852%CPU, 46587.32user 844.62system 5:32.73elapsed 14255%CPU, and 969333.74user 10324.09system 2:03:56elapsed 13174%CPU. Another run with djbsort-20260127 verified 4 implementations at some selected sizes (beware that size 8192 used more than 512GB):

$ echo 512 653 761 768 857 953 960 1013 1024 1248 1277 1280 1547 1664 2048 4096 8192 | time env THREADS=128 ./verifymany
results: 68
successes: 68
147586.26user 5486.63system 19:23:27elapsed 219%CPU (0avgtext+0avgdata 486243736maxresident)k
2012837248inputs+1944920outputs (44842378major+2862319742minor)pagefaults 0swaps

Verification after mv skipcompilers/* compilers and compilation with --no-trim (3 portable implementations times 6 compilers, plus 2 SSE4.2 implementations times 4 compilers, plus 2 AVX2 implementations times 2 compilers):

$ seq 1024 -1 0 | time env THREADS=128 ./verifymany
for int32 20260210 amd64 will test avx2 portable3 portable4 sse42
for int64 20260210 amd64 will test avx2 portable4 sse42
results: 30750
successes: 30750
4696261.22user 67190.04system 9:51:07elapsed 13430%CPU (0avgtext+0avgdata 8849296maxresident)k

Verification for 64-bit ARM after compilation with --no-trim (5 implementations times 2 compilers, namely aarch64-linux-gnu-gcc -Wall -fPIC -fwrapv -O2 -mtune=cortex-a53 and aarch64-linux-gnu-clang -Wall -fPIC -fwrapv -Qunused-arguments -O2 -mtune=cortex-a53):

$ seq 1024 -1 0 | time env THREADS=128 ./verifymany
for int32 20260210 arm64 will test neon portable3 portable4
for int64 20260210 arm64 will test neon portable4
results: 10250
successes: 10250
1636296.03user 13152.07system 3:25:06elapsed 13403%CPU (0avgtext+0avgdata 7010192maxresident)k
1104inputs+49819976outputs (5193major+7109222786minor)pagefaults 0swaps

Verification for 32-bit ARM in Thumb mode after compilation with --no-trim (3 implementations times 2 compilers, namely arm-linux-gnueabi-gcc -Wall -fPIC -fwrapv -O2 -mthumb -march=armv7-a -mtune=cortex-a7 -mgeneral-regs-only and arm-linux-gnueabi-clang -Wall -fPIC -fwrapv -Qunused-arguments -O2 -mthumb -mcpu=cortex-a7 -mgeneral-regs-only):

$ seq 1024 -1 0 | time env THREADS=128 ./verifymany
for int32 20260210 arm32 will test portable3 portable4
for int64 20260210 arm32 will test portable4
results: 6150
successes: 6150
3992755.43user 38340.85system 8:39:05elapsed 12942%CPU (0avgtext+0avgdata 14852796maxresident)k
16inputs+68334320outputs (10753major+17163689996minor)pagefaults 0swaps

Observed single-verification timings varied across verification tasks. Verification of AVX2 sorting code compiled with gcc took 59 seconds for int64[256], 137 seconds for int64[512], and 335 seconds for int64[1024]. The AVX2 code for int32 sorting has fewer instructions than the AVX2 code for int64 sorting and is faster to verify: 36 seconds for int32[256], 72 seconds for int32[512], and 169 seconds for int32[1024]. Verification time depends on machine-code details that depend on the compiler: verification of AVX2 sorting code compiled with clang took 36 seconds for int32[256], 71 seconds for int32[512], 136 seconds for int32[1024], 61 seconds for int64[256], 112 seconds for int64[512], and 288 seconds for int64[1024].


Version: This is version 2026.02.10 of the "Verification" web page.