-rw-r--r-- 33547 djbsort-20260127/doc/verif.md rawThe sortverif toolkit in djbsort,
given an array size n,
automatically verifies that `djbsort_int32` correctly sorts all 2^32n^ possible length-n `int32` arrays
and that `djbsort_int64` correctly sorts all 2^64n^ possible length-n `int64` arrays.
The [time](#timings) to run the tools
gradually increases with n
but is affordable for, e.g., n=1024.

The sortverif interface is a tool called `verifymany`,
which runs the whole toolkit on a range of array sizes.
This page explains
how the tools work and how to run them.
This is a long page organized as follows:

* [The structure of the verification tools](#structure):
  quick overview of what the four main tools do.

* [How to prove a min-max program](#minmax):
  understanding how verification of a min-max program
  can efficiently handle all possible inputs.

* [The rewrite rules for `minmax`](#rules):
  dealing with the fact that
  analysis of compiled binaries does not end up simply saying "min" and "max".

* [How to install angr and verify djbsort](#run):
  main usage instructions for sortverif.

* [Verifying multiple implementations](#impls):
  how sortverif deals with multiple sorting implementations in the library.

* [Verifying cross-compiled implementations](#cross):
  additional instructions for using a big server
  to verify sorting code compiled for ARM.

* [Developing new rules for `minmax`](#newrules):
  what to do if a new compiler produces complaints from the binary analysis.

* [Some verification timings](#timings):
  concrete examples of verification feasibility for various sizes.

## The structure of the verification tools {#structure}

There are four main tools inside sortverif:

* `unroll`: An unroller
  that, given the compiled djbsort library,
  produces a fully unrolled program for arrays of the specified size.
  This converter is a small Python script built on top of the
  [angr](https://angr.io/) symbolic-execution tool introduced in
  [2016 Shoshitaishvili–Wang–Salls–Stephens–Polino–Dutcher–Grosen–Feng–Hauser–Kruegel–Vigna](https://www.cs.ucsb.edu/~vigna/publications/2016_SP_angrSoK.pdf).
  This tool, in turn, builds upon the VEX library inside
  [Valgrind](https://valgrind.org).

* `minmax`: A peephole optimizer
  that converts the fully unrolled program
  to a sequence of "min" and "max" operations.
  There is more information [below](#rules) on how this conversion works.

* `decompose`: A verifier
  that proves that the min-max program is a sorting program.
  There is more information [below](#minmax) on how this works.

* `verifymany`: A top-level driver
  that runs `unroll`, `minmax`, and `decompose` for each size.
  There is more information [below](#run) on how to run this tool.

## How to prove a min-max program {#minmax}

How does verification prove that a min-max program correctly sorts _all_ inputs?
Let's start with a standard observation about min-max programs,
called the "0-1 principle":
a min-max program correctly sorts all arrays
if and only if it correctly sorts all 0-1 arrays.
A 0-1 array means an array where each entry is either 0 or 1:
for example, the 0-1 arrays of length 3 are 0,0,0 and 0,0,1 and 0,1,0 and 0,1,1 and 1,0,0 and 1,0,1 and 1,1,0 and 1,1,1.

The point is that if a min-max program mis-sorts an array,
let's say producing an output ...,27,18,... with 27 and 18 in the wrong order,
then the same min-max program will also mis-sort a 0-1 array that has 1 exactly where the original array has 27 or larger,
producing an output ...,1,0,... with 1 and 0 in the wrong order.

The 0-1 principle is normally formulated for networks,
meaning min-max programs consisting of comparators that overwrite a,b with min(a,b),max(a,b);
but it's also true for more general min-max data flows.
If a min-max program works for all 0-1 arrays,
then the output of the same min-max program will also have the correct number of occurrences of 27 or larger
(the same number of occurrences as in the original array)
and will have those after the other integers.
Now replace "27 or larger" with "larger than 27"
to see that the output will have the correct number of occurrences of exactly 27,
and the correct positioning of those occurrences.

The 0-1 principle reduces the verification cost for `int32[n]` sorting from 2^32n^ to 2^n^.
A standard variant of the principle reduces the cost to polynomial in n.
This variant says that a min-max program correctly _merges_ all _sorted_ arrays of length n−r
with all _sorted_ arrays of length r
if and only if it correctly merges all sorted 0-1 arrays of length n−r
with all sorted 0-1 arrays of length r.
Notice that there are only r+1 sorted 0-1 arrays of length r:
for example, for r = 3, these arrays are 0,0,0 and 0,0,1 and 0,1,1 and 1,1,1.

The most important step in the `decompose` tool
is analyzing the data flow in a min-max program
to compute a decomposition of the program into presumed 2-way-merging programs.
The tool does not require the merge to be a merge of sorted inputs on the left
with sorted inputs on the right;
it allows arbitrary ordering or interleaving of the two lists of inputs.
(In the current version of djbsort,
left and right sides are often interleaved for vectorizability.
Also, the left side is often in reverse order,
as in many presentations of bitonic sorting.)

The tool then verifies each presumed 2-way-merging program
by testing the program on a linear-size list of inputs.
For example,
a min-max program correctly merges all sequences of 3 inputs
with all sequences of 5 inputs if and only if it correctly merges
0 1 2 with 3 4 5 6 7;
0 1 7 with 2 3 4 5 6;
0 6 7 with 1 2 3 4 5;
and
5 6 7 with 0 1 2 3 4.

(Linear size refers to the number of inputs here being n.
Each input fits into ceil(log~2~n) bits.
Programmers using these techniques to verify, e.g., `int8[300]` sorting
should note that these test inputs need `int16` instead.
The `decompose` tool uses Python's integers, which do not have size limits other than total memory.)

The fact that a quadratic-size list of test bits
is enough for a merging network
(whereas the best methods known to test a general sorting network
take exponential time)
is noted in Knuth's Art of Computer Programming.
The fact that larger test inputs allow a _linear-size_ list
(analogously to an older speedup by Yao
for general sorting networks)
was pointed out in
[1987 Chung–Ravikumar](https://www.sciencedirect.com/science/article/pii/0012365X9090173F).
However, the test set stated in that paper is too small
(off by one).
A correct test set appears in
[2007 Even–Levi–Litman](https://web.archive.org/web/20250103112116/https://www.eng.tau.ac.il/~guy/Papers/conclusive.pdf).
Both of these papers
consider only the balanced case,
merging n/2 inputs with n/2 inputs,
but the idea generalizes immediately to other cases.

This verification strategy suffices for djbsort,
which handles each array size with a min-max program built as a composition of 2-way merging programs.
Beyond djbsort,
some sorting networks are not built as compositions of 2-way merging programs:
for example, verifying the communication-efficient
Schnorr–Shamir mesh sorting algorithm
would require extending the tool to recognize
the structure of that algorithm.

## The rewrite rules for `minmax` {#rules}

The point of the `minmax` tool is as follows.
The programmer transforms a sequence of min-max operations
into a sequence of operations in C code;
the compiler transforms those operations
into a sequence of machine instructions;
`unroll`, using angr's binary analysis,
transforms those machine instructions
into a sequence of instructions in angr's language for arithmetic;
`minmax` tries to transform those instructions back into the original sequence of min-max operations.

Internally,
`minmax` follows various rewrite rules in a file `rules`,
such as a rule

    invert(unsignedle(S,T)) -> unsignedlt(T,S)

saying that complementing the 1-bit result of `S<=T` is the same as the 1-bit result of `T<S`,
and more complicated rules such as

    xor(and(signedrshift(xor(BMINUSA=sub(B,A),and(xor(B,BMINUSA),AXORB=xor(A,B))),Bits(32,const31)),AXORB),A) -> signedmin(A,B)

recognizing one of the portable ways to express min(A,B) using arithmetic operations.

Each `minmax` run generates a random input for the program it is analyzing,
and tests that each rewrite being applied produces correct results for that input.
It would be easy to scale up the number of inputs tested for each rewrite.
However, this would not catch rewrite bugs that apply only to rare inputs.
For higher assurance,
there is a script `checkrules`
(usage after [angr installation](#run): `./checkrules < rules`)
that uses an SMT solver (via angr's SMT-solving interface)
to check that the rules are correct for various sizes of operands.
SMT solvers are designed to reason about how all possible inputs behave,
not just random inputs;
for example,

    echo 'equal(S,Concat(Bits(64,const31),Bits(64,const5))) -> bit0' | ./checkrules

rapidly prints out a specific 128-bit value of `S` disproving this rule.

## How to install angr and verify djbsort {#run}

The sortverif toolkit has been tested only under Debian and Ubuntu.
The toolkit needs the following system packages:

* gcc and other compiler tools: `apt install build-essential`
* clang: `apt install clang`
* Python 3: `apt install python3`
* support for Python virtual environments: `apt install python3-venv`
* a parsing library: `apt install python3-pyparsing`
* `/usr/bin/time`, not required but better than sh builtins for timings: `apt install time`

(Here's everything together: `apt install build-essential clang python3 python3-venv python3-pyparsing time`.)
The remaining steps are as an unprivileged user.

[Download and unpack](download.html) djbsort.
(The sortverif toolkit is included in the djbsort source distribution.
It is not installed as part of a djbsort binary package,
and is not incorporated into the `djbsort-fulltest` utility.)

[Install](install.html) djbsort
as a home-directory installation
if it isn't already installed on the system.
(The toolkit tests machine code, not source code.)

Inside the unpacked djbsort distribution,
switch into the `sortverif` directory
and install angr:

    cd sortverif
    python3 -m venv angrenv
    . angrenv/bin/activate
    pip install angr==9.2.102 pycparser==2.23

If you come back to this later in a new shell,
you will have to switch to the `sortverif` directory again
and run `. angrenv/bin/activate` again.

Run the `verifymany` tool
to verify sorting for (e.g.) array lengths 0 through 16:

    seq 0 16 | time ./verifymany &

The `verifymany` tool puts its outputs into a `results` directory.
Check for `success` lines and `fail` lines in a file `results/overview`.

Actually, `results` is a symbolic link to a directory whose name matches `results-*`,
where the `*` includes a timestamp.
If you run `verifymany` again then the symbolic link will point to the new directory
while the old results will still be in the previous directory.

By default,
`verifymany` will parallelize across all OS threads.
The OS normally sets up one thread per CPU core
(times the number of CPU hyperthreads per core).
This can be a problem if there is not a correspondingly large amount of RAM.
You can instead run

    seq 0 16 | time env THREADS=3 ./verifymany &

and then `verifymany` will run at most 3 verifications at once.
A few components of angr are further parallelized,
so the load average can be somewhat larger than 3;
`verifymany` does not pin each verification to an OS thread.

Parallel computation will tend to have slightly lower latency
if you try array lengths in the opposite order,
changing `seq 0 16` to `seq 16 -1 0`.
You can also test a specific length using, e.g., `echo 768`.

An experiment with 128 threads on a big server
for array sizes through 2048
found total memory usage fitting into about 1TB.
A single `int64[2048]` verification grew to 30GB
but the average memory usage was lower.

## Verifying multiple implementations {#impls}

sortverif automatically analyzes
every sorting implementation compiled into the library,
not just the implementations automatically selected for the current machine.
Concretely, on Intel/AMD CPUs, the verification tools automatically
analyze both the fast AVX2 implementation and the portable fallback,
whether or not the current machine supports AVX2.

If you configure djbsort with `--no-trim`, install, and then run sortverif,
then further options will be compiled into the library
and will be analyzed by sortverif.
For example, sortverif will analyze the AVX2 implementation compiled with gcc
and the AVX2 implementation compiled with clang,
if both gcc and clang are installed when djbsort is configured and compiled.

## Verifying cross-compiled implementations {#cross}

For conventional tests,
running native binaries is faster than running emulated binaries,
although running an emulated binary on a big server
can still outperform the same binary running natively on a smaller device.
The sortverif analysis is different:
binaries are emulated in all cases,
so you might as well run the analysis for all architectures on a big server.

For concreteness,
the following instructions focus on an Intel/AMD server running Debian or Ubuntu
to verify djbsort cross-compiled for one architecture, 32-bit ARM.
There are various ARM-specific parts here:
the name `arm32`;
Debian's `armel` name for little-endian `arm32`;
gcc's `arm-linux-gnueabi` name for `arm32` with a particular ABI;
and more specific compiler options `-mthumb -march=armv7-a -mtune=cortex-a7 -mgeneral-regs-only`.

As root,
install an ARM cross-compiler, libcpucycles for ARM, and an emulator,
along with the system packages generally needed for sortverif:

    dpkg --add-architecture armel
    apt update
    apt install \
      build-essential clang python3 python3-venv python3-pyparsing time \
      binutils-arm-linux-gnueabi crossbuild-essential-armel libc6:armel libcpucycles-dev:armel \
      binfmt-support qemu-user-static
    ln -s /usr/bin/clang /usr/bin/arm-linux-gnueabi-clang

If your Debian or Ubuntu is too old to include `libcpucycles-dev`,
do the above without `libcpucycles-dev:armel`
and then install libcpucycles as follows:

    cd /root
    wget -m https://cpucycles.cr.yp.to/libcpucycles-latest-version.txt
    version=$(cat cpucycles.cr.yp.to/libcpucycles-latest-version.txt)
    wget -m https://cpucycles.cr.yp.to/libcpucycles-$version.tar.gz
    tar -xzf cpucycles.cr.yp.to/libcpucycles-$version.tar.gz
    cd libcpucycles-$version
    echo arm-linux-gnueabi-gcc -Wall -fPIC -fwrapv -O2 > compilers/default
    ./configure --host=arm32 --prefix=/usr/lib/arm-linux-gnueabi && make -j8
    mkdir -p /usr/include /usr/lib/arm-linux-gnueabi
    cp build/0/package/lib/libcpucycles.so.1 /usr/lib/arm-linux-gnueabi/
    ln -s libcpucycles.so.1 /usr/lib/arm-linux-gnueabi/libcpucycles.so
    cp build/0/package/include/cpucycles.h /usr/include/

Now make an account specifically to verify sorting code compiled for 32-bit ARM:

    adduser --disabled-password --gecos arm32sort arm32sort
    su - arm32sort

All further steps are within this account.
(If you want to verify sorting code for multiple architectures, make one account per architecture.)

Set up environment variables:

    export LD_LIBRARY_PATH="$HOME/lib"
    export LIBRARY_PATH="$HOME/lib"
    export CPATH="$HOME/include"
    export MANPATH="$HOME/man"
    export PATH="$HOME/bin:$PATH"
    export CROSS="arm-linux-gnueabi-"
    export CC="arm-linux-gnueabi-gcc"
    export CFLAGS="-L$HOME/lib"

Download, unpack, and install djbsort,
configuring with `--host=arm32` and two choices of cross-compilers:

    cd
    wget -m https://sorting.cr.yp.to/djbsort-latest-version.txt
    version=$(cat sorting.cr.yp.to/djbsort-latest-version.txt)
    wget -m https://sorting.cr.yp.to/djbsort-$version.tar.gz
    tar -xzf sorting.cr.yp.to/djbsort-$version.tar.gz
    cd djbsort-$version
    ( echo arm-linux-gnueabi-gcc -Wall -fPIC -fwrapv -O2 -mthumb -march=armv7-a -mtune=cortex-a7 -mgeneral-regs-only
      echo arm-linux-gnueabi-clang -Wall -fPIC -fwrapv -Qunused-arguments -O2 -mthumb -mcpu=cortex-a7 -mgeneral-regs-only
    ) > compilers/default
    ./configure --host=arm32 --no-trim --prefix=$HOME && make -j8 install

As a test at this point,
running `djbsort-speed` should print various lines including `djbsort arch arm32` near the top,
and should continue with various lines mentioning `arm-linux-gnueabi-gcc` and `arm-linux-gnueabi-clang`,
along with many cycle counts
(higher cycle counts than running on a real ARM device;
`djbsort-speed` is running in emulation under `qemu`).

Next, switch into the `sortverif` directory and install angr:

    cd sortverif
    python3 -m venv angrenv
    . angrenv/bin/activate
    pip install angr==9.2.102 pycparser==2.23

Finally, run `verifymany`:

    seq 0 16 | time env THREADS=3 ./verifymany &

Results are in `results/overview` as in the non-cross-compiled case.

## Developing new rules for `minmax` {#newrules}

A change to the machine code for sorting—for example,
because of a change to the target architecture or compiler options,
a change to the compiler version, or a change to a different compiler—could break verification.
This could be because the new machine code is wrong
(try introducing a bug into the djbsort code and then see what sortverif does),
but it could also be because the binary-analysis tools do not understand the instructions used.

Here is an example from the `minmax` development,
where machine code for a portable implementation called `bitonic2`
used min-max idioms
that were not recognized by a development version of `minmax`,
but adding various lines to `rules` recognized the idioms.

The first sortverif experiments on 32-bit ARM (via cross-compilation)
successfully verified `int32` sorting but failed for `int64` sorting with `bitonic2`.
There was a line in `results/overview` saying
`minmaxfailed/int64/impl0/2 bitonic2` for `gcc`,
and similarly for `clang`.
Other files in the `results` directory
included `results/minmaxfailed/int64/impl0/2` saying "exit code 1";
`results/minmaxstderr/int64/impl0/2` saying
`Exception: resulting operations ['Concat', 'Extract', 'ZeroExt', 'and', 'constant', 'signedrshift', 'sub', 'unsignedle', 'xor'] are not just signedmin, signedmax`;
and `results/minmaxstdout/int64/impl0/2` with the following contents:

    v1 = x_0_64
    v3 = Extract(v1,63,32)
    v2 = x_1_64
    v4 = Extract(v2,63,32)
    v5 = xor(v3,v4)
    v6 = sub(v4,v3)
    v7 = Extract(v1,31,0)
    v8 = Extract(v2,31,0)
    v12 = unsignedle(v7,v8)
    v13 = ZeroExt(v12,31)
    v14 = constant(32,1)
    v15 = xor(v13,v14)
    v16 = sub(v6,v15)
    v17 = xor(v4,v16)
    v18 = and(v17,v5)
    v19 = xor(v16,v18)
    v20 = constant(32,31)
    v21 = signedrshift(v19,v20)
    v22 = and(v5,v21)
    v23 = xor(v3,v22)
    v24 = xor(v7,v8)
    v25 = and(v24,v21)
    v26 = xor(v7,v25)
    v27 = Concat(v23,v26)
    y_0_64 = v27
    v28 = xor(v4,v22)
    v29 = xor(v8,v25)
    v30 = Concat(v28,v29)
    y_1_64 = v30

This is correct code using 32-bit arithmetic
to compute the min and max of two 64-bit inputs,
but that version of `minmax`
did not see how to simplify the code down into just saying min and max.

The original C code already had several operations to replace `a` and `b` with their min and max,
namely `r = b^a; z = b-a; z ^= r&(z^b); z >>= 63; z &= r; a ^= z; b ^= z`.
The code then expanded because each of the variables here is an `int64` variable.
On a 32-bit CPU, an `int64` operation typically turns into multiple 32-bit instructions:
for example, a 64-bit `r = b^a` turns into two 32-bit xors,
and a 64-bit `z = b-a` turns into a 32-bit subtraction followed by a 32-bit subtract-with-borrow instruction.
In context, the 32-bit subtraction matters only for the borrow result,
since `z >>= 63` depends only on the top bit of `z`.
The subtraction-with-borrow instruction
is decomposed by angr into simpler instructions in the language used by angr for arithmetic:
for example, the 32-bit borrow is expressed as
`xor(ZeroExt(C,31),constant(32,1))`
where `C` is a comparison bit, `ZeroExt(C,31)` pads it to 32 bits,
and `constant(32,1)` is the 32-bit constant with value 1.

Adding the following lines to `rules` simplified the computation:

    xor(ZeroExt(C,31),Bits(32,0star1)) -> ZeroExt(invert(C),31)

This turns the 32-bit xor into a 1-bit inversion.
`Bits(32,...)` is a pattern matching a 32-bit quantity `...`,
and `0star1` is a pattern matching a string of 0s followed by a 1, i.e., the integer 1.
(Bit strings in angr are big-endian.)

    invert(unsignedle(S,T)) -> unsignedlt(T,S)

The comparison bit is computed as `unsignedle(S,T)`, meaning `S<=T`,
where `S` and `T` are the bottom 32 bits of the 64-bit inputs.
Inverting this comparison bit gives the same result as `T<S`.

    sub(sub(Extract(Bits(64,B),63,32),Extract(Bits(64,A),63,32)),ZeroExt(unsignedlt(Extract(B,31,0),Extract(A,31,0)),31)) -> Extract(sub(B,A),63,32)

This recognizes what the compiler produces for the top 32-bit word of `B-A`.

    and(signedrshift(xor(BMINUSATOP=Extract(sub(Bits(64,B),Bits(64,A)),63,32),and(xor(BTOP=Extract(B,63,32),BMINUSATOP),BXORATOP=xor(ATOP=Extract(A,63,32),BTOP))),Bits(32,const31)),BXORATOP) -> xor(ATOP,Extract(signedmin(A,B),63,32))
    and(signedrshift(xor(BMINUSATOP=Extract(sub(Bits(64,B),Bits(64,A)),63,32),and(xor(BTOP=Extract(B,63,32),BMINUSATOP),xor(Extract(A,63,32),BTOP))),Bits(32,const31)),xor(ABOT=Extract(A,31,0),Extract(B,31,0))) -> xor(ABOT,Extract(signedmin(A,B),31,0))

This recognizes what the compiler produces for the top and bottom 32-bit words of
`r = b^a; z = b-a; z ^= r&(z^b); z >>= 63; z &= r`,
and rewrites these as the corresponding words of `a` xor'ed with the desired min.

    xor(Extract(B,31,0),xor(Extract(A,31,0),Extract(signedmin(A,B),31,0))) -> Extract(signedmax(A,B),31,0)
    xor(Extract(B,63,32),xor(Extract(A,63,32),Extract(signedmin(A,B),63,32))) -> Extract(signedmax(A,B),63,32)

This recognizes the top and bottom words of a computation of the max
as the min xor'ed with the two inputs.

These extra rules changed `results/minmaxstdout/int64/impl0/2` to simply

    v1 = x_0_64
    v2 = x_1_64
    v27 = signedmin(v1,v2)
    y_0_64 = v27
    v30 = signedmax(v1,v2)
    y_1_64 = v30

and produced `success/int64/impl0/2`.
But there was still a failure for `impl1/2`, which used `clang` instead of `gcc`.
The following extra rules fixed that:

    and(signedrshift(xor(and(xor(BMINUSATOP=Extract(sub(Bits(64,B),Bits(64,A)),63,32),BTOP=Extract(B,63,32)),BXORATOP=xor(BTOP,ATOP=Extract(A,63,32))),BMINUSATOP),Bits(32,const31)),BXORATOP) -> xor(ATOP,Extract(signedmin(A,B),63,32))
    and(signedrshift(xor(and(xor(BMINUSATOP=Extract(sub(Bits(64,B),Bits(64,A)),63,32),BTOP=Extract(B,63,32)),xor(BTOP,ATOP=Extract(A,63,32))),BMINUSATOP),Bits(32,const31)),xor(Extract(B,31,0),ABOT=Extract(A,31,0))) -> xor(ABOT,Extract(signedmin(A,B),31,0))

With these added rules,
verification finished for all of the `int64[2]` sorting implementations in the library for 32-bit ARM.
Larger array sizes then needed a few more rules,
but the most complicated work was for `int64[2]`.

## Some verification timings {#timings}

Dual AMD EPYC 7742 running at 2.245GHz (overclocking disabled)
verifying 4 implementations
(`int32[n]` AVX2, `int32[n]` portable fallback, `int64[n]` AVX2, `int64[n]` portable fallback)
at various sizes
(each experiment here covers all sizes down through 0 so there is some redundancy):

    $ seq 4 -1 0 | time env THREADS=128 ./verifymany
    results: 20
    successes: 20
    156.21user 10.55system 0:10.57elapsed 1577%CPU (0avgtext+0avgdata 153052maxresident)k
    552inputs+2216outputs (108major+1128869minor)pagefaults 0swaps
    $ seq 8 -1 0 | time env THREADS=128 ./verifymany
    results: 36
    successes: 36
    244.73user 16.19system 0:10.34elapsed 2522%CPU (0avgtext+0avgdata 153008maxresident)k
    8inputs+3160outputs (249major+1924771minor)pagefaults 0swaps
    $ seq 16 -1 0 | time env THREADS=128 ./verifymany
    results: 68
    successes: 68
    434.31user 36.47system 0:12.76elapsed 3689%CPU (0avgtext+0avgdata 153132maxresident)k
    0inputs+5472outputs (1373major+3722489minor)pagefaults 0swaps
    $ seq 32 -1 0 | time env THREADS=128 ./verifymany
    results: 132
    successes: 132
    988.40user 109.17system 0:18.51elapsed 5927%CPU (0avgtext+0avgdata 158428maxresident)k
    8inputs+13152outputs (2318major+8387876minor)pagefaults 0swaps
    $ seq 64 -1 0 | time env THREADS=128 ./verifymany
    results: 260
    successes: 260
    3614.18user 191.98system 0:38.63elapsed 9852%CPU (0avgtext+0avgdata 186716maxresident)k
    8inputs+44976outputs (1862major+22207438minor)pagefaults 0swaps
    $ seq 128 -1 0 | time env THREADS=128 ./verifymany
    results: 516
    successes: 516
    12481.45user 365.95system 1:32.14elapsed 13942%CPU (0avgtext+0avgdata 265092maxresident)k
    0inputs+192728outputs (1726major+65655150minor)pagefaults 0swaps
    $ seq 256 -1 0 | time env THREADS=128 ./verifymany
    results: 1028
    successes: 1028
    46587.32user 844.62system 5:32.73elapsed 14255%CPU (0avgtext+0avgdata 565696maxresident)k
    552inputs+922880outputs (2830major+219868703minor)pagefaults 0swaps
    $ seq 512 -1 0 | time env THREADS=128 ./verifymany
    results: 2052
    successes: 2052
    201194.95user 2555.92system 24:53.86elapsed 13639%CPU (0avgtext+0avgdata 1790720maxresident)k
    280inputs+4609744outputs (3153major+829241677minor)pagefaults 0swaps
    $ seq 1024 -1 0 | time env THREADS=128 ./verifymany
    results: 4100
    successes: 4100
    969333.74user 10324.09system 2:03:56elapsed 13174%CPU (0avgtext+0avgdata 6598792maxresident)k
    336inputs+23097840outputs (3830major+3554238496minor)pagefaults 0swaps

Verifying 4 implementations at some selected sizes
(beware that size 8192 used more than 512GB):

    $ echo 512 653 761 768 857 953 960 1013 1024 1248 1277 1280 1547 1664 2048 4096 8192 | time env THREADS=128 ./verifymany
    results: 68
    successes: 68
    147586.26user 5486.63system 19:23:27elapsed 219%CPU (0avgtext+0avgdata 486243736maxresident)k
    2012837248inputs+1944920outputs (44842378major+2862319742minor)pagefaults 0swaps

Verifying 16 implementations after compilation with `--no-trim`:

    $ seq 4 -1 0 | time env THREADS=128 ./verifymany
    results: 80
    successes: 80
    475.04user 47.17system 0:14.62elapsed 3570%CPU (0avgtext+0avgdata 154904maxresident)k
    0inputs+5520outputs (2202major+4183646minor)pagefaults 0swaps
    $ seq 8 -1 0 | time env THREADS=128 ./verifymany
    results: 144
    successes: 144
    931.09user 110.29system 0:17.89elapsed 5818%CPU (0avgtext+0avgdata 154912maxresident)k
    280inputs+9232outputs (1763major+7784589minor)pagefaults 0swaps
    $ seq 16 -1 0 | time env THREADS=128 ./verifymany
    results: 272
    successes: 272
    2170.06user 188.34system 0:24.06elapsed 9801%CPU (0avgtext+0avgdata 154860maxresident)k
    8inputs+17688outputs (1959major+15245422minor)pagefaults 0swaps
    $ seq 32 -1 0 | time env THREADS=128 ./verifymany
    results: 528
    successes: 528
    5559.28user 329.05system 0:40.21elapsed 14640%CPU (0avgtext+0avgdata 158752maxresident)k
    0inputs+44056outputs (2181major+34434657minor)pagefaults 0swaps
    $ seq 64 -1 0 | time env THREADS=128 ./verifymany
    results: 1040
    successes: 1040
    15827.23user 641.78system 1:34.97elapsed 17340%CPU (0avgtext+0avgdata 191036maxresident)k
    304inputs+147944outputs (2880major+90343814minor)pagefaults 0swaps
    $ seq 128 -1 0 | time env THREADS=128 ./verifymany
    results: 2064
    successes: 2064
    48796.34user 1338.16system 4:58.60elapsed 16789%CPU (0avgtext+0avgdata 277844maxresident)k
    16inputs+626704outputs (2586major+265003719minor)pagefaults 0swaps
    $ seq 256 -1 0 | time env THREADS=128 ./verifymany
    results: 4112
    successes: 4112
    173636.20user 3195.65system 19:31.36elapsed 15096%CPU (0avgtext+0avgdata 629124maxresident)k
    688inputs+2979424outputs (4411major+865383836minor)pagefaults 0swaps
    $ seq 512 -1 0 | time env THREADS=128 ./verifymany
    results: 8208
    successes: 8208
    724217.82user 9306.65system 1:27:59elapsed 13893%CPU (0avgtext+0avgdata 2061384maxresident)k
    416inputs+14986112outputs (5656major+3071113112minor)pagefaults 0swaps
    $ seq 1024 -1 0 | time env THREADS=128 ./verifymany
    results: 16400
    successes: 16400
    3370271.42user 34556.82system 7:07:51elapsed 13263%CPU (0avgtext+0avgdata 7679840maxresident)k
    48inputs+75308976outputs (20281major+12057248973minor)pagefaults 0swaps

Verifying 6 implementations compiled for 32-bit ARM in Thumb mode with `--no-trim`:

    $ seq 4 -1 0 | time env THREADS=128 ./verifymany
    results: 30
    successes: 30
    193.53user 14.32system 0:11.67elapsed 1779%CPU (0avgtext+0avgdata 152620maxresident)k
    0inputs+2688outputs (390major+1638987minor)pagefaults 0swaps
    $ seq 8 -1 0 | time env THREADS=128 ./verifymany
    results: 54
    successes: 54
    391.66user 28.02system 0:12.05elapsed 3481%CPU (0avgtext+0avgdata 152680maxresident)k
    0inputs+4176outputs (867major+3122156minor)pagefaults 0swaps
    $ seq 16 -1 0 | time env THREADS=128 ./verifymany
    results: 102
    successes: 102
    833.59user 72.34system 0:19.61elapsed 4618%CPU (0avgtext+0avgdata 153912maxresident)k
    0inputs+8304outputs (2666major+7427107minor)pagefaults 0swaps
    $ seq 32 -1 0 | time env THREADS=128 ./verifymany
    results: 198
    successes: 198
    3110.18user 158.26system 0:40.86elapsed 7997%CPU (0avgtext+0avgdata 164744maxresident)k
    0inputs+24232outputs (2390major+18622979minor)pagefaults 0swaps
    $ seq 64 -1 0 | time env THREADS=128 ./verifymany
    results: 390
    successes: 390
    12439.13user 284.02system 1:47.89elapsed 11792%CPU (0avgtext+0avgdata 203588maxresident)k
    0inputs+99768outputs (2604major+51671692minor)pagefaults 0swaps
    $ seq 128 -1 0 | time env THREADS=128 ./verifymany
    results: 774
    successes: 774
    55395.70user 618.29system 6:47.86elapsed 13733%CPU (0avgtext+0avgdata 355768maxresident)k
    16inputs+491656outputs (2423major+163326008minor)pagefaults 0swaps
    $ seq 256 -1 0 | time env THREADS=128 ./verifymany
    results: 1542
    successes: 1542
    277933.18user 1680.16system 35:13.78elapsed 13228%CPU (0avgtext+0avgdata 968752maxresident)k
    0inputs+2555208outputs (2417major+547038882minor)pagefaults 0swaps

Verifying 3 implementations compiled for 64-bit ARM:

    $ seq 4 -1 0 | time env THREADS=128 ./verifymany
    results: 15
    successes: 15
    129.26user 7.08system 0:08.00elapsed 1703%CPU (0avgtext+0avgdata 152692maxresident)k
    0inputs+1944outputs (99major+823002minor)pagefaults 0swaps
    $ seq 8 -1 0 | time env THREADS=128 ./verifymany
    results: 27
    successes: 27
    172.91user 12.71system 0:08.77elapsed 2115%CPU (0avgtext+0avgdata 152728maxresident)k
    0inputs+2624outputs (225major+1432485minor)pagefaults 0swaps
    $ seq 16 -1 0 | time env THREADS=128 ./verifymany
    results: 51
    successes: 51
    333.68user 24.84system 0:10.28elapsed 3484%CPU (0avgtext+0avgdata 152636maxresident)k
    0inputs+4056outputs (1303major+2837773minor)pagefaults 0swaps
    $ seq 32 -1 0 | time env THREADS=128 ./verifymany
    results: 99
    successes: 99
    702.32user 69.33system 0:13.59elapsed 5675%CPU (0avgtext+0avgdata 152732maxresident)k
    0inputs+8168outputs (2366major+6279701minor)pagefaults 0swaps
    $ seq 64 -1 0 | time env THREADS=128 ./verifymany
    results: 195
    successes: 195
    2330.41user 151.58system 0:23.17elapsed 10711%CPU (0avgtext+0avgdata 160168maxresident)k
    0inputs+23168outputs (2434major+17144186minor)pagefaults 0swaps
    $ seq 128 -1 0 | time env THREADS=128 ./verifymany
    results: 387
    successes: 387
    7882.11user 282.80system 0:55.66elapsed 14667%CPU (0avgtext+0avgdata 185904maxresident)k
    0inputs+91112outputs (2825major+50307004minor)pagefaults 0swaps
    $ seq 256 -1 0 | time env THREADS=128 ./verifymany
    results: 771
    successes: 771
    29035.16user 626.00system 3:20.71elapsed 14777%CPU (0avgtext+0avgdata 286408maxresident)k
    0inputs+422768outputs (1988major+160600259minor)pagefaults 0swaps
    $ seq 512 -1 0 | time env THREADS=128 ./verifymany
    results: 1539
    successes: 1539
    118897.65user 1635.60system 14:26.97elapsed 13902%CPU (0avgtext+0avgdata 661284maxresident)k
    8inputs+2145088outputs (3189major+528936595minor)pagefaults 0swaps
    $ seq 1024 -1 0 | time env THREADS=128 ./verifymany
    results: 3075
    successes: 3075
    541368.61user 5339.49system 1:08:24elapsed 13321%CPU (0avgtext+0avgdata 2227168maxresident)k
    8inputs+10831952outputs (3671major+1840178657minor)pagefaults 0swaps

Observed single-verification timings varied across verification tasks.
Verification of AVX2 sorting code compiled with gcc
took
132 seconds for `int64[256]`,
495 seconds for `int64[512]`,
and
1105 seconds for `int64[1024]`.
The AVX2 code for `int32` sorting
has fewer instructions than the AVX2 code for `int64` sorting
and is faster to verify:
36 seconds for `int32[256]`,
60 seconds for `int32[512]`,
and
207 seconds for `int32[1024]`.
Verification time depends on machine-code details that depend on the compiler:
verification of AVX2 sorting code compiled with clang
took
32 seconds for `int32[256]`,
58 seconds for `int32[512]`,
180 seconds for `int32[1024]`,
134 seconds for `int64[256]`,
357 seconds for `int64[512]`,
and
1333 seconds for `int64[1024]`.