-rw-r--r-- 10209 djbsort-20260127/doc/download.md rawTo download and unpack the latest version of djbsort:

    wget -m https://sorting.cr.yp.to/djbsort-latest-version.txt
    version=$(cat sorting.cr.yp.to/djbsort-latest-version.txt)
    wget -m https://sorting.cr.yp.to/djbsort-$version.tar.gz
    tar -xzf sorting.cr.yp.to/djbsort-$version.tar.gz
    cd djbsort-$version

Then [install](install.html)
and [test](test.html).

## Archives and changelog (reverse chronological) {#changelog}

[`djbsort-20260127.tar.gz`](djbsort-20260127.tar.gz) [browse](djbsort-20260127.html)

**Algorithm.**
The vectorizable sorting algorithm has been redesigned from the ground up for higher speed and improved generality.
Both `int32/avx2` and the new `int64/avx2` are now automatically generated by `autogen/sort`.

**C API.**
There are new `uint64` and `float64` wrappers similar to the previous `uint32` and `float32` wrappers.
There are also new reverse-sorting wrappers
`int32down`, `uint32down`, `float32down`,
`int64down`, `uint64down`, `float64down`.

Each wrapper now has an `avx2useint` implementation to supplement the `useint` implementation.
All of these implementations are now automatically generated by `autogen/useint`.

All external symbols are now within the `djbsort` namespace
(meaning C symbols beginning `djbsort_`, not C++ symbols `djbsort::`).

**Packaging.**
The infrastructure is now a variant of the packaging infrastructure
from [lib25519](https://lib25519.cr.yp.to)
and [libmceliece](https://lib.mceliece.org).
This provides
cross-compilation,
namespace tests,
instruction-set tests,
tests for branches in the binaries,
automatic run-time selection of implementations,
etc.

**Internals.**
The software now uses
[cryptoint](https://cr.yp.to/papers.html#cryptoint)
to provide efficient comparators (`minmax`) on many CPUs
while at the same time protecting against current compilers converting arithmetic into branches.

**Benchmarking.**
Cycle counting now relies on
[libcpucycles](https://cpucycles.cr.yp.to),
both for sortbench and for the installed `djbsort-speed`.

An updated version of the sortbench package is now included in the djbsort package
and supersedes the separately released sortbench package.
The updated version compares the aspas, djbsort, far, herf, stdsort, vqsort, vxsort, and x86simdsort libraries
for `int32` and `int64` at many different array sizes;
systematically varies the data permutations, the physical tagging (running each benchmark program 8 times), and the array alignment;
and plots [stabilized quartiles](https://cr.yp.to/papers.html#rsrst).

**Verification.**
The `verifymany` driver has been rewritten,
and now analyzes all of the `int32` and `int64` implementations compiled into the library.

For `minmax`:
Speedups.
Less memory usage.
Computations for concrete inputs are now integrated,
both for constant propagation and for sanity checks.
More peephole optimizations for the code being analyzed,
moving further towards allowing a simpler symbolic-execution backend.
Almost all rewrite rules are now handled systematically with a DSL.
A separate `checkrules` script uses an SMT solver to verify rewrite rules expressed in the DSL.

For `unroll`:
Various tweaks,
such as disabling some angr peephole optimizations
and modifying angr's CPUID to report AVX2.

---

[`sortbench-20240116.tar.gz`](sortbench-20240116.tar.gz) [browse](sortbench-20240116.html)

Intermediate release of sortbench as a separate package.

---

[`djbsort-20190516.tar.gz`](djbsort-20190516.tar.gz) [browse](djbsort-20190516.html)

**Benchmarking.**
Speed tests now call cpucycles() before setting resource limits.
This is important on platforms where cpucycles() needs to read files.

**Verification.**
Support for SignExt and several more peephole optimizations,
working towards support for simpler symbolic-execution backend.
Various updates to work with angr8 and python3.

---

[`djbsort-20180729.tar.gz`](djbsort-20180729.tar.gz) [browse](djbsort-20180729.html)

**Algorithm.**
Rewrite of the core `int32/avx2` implementation
for (1) higher speed and (2) reduced memory consumption.
Stack allocation is now at most a few kilobytes,
even for gigantic arrays.

Internally,
the sorting algorithm
is now mostly bitonic to simplify indexing,
although odd-even speedups are still applied when convenient.
Lanes are complemented
to take the down-up decision out of the inner loops.

As in previous djbsort versions,
data is sorted first in vector lanes
and then transposed for final merges,
reducing the overall number of vector permutations.
Unlike previous versions,
transposition is done in-place.
The transposition in this version
is bit-reversal on the outer 6 bits
(bottom 3 bits and the top 3 bits),
but leaves intermediate bits alone.
Non-power-of-2 array sizes
are handled by an extra, more traditional, merge step.

Sizes 2, 3, 4, 5, 6, 7, 8, 16, 32 are now special-cased.
Non-power-of-2 sizes below 256
are padded to the next power of 2.

Portable implementations:
The out-of-place
`int32/portable1` and `int32/portable2` implementations
are now gone;
the in-place
`int32/portable3` and `int32/portable4` implementations
remain.

**C API.**
`float32_sort` is now supported.
The arithmetic in the reduction from `float32` to `int32`
is `int32` 31-bit right shift, `uint32` 1-bit right shift, xor;
this is slightly more efficient
than the reduction from `float32` to `uint32`
from [2001 Herf](refs.html#2001-herf).

**Compiling.**
Tests now have more variation (without much slowdown):
the `uint32` test cases
now deviate from `int32` in more than the sign;
`float32` uses floating-point numbers that aren't integers;
`int32` does more loops for small cases,
and some larger cases.

**Internals.**
API for 2-input sorting is now `MINMAX` macro
operating on two inputs in place.

Better inline assembly from Jason Donenfeld for 2-input sorting:
more flexibility in compiler's register allocation.

The package version number is now automatically copied to `version.c`
as the implementation version number
for implementations that don't provide `version.c`.

**Verification.**
`minmax` now supports more peephole optimizations
for complemented bitonic sorting and for padding:
`xor(s,xor(s,t))` ⇒ `t`;
`xor(-1,s)` ⇒ `invert(s)`;
`Reverse(Reverse(s))` ⇒ `s`;
`signedmin(invert(s),invert(t))` ⇒ `invert(signedmax(s,t))`;
`signedmax(invert(s),invert(t))` ⇒ `invert(signedmin(s,t))`;
`invert(s)[high:low]` ⇒ `invert(s[high:low])`;
`s[bits-1:0]` ⇒ `s`;
`s[high:low][high2:low2]` ⇒ `s[high2+low:low2+low]`;
`Concat(...)[high:low]` ⇒ `...[high-pos:low-pos]` when possible;
`Reverse(s)[high:low]` ⇒ `Reverse(s[...])` when possible;
eliminate `signedmin`/`signedmax`
when one input is the minimum or maximum constant.

`verifymany` now includes the implementation version number
on `verified` lines.

---

[`djbsort-20180717.tar.gz`](djbsort-20180717.tar.gz) [browse](djbsort-20180717.html)

**C API.**
`uint32_sort` is now supported, joining `int32_sort`.
(Internally, `uint32_sort` simply flips top bits and calls `int32_sort`.
Inlining the `int32_sort` code and integrating the flips into the initial and final passes
would be noticeably faster.
Adapting the `int32` code to handle `uint32` directly, without flips,
would be noticeably faster on platforms that have all relevant `uint32` instructions.
However, the separate flip has the virtue of minimizing the code size for the library.)

The `.h` files should work from C++ now. (Not tested yet.)

**Compiling and benchmarking.**
`./do` now finishes
by running `int32-speed`.

`int32-speed` now prints
cycle-count overhead;
cycle counts for some intermediate sizes;
and cycles per byte.

The default compiler list is now revamped,
and shorter.

**Internals.**
There is now a unified internal API for 2-input sorting.
This API has the following interchangeable implementations:
`int32_minmax.c` (portable);
`int32_minmax_x86.c` (using `cmovg` in assembly);
presumably more later.
These implementations are now shared by the higher-level sorting code.

**Verification.**
`verifymany` now prints a "`verified`" line for each successful verification.

Verification is now flexible enough
to handle the `portable` implementations,
at least compiled on amd64
with typical compiler options.

Internally,
Z3 is no longer asked
to simplify symbolic expressions.
All necessary simplifications
are handled by peephole optimizations
(in djbsort's `minmax`,
or in patches from djbsort to angr's claripy).
Added peephole optimizations in `minmax`:
`If(c,constbit1,constbit0)` ⇒ `c`;
`xor(c,constbit1)` ⇒ `invert(c)`;
`equal(c,bit0)` ⇒ `invert(c)`;
`invert(invert(c))` ⇒ `c`.

More operations supported in input DSL
for `minmax` and `tryinput`:
`xor`; `or`; `and`;
`add`; `sub`; `mul`;
`equal`;
`signedlt`;
`signedrshift`;
any number of inputs to concatenation.
Reduced redundancy in `minmax` input grammar.
Some work on cleaning up DSL syntax.

---

[`djbsort-20180710.tar.gz`](djbsort-20180710.tar.gz) [browse](djbsort-20180710.html)

Original release.

---

## Archive of angr patches

To simplify the `unroll` and `minmax` tools,
djbsort contributed some patches to angr back in 2018:

* [`angrpatch1.txt`](angrpatch1.txt):
  Support the AVX2 `vpunpck` instructions.
  Otherwise angr was crashing on djbsort.

* [`angrpatch2.txt`](angrpatch2.txt):
  Superseded by `angrpatch3.txt`.

* [`angrpatch3.txt`](angrpatch3.txt):
  Various peephole optimizations.
  Specifically:
  C idiom for signedmin;
  C idiom for signedmax;
  `reverse(concat(reverse(x),reverse(y)))`
  ⇒ `concat(y,x)` if lengths are multiples of 8;
  `reverse(concat(reverse(x),0))`
  ⇒ `concat(0,x)` if lengths are multiples of 8;
  same for any combinations of reversals and zeros;
  `reverse(extract(reverse(x),high,low))`
  ⇒ `extract(x,xbits-1-low,xbits-1-high)`
  if `xbits` and `high+1` and `low` are
  multiples of 8;
  `concat(a,b) ^ concat(c,d)`
  ⇒ `concat(a^c,b^d)` if sizes match;
  `concat(a,b) & concat(c,d)`
  ⇒ `concat(a&c,b&d)` if sizes match;
  `(SignExt(n,v)>>s)[vbits-1:0]`
  ⇒ `v>>(s[vbits-1:0])`
  if `n+vbits<2**(vbits-1)`.

* [`angrpatch4.txt`](angrpatch4.txt):
  Portion of `angrpatch3.txt`
  beyond claripy commit `7826167ecb4f965acb55f7b0d3a4677588242c28`.