To download and unpack the latest version of djbsort:
wget -m https://sorting.cr.yp.to/djbsort-latest-version.txt
version=$(cat sorting.cr.yp.to/djbsort-latest-version.txt)
wget -m https://sorting.cr.yp.to/djbsort-$version.tar.gz
tar -xzf sorting.cr.yp.to/djbsort-$version.tar.gz
cd djbsort-$version
Archives and changelog (reverse chronological)
djbsort-20260210.tar.gz browse
Algorithm.
autogen/sort now handles 128-bit vectors,
in particular producing neon for arm64 and sse42 for amd64.
The permutation machinery now supports a few further abstractions.
Some additional bitonic stages are replaced by odd-even stages for fewer than 8 words/vector, slightly reducing operation counts.
Packaging. There are now more compile-time options for tradeoffs between the size of the compiled library and the number of instruction sets supported.
Benchmarking.
sortbench now covers sid1607 (for arrays of suitable size and alignment) and radixwrapper.
sortbench now sets test options to speed up vqsort compilation,
and now skips benchmarks on re-runs unless the skipbench files are removed.
sortbench now uses g++ rather than clang++ for libraries where that seems to help.
To run more quietly,
sortbench now saves x86simdsort compilation results in a file.
Also, sortbench now avoids a warning from current versions of clang.
Testing.
djbsort-test now works around a gcc bug
that was breaking the floating-point tests for Intel/AMD platforms in 32-bit mode:
if gcc has done FLD of a floating-point number
then it "optimizes" copy-integer-from-the-same-location into FST,
incorrectly assuming that FLD followed by FST is always a copy.
This assumption is wrong for signaling NaNs,
which are covered by the tests in djbsort-test
(and apparently not adequately covered in the gcc tests).
The sorting was always working correctly (even for signaling NaNs)
and should be immune to compiler bugs in handling floating-point numbers,
but the tests were checking against how the compiler handles floating-point numbers.
Verification.
rules has more entries to handle various idioms produced by the new neon and sse42 code.
minmax is now simpler and faster, and uses less memory.
The most important improvement is that minmax now applies rewrites to each instruction as soon as the instruction appears,
and then locks the rewritten instruction into place,
while recording intermediate results to fast-forward subsequent rewrites.
Some smaller improvements:
each minmax run applies a move-to-front heuristic on the rewrite rules to try;
calcbits is now removed in favor of extracting the same information from calcconcrete.
The two initial calls to outsim in verifymany are now parallelized.
This is not noticeable for large runs
but saves some seconds of latency for small runs.
verifymany now prints a summary of the implementations that it will test.
djbsort-20260127.tar.gz browse
Algorithm.
The vectorizable sorting algorithm has been redesigned from the ground up for higher speed and improved generality.
Both int32/avx2 and the new int64/avx2 are now automatically generated by autogen/sort.
C API.
There are new uint64 and float64 wrappers similar to the previous uint32 and float32 wrappers.
There are also new reverse-sorting wrappers
int32down, uint32down, float32down,
int64down, uint64down, float64down.
Each wrapper now has an avx2useint implementation to supplement the useint implementation.
All of these implementations are now automatically generated by autogen/useint.
All external symbols are now within the djbsort namespace
(meaning C symbols beginning djbsort_, not C++ symbols djbsort::).
Packaging. The infrastructure is now a variant of the packaging infrastructure from lib25519 and libmceliece. This provides cross-compilation, namespace tests, instruction-set tests, tests for branches in the binaries, automatic run-time selection of implementations, etc.
Internals.
The software now uses
cryptoint
to provide efficient comparators (minmax) on many CPUs
while at the same time protecting against current compilers converting arithmetic into branches.
Benchmarking.
Cycle counting now relies on
libcpucycles,
both for sortbench and for the installed djbsort-speed.
An updated version of the sortbench package is now included in the djbsort package
and supersedes the separately released sortbench package.
The updated version compares the aspas, djbsort, far, herf, stdsort, vqsort, vxsort, and x86simdsort libraries
for int32 and int64 at many different array sizes;
systematically varies the data permutations, the physical tagging (running each benchmark program 8 times), and the array alignment;
and plots stabilized quartiles.
Verification.
The verifymany driver has been rewritten,
and now analyzes all of the int32 and int64 implementations compiled into the library.
For minmax:
Speedups.
Less memory usage.
Computations for concrete inputs are now integrated,
both for constant propagation and for sanity checks.
More peephole optimizations for the code being analyzed,
moving further towards allowing a simpler symbolic-execution backend.
Almost all rewrite rules are now handled systematically with a DSL.
A separate checkrules script uses an SMT solver to verify rewrite rules expressed in the DSL.
For unroll:
Various tweaks,
such as disabling some angr peephole optimizations
and modifying angr's CPUID to report AVX2.
sortbench-20240116.tar.gz browse
Intermediate release of sortbench as a separate package.
djbsort-20190516.tar.gz browse
Benchmarking. Speed tests now call cpucycles() before setting resource limits. This is important on platforms where cpucycles() needs to read files.
Verification. Support for SignExt and several more peephole optimizations, working towards support for simpler symbolic-execution backend. Various updates to work with angr8 and python3.
djbsort-20180729.tar.gz browse
Algorithm.
Rewrite of the core int32/avx2 implementation
for (1) higher speed and (2) reduced memory consumption.
Stack allocation is now at most a few kilobytes,
even for gigantic arrays.
Internally, the sorting algorithm is now mostly bitonic to simplify indexing, although odd-even speedups are still applied when convenient. Lanes are complemented to take the down-up decision out of the inner loops.
As in previous djbsort versions, data is sorted first in vector lanes and then transposed for final merges, reducing the overall number of vector permutations. Unlike previous versions, transposition is done in-place. The transposition in this version is bit-reversal on the outer 6 bits (bottom 3 bits and the top 3 bits), but leaves intermediate bits alone. Non-power-of-2 array sizes are handled by an extra, more traditional, merge step.
Sizes 2, 3, 4, 5, 6, 7, 8, 16, 32 are now special-cased. Non-power-of-2 sizes below 256 are padded to the next power of 2.
Portable implementations:
The out-of-place
int32/portable1 and int32/portable2 implementations
are now gone;
the in-place
int32/portable3 and int32/portable4 implementations
remain.
C API.
float32_sort is now supported.
The arithmetic in the reduction from float32 to int32
is int32 31-bit right shift, uint32 1-bit right shift, xor;
this is slightly more efficient
than the reduction from float32 to uint32
from 2001 Herf.
Compiling.
Tests now have more variation (without much slowdown):
the uint32 test cases
now deviate from int32 in more than the sign;
float32 uses floating-point numbers that aren't integers;
int32 does more loops for small cases,
and some larger cases.
Internals.
API for 2-input sorting is now MINMAX macro
operating on two inputs in place.
Better inline assembly from Jason Donenfeld for 2-input sorting: more flexibility in compiler's register allocation.
The package version number is now automatically copied to version.c
as the implementation version number
for implementations that don't provide version.c.
Verification.
minmax now supports more peephole optimizations
for complemented bitonic sorting and for padding:
xor(s,xor(s,t)) ⇒ t;
xor(-1,s) ⇒ invert(s);
Reverse(Reverse(s)) ⇒ s;
signedmin(invert(s),invert(t)) ⇒ invert(signedmax(s,t));
signedmax(invert(s),invert(t)) ⇒ invert(signedmin(s,t));
invert(s)[high:low] ⇒ invert(s[high:low]);
s[bits-1:0] ⇒ s;
s[high:low][high2:low2] ⇒ s[high2+low:low2+low];
Concat(...)[high:low] ⇒ ...[high-pos:low-pos] when possible;
Reverse(s)[high:low] ⇒ Reverse(s[...]) when possible;
eliminate signedmin/signedmax
when one input is the minimum or maximum constant.
verifymany now includes the implementation version number
on verified lines.
djbsort-20180717.tar.gz browse
C API.
uint32_sort is now supported, joining int32_sort.
(Internally, uint32_sort simply flips top bits and calls int32_sort.
Inlining the int32_sort code and integrating the flips into the initial and final passes
would be noticeably faster.
Adapting the int32 code to handle uint32 directly, without flips,
would be noticeably faster on platforms that have all relevant uint32 instructions.
However, the separate flip has the virtue of minimizing the code size for the library.)
The .h files should work from C++ now. (Not tested yet.)
Compiling and benchmarking.
./do now finishes
by running int32-speed.
int32-speed now prints
cycle-count overhead;
cycle counts for some intermediate sizes;
and cycles per byte.
The default compiler list is now revamped, and shorter.
Internals.
There is now a unified internal API for 2-input sorting.
This API has the following interchangeable implementations:
int32_minmax.c (portable);
int32_minmax_x86.c (using cmovg in assembly);
presumably more later.
These implementations are now shared by the higher-level sorting code.
Verification.
verifymany now prints a "verified" line for each successful verification.
Verification is now flexible enough
to handle the portable implementations,
at least compiled on amd64
with typical compiler options.
Internally,
Z3 is no longer asked
to simplify symbolic expressions.
All necessary simplifications
are handled by peephole optimizations
(in djbsort's minmax,
or in patches from djbsort to angr's claripy).
Added peephole optimizations in minmax:
If(c,constbit1,constbit0) ⇒ c;
xor(c,constbit1) ⇒ invert(c);
equal(c,bit0) ⇒ invert(c);
invert(invert(c)) ⇒ c.
More operations supported in input DSL
for minmax and tryinput:
xor; or; and;
add; sub; mul;
equal;
signedlt;
signedrshift;
any number of inputs to concatenation.
Reduced redundancy in minmax input grammar.
Some work on cleaning up DSL syntax.
djbsort-20180710.tar.gz browse
Original release.
Archive of angr patches
To simplify the unroll and minmax tools,
djbsort contributed some patches to angr back in 2018:
-
angrpatch1.txt: Support the AVX2vpunpckinstructions. Otherwise angr was crashing on djbsort. -
angrpatch2.txt: Superseded byangrpatch3.txt. -
angrpatch3.txt: Various peephole optimizations. Specifically: C idiom for signedmin; C idiom for signedmax;reverse(concat(reverse(x),reverse(y)))⇒concat(y,x)if lengths are multiples of 8;reverse(concat(reverse(x),0))⇒concat(0,x)if lengths are multiples of 8; same for any combinations of reversals and zeros;reverse(extract(reverse(x),high,low))⇒extract(x,xbits-1-low,xbits-1-high)ifxbitsandhigh+1andloware multiples of 8;concat(a,b) ^ concat(c,d)⇒concat(a^c,b^d)if sizes match;concat(a,b) & concat(c,d)⇒concat(a&c,b&d)if sizes match;(SignExt(n,v)>>s)[vbits-1:0]⇒v>>(s[vbits-1:0])ifn+vbits<2**(vbits-1). -
angrpatch4.txt: Portion ofangrpatch3.txtbeyond claripy commit7826167ecb4f965acb55f7b0d3a4677588242c28.
Version: This is version 2026.02.10 of the "Download" web page.