Version 20190516: browse djbsort-20190516.tar.gz
Benchmarking
Speed tests now call cpucycles() before setting resource limits. This is important on platforms where cpucycles() needs to read files.
Verification
Support for SignExt and several more peephole optimizations, working towards support for simpler symbolic-execution backend. Various updates to work with angr8 and python3.
Version 20180729: browse djbsort-20180729.tar.gz
Algorithm
Rewrite of the core int32/avx2
implementation
for (1) higher speed and (2) reduced memory consumption.
Stack allocation is now at most a few kilobytes,
even for gigantic arrays.
Internally, the sorting algorithm is now mostly bitonic to simplify indexing, although odd-even speedups are still applied when convenient. Lanes are complemented to take the down-up decision out of the inner loops.
As in previous djbsort versions, data is sorted first in vector lanes and then transposed for final merges, reducing the overall number of vector permutations. Unlike previous versions, transposition is done in-place. The transposition in this version is bit-reversal on the outer 6 bits (bottom 3 bits and the top 3 bits), but leaves intermediate bits alone. Non-power-of-2 array sizes are handled by an extra, more traditional, merge step.
Sizes 2, 3, 4, 5, 6, 7, 8, 16, 32 are now special-cased. Non-power-of-2 sizes below 256 are padded to the next power of 2.
Portable implementations:
The out-of-place
int32/portable1
and int32/portable2
implementations
are now gone;
the in-place
int32/portable3
and int32/portable4
implementations
remain.
C API
float32_sort
is now supported.
The arithmetic in the reduction from float32
to int32
is int32
31-bit right shift, uint32
1-bit right shift, xor;
this is slightly more efficient
than the reduction from float32
to uint32
from 2001 Herf.
Compiling
Tests now have more variation (without much slowdown):
the uint32
test cases
now deviate from int32
in more than the sign;
float32
uses floating-point numbers that aren't integers;
int32
does more loops for small cases,
and some larger cases.
Internals
API for 2-input sorting is now MINMAX
macro
operating on two inputs in place.
Better inline assembly from Jason Donenfeld for 2-input sorting: more flexibility in compiler's register allocation.
The package version number is now automatically copied to version.c
as the implementation version number
for implementations that don't provide version.c
.
Verification
minmax
now supports more peephole optimizations
for complemented bitonic sorting and for padding:
xor(s,xor(s,t))
⇒ t
;
xor(-1,s)
⇒ invert(s)
;
Reverse(Reverse(s))
⇒ s
;
signedmin(invert(s),invert(t))
⇒ invert(signedmax(s,t))
;
signedmax(invert(s),invert(t))
⇒ invert(signedmin(s,t))
;
invert(s)[high:low]
⇒ invert(s[high:low])
;
s[bits-1:0]
⇒ s
;
s[high:low][high2:low2]
⇒ s[high2+low:low2+low]
;
Concat(...)[high:low]
⇒ ...[high-pos:low-pos]
when possible;
Reverse(s)[high:low]
⇒ Reverse(s[...])
when possible;
eliminate signedmin
/signedmax
when one input is the minimum or maximum constant.
verifymany
now includes the implementation version number
on verified
lines.
Version 20180717: browse djbsort-20180717.tar.gz
C API
uint32_sort
is now supported, joining int32_sort
.
(Internally, uint32_sort
simply flips top bits and calls int32_sort
.
Inlining the int32_sort
code and integrating the flips into the initial and final passes
would be noticeably faster.
Adapting the int32
code to handle uint32
directly, without flips,
would be noticeably faster on platforms that have all relevant uint32
instructions.
However, the separate flip has the virtue of minimizing the code size for the library.)
The .h
files should work from C++ now. (Not tested yet.)
Compiling and benchmarking
./do
now finishes
by running int32-speed
.
int32-speed
now prints
cycle-count overhead;
cycle counts for some intermediate sizes;
and cycles per byte.
The default compiler list is now revamped, and shorter.
Internals
There is now a unified internal API for 2-input sorting.
This API has the following interchangeable implementations:
int32_minmax.c
(portable);
int32_minmax_x86.c
(using cmovg
in assembly);
presumably more later.
These implementations are now shared by the higher-level sorting code.
Verification
verifymany
now prints a "verified
" line for each successful verification.
Verification is now flexible enough
to handle the portable
implementations,
at least compiled on amd64
with typical compiler options.
Internally,
Z3 is no longer asked
to simplify symbolic expressions.
All necessary simplifications
are handled by peephole optimizations
(in djbsort's minmax
,
or in patches from djbsort to angr's claripy).
Added peephole optimizations in minmax
:
If(c,constbit1,constbit0)
⇒ c
;
xor(c,constbit1)
⇒ invert(c)
;
equal(c,bit0)
⇒ invert(c)
;
invert(invert(c))
⇒ c
.
More operations supported in input DSL
for minmax
and tryinput
:
xor
; or
; and
;
add
; sub
; mul
;
equal
;
signedlt
;
signedrshift
;
any number of inputs to concatenation.
Reduced redundancy in minmax
input grammar.
Some work on cleaning up DSL syntax.
Version 20180710: browse djbsort-20180710.tar.gz
Original release.
Version: This is version 2019.05.16 of the "Changes" web page.