-rw-r--r-- 6981 djbsort-20260127/doc/comparison.md raw
This page compares djbsort to some other numeric-array sorting libraries.
The libraries selected here for comparison are the same as djbsort's main competitors on the [speed page](speed.html):
"the [vxsort](https://github.com/damageboy/vxsort-cpp) library
[used in Microsoft's .NET](https://github.com/dotnet/runtime/pull/37159)
starting in 2020;
the [vqsort/highway](https://github.com/simd-sorting/fast-and-robust) library
[introduced by Google](https://opensource.googleblog.com/2022/06/Vectorized%20and%20performance%20portable%20Quicksort.html);
the [far](https://github.com/simd-sorting/fast-and-robust) library, which is the (sometimes faster) parent of vqsort;
and the [x86simdsort](https://github.com/intel/x86-simd-sort.git) library
[introduced by Intel](https://www.phoronix.com/news/Intel-AVX-512-Quicksort-Numpy)".
One can do better on some of these comparison axes when speed is not an issue:
for example, one can simultaneously (1) use less code and (2) sort a wider range of data types.
## Interface
Data types:
* djbsort: int32, uint32, float32, int64, uint64, float64
* far: int32
* vqsort: int16, uint16, float16, int32, uint32, float32, int64, uint64, float64, uint128
* vxsort: int16, uint16, int32, uint32, float32, int64, uint64, float64
* x86simdsort: int16, uint16, float16, int32, uint32, float32, int64, uint64, float64
Array lengths allowed by API on 64-bit CPUs:
* djbsort: up to 2^63^ (`void djbsort_int32(int32_t *,long long)`)
* far: up to 2^31^ (`void quicksort(int *arr, int n)`)
* vqsort: up to 2^64^ (`void VQSort(int32_t* HWY_RESTRICT keys, size_t n, SortAscending)`)
* vxsort: up to 2^64^ (`void sort(T* left, T* right, T left_hint = std::numeric_limits<T>::min(), T right_hint = std::numeric_limits<T>::max())`)
* x86simdsort: up to 2^64^ (`void qsort(T *arr, size_t arrsize, bool hasnan = false, bool descending = false)`)
API is CPU-independent, with implementation automatically selected based on run-time CPU:
* djbsort: yes (chooses between AVX2 and portable fallback)
* far: no (provides AVX2 code; caller handles fallback)
* vqsort: yes (chooses between AVX-512, AVX2, ARM NEON, more options, and portable fallback)
* vxsort: no (provides AVX-512 code and AVX2 code; caller handles selection and fallback)
* x86simdsort: yes (chooses between AVX-512, AVX2, portable fallback)
## Correctness and security
[Memory safety](security.html#memory) and protection against [other bugs](security.html#bugs):
* djbsort: [conventional tests](test.html); [tests under valgrind](test.html#dataflow); [verification](verif.html)
* far: conventional tests
* vqsort: conventional tests
* vxsort: conventional tests
* x86simdsort: conventional tests
Protection against [timing leaks](security.html#timing):
* djbsort: yes (constant cycle counts when array size is constant)
* far: no
* vqsort: no
* vxsort: no
* x86simdsort: no
Protection against [denial of service](security.html#dos):
* djbsort: clear (changing array contents does not increase cycle counts)
* far: unclear (chooses pivot from 72 pseudorandom elements but deterministically; recognizes constant subarrays; falls back to radix exchange)
* vqsort: unclear (chooses pivot from 576 bytes using per-thread call to OS RNG; falls back to heapsort)
* vxsort: unclear (chooses deterministic median-of-three pivot; falls back to heapsort)
* x86simdsort: unclear (similar to the far library; falls back to `std::sort`)
More research is required to quantify the cost of the far, vqsort, vxsort, and x86simdsort libraries
when array contents are chosen adversarially.
Note that a library that allows a larger relative slowdown
may still provide better protection against denial of service
if it is faster to begin with.
## Cycle counts for random data
CPU cycles/byte to sort `int32[16]` on AMD Zen 3
(the three numbers are stabilized quartiles of many measurements _excluding_ 31 cycles of cycle-counting overhead):
* djbsort: 0.33, 0.34, 0.35
* far: 1.14, 1.14, 1.14
* vqsort: 0.92, 0.92, 0.92
* vxsort: 1.85, 1.93, 2.09
* x86simdsort: 0.91, 1.00, 1.01
CPU cycles/byte to sort `int32[1024]` on AMD Zen 3
(the three numbers are stabilized quartiles of many measurements):
* djbsort: 1.14, 1.15, 1.16
* far: 1.85, 1.90, 1.95
* vqsort: 2.34, 2.46, 2.56
* vxsort: 2.61, 2.63, 2.64
* x86simdsort: 2.61, 2.78, 2.96
CPU cycles/byte to sort `int32[65536]` on AMD Zen 3
(the three numbers are stabilized quartiles of many measurements):
* djbsort: 3.27, 3.28, 3.31
* far: 3.28, 3.30, 3.31
* vqsort: 3.51, 3.55, 3.60
* vxsort: 3.67, 3.71, 3.76
* x86simdsort: 4.15, 4.18, 4.21
See the separate [speed page](speed.html)
for a more comprehensive comparison of cycle counts.
## Code size
Compiled AVX2 code size for a simple `int32.cc` that defines `void sortint32(int32_t *y,int n)` by calling the library
(text+data+bss for `int32.o`, plus text+data+bss for the relevant `*.o` files in the library):
* djbsort: 56+0+0 for `int32.o`
plus 25050+0+0 for `int32_avx2_C0-sort.o`
plus 1772+0+0 for portable fallback and dispatch (`dispatch_auto-sort_int32.o`, `int32_portable4_C3-sort.o`, `compilers*.o`)
* far: 24387+0+0 for `int32.o`
plus 0+0+0 (far is header-only)
* vqsort: 277+8+0 for `int32.o`
plus 266420+1224+24 for `vqsort_i32a.cc.o` plus some small support routines for dispatch etc.
* vxsort: 23483+0+0 for `int32.o`
plus 4096+0+0 for `avx2_masks.cpp.o`
* x86simdsort: 60+0+0 for `int32.o`
plus 3951712+8+0 for `x86simdsort-avx2.cpp.o` (all data types in the library are in a single `.o` file) plus some small support routines
An application with many calls to a big-header library such as far or vxsort
will incur many copies of the first number,
but can avoid this by instead calling an intermediate function such as `sortint32`,
so the overall code size includes only one copy of that number.
Compiled AVX2 code size for a simple `all32.cc` that defines `sortint32`, `sortuint32`, and `sortfloat32`
(text+data+bss for `all32.o`, plus text+data+bss for the relevant `*.o` files in the library):
* djbsort: 128+0+0 for `all32.o`
plus 25050+0+0 for `int32_avx2_C0-sort.o`
plus 445+0+0 for `uint32_avx2useint32_C0-sort.o`
plus 493+0+0 for `float32_avx2useint32_C0-sort.o`
plus 3960+0+0 for portable fallback and dispatch (`dispatch_auto-sort_int32.o`, `dispatch_auto-sort_uint32.o`, `dispatch_auto-sort_float32.o`, `int32_portable4_C3-sort.o`, `uint32_useint32_C3-sort.o`, `float32_useint32_C3-sort.o`, `compilers*.o`)
* far: not applicable
* vqsort: 677+8+0 for `all32.o` plus 266420+1224+24 for `vqsort_i32a.cc.o` plus 279990+1224+24 for `vqsort_u32a.cc.o` plus 244433+1224+24 for `vqsort_f32a.cc.o` plus some small support routines
* vxsort: 53557+0+0 for `all32.o` plus 4096+0+0 for `avx2_masks.cpp.o`
* x86simdsort: 132+0+0 for `all32.o` plus 3951712+8+0 for `x86simdsort-avx2.cpp.o` plus some small support routines
The `uint64` and `float64` functions in djbsort are similarly small wrappers around `int64`,
which is separate code from `int32`.