-rw-r--r-- 6981 djbsort-20260127/doc/comparison.md raw
This page compares djbsort to some other numeric-array sorting libraries. The libraries selected here for comparison are the same as djbsort's main competitors on the [speed page](speed.html): "the [vxsort](https://github.com/damageboy/vxsort-cpp) library [used in Microsoft's .NET](https://github.com/dotnet/runtime/pull/37159) starting in 2020; the [vqsort/highway](https://github.com/simd-sorting/fast-and-robust) library [introduced by Google](https://opensource.googleblog.com/2022/06/Vectorized%20and%20performance%20portable%20Quicksort.html); the [far](https://github.com/simd-sorting/fast-and-robust) library, which is the (sometimes faster) parent of vqsort; and the [x86simdsort](https://github.com/intel/x86-simd-sort.git) library [introduced by Intel](https://www.phoronix.com/news/Intel-AVX-512-Quicksort-Numpy)". One can do better on some of these comparison axes when speed is not an issue: for example, one can simultaneously (1) use less code and (2) sort a wider range of data types. ## Interface Data types: * djbsort: int32, uint32, float32, int64, uint64, float64 * far: int32 * vqsort: int16, uint16, float16, int32, uint32, float32, int64, uint64, float64, uint128 * vxsort: int16, uint16, int32, uint32, float32, int64, uint64, float64 * x86simdsort: int16, uint16, float16, int32, uint32, float32, int64, uint64, float64 Array lengths allowed by API on 64-bit CPUs: * djbsort: up to 2^63^ (`void djbsort_int32(int32_t *,long long)`) * far: up to 2^31^ (`void quicksort(int *arr, int n)`) * vqsort: up to 2^64^ (`void VQSort(int32_t* HWY_RESTRICT keys, size_t n, SortAscending)`) * vxsort: up to 2^64^ (`void sort(T* left, T* right, T left_hint = std::numeric_limits<T>::min(), T right_hint = std::numeric_limits<T>::max())`) * x86simdsort: up to 2^64^ (`void qsort(T *arr, size_t arrsize, bool hasnan = false, bool descending = false)`) API is CPU-independent, with implementation automatically selected based on run-time CPU: * djbsort: yes (chooses between AVX2 and portable fallback) * far: no (provides AVX2 code; caller handles fallback) * vqsort: yes (chooses between AVX-512, AVX2, ARM NEON, more options, and portable fallback) * vxsort: no (provides AVX-512 code and AVX2 code; caller handles selection and fallback) * x86simdsort: yes (chooses between AVX-512, AVX2, portable fallback) ## Correctness and security [Memory safety](security.html#memory) and protection against [other bugs](security.html#bugs): * djbsort: [conventional tests](test.html); [tests under valgrind](test.html#dataflow); [verification](verif.html) * far: conventional tests * vqsort: conventional tests * vxsort: conventional tests * x86simdsort: conventional tests Protection against [timing leaks](security.html#timing): * djbsort: yes (constant cycle counts when array size is constant) * far: no * vqsort: no * vxsort: no * x86simdsort: no Protection against [denial of service](security.html#dos): * djbsort: clear (changing array contents does not increase cycle counts) * far: unclear (chooses pivot from 72 pseudorandom elements but deterministically; recognizes constant subarrays; falls back to radix exchange) * vqsort: unclear (chooses pivot from 576 bytes using per-thread call to OS RNG; falls back to heapsort) * vxsort: unclear (chooses deterministic median-of-three pivot; falls back to heapsort) * x86simdsort: unclear (similar to the far library; falls back to `std::sort`) More research is required to quantify the cost of the far, vqsort, vxsort, and x86simdsort libraries when array contents are chosen adversarially. Note that a library that allows a larger relative slowdown may still provide better protection against denial of service if it is faster to begin with. ## Cycle counts for random data CPU cycles/byte to sort `int32[16]` on AMD Zen 3 (the three numbers are stabilized quartiles of many measurements _excluding_ 31 cycles of cycle-counting overhead): * djbsort: 0.33, 0.34, 0.35 * far: 1.14, 1.14, 1.14 * vqsort: 0.92, 0.92, 0.92 * vxsort: 1.85, 1.93, 2.09 * x86simdsort: 0.91, 1.00, 1.01 CPU cycles/byte to sort `int32[1024]` on AMD Zen 3 (the three numbers are stabilized quartiles of many measurements): * djbsort: 1.14, 1.15, 1.16 * far: 1.85, 1.90, 1.95 * vqsort: 2.34, 2.46, 2.56 * vxsort: 2.61, 2.63, 2.64 * x86simdsort: 2.61, 2.78, 2.96 CPU cycles/byte to sort `int32[65536]` on AMD Zen 3 (the three numbers are stabilized quartiles of many measurements): * djbsort: 3.27, 3.28, 3.31 * far: 3.28, 3.30, 3.31 * vqsort: 3.51, 3.55, 3.60 * vxsort: 3.67, 3.71, 3.76 * x86simdsort: 4.15, 4.18, 4.21 See the separate [speed page](speed.html) for a more comprehensive comparison of cycle counts. ## Code size Compiled AVX2 code size for a simple `int32.cc` that defines `void sortint32(int32_t *y,int n)` by calling the library (text+data+bss for `int32.o`, plus text+data+bss for the relevant `*.o` files in the library): * djbsort: 56+0+0 for `int32.o` plus 25050+0+0 for `int32_avx2_C0-sort.o` plus 1772+0+0 for portable fallback and dispatch (`dispatch_auto-sort_int32.o`, `int32_portable4_C3-sort.o`, `compilers*.o`) * far: 24387+0+0 for `int32.o` plus 0+0+0 (far is header-only) * vqsort: 277+8+0 for `int32.o` plus 266420+1224+24 for `vqsort_i32a.cc.o` plus some small support routines for dispatch etc. * vxsort: 23483+0+0 for `int32.o` plus 4096+0+0 for `avx2_masks.cpp.o` * x86simdsort: 60+0+0 for `int32.o` plus 3951712+8+0 for `x86simdsort-avx2.cpp.o` (all data types in the library are in a single `.o` file) plus some small support routines An application with many calls to a big-header library such as far or vxsort will incur many copies of the first number, but can avoid this by instead calling an intermediate function such as `sortint32`, so the overall code size includes only one copy of that number. Compiled AVX2 code size for a simple `all32.cc` that defines `sortint32`, `sortuint32`, and `sortfloat32` (text+data+bss for `all32.o`, plus text+data+bss for the relevant `*.o` files in the library): * djbsort: 128+0+0 for `all32.o` plus 25050+0+0 for `int32_avx2_C0-sort.o` plus 445+0+0 for `uint32_avx2useint32_C0-sort.o` plus 493+0+0 for `float32_avx2useint32_C0-sort.o` plus 3960+0+0 for portable fallback and dispatch (`dispatch_auto-sort_int32.o`, `dispatch_auto-sort_uint32.o`, `dispatch_auto-sort_float32.o`, `int32_portable4_C3-sort.o`, `uint32_useint32_C3-sort.o`, `float32_useint32_C3-sort.o`, `compilers*.o`) * far: not applicable * vqsort: 677+8+0 for `all32.o` plus 266420+1224+24 for `vqsort_i32a.cc.o` plus 279990+1224+24 for `vqsort_u32a.cc.o` plus 244433+1224+24 for `vqsort_f32a.cc.o` plus some small support routines * vxsort: 53557+0+0 for `all32.o` plus 4096+0+0 for `avx2_masks.cpp.o` * x86simdsort: 132+0+0 for `all32.o` plus 3951712+8+0 for `x86simdsort-avx2.cpp.o` plus some small support routines The `uint64` and `float64` functions in djbsort are similarly small wrappers around `int64`, which is separate code from `int32`.