-r--r--r-- 15079 djbsort-20260127/doc/html/speed.html raw<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<style type="text/css">
html{overflow-y:scroll;background-color:#012345}
body{font-family:"Noto Sans","Droid Sans","DejaVu Sans","Arial",sans-serif;line-height:1.5}
tt,code{background-color:#f0f0f0;font-family:"Noto Sans Mono","Droid Sans Mono","DejaVu Sans Mono","Courier New",monospace,sans-serif;font-size:1em;}
pre{margin-left:3em}
p,ul,ol,blockquote,pre{font-size:1.0em;line-height:1.6}
li p{font-size:1.0em}
blockquote p{font-size:1.0em}
h1{font-size:1.5em}
h2{font-size:1.3em}
h3{font-size:1.0em}
h1 a{text-decoration:none}
table{border-collapse:collapse}
th,td{border:1px solid black}
table a{text-decoration:none}
table tr{font-size:1.0em;line-height:1.6em}
table tr{font-size:1.0em;line-height:1.5}
tbody tr:nth-child(2n+1){background-color:#f0ffff}
tbody tr:nth-child(2n+2){background-color:#fffff0}

#headline{display:block;margin:0;padding:0;color:#ffffff;background-color:#012345}
#headline .text{font-weight:bold;font-size:1.0em}
#headline input{display:none}
#nav ul{margin:0;padding:0}
#nav li{list-style-type:none;margin:0;padding:0}
.navtop{padding-bottom:0.5em;font-weight:bold;font-size:1.0em}
.navtop{background-color:#012345;color:#ffffff}
#nav .here{background-color:#012345;color:#ffffff}
#nav .away{background-color:#012345;color:#ffffff}
#nav .away a{text-decoration:none;display:block;color:#ffffff}
#nav .away a:hover,.away a:active{text-decoration:underline}
#hidemenu{visibility:hidden;display:none;overflow:hidden;position:fixed;top:0;left:0;height:100%;width:100%}

.main{padding:5px}
.main{background-color:#ffffff}
.pagetitle{font-size:1.4em;font-weight:bold}
@media only screen and (min-width:512px) {
  .navtop{padding-top:5px}
  #headline{top:0;margin:0;width:160px;height:100%;position:fixed;overflow:auto}
  #headline .noselect{display:none}
  #headline #nav{visibility:visible;display:block;width:auto;height:auto}
  .main{margin-left:170px}
  #headline #hidemenu{visibility:hidden}
}
@media not screen and (min-width:512px) {
  #headline .noselect{-webkit-user-select:none;-ms-user-select:none;user-select:none;}
  #headline #nav #navbot{visibility:hidden;position:fixed;top:0;left:-70%;z-index:2;transition:0.2s;margin:0;padding:0}
  #headline input:checked ~ #nav #navbot{height:100%;position:fixed;top:0;left:0;visibility:visible;display:block;box-sizing:border-box;-moz-box-sizing:border-box;-webkit-box-sizing:border-box;vertical-align:center;font-size:1.0em;width:70%;overflow:auto}
  #headline input:checked ~ #hidemenu{visibility:visible;display:block;background:black;opacity:0.3;z-index:1}
}
</style>
<title>
djbsort: Speed</title>
</head>
<body>
<label id=headline>
<input type=checkbox />
<nav id=nav>
<div class=navtop>
<span class=noselect>≡</span>
djbsort</div>
<ul id=navbot>
<li class=away><a href=index.html>Intro</a>
</li><li class=away><a href=download.html>Download</a>
</li><li class=away><a href=install.html>Install</a>
</li><li class=away><a href=test.html>Test</a>
</li><li class=away><a href=api.html>API</a>
</li><li class=here>Speed
</li><li class=away><a href=security.html>Security</a>
</li><li class=away><a href=verif.html>Verification</a>
</li><li class=away><a href=internals.html>Internals</a>
</li><li class=away><a href=license.html>License</a>
</li><li class=away><a href=refs.html>References</a>
</li><li class=away><a href=comparison.html>Comparison</a>
</li></ul></nav>
<div id=hidemenu></div>
</label>
<div class=main>
<div class=pagetitle>djbsort: Speed</div>
<p><img src="https://sorting.cr.yp.to/sortbench/20260127-cezanne/plot32.png"></a></p>
<p>(For higher resolution:
<a href="https://sorting.cr.yp.to/sortbench/20260127-cezanne/plot32.svg">SVG</a>,
<a href="https://sorting.cr.yp.to/sortbench/20260127-cezanne/plot32.pdf">PDF</a>.)</p>
<p>This graph compares sorting performance for various <code>int32</code> array sizes
on one core of an AMD Ryzen 5 PRO 5650G
(2021 CPU launch; Zen 3 microarchitecture)
with overclocking disabled, Debian 12, gcc 12.2.0, clang 14.0.6
using the following libraries:
djbsort;
the <a href="https://github.com/damageboy/vxsort-cpp">vxsort</a> library
<a href="https://github.com/dotnet/runtime/pull/37159">used in Microsoft's .NET</a>
starting in 2020;
the <a href="https://github.com/simd-sorting/fast-and-robust">vqsort/highway</a> library
<a href="https://opensource.googleblog.com/2022/06/Vectorized%20and%20performance%20portable%20Quicksort.html">introduced by Google</a>;
the <a href="https://github.com/simd-sorting/fast-and-robust">far</a> library, which is the (sometimes faster) parent of vqsort;
and the <a href="https://github.com/intel/x86-simd-sort.git">x86simdsort</a> library
<a href="https://www.phoronix.com/news/Intel-AVX-512-Quicksort-Numpy">introduced by Intel</a>.
See also the separate
<a href="comparison.html">comparison page</a>
for a comparison of library features beyond speed.</p>
<p>The far, vqsort, vxsort, and x86simdsort libraries
use vectorized sorting networks for small lengths
and vectorized quicksort for large lengths.
Increasing the length cutoff
and calling djbsort for the base case
would improve the performance of those libraries
not just for small lengths but also for large lengths.
The point is that larger-length sorting uses smaller-length sorting as a subroutine.
(Beware that combining code in this way would not achieve the
<a href="security.html">security</a> and
<a href="verif.html">verification</a> of djbsort.)</p>
<p>There are also many slower sorting libraries.
The graph includes three of those for comparison:
"stdsort", which is <code>std::sort</code> from the installed C++ standard library;
"herf", which is radix-sorting code (with radix 2048) posted by Michael Herf in
<a href="http://stereopsis.com/radix.html">2001</a>
plus a simplification to sort <code>int32</code> rather than <code>float32</code>;
and
<a href="https://github.com/vtsynergy/aspas_sort">"aspas"</a>,
from a 2018 paper
"A framework for the automatic vectorization of parallel sort on x86-based processors".</p>
<p>The herf and aspas results are shown only for array sizes at least 64 and 16 respectively,
since it is clear that <code>std::sort</code> is preferable for smaller arrays.
Note that Lukas Bergdoll reported in 2023
that "~50% of the time spent in <code>slice::sort</code>" in "clean compiling 60 crates"
was spent sorting arrays of
<a href="https://github.com/Voultapher/sort-research-rs/blob/main/writeup/intel_avx512/text.md">length at most 20</a>.
It would be interesting to see more reports
on how the sorting time spent by applications
is distributed across array sizes and data types.</p>
<h2>How were the numbers collected?</h2>
<p>The graph relies on data from
a small C++ benchmarking program,
<a href="#sortbench">sortbench</a>,
used for all of the libraries.
The program is run 8 times for each library
(which should catch any frequent VIPT effects).
Each run tries many array sizes,
including powers of 2 but also many more sizes
(which should catch slowdowns for inconvenient sizes).
For each array size, each run</p>
<ul>
<li>generates an array with random contents;</li>
<li>fills 64 arrays with random permutations of the first array;</li>
<li>sorts each of those 64 arrays in turn with the tested library, checking the cycle counter after each sort; and</li>
<li>checks that all of the supposedly sorted arrays match the output of <code>std::sort</code>.</li>
</ul>
<p>The 64 arrays are spaced by an odd number of integers, so most of them are not aligned to cache lines.
(Applications might specifically align arrays to try to save time,
so it would be useful to separately benchmark the aligned case.)</p>
<p>For each library at each size,
the graph shows
<a href="https://cr.yp.to/papers.html#rsrst">stabilized quartiles</a>
of the resulting 512 cycle-count measurements.
The three quartiles are marked as horizontal line, small X, horizontal line.
Measurements can vary (producing noticeable vertical distances between the horizontal lines)
because of branch-prediction effects and other pipeline effects,
or because of some sorting algorithms handling different permutations at different speeds.
The benchmarking program does not try to find "bad" inputs.</p>
<p>The program collects cycle counts using
<a href="https://cpucycles.cr.yp.to">libcpucycles</a>,
which on the machine mentioned above ends up using RDPMC to read an on-core cycle counter.
The plotting script subtracts the observed cycle-counting overhead
from every cycle count in the graph before drawing the graph.</p>
<p>Since CPUs typically support overclocking and typically have overclocking enabled by default
(at the expense of
<a href="https://blog.cr.yp.to/20230609-turboboost.html">security problems</a>
and decreased hardware longevity),
it would also be useful to measure power consumption at various clock frequencies,
and to measure time consumption when overclocking is enabled
with specified numbers of cores active simultaneously.
Enabling overclocking can produce different CPU frequencies
for different libraries with different power consumption,
changing the relative time for those libraries.</p>
<h2>What about <code>int64</code> arrays?</h2>
<p><img src="https://sorting.cr.yp.to/sortbench/20260127-cezanne/plot64.png"></a></p>
<p>(For higher resolution:
<a href="https://sorting.cr.yp.to/sortbench/20260127-cezanne/plot64.svg">SVG</a>,
<a href="https://sorting.cr.yp.to/sortbench/20260127-cezanne/plot64.pdf">PDF</a>.)</p>
<p>There are two reasons
that the speedup of vectorized libraries over <code>std::sort</code> for <code>int64</code> arrays
is smaller than the speedup over <code>std::sort</code> for <code>int32</code> arrays.
First, a 256-bit vector instruction carries out 8 <code>int32</code> operations
but only 4 <code>int64</code> operations.
Second,
the AVX2 instruction set has <code>min</code> and <code>max</code> instructions for <code>int32x8</code> but not for <code>int64x4</code>.</p>
<h2>What about other CPUs?</h2>
<p>Here's one core of an Intel Xeon E3-1220 v5
(2015 CPU launch; Skylake microarchitecture)
with Ubuntu 24.04, gcc 13.3.0, clang 18.1.3.</p>
<p><img src="https://sorting.cr.yp.to/sortbench/20260127-samba/plot32.png"></a></p>
<p>(For higher resolution:
<a href="https://sorting.cr.yp.to/sortbench/20260127-samba/plot32.svg">SVG</a>,
<a href="https://sorting.cr.yp.to/sortbench/20260127-samba/plot32.pdf">PDF</a>.)</p>
<p><img src="https://sorting.cr.yp.to/sortbench/20260127-samba/plot64.png"></a></p>
<p>(For higher resolution:
<a href="https://sorting.cr.yp.to/sortbench/20260127-samba/plot64.svg">SVG</a>,
<a href="https://sorting.cr.yp.to/sortbench/20260127-samba/plot64.pdf">PDF</a>.)</p>
<p>It would be useful to benchmark more CPUs.
Some libraries obtain a speed boost from AVX-512
(although vxsort would need some lines to change in the sortbench program for AVX-512).
Also, vqsort supports ARM NEON etc.</p>
<h2 id="sortbench">Running sortbench</h2>
<p>Currently the sortbench tool supports only CPUs with AVX2,
and has been tested only under Debian and Ubuntu.
The tool needs the following system packages:</p>
<ul>
<li>gcc and other compiler tools: <code>apt install build-essential</code></li>
<li>clang: <code>apt install clang</code></li>
<li>for Ubuntu 22.04, missing C++ libraries: <code>apt install g++-12</code></li>
<li>cmake: <code>apt install cmake</code></li>
<li>meson: <code>apt install meson</code></li>
<li>Python 3: <code>apt install python3</code></li>
<li>matplotlib: <code>apt install python3-matplotlib</code></li>
<li>libcpucycles: <code>apt install libcpucycles-dev</code> (or install directly from <a href="https://cpucycles.cr.yp.to">the source</a>)</li>
</ul>
<p>(Combined for Ubuntu 22.04:
<code>apt install build-essential clang g++-12 cmake meson python3 python3-matplotlib</code> and install libcpucycles from source.
Combined for Debian 12:
<code>apt install build-essential clang cmake meson python3 python3-matplotlib</code> and install libcpucycles from source.
Combined for Ubuntu 24.04 or Debian 13:
<code>apt install build-essential clang cmake meson python3 python3-matplotlib libcpucycles-dev</code>.)</p>
<p>The sortbench tool is included in the djbsort source distribution.
It is not
<a href="install.html">installed</a> as part of a djbsort binary package
(and is not incorporated into the <code>djbsort-speed</code> utility).
So the next step is to
<a href="download.html">download and unpack</a> djbsort
as an unprivileged user.</p>
<p>You still need a network connection after this for running the sortbench tool:
the tool automatically downloads copies of the aspas, far, vqsort, vxsort, and x86simdsort libraries.
The tool assumes you have already installed djbsort (as a home-directory installation or a system installation)
and stdsort (as part of the standard C++ library).
The tool includes a copy of the (very small) herf library.</p>
<p>Last step:</p>
<pre><code>cd sortbench
./do
</code></pre>
<p>Results should end up in
<code>plot32.pdf</code>, <code>plot32.svg</code>, <code>plot32.png</code>,
<code>plot64.pdf</code>, <code>plot64.svg</code>, and <code>plot64.png</code>
in the <code>sortbench</code> directory.</p>
<p>The tool runs benchmarks on a single core
(while allowing whatever number of cores are used for compilation of the libraries).
The benchmarks typically take 10 to 20 minutes, depending on the core speed.
More time is taken by compilation, especially for vqsort,
but overall it is not surprising if <code>./do</code> finishes in under an hour total.</p>
<p>If you re-run <code>./do</code> then the libraries will not be recompiled
(unless you do <code>rm */skiprebuild</code> first)
but the benchmarks will be re-run.
This is useful if you have enabled
<a href="https://cpucycles.cr.yp.to/counters.html">more reliable cycle counters</a>
in the meantime.</p>
<p>One known failure mode in <code>./do</code> is vqsort compilation running out of RAM.
If this happens, try <code>taskset -c 0 ./do</code> to limit everything to core 0.
If that still isn't enough, try adding some swap space.
(As root, to add 16GB of swap space until reboot:
<code>mkdir /root/swap; chmod 700 /root/swap; dd if=/dev/zero of=/root/swap/tmp1 bs=16777216 count=1024; chmod 600 /root/swap/tmp1; mkswap /root/swap/tmp1; swapon /root/swap/tmp1</code>.)</p>
<p>For CPUs with heterogeneous cores (such as P-cores and E-cores in many recent Intel CPUs),
you should benchmark each type of core separately.
For example,</p>
<pre><code>taskset -c 0-3 ./do
mkdir threads0-3
mv plot*.* threads0-3
taskset -c 4-7 ./do
mkdir threads4-7
mv plot*.* threads4-7
</code></pre>
<p>will limit benchmarks to OS threads 0–3 first, then to OS threads 4–7.
There is no standard mechanism to figure out the partition of threads into core types, but
<code>grep . /sys/devices/system/cpu/cpu*/cache/index*/size</code>
usually makes the patterns clear.</p><hr><font size=1><b>Version:</b>
This is version 2026.01.27 of the "Speed" web page.
</font>
</div>
</body>
</html>