-r--r--r-- 11547 djbsort-20260127/doc/html/internals.html raw<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<style type="text/css">
html{overflow-y:scroll;background-color:#012345}
body{font-family:"Noto Sans","Droid Sans","DejaVu Sans","Arial",sans-serif;line-height:1.5}
tt,code{background-color:#f0f0f0;font-family:"Noto Sans Mono","Droid Sans Mono","DejaVu Sans Mono","Courier New",monospace,sans-serif;font-size:1em;}
pre{margin-left:3em}
p,ul,ol,blockquote,pre{font-size:1.0em;line-height:1.6}
li p{font-size:1.0em}
blockquote p{font-size:1.0em}
h1{font-size:1.5em}
h2{font-size:1.3em}
h3{font-size:1.0em}
h1 a{text-decoration:none}
table{border-collapse:collapse}
th,td{border:1px solid black}
table a{text-decoration:none}
table tr{font-size:1.0em;line-height:1.6em}
table tr{font-size:1.0em;line-height:1.5}
tbody tr:nth-child(2n+1){background-color:#f0ffff}
tbody tr:nth-child(2n+2){background-color:#fffff0}

#headline{display:block;margin:0;padding:0;color:#ffffff;background-color:#012345}
#headline .text{font-weight:bold;font-size:1.0em}
#headline input{display:none}
#nav ul{margin:0;padding:0}
#nav li{list-style-type:none;margin:0;padding:0}
.navtop{padding-bottom:0.5em;font-weight:bold;font-size:1.0em}
.navtop{background-color:#012345;color:#ffffff}
#nav .here{background-color:#012345;color:#ffffff}
#nav .away{background-color:#012345;color:#ffffff}
#nav .away a{text-decoration:none;display:block;color:#ffffff}
#nav .away a:hover,.away a:active{text-decoration:underline}
#hidemenu{visibility:hidden;display:none;overflow:hidden;position:fixed;top:0;left:0;height:100%;width:100%}

.main{padding:5px}
.main{background-color:#ffffff}
.pagetitle{font-size:1.4em;font-weight:bold}
@media only screen and (min-width:512px) {
  .navtop{padding-top:5px}
  #headline{top:0;margin:0;width:160px;height:100%;position:fixed;overflow:auto}
  #headline .noselect{display:none}
  #headline #nav{visibility:visible;display:block;width:auto;height:auto}
  .main{margin-left:170px}
  #headline #hidemenu{visibility:hidden}
}
@media not screen and (min-width:512px) {
  #headline .noselect{-webkit-user-select:none;-ms-user-select:none;user-select:none;}
  #headline #nav #navbot{visibility:hidden;position:fixed;top:0;left:-70%;z-index:2;transition:0.2s;margin:0;padding:0}
  #headline input:checked ~ #nav #navbot{height:100%;position:fixed;top:0;left:0;visibility:visible;display:block;box-sizing:border-box;-moz-box-sizing:border-box;-webkit-box-sizing:border-box;vertical-align:center;font-size:1.0em;width:70%;overflow:auto}
  #headline input:checked ~ #hidemenu{visibility:visible;display:block;background:black;opacity:0.3;z-index:1}
}
</style>
<title>
djbsort: Internals</title>
</head>
<body>
<label id=headline>
<input type=checkbox />
<nav id=nav>
<div class=navtop>
<span class=noselect>≡</span>
djbsort</div>
<ul id=navbot>
<li class=away><a href=index.html>Intro</a>
</li><li class=away><a href=download.html>Download</a>
</li><li class=away><a href=install.html>Install</a>
</li><li class=away><a href=test.html>Test</a>
</li><li class=away><a href=api.html>API</a>
</li><li class=away><a href=speed.html>Speed</a>
</li><li class=away><a href=security.html>Security</a>
</li><li class=away><a href=verif.html>Verification</a>
</li><li class=here>Internals
</li><li class=away><a href=license.html>License</a>
</li><li class=away><a href=refs.html>References</a>
</li><li class=away><a href=comparison.html>Comparison</a>
</li></ul></nav>
<div id=hidemenu></div>
</label>
<div class=main>
<div class=pagetitle>djbsort: Internals</div>
<p>djbsort includes infrastructure adapted from lib25519, libmceliece, etc.
for selecting implementations and compilers.
The djbsort distribution includes rules
(which are automatically derived from benchmarks reported by <code>djbsort-speed</code>)
for which implementations to prioritize depending on CPU features,
and then at run time each CPU selects the top-priority implementation that it supports.
For the current djbsort software,
this boils down to the following:</p>
<ul>
<li>
<p>The compiled library for Intel/AMD CPUs includes
  an AVX2 implementation with one compiler, plus a portable fallback.
  The AVX2 implementation is automatically selected at program startup on CPUs that support AVX2.</p>
</li>
<li>
<p>The compiled library for other CPUs includes simply a portable fallback.</p>
</li>
</ul>
<p>To expand the tests to cover all combinations of implementations and compilers
during development and optimization,
configure with <code>--no-trim</code>.</p>
<h2>Choices of sorting networks</h2>
<p>The AVX2 implementations of <code>int32</code> sorting and <code>int64</code> sorting
are automatically generated by <code>autogen/sort</code>,
a sorting-network compiler.
The sorting networks used are mostly bitonic to support a hypercube data flow,
allowing the hypercube to be sliced and rotated to reduce data movement in vectorized code.
However,
faster odd-even stages are used in some code segments
for which the odd-even data flow does not cause complications.</p>
<p>Odd-even sorting is often stated only for powers of 2,
and bitonic sorting is usually stated only for powers of 2.
Knuth's reorganization of odd-even sorting as "merge exchange" easily handles arbitrary sizes.
More importantly for the current version of djbsort,
the formulation of bitonic sorting from <a href="https://hwlang.de/algorithmen/sortieren/bitonic/oddn.htm">1998 Lang</a>
easily handles arbitrary sizes:</p>
<ul>
<li>
<p>Merging <code>1*0*1*</code> into <code>0*1*</code>:
  If a bit array of size n (at least 2) matches regexp <code>1*0*1*</code> (a V shape),
  and p is the largest power of 2 strictly below n,
  then one stage of distance-p bitonic merging (n−p comparators)
  has the same effect as distance-p bitonic merging (p comparators) on a hypothetical power-of-2 array of length 2p having the same contents right-padded with <code>1*</code>.
  Merging can then continue recursively with regular power-of-2 bitonic merging on the left p elements,
  and not-necessarily-power-of-2 bitonic merging on the right n−p elements.</p>
</li>
<li>
<p>Recursive sorting (which happens first):
  As in the power-of-2 case,
  one can set up <code>1*0*1*</code> by recursively sorting the left and right sides of the array
  (so now left is <code>0*1*</code> and right is <code>0*1*</code>)
  and then reversing the left side
  (so now left is <code>1*0*</code> and right is <code>0*1*</code>).
  The split of sides here does not have to match the p,n−p split used in the merging.
  Lang suggests a split into floor(n/2),ceil(n/2).</p>
</li>
</ul>
<p>For each n in turn,
one can calculate the optimal left-right split for the recursive sorting.
It turns out that p,n−p uses more comparators than necessary,
especially when n is only slightly larger than p.
Using floor(n/2),ceil(n/2) also turns out to be suboptimal
(perhaps surprisingly, uneven splits produce fewer comparators than even splits when n is a power of 2).
To simplify vectorization and control flow,
djbsort limits the choices of sizes for the left side,
specifically taking a left side of size either p or 3p/4.</p>
<h2>Reverse sorting</h2>
<p>Instead of sorting the left side and then doing an extra pass through the data to reverse the result,
one can call a sort-downwards function for the left side,
at the expense of duplicating code;
djbsort does this for some small functions.
An alternative is to xor <code>-1</code> into the array before and after sorting,
integrating this into the sorting functions to avoid extra passes;
larger functions in djbsort follow this approach to reduce code size.</p>
<p>Similarly, djbsort handles <code>uint</code> and <code>float</code> sorting (and <code>down</code> sorting) with small wrappers on top of <code>int</code> sorting.
These wrappers use only one pass before sorting and only one pass after sorting.</p>
<h2>Handling odd tails of arrays</h2>
<p>An <code>int32[64]</code> array is easily loaded into eight 256-bit AVX2 vectors, each holding an <code>int32[8]</code>.
But what about an <code>int32[61]</code> array?
There might be other program data immediately after the array,
so one cannot simply load and store a 256-bit vector for positions 56,...,63 as if the array had 3 more elements.</p>
<p>One solution (for array lengths above 8) is to use overlapping loads and stores,
applying a vector operation to positions 53,...,60 along with 0,...,7 and 8,...,15 and so on through 48,...,55.
This was used in earlier versions of djbsort
and works well with, e.g., a min-max computation of two vectors,
but it constrains sorting-network layouts.</p>
<p>Another solution is to copy the <code>int32[61]</code> array into an <code>int32[64]</code> array padded with maximum <code>int32</code> values,
sort, and copy back.
This approach is used in djbsort for some small array sizes,
but is unacceptable for in-place software handling large array sizes.</p>
<p>Yet another solution
is to copy just the last 5 entries of the array into an <code>int32[8]</code> array padded with maximum <code>int32</code> values,
or the last 13 entries of the array into an <code>int32[16]</code> array padded with maximum <code>int32</code> values,
etc.
This has less copying and, as a bonus, allows other sorting stages to assume that array lengths are multiples of 8, 16, etc.;
the disadvantage is that the array tail has to be tracked separately through each sorting subroutine.</p>
<p>A conceptually simpler solution comes from the AVX2 <code>VPMASKMOV</code> instruction,
which can load just <code>int32[5]</code> into an <code>int32[8]</code> vector
(and store <code>int32[5]</code> later),
given a mask that designates the first 5 out of 8 entries.
It is easy to convert 61 into such a mask,
and to blend the resulting vector with a vector of maximum <code>int32</code> values.</p>
<p>Some vectorized sorting software uses <code>VPMASKMOV</code>, but djbsort does not,
for two reasons.
First,
AMD's official documentation says that a CPU
"may signal a data breakpoint or a page fault for doublewords that are zero-masked and not actually written".
(This AMD statement is surprising:
Intel's AVX2 documentation guarantees that faults
"will not occur due to referencing any memory location if the corresponding mask bit for that memory location is 0";
AMD claims to support AVX2;
tests on a variety of AMD CPUs that claim AVX2 support shows each CPU seeming to do what Intel says.
But is the AMD statement <em>so</em> surprising that it's clearly wrong?)
Second,
loads with <code>VPMASKMOV</code> are very slow on AMD's AVX2 CPUs before Zen 2,
and stores with <code>VPMASKMOV</code> are very slow even on newer AMD AVX2 CPUs.</p>
<p>The sorting-network compiler in djbsort understands two options internally.
One option generates <code>VPMASKMOV</code>;
the other generates a slightly longer sequence of instructions that adequately simulates <code>VPMASKMOV</code>
for the case of loading and storing an array tail.
Currently djbsort uses the slightly longer sequence of instructions.
This marginally slows down Intel CPUs, but speeds up AMD CPUs and avoids correctness questions on AMD CPUs.
It would also be possible to automatically select the <code>VPMASKMOV</code> code for Intel CPUs
and the non-<code>VPMASKMOV</code> code for AMD CPUs,
at the expense of having both options in the compiled library.</p><hr><font size=1><b>Version:</b>
This is version 2026.01.26 of the "Internals" web page.
</font>
</div>
</body>
</html>