A Primer on Floating-Point Arithmetic and IEEE 754

This repository addresses usnistgov/discuss#8 by exploring
floating-point rounding errors through the lens of the associative
property of addition. The programs herein demonstrate that deviations
from mathematical precision are due to the standard representation of
floating-point numbers in a binary data type, in compliance with
IEEE 754. The evidence presented here does not support any
role of the CPU, instruction sets, or dynamic execution in the
deviations.

tl;dr: If strict adherence to mathematical law is required, use a
high-precision math library. For example, instead of the
built-in data types, use

MPFR for C/C++.

MPMath for Python

If performance is needed, and some precision can be sacrificed in the
interests of accuracy, build your C programs with "unsafe" math
optimizations: gcc -funsafe-math-optimizations.

This primer was written by Trevor Keller trevor.keller@nist.gov.
It is intended for educational purposes only.
Please cite this work:

Trevor Keller, "A primer on floating-point arithmetic and IEEE 754."
National Institute of Standards and Technology (2018):
DOI 10.5281/zenodo.6524704.

Three Term Addition
Shuffled Summation
Out-of-Order Execution
Conclusions
Further Reading

Dependencies

To build the C and C++ programs in this repository, you will need:

C and C++ compilers with standard libraries, e.g. gcc
Make build tool
MPFR library and headers, version 4 or greater

On Debian derivatives (Ubuntu, Mint), these are all packaged:

apt install gcc libmpfr6 libmpfr-dev make

Basic Premise

Addition of real numbers is [associative][add]: inserting parentheses
must not change the computed sum. For example,

$$ (1+2)+(3+4)+5 = (1+2)+3+(4+5) = 1+(2+3)+(4+5) = 15 $$

The two's complement representation of integers naturally
supports associativity, within the representable range of $2^{n-1}-1$
for an n-bit representation. However, fractional real numbers incur
a round-off error since infinite digits cannot practically be stored.
The chief exception to this general rule is that real fractions
formed by the sum of powers of two are exactly represented, as a
natural consequence of the data format.

It should be noted that two's complement is not the internal
representation used for floating point numbers, but can be used to
visualize floating point representations.

Three Term Addition

When addition is associative, we expect the sum of three terms,
$a+b+c$, to be independent of computation as $(a+b)+c$ or $a+(b+c)$.
Due to the binary representation of floating-point numbers (specified
by IEEE 754), or due to out-of-order execution on some CPUs,
this will not be the case for numerical approximations to
floating-point summation.

Results of Three-Term Addition

The code has three variants:

std (standard representation)
mpf (GNU MPFR representation)
unsafe (std with unsafe optimizations).

Use the included Makefile or the commands in the following sections
to compile the executables.

Each program will output the value of $a$ in decimal and binary form,
$a+b+c$ in binary form, and both of the associative expressions in
decimal form. When the expressions agree, a 1 is printed in the last
column; otherwise, it will be 0. The binary form of a floating-point
number is similar to an integer binary, with each bit representing a
power of two that decreases from 0 at the decimal. Therefore,

$$ 1 = 2^{ 0} = 1.00 $$ $$ \frac{1}{2} = 2^{-1} = 0.10 $$ $$ \frac{1}{4} = 2^{-2} = 0.01 $$ $$ \frac{3}{4} = \frac{1}{2} + \frac{1}{4} = 0.11 $$

Built-in floating point representation

$ make std
gcc -O3 -Wall addition.c -o std -lm && ./std

a	bin(a)	bin(a+b+c)	(a+b)+c	a+(b+c)	equal
1.000000000	1.0	1.0	1.000000000	1.000000000	1
0.500000000	0.1	0.1	0.500000000	0.500000000	1
0.333333343	0.0101010101010101010101011	0.01010101010101010101011	0.333333373	0.333333343	0
0.250000000	0.01	0.01	0.250000000	0.250000000	1
0.200000003	0.00110011001100110011001101	0.0011001100110011001101	0.200000048	0.200000003	0
0.166666672	0.00101010101010101010101011	0.00101010101010101010101	0.166666627	0.166666672	0
0.142857149	0.00100100100100100100100101	0.00100100100100100100101	0.142857194	0.142857149	0
0.125000000	0.001	0.001	0.125000000	0.125000000	1

Multi-Precision Floating-point Representation

Using precision identical to the built-in float, we have the same
result:

$ make spf
gcc -O3 -Wall -D_SINGLE_ -pedantic -include "mpfr.h" addition.c -o spf -lm -lmpfr && ./spf

a	bin(a)	bin(a+b+c)	(a+b)+c	a+(b+c)	equal
1.000000000	1.0	1.0	1.000000000	1.000000000	1
0.500000000	0.1	0.1	0.500000000	0.500000000	1
0.333328247	0.0101010101010101	0.01010101010101	0.333312988	0.333328247	0
0.250000000	0.01	0.01	0.250000000	0.250000000	1
0.199996948	0.0011001100110011	0.001100110011001	0.199981689	0.199996948	0
0.166664124	0.00101010101010101	0.001010101010101	0.166656494	0.166664124	0
0.142856598	0.001001001001001001	0.001001001001001	0.142852783	0.142856598	0
0.125000000	0.001	0.001	0.125000000	0.125000000	1

The MPFR library allows us to increase the precision from double
(64-bit) to quadruple (128-bit):

$ make mpf
gcc -O3 -Wall -pedantic -include "mpfr.h" addition.c -o mpf -lm -lmpfr && ./mpf

a	bin(a)	bin(a+b+c)	(a+b)+c	a+(b+c)	equal
1.000000000	1.0	1.0	1.000000000	1.000000000	1
0.500000000	0.1	0.1	0.500000000	0.500000000	1
0.333333313	0.010101010101010101010101	0.010101010101010101010101	0.333333313	0.333333313	1
0.250000000	0.01	0.01	0.250000000	0.250000000	1
0.199999988	0.001100110011001100110011	0.001100110011001100110011	0.199999988	0.199999988	1
0.166666657	0.0010101010101010101010101	0.0010101010101010101010101	0.166666657	0.166666657	1
0.142857134	0.001001001001001001001001	0.001001001001001001001001	0.142857134	0.142857134	1
0.125000000	0.001	0.001	0.125000000	0.125000000	1

The increased precision is implemented in software, and libraries
like this allow the programmer to choose an appropriate level of
precision, the rounding scheme, and deterministic orders of
operations. However, they are necessarily slower than numerical
formats supported in hardware (like double, single, and
half-precision floats).

Unsafe floating-point representation

Per the Using the GNU Compiler Collection(GCC) §3.10,
-funsafe-math-optimizations enables optimizations that

assume both arguments and results are valid
violate IEEE and ANSI standards
change the floating-point unit control word
cause programs that rely on exact implementation of IEEE or ISO
rules to fail
may yield faster code for programs that do not require the
guarantees of these specifications.

In other words, it is ill-advised for production code. However,

$ make unsafe
gcc -O3 -Wall -funsafe-math-optimizations addition.c -o unsafe -lm && ./unsafe

a	bin(a)	bin(a+b+c)	(a+b)+c	a+(b+c)	equal
1.000000000	1.0	1.0	1.000000000	1.000000000	1
0.500000000	0.1	0.1	0.500000000	0.500000000	1
0.333333343	0.0101010101010101010101011	0.0101010101010101010101011	0.333333343	0.333333343	1
0.250000000	0.01	0.01	0.250000000	0.250000000	1
0.200000003	0.00110011001100110011001101	0.00110011001100110011001101	0.200000003	0.200000003	1
0.166666672	0.00101010101010101010101011	0.00101010101010101010101011	0.166666672	0.166666672	1
0.142857149	0.00100100100100100100100101	0.00100100100100100100100101	0.142857149	0.142857149	1
0.125000000	0.001	0.001	0.125000000	0.125000000	1

Discussion of Three-Term Addition

As @JRMatey-NIST noted, this is a straight-forward byproduct
of the binary representation of floating-point numbers. Exact
representation is possible for integral exponents of 2; any other
number incurs rounding error. This is good to know, especially
for codes that frequently increment large values by small amounts.

Since many scientific computing applications model diffusive
processes (heat transfer, mass diffusion, etc.), the effect is
expected to be small: the perturbations caused by round-off error
will be smoothed out by the stable numerical scheme without any
additional effort on the part of the programmer.

Shuffled Summation

Ideally, the sequence of decimals (powers of 10)

$$ \frac{1}{1000} + 9 \times \left(\frac{1}{1000} + \frac{1}{100} + \frac{1}{10} + 1 + 10 + 100 + 1000\right) = 10^4 = 10000 $$

However, due to the same floating point representation problem,
variations arise from the order of summation. As a demonstration,
this program will generate a vector of 64 numbers ($10 + 9\times 6$),
then for each of 1 million trials, the same vector gets shuffled
before summing. The histogram of values is then reported. For
additional details, see the original thread.

Similarly, the sequence of binaries (powers of 2)

$$ \frac{1}{2} + 8 \times \left(\frac{1}{16} + \frac{1}{8} + \frac{1}{4} + \frac{1}{2} + 1 + 2 + 4 + 8\right) = 2^7 = 128 $$

The program will generate a vector of 65 numbers ($9 + 8\times 7$)
and, for each of 1 million trials, shuffle the vector before
summing. Due to the exact representation of powers-of-two, only one
result (128) is expected for all million shuffles.

Results of Shuffled Summation

There are two variants, shuffle and shuffle10, which can be built
using the Makefile or the command listed below.

Floating point representation of decimal sequence $\sum 10^n$

$ make shuffle10
g++ -O3 -Wall -pedantic -std=c++11 -DDECIMAL summation.cpp -o shuffle && ./shuffle

Value	Proportion
9999.99414062500000000000000000	0.033300001 %
9999.99511718750000000000000000	0.624599993 %
9999.99609375000000000000000000	4.240699768 %
9999.99707031250000000000000000	15.360699654 %
9999.99804687500000000000000000	34.903400421 %
9999.99902343750000000000000000	33.413898468 %
10000.00000000000000000000000000	11.187700272 %
10000.00097656250000000000000000	0.235699996 %

Floating point representation of binary sequence $\sum 2^n$

$$\$ make shuffle g++ -O3 -Wall -pedantic -std=c++11 summation.cpp -o shuffle && ./shuffle$$

Value	Proportion
128.00000000000000000000000000	100.000000000 %

Discussion of Shuffled Summation

The sequence comprised exclusively of powers of 2 is represented
exactly, i.e. 1 million repetitions produce the same result,
exactly 128, every time. The sequence of powers of 10 is
approximate, with the exact result, 10,000, computed in only 11 %
of the million repetitions.

Out-of-Order Execution

Out-of-order execution is a common method to avoid processor
pipeline stalls: instructions are fetched in the order provided by
the compiler, but executed in a way that minimizes wasted processor
cycles. While modern CPUs implement out-of-order execution with
in-order completion, a few architectures feature out-of-order
completion of the compiler-supplied instructions, e.g. Intel's
P6 (Pentium Pro/II/III), Core, and
Silvermont (Atom, Knights Landing). Since the sequence of
mathematical operations would be non-deterministic, these processors
could exhibit poor reproducibility of floating-point arithmetic.

Results of Out-of-Order Execution

Variants of the shuffled summation and
three-term addition codes can be used to
compare results between Intel Xeon (Sandy Bridge) and
Knights Landing (Silvermont) processors. Since the KNL
hardware is novel, the Intel C++ Compiler is preferred over
GCC; it sets -fp-model fast=1 by default, which is
equivalent to -Ofast in GCC and must be disabled to provide a valid
comparison using standards-compliant arithmetic. The variants, phi
and shufflePhi, can be built using the Makefile or the commands
below.

$ make phi
icc -O3 -Wall -xmic-avx512 -fp-model precise addition.c -o phi && ./phi

a	bin(a)	bin(a+b+c)	(a+b)+c	a+(b+c)	equal
1.000000000	1.0	1.0	1.000000000	1.000000000	1
0.500000000	0.1	0.1	0.500000000	0.500000000	1
0.333333343	0.0101010101010101010101011	0.01010101010101010101011	0.333333373	0.333333343	0
0.250000000	0.01	0.01	0.250000000	0.250000000	1
0.200000003	0.00110011001100110011001101	0.0011001100110011001101	0.200000048	0.200000003	0
0.166666672	0.00101010101010101010101011	0.00101010101010101010101	0.166666627	0.166666672	0
0.142857149	0.00100100100100100100100101	0.00100100100100100100101	0.142857194	0.142857149	0
0.125000000	0.001	0.001	0.125000000	0.125000000	1
0.111111112	0.000111000111000111000111001	0.000111000111000111001	0.111111164	0.111111112	0
0.100000001	0.000110011001100110011001101	0.00011001100110011001101	0.100000024	0.100000001	0
0.090909094	0.0001011101000101110100011	0.00010111010001011101001	0.090909123	0.090909094	0
0.083333336	0.000101010101010101010101011	0.00010101010101010101011	0.083333373	0.083333336	0
0.076923080	0.000100111011000100111011001	0.0001001110110001001111	0.076923132	0.076923080	0
0.071428575	0.000100100100100100100100101	0.0001001001001001001001	0.071428537	0.071428575	0
0.066666670	0.000100010001000100010001001	0.00010001000100010001001	0.066666722	0.066666670	0
0.062500000	0.0001	0.0001	0.062500000	0.062500000	1

$ make shufflePhi
icc -O3 -Wall -pedantic -std=c++11 -DDECIMAL -xmic-avx512 -fp-model strict summation.cpp -o shufflePhi && ./shufflePhi

Value	Proportion
9999.99414062500000000000000000	0.033300001 %
9999.99511718750000000000000000	0.624599993 %
9999.99609375000000000000000000	4.240699768 %
9999.99707031250000000000000000	15.360699654 %
9999.99804687500000000000000000	34.903400421 %
9999.99902343750000000000000000	33.413898468 %
10000.00000000000000000000000000	11.187700272 %
10000.00097656250000000000000000	0.235699996 %

Discussion of Out-of-Order Execution

The results of shuffled summation, with zeros truncated, shows no difference
between the microarchitectures:

--------------------------------  --------------------------------
         Xeon E5-1650                      Xeon Phi 7210          
--------------------------------  --------------------------------
 9999.9941406250:  0.033300000 %   9999.9941406250:  0.033300001 %
 9999.9951171875:  0.624600000 %   9999.9951171875:  0.624599993 %
 9999.9960937500:  4.240700000 %   9999.9960937500:  4.240699768 %
 9999.9970703125: 15.360700000 %   9999.9970703125: 15.360699654 %
 9999.9980468750: 34.903400000 %   9999.9980468750: 34.903400421 %
 9999.9990234375: 33.413900000 %   9999.9990234375: 33.413898468 %
10000.0000000000: 11.187700000 %  10000.0000000000: 11.187700272 %
10000.0009765625:  0.235700000 %  10000.0009765625:  0.235699996 %
--------------------------------  --------------------------------

Conclusions

For both test cases — associativity of $a+b+c$ and shuffled
summations equal to 10,000 (decimal) and 128 (binary) — the programs
in this repository demonstrate that deviations from mathematically
expected results arise strictly due to the binary representation of
floating-point numbers, not computer hardware or CPU instruction
sets. While this inadequacy of IEEE 754 is nothing new, the
source code in this repository does provide simple, repeatable
examples of the phenomenon, and may be of use as a teaching aid.

tkphd/floating-point