tkphd/floating-point
deviations from mathematical precision are due to the standard representation of floating-point numbers in a binary data type
A Primer on Floating-Point Arithmetic and IEEE 754
This repository addresses usnistgov/discuss#8 by exploring
floating-point rounding errors through the lens of the associative
property of addition. The programs herein demonstrate that deviations
from mathematical precision are due to the standard representation of
floating-point numbers in a binary data type, in compliance with
IEEE 754. The evidence presented here does not support any
role of the CPU, instruction sets, or dynamic execution in the
deviations.
tl;dr: If strict adherence to mathematical law is required, use a
high-precision math library. For example, instead of the
built-in data types, useIf performance is needed, and some precision can be sacrificed in the
interests of accuracy, build your C programs with "unsafe" math
optimizations:gcc -funsafe-math-optimizations.
This primer was written by Trevor Keller trevor.keller@nist.gov.
It is intended for educational purposes only.
Please cite this work:
Trevor Keller, "A primer on floating-point arithmetic and IEEE 754."
National Institute of Standards and Technology (2018):
DOI 10.5281/zenodo.6524704.
Table of Contents
Dependencies
To build the C and C++ programs in this repository, you will need:
- C and C++ compilers with standard libraries, e.g. gcc
- Make build tool
- MPFR library and headers, version 4 or greater
On Debian derivatives (Ubuntu, Mint), these are all packaged:
apt install gcc libmpfr6 libmpfr-dev makeBasic Premise
Addition of real numbers is [associative][add]: inserting parentheses
must not change the computed sum. For example,
The two's complement representation of integers naturally
supports associativity, within the representable range of
for an n-bit representation. However, fractional real numbers incur
a round-off error since infinite digits cannot practically be stored.
The chief exception to this general rule is that real fractions
formed by the sum of powers of two are exactly represented, as a
natural consequence of the data format.
It should be noted that two's complement is not the internal
representation used for floating point numbers, but can be used to
visualize floating point representations.
Three Term Addition
When addition is associative, we expect the sum of three terms,
Due to the binary representation of floating-point numbers (specified
by IEEE 754), or due to out-of-order execution on some CPUs,
this will not be the case for numerical approximations to
floating-point summation.
Results of Three-Term Addition
The code has three variants:
std(standard representation)mpf(GNU MPFR representation)unsafe(stdwith unsafe optimizations).
Use the included Makefile or the commands in the following sections
to compile the executables.
Each program will output the value of
decimal form. When the expressions agree, a 1 is printed in the last
column; otherwise, it will be 0. The binary form of a floating-point
number is similar to an integer binary, with each bit representing a
power of two that decreases from 0 at the decimal. Therefore,
Built-in floating point representation
$ make std
gcc -O3 -Wall addition.c -o std -lm && ./std| a | bin(a) | bin(a+b+c) | (a+b)+c | a+(b+c) | equal |
|---|---|---|---|---|---|
| 1.000000000 | 1.0 | 1.0 | 1.000000000 | 1.000000000 | 1 |
| 0.500000000 | 0.1 | 0.1 | 0.500000000 | 0.500000000 | 1 |
| 0.333333343 | 0.0101010101010101010101011 | 0.01010101010101010101011 | 0.333333373 | 0.333333343 | 0 |
| 0.250000000 | 0.01 | 0.01 | 0.250000000 | 0.250000000 | 1 |
| 0.200000003 | 0.00110011001100110011001101 | 0.0011001100110011001101 | 0.200000048 | 0.200000003 | 0 |
| 0.166666672 | 0.00101010101010101010101011 | 0.00101010101010101010101 | 0.166666627 | 0.166666672 | 0 |
| 0.142857149 | 0.00100100100100100100100101 | 0.00100100100100100100101 | 0.142857194 | 0.142857149 | 0 |
| 0.125000000 | 0.001 | 0.001 | 0.125000000 | 0.125000000 | 1 |
Multi-Precision Floating-point Representation
Using precision identical to the built-in float, we have the same
result:
$ make spf
gcc -O3 -Wall -D_SINGLE_ -pedantic -include "mpfr.h" addition.c -o spf -lm -lmpfr && ./spf| a | bin(a) | bin(a+b+c) | (a+b)+c | a+(b+c) | equal |
|---|---|---|---|---|---|
| 1.000000000 | 1.0 | 1.0 | 1.000000000 | 1.000000000 | 1 |
| 0.500000000 | 0.1 | 0.1 | 0.500000000 | 0.500000000 | 1 |
| 0.333328247 | 0.0101010101010101 | 0.01010101010101 | 0.333312988 | 0.333328247 | 0 |
| 0.250000000 | 0.01 | 0.01 | 0.250000000 | 0.250000000 | 1 |
| 0.199996948 | 0.0011001100110011 | 0.001100110011001 | 0.199981689 | 0.199996948 | 0 |
| 0.166664124 | 0.00101010101010101 | 0.001010101010101 | 0.166656494 | 0.166664124 | 0 |
| 0.142856598 | 0.001001001001001001 | 0.001001001001001 | 0.142852783 | 0.142856598 | 0 |
| 0.125000000 | 0.001 | 0.001 | 0.125000000 | 0.125000000 | 1 |
The MPFR library allows us to increase the precision from double
(64-bit) to quadruple (128-bit):
$ make mpf
gcc -O3 -Wall -pedantic -include "mpfr.h" addition.c -o mpf -lm -lmpfr && ./mpf| a | bin(a) | bin(a+b+c) | (a+b)+c | a+(b+c) | equal |
|---|---|---|---|---|---|
| 1.000000000 | 1.0 | 1.0 | 1.000000000 | 1.000000000 | 1 |
| 0.500000000 | 0.1 | 0.1 | 0.500000000 | 0.500000000 | 1 |
| 0.333333313 | 0.010101010101010101010101 | 0.010101010101010101010101 | 0.333333313 | 0.333333313 | 1 |
| 0.250000000 | 0.01 | 0.01 | 0.250000000 | 0.250000000 | 1 |
| 0.199999988 | 0.001100110011001100110011 | 0.001100110011001100110011 | 0.199999988 | 0.199999988 | 1 |
| 0.166666657 | 0.0010101010101010101010101 | 0.0010101010101010101010101 | 0.166666657 | 0.166666657 | 1 |
| 0.142857134 | 0.001001001001001001001001 | 0.001001001001001001001001 | 0.142857134 | 0.142857134 | 1 |
| 0.125000000 | 0.001 | 0.001 | 0.125000000 | 0.125000000 | 1 |
The increased precision is implemented in software, and libraries
like this allow the programmer to choose an appropriate level of
precision, the rounding scheme, and deterministic orders of
operations. However, they are necessarily slower than numerical
formats supported in hardware (like double, single, and
half-precision floats).
Unsafe floating-point representation
Per the Using the GNU Compiler Collection(GCC) §3.10,
-funsafe-math-optimizations enables optimizations that
- assume both arguments and results are valid
- violate IEEE and ANSI standards
- change the floating-point unit control word
- cause programs that rely on exact implementation of IEEE or ISO
rules to fail - may yield faster code for programs that do not require the
guarantees of these specifications.
In other words, it is ill-advised for production code. However,
$ make unsafe
gcc -O3 -Wall -funsafe-math-optimizations addition.c -o unsafe -lm && ./unsafe| a | bin(a) | bin(a+b+c) | (a+b)+c | a+(b+c) | equal |
|---|---|---|---|---|---|
| 1.000000000 | 1.0 | 1.0 | 1.000000000 | 1.000000000 | 1 |
| 0.500000000 | 0.1 | 0.1 | 0.500000000 | 0.500000000 | 1 |
| 0.333333343 | 0.0101010101010101010101011 | 0.0101010101010101010101011 | 0.333333343 | 0.333333343 | 1 |
| 0.250000000 | 0.01 | 0.01 | 0.250000000 | 0.250000000 | 1 |
| 0.200000003 | 0.00110011001100110011001101 | 0.00110011001100110011001101 | 0.200000003 | 0.200000003 | 1 |
| 0.166666672 | 0.00101010101010101010101011 | 0.00101010101010101010101011 | 0.166666672 | 0.166666672 | 1 |
| 0.142857149 | 0.00100100100100100100100101 | 0.00100100100100100100100101 | 0.142857149 | 0.142857149 | 1 |
| 0.125000000 | 0.001 | 0.001 | 0.125000000 | 0.125000000 | 1 |
Discussion of Three-Term Addition
As @JRMatey-NIST noted, this is a straight-forward byproduct
of the binary representation of floating-point numbers. Exact
representation is possible for integral exponents of 2; any other
number incurs rounding error. This is good to know, especially
for codes that frequently increment large values by small amounts.
Since many scientific computing applications model diffusive
processes (heat transfer, mass diffusion, etc.), the effect is
expected to be small: the perturbations caused by round-off error
will be smoothed out by the stable numerical scheme without any
additional effort on the part of the programmer.
Shuffled Summation
Ideally, the sequence of decimals (powers of 10)
However, due to the same floating point representation problem,
variations arise from the order of summation. As a demonstration,
this program will generate a vector of 64 numbers (
then for each of 1 million trials, the same vector gets shuffled
before summing. The histogram of values is then reported. For
additional details, see the original thread.
Similarly, the sequence of binaries (powers of 2)
The program will generate a vector of 65 numbers (
and, for each of 1 million trials, shuffle the vector before
summing. Due to the exact representation of powers-of-two, only one
result (128) is expected for all million shuffles.
Results of Shuffled Summation
There are two variants, shuffle and shuffle10, which can be built
using the Makefile or the command listed below.
Floating point representation of decimal sequence $\sum 10^n$
$ make shuffle10
g++ -O3 -Wall -pedantic -std=c++11 -DDECIMAL summation.cpp -o shuffle && ./shuffle| Value | Proportion |
|---|---|
| 9999.99414062500000000000000000 | 0.033300001 % |
| 9999.99511718750000000000000000 | 0.624599993 % |
| 9999.99609375000000000000000000 | 4.240699768 % |
| 9999.99707031250000000000000000 | 15.360699654 % |
| 9999.99804687500000000000000000 | 34.903400421 % |
| 9999.99902343750000000000000000 | 33.413898468 % |
| 10000.00000000000000000000000000 | 11.187700272 % |
| 10000.00097656250000000000000000 | 0.235699996 % |
Floating point representation of binary sequence $\sum 2^n$
| Value | Proportion |
|---|---|
| 128.00000000000000000000000000 | 100.000000000 % |
Discussion of Shuffled Summation
The sequence comprised exclusively of powers of 2 is represented
exactly, i.e. 1 million repetitions produce the same result,
exactly 128, every time. The sequence of powers of 10 is
approximate, with the exact result, 10,000, computed in only 11 %
of the million repetitions.
Out-of-Order Execution
Out-of-order execution is a common method to avoid processor
pipeline stalls: instructions are fetched in the order provided by
the compiler, but executed in a way that minimizes wasted processor
cycles. While modern CPUs implement out-of-order execution with
in-order completion, a few architectures feature out-of-order
completion of the compiler-supplied instructions, e.g. Intel's
P6 (Pentium Pro/II/III), Core, and
Silvermont (Atom, Knights Landing). Since the sequence of
mathematical operations would be non-deterministic, these processors
could exhibit poor reproducibility of floating-point arithmetic.
Results of Out-of-Order Execution
Variants of the shuffled summation and
three-term addition codes can be used to
compare results between Intel Xeon (Sandy Bridge) and
Knights Landing (Silvermont) processors. Since the KNL
hardware is novel, the Intel C++ Compiler is preferred over
GCC; it sets -fp-model fast=1 by default, which is
equivalent to -Ofast in GCC and must be disabled to provide a valid
comparison using standards-compliant arithmetic. The variants, phi
and shufflePhi, can be built using the Makefile or the commands
below.
$ make phi
icc -O3 -Wall -xmic-avx512 -fp-model precise addition.c -o phi && ./phi| a | bin(a) | bin(a+b+c) | (a+b)+c | a+(b+c) | equal |
|---|---|---|---|---|---|
| 1.000000000 | 1.0 | 1.0 | 1.000000000 | 1.000000000 | 1 |
| 0.500000000 | 0.1 | 0.1 | 0.500000000 | 0.500000000 | 1 |
| 0.333333343 | 0.0101010101010101010101011 | 0.01010101010101010101011 | 0.333333373 | 0.333333343 | 0 |
| 0.250000000 | 0.01 | 0.01 | 0.250000000 | 0.250000000 | 1 |
| 0.200000003 | 0.00110011001100110011001101 | 0.0011001100110011001101 | 0.200000048 | 0.200000003 | 0 |
| 0.166666672 | 0.00101010101010101010101011 | 0.00101010101010101010101 | 0.166666627 | 0.166666672 | 0 |
| 0.142857149 | 0.00100100100100100100100101 | 0.00100100100100100100101 | 0.142857194 | 0.142857149 | 0 |
| 0.125000000 | 0.001 | 0.001 | 0.125000000 | 0.125000000 | 1 |
| 0.111111112 | 0.000111000111000111000111001 | 0.000111000111000111001 | 0.111111164 | 0.111111112 | 0 |
| 0.100000001 | 0.000110011001100110011001101 | 0.00011001100110011001101 | 0.100000024 | 0.100000001 | 0 |
| 0.090909094 | 0.0001011101000101110100011 | 0.00010111010001011101001 | 0.090909123 | 0.090909094 | 0 |
| 0.083333336 | 0.000101010101010101010101011 | 0.00010101010101010101011 | 0.083333373 | 0.083333336 | 0 |
| 0.076923080 | 0.000100111011000100111011001 | 0.0001001110110001001111 | 0.076923132 | 0.076923080 | 0 |
| 0.071428575 | 0.000100100100100100100100101 | 0.0001001001001001001001 | 0.071428537 | 0.071428575 | 0 |
| 0.066666670 | 0.000100010001000100010001001 | 0.00010001000100010001001 | 0.066666722 | 0.066666670 | 0 |
| 0.062500000 | 0.0001 | 0.0001 | 0.062500000 | 0.062500000 | 1 |
$ make shufflePhi
icc -O3 -Wall -pedantic -std=c++11 -DDECIMAL -xmic-avx512 -fp-model strict summation.cpp -o shufflePhi && ./shufflePhi| Value | Proportion |
|---|---|
| 9999.99414062500000000000000000 | 0.033300001 % |
| 9999.99511718750000000000000000 | 0.624599993 % |
| 9999.99609375000000000000000000 | 4.240699768 % |
| 9999.99707031250000000000000000 | 15.360699654 % |
| 9999.99804687500000000000000000 | 34.903400421 % |
| 9999.99902343750000000000000000 | 33.413898468 % |
| 10000.00000000000000000000000000 | 11.187700272 % |
| 10000.00097656250000000000000000 | 0.235699996 % |
Discussion of Out-of-Order Execution
The results of shuffled summation, with zeros truncated, shows no difference
between the microarchitectures:
-------------------------------- --------------------------------
Xeon E5-1650 Xeon Phi 7210
-------------------------------- --------------------------------
9999.9941406250: 0.033300000 % 9999.9941406250: 0.033300001 %
9999.9951171875: 0.624600000 % 9999.9951171875: 0.624599993 %
9999.9960937500: 4.240700000 % 9999.9960937500: 4.240699768 %
9999.9970703125: 15.360700000 % 9999.9970703125: 15.360699654 %
9999.9980468750: 34.903400000 % 9999.9980468750: 34.903400421 %
9999.9990234375: 33.413900000 % 9999.9990234375: 33.413898468 %
10000.0000000000: 11.187700000 % 10000.0000000000: 11.187700272 %
10000.0009765625: 0.235700000 % 10000.0009765625: 0.235699996 %
-------------------------------- --------------------------------
Conclusions
For both test cases — associativity of
summations equal to 10,000 (decimal) and 128 (binary) — the programs
in this repository demonstrate that deviations from mathematically
expected results arise strictly due to the binary representation of
floating-point numbers, not computer hardware or CPU instruction
sets. While this inadequacy of IEEE 754 is nothing new, the
source code in this repository does provide simple, repeatable
examples of the phenomenon, and may be of use as a teaching aid.
Further Reading
The following resources helped me to gain understanding.
-
David Goldberg (1991):
What Every Computer Scientist Should Know About Floating-Point Arithmetic.Concise summary of how the standard floating-point representation
works, and pitfalls to avoid. -
Donald Knuth, The Art of Computer Programming, Vol. 2, Ed. 3,
(1997):
§4.2.2: Accuracy of Floating Point ArithmeticPhilosophical review of floating-point representations with
discussion of the compromises made to arrive at a working
standard. -
Erik Cheever (2001):
Representation of NumbersSummary, with exercises, of how the standard represents real
numbers. -
Rick Regan (2012):
Why 0.1 Does Not Exist In Floating-PointExcellent illustration of why round-off error is a necessary evil
in the standard representation. -
James Demmel and Hong Diep Nguyen (2013):
Numerical Reproducibility and Accuracy at ExascaleShort paper summarizing reproducibility of summation algorithms.
-
James Demmel, Hong Diep Nguyen, & Peter Ahrens (2015):
Cost of Floating-Point ReproducibilityDetailed slides discussing sources of non-reproducibility and the
computational cost of correcting the deficiencies. -
Jeff Arnold (2017):
An Introduction to Floating-Point Arithmetic and ComputationThorough slide deck highlighting problem areas for scientific
computing. -
Thomas Risse (2017):
Better is the Enemy of Good: Unums — An Alternative to IEEE 754 Floats and DoublesConcise discussion of an alternative floating-point
representation. -
James Matey (2018):
Re: Floating-Point Repeatability Issues on Modern ProcessorsInsightful comment on the GitHub Issue with helpful
references.