IEEE Standard 754 Floating-Point (2024)

Steve Hollasch • Last update 2018-08-24

IEEE Standard 754 floating point is the most common representation today for real numbers oncomputers, including Intel-based PC's, Macintoshes, and most Unix platforms. This article gives abrief overview of IEEE floating point and its representation. Discussion of arithmeticimplementation may be found in the book mentioned at the bottom of this article.

What Are Floating Point Numbers?

There are several ways to represent real numbers on computers. Fixed point places a radix pointsomewhere in the middle of the digits, and is equivalent to using integers that represent portionsof some unit. For example, one might represent 1/100ths of a unit; if you have four decimal digits,you could represent 10.82, or 00.01. Another approach is to use rationals, and represent everynumber as the ratio of two integers.

Floating-point representation – the most common solution – uses scientific notation to encodenumbers, with a base number and an exponent. For example, 123.456 could be represented as 1.23456 ×10². In hexadecimal, the number 123.abc might be represented as 1.23abc × 16².In binary, the number 10100.110 could be represented as 1.0100110 × 2⁴.

Floating-point solves a number of representation problems. Fixed-point has a fixed window ofrepresentation, which limits it from representing both very large and very small numbers. Also,fixed-point is prone to a loss of precision when two large numbers are divided.

Floating-point, on the other hand, employs a sort of "sliding window" of precision appropriateto the scale of the number. This allows it to represent numbers from 1,000,000,000,000 to0.0000000000000001 with ease, and while maximizing precision (the number of digits) at both ends ofthe scale.

Storage Layout

IEEE floating point numbers have three basic components: the sign, the exponent, and themantissa. The mantissa is composed of the fraction and an implicit leading digit(explained below). The exponent base (2) is implicit and need not be stored.

The following table shows the layout for single (32-bit) and double (64-bit) precisionfloating-point values. The number of bits for each field are shown, followed by the bit ranges insquare brackets. 00 = least-significant bit.

Floating Point Components
	Sign	Exponent	Fraction
Single Precision	1 [31]	8 [30–23]	23 [22–00]
Double Precision	1 [63]	11 [62–52]	52 [51–00]

The Sign Bit

The sign bit is as simple as it gets: 0 denotes a positive number, and 1 denotes a negativenumber. Flipping the value of this bit flips the sign of the number.

The Exponent

The exponent field needs to represent both positive and negative exponents. To do this, abias is added to the actual exponent in order to get the stored exponent. For IEEEsingle-precision floats, this value is 127. Thus, to express an exponent of zero, 127 is stored inthe exponent field. A stored value of 200 indicates an exponent of (200−127), or 73. For reasonsdiscussed later, exponents of −127 (all 0s) and +128 (all 1s) are reserved for special numbers.

Double precision has an 11-bit exponent field, with a bias of 1023.

The Mantissa

The mantissa, also known as the significand, represents the precision bitsof the number. It is composed of an implicit leading bit (left of the radix point) and the fractionbits (to the right of the radix point).

To find out the value of the implicit leading bit, consider that any number can be expressed inscientific notation in many different ways. For example, the number 50 can be represented as any ofthese:

0.050 × 10³.5000 × 10²5.000 × 10¹50.00 × 10⁰5000. × 10⁻²

In order to maximize the quantity of representable numbers, floating-point numbers are typicallystored in normalized form. This basically puts the radix point after the first non-zerodigit. In normalized form, 50 is represented as 5.000 × 10¹.

A nice little optimization is now available to us in base two, since binary has only onepossible non-zero digit: 1. Thus, we can just assume a leading digit of 1, and don't need to storeit in the floating-point representation. As a result, we can assume a leading digit of 1 withoutstoring it, so that a 32-bit floating-point value effectively has 24 bits of mantissa: 23 explicitfraction bits plus one implicit leading bit of 1.

Putting it All Together

So, to sum up:

The sign bit is 0 for positive, 1 for negative.
The exponent base is two.
The exponent field contains 127 plus the true exponent for single-precision, or 1023 plus the true exponent for double precision.
The first bit of the mantissa is typically assumed to be 1, yielding a full mantissa of 1.f, where f is the field of fraction bits.

Ranges of Floating-Point Numbers

Let's consider single-precision floats for a second. We're taking essentially a 32-bit numberand reinterpreting the fields to cover a much broader range. Something has to give, and thatsomething is precision. For example, regular 32-bit integers, with all precision centered aroundzero, can precisely store integers with 32-bits of resolution. Single-precision floating-point, onthe other hand, is unable to match this resolution with its 24 bits. It does, however, approximatethis value by effectively truncating from the lower end and rounding up. For example:

 11110000 11001100 10101010 10101111 // 32-bit integer= +1.1110000 11001100 10101011 x 2³¹ // Single-precision float= 11110000 11001100 10101011 00000000 // Actual float value

This approximates the 32-bit value, but doesn't yield an exact representation. On the otherhand, besides the ability to represent fractional components (which integers lack completely), thefloating-point value can represent numbers around 2¹²⁷, compared to 32-bit integers'maximum value around 2³².

The range of positive floating point numbers can be split into normalized numbers (whichpreserve the full precision of the mantissa), and denormalized numbers (which assume aleading digit of 0, discussed later) which use only a portion of the fractions's precision.

Floating Point Range
	Denormalized	Normalized	Approximate Decimal
Single Precision	± 2⁻¹⁴⁹ to (1−2⁻²³)×2⁻¹²⁶	± 2⁻¹²⁶ to (2−2⁻²³)×2¹²⁷	± ≈10^−44.85 to ≈10^38.53
Double Precision	± 2⁻¹⁰⁷⁴ to (1−2⁻⁵²)×2⁻¹⁰²²	± 2⁻¹⁰²² to (2−2⁻⁵²)×2¹⁰²³	± ≈10^−323.3 to ≈10^308.3

Since every floating-point number has a corresponding, negated value (by toggling the sign bit),the ranges above are symmetric around zero.

There are five distinct numerical ranges that single-precision floating-point numbers arenot able to represent with the scheme presented so far:

Negative numbers less than −(2−2⁻²³) × 2¹²⁷ (negative overflow)
Negative numbers greater than −2⁻¹⁴⁹ (negative underflow)
Zero
Positive numbers less than 2⁻¹⁴⁹ (positive underflow)
Positive numbers greater than (2−2⁻²³) × 2¹²⁷ (positive overflow)

Overflow means that values have grown too large for the representation, much in the same waythat you can overflow integers. Underflow is a less serious problem because is just denotes a lossof precision, which is guaranteed to be closely approximated by zero.

Here's a table of the total effective range of finite IEEE floating-point numbers:

Effective Floating-Point Range
	Binary	Decimal
Single	± (2−2⁻²³) × 2¹²⁷	≈ ± 10^38.53
Double	± (2−2⁻⁵²) × 2¹⁰²³	≈ ± 10^308.25

Note that the extreme values occur (regardless of sign) when the exponent is at the maximumvalue for finite numbers (2¹²⁷ for single-precision, 2¹⁰²³ for double), andthe mantissa is filled with 1s (including the normalizing 1 bit).

Special Values

IEEE reserves exponent field values of all 0s and all 1s to denote special values in thefloating-point scheme.

Denormalized

If the exponent is all 0s, then the value is a denormalized number, which now has an assumed leading 0 before the binary point. Thus, this represents a number (−1)^s × 0.f × 2⁻¹²⁶, where s is the sign bit and f is the fraction. For double precision, denormalized numbers are of the form (−1)^s × 0.f × 2⁻¹⁰²².

As denormalized numbers get smaller, they gradually lose precision as the left bits of the fraction become zeros. At the smallest non-zero denormalized value (only the least-significant fraction bit is one), a 32-bit floating-point number has but a single bit of precision, compared to the standard 24-bits for normalized values.

Zero

You can think of zero as a denormalized number (an implicit leading 0 bit) with all 0 fraction bits. Note that −0 and +0 are distinct values, though they both compare as equal.

Infinity

The values +∞ and −∞ are denoted with an exponent of all 1s and a fraction of all 0s. The sign bit distinguishes between negative infinity and positive infinity. Being able to denote infinity as a specific value is useful because it allows operations to continue past overflow situations. Operations with infinite values are well defined in IEEE floating point.

Not A Number

The value NaN (Not a Number) is used to represent a value that does not represent a real number. NaN's are represented by a bit pattern with an exponent of all 1s and a non-zero fraction. There are two categories of NaN: QNaN (Quiet NaN) and SNaN (Signalling NaN).

A QNaN is a NaN with the most significant fraction bit set. QNaN's propagate freely through most arithmetic operations. These values are generated from an operation when the result is not mathematically defined.

An SNaN is a NaN with the most significant fraction bit clear. It can be used to signal an exception when used in operations. SNaN's can be handy to assign to uninitialized variables to trap premature usage.

Semantically, QNaN's denote indeterminate operations, while SNaN's denote invalid operations.

Special Operations

Operations on special numbers are well-defined by IEEE. In the simplest case, any operation witha NaN yields a NaN result. Other operations are as follows:

Special Arithmetic Results
Operation	Result
n ÷ ±∞	0
±∞ × ±∞	±∞
±nonZero ÷ ±0	±∞
±finite × ±∞	±∞
∞ + ∞ ∞ − −∞	+∞
−∞ − ∞ −∞ + −∞	−∞
∞ − ∞ −∞ + ∞	NaN
±0 ÷ ±0	NaN
±∞ ÷ ±∞	NaN
±∞ × 0	NaN
NaN == NaN	false

Summary

To sum up, the following are the corresponding values for a given representation:

Float Values (b = bias)
Sign	Exponent (e)	Fraction (f)	Value
0	00⋯00	00⋯00	+0
0	00⋯00	00⋯01 &vellip; 11⋯11	Positive Denormalized Real 0.f × 2^(−b+1)
0	00⋯01 &vellip; 11⋯10	XX⋯XX	Positive Normalized Real 1.f × 2^(e−b)
0	11⋯11	00⋯00	+∞
0	11⋯11	00⋯01 &vellip; 01⋯11	SNaN
0	11⋯11	1X⋯XX	QNaN
1	00⋯00	00⋯00	−0
1	00⋯00	00⋯01 &vellip; 11⋯11	Negative Denormalized Real −0.f × 2^(−b+1)
1	00⋯01 &vellip; 11⋯10	XX⋯XX	Negative Normalized Real −1.f × 2^(e−b)
1	11⋯11	00⋯00	−∞
1	11⋯11	00⋯01 &vellip; 01⋯11	SNaN
1	11⋯11	1X⋯XX	QNaN

References

A lot of this stuff was observed from small programs I wrote to go back and forth between hexand floating point (printf-style), and to examine the results of various operations. The bulkof this material, however, was lifted from Stallings' book.

Computer Organization and Architecture, William Stallings, pp. 222–234 Macmillan Publishing Company, ISBN 0-02-415480-6
IEEE Computer Society (1985), IEEE Standard for Binary Floating-Point Arithmetic, IEEE Std 754-1985.
Intel Architecture Software Developer's Manual, Volume 1: Basic Architecture, (a PDF document downloaded from intel.com).