In the last post, we looked at how integral numbers are represented in binary in computer memory.
In this post we’re going to look at the representation of floating point numbers. These are numbers with a fractional component, i.e. real numbers.
The representation of integral numbers is easy to understand. To recap, the signed integral primitive data types in Java (byte
, short
, int
, long
) are stored in positional notation as binary numbers (0
’s and 1
’s). Each bit from the right represents an increasing power of two, except for the most significant bit in the leftmost position which represents the signed bit.
Representation
Floating point numbers are represented very differently in memory, and are a lot harder to understand and use correctly.
The term floating point refers to the fact that the decimal point of the number can “float” around. It can be placed between any of the significant digits of the number. The position of the decimal point is determined by the exponent (multiplier).
For example, the floating point number 123.456
can be thought of as 123456
multiplied by 10
to the power of -3
(i.e., divided by 1000). The value 123456
is the significand (or mantissa) as it represents the significant digits in the number. 10
is the base (or radix), and -3
is the exponent.
Floating point is similar to scientific notation. In scientific notation, a number is scaled by a power of 10, so that it lies within a specific range. The range of the number is between 1 and 10, with the decimal point (or radix point) appearing immediately after the first digit. The scaling factor as a power of ten is then specified separately at the end of the number.
So the floating point number 123.456
can be written in scientific notation as 123456E-3
. It’s more commonly written as1.23456E2
. The decimal point is usually placed after the first digit.
Most computers use base two as their floating-point systems. This means that both the mantissa and exponent are represented in base two (binary floating point). Base ten (decimal floating point) is occasionally used in some computers.
Mantissas and Exponents
The primitive floating point data types in Java have three components in their representation: a mantissa/significand, an exponent and a sign bit.
- The mantissa/significand represents the actual number. This is a fixed precision integer using the same positional notation as integral types. The length of the significand determines the precision of the actual number.
- The exponent represents a multiplier of the mantissa to provide a wider range of values than if there was only the significand component. The exponent is a signed integer value that is used to scale the magnitude of the actual number.
- The sign bit represents the sign of the entire number, i.e., positive or negative.
Integral data types have strictly limited ranges, but floating point numbers can be used for both very small and very large real values. They can just as easily represent distances between galaxies as distances between atoms. This is the work of the exponent part of the floating point number.
The precision at which we can represent a value is determined by the number of digits in the mantissa. The actual size of the value is determined by the number of digits in the exponent.
Java has two primitive floating point types, float
and double
. The size of their significands and exponents differ, as given by the following table:
Type | Size | Sign bit | Mantissa | Exponent |
---|---|---|---|---|
float |
32 bits | 1 bit | 23 bits | 8 bits |
double |
64 bits | 1 bit | 52 bits | 11 bits |
Real numbers can have an infinite number of digits in their fractional part. When we try to represent a real number in a limited number of mantissa bits, we will obviously run into precision errors. It’s often impossible to represent a real number accurately in a computer system. This is referred to as a quantitization error. Hundreds, if not thousands, of technical articles and papers have been devoted to this topic.
The minimum and maximum values of float
s and double
s differ, as given by the following table:
Type | Size | Precision | Minimum Value | Maximum Value |
---|---|---|---|---|
float |
4 bytes | 6 – 7 digits | ±1.4E-45 | ±3.40282347E+38F |
double |
8 bytes | 15 digits | ±4.9E-324 | ±1.7976931348623157E+308 |
The downside of this very wide range is that the numbers that can be represented are not uniformly spaced. The difference between two consecutive representable numbers depends on the value of the exponent.
If the lowest bit changes from a zero to a one in an integral number, the actual number also changes by one. If the lowest bit changes from a zero to a one in a floating point number, what the value changes to is determined both by the exponent value and the precision of the mantissa.
It would be inaccurate to say that it’s anyone’s guess what the actual value changes to, but it’s not that far from the truth. LOL! Play around with some of the floating point bit-twiddling sites mentioned at the end of this post to see for yourself.
The IEEE 754 standard
The IEEE 754 standard defines five basic formats that are named after their base and the number of encoding bits. There are three basic binary floating-point formats (32, 64 or 128 bits) and two basic decimal floating-point formats (encoded with 64 or 128 bits). There are also a number of interchange formats.
The Java floating point types conform to the IEEE 754 standard. The float
and double
data types correspond to the IEEE 754 single precision binary32
and double precision binary64
formats respectively.
The more useful type for most calculations is double
. The limited precision of float
is often inadequate. We generally only use float
if we need to save memory space and/or we know that high precision calculations are not needed.
When a double
doesn’t have the required range or precision, Java provides the “big number” classes: BigDecimal
and BigInteger
(in the java.math
package). These allow arbitrary precision calculations, i.e. any size of mantissa and exponent for BigDecimal
, and any integral range for BigInteger
. More on arbitrary precision arithmetic in a later post.
Class Wrappers
As we already know, Java has a split data type system. It has the fundamental data types that we’ve been looking at, and it has object reference types. If we want to use a primitive type as an objct, we have a wrap it in an object. There are a number of class wrappers that we use to encapsulate (wrap) a single field of a fundamental data type in an object. These classes provide methods for converting the wrapped field to and from a String
, as well as other useful constants and methods for dealing with that particular data type.
The floating point class wrappers are named after their primitive counterparts, i.e., Float
and Double
. Note the uppercase spelling! They represent classes.
Constants include BYTES
, SIZE
, TYPE
, MIN_VALUE
and MAX_VALUE
. These are similar to the integral class wrapper classes shown in the last post.
The following example shows how to use these constants:
/**
* A simple program to print the names, sizes and ranges of the
* floating point data types using the appropriate class wrappers.
* We're using printf() here for nicely formatted output.
*
* We can use \u2502 instead of the | (pipe sign) for vertical lines.
*
* Run with -Djava.locale.providers=SPI to get locale specific
* thousands separators.
*
*/
public class WrapperTest {
public static void main(String args[]) {
final String title = "| %-7s | %4s | %15s | %15s |%n";
final String fmt = "| %-7s | %4d | %15e | %15e |%n";
System.out.println();
System.out.printf(title, "TYPE", "SIZE", "MINIMUM", "MAXIMUM");
System.out.printf(fmt, Float .TYPE, Float .SIZE,
Float .MIN_VALUE, Float .MAX_VALUE );
System.out.printf(fmt, Double.TYPE, Double.SIZE,
Double.MIN_VALUE, Double.MAX_VALUE );
}
} // end of class
There are some extra more specialised constants. These are MIN_EXPONENT
, MAX_EXPONENT
, MIN_NORMAL
, NEGATIVE_INFINITY
, POSITIVE_INFINITY
and NaN
(not a number). We’ll look at some of these concepts in later posts. Explanations of these can be found in the Java API documentation and the IEEE 754 standard.
Further Reading and Signing Off
Lots more on floating point arithmetic to come!
The Floating-Point Guide is a simple, easy to read site devoted to explaining what every programmer should know about floating point arithmetic.
Or you can read a vastly more technical article about What Every Computer Scientist Should Know About Floating-Point Arithmetic.
For more on floating point numbers in general, see the Wikipedia page.
For more on IEEE 754, see the Wikipedia page on the topic.
You can see, visualise and change any of the bits in floating point numbers at a few websites: https://evanw.github.io/float-toy/, https://float.exposed/0x400921fbc0000000 and https://bartaz.github.io/ieee754-visualization/.
Was this useful? Please share your comments