It’s been a while since I posted last. Life, and death, get in the way of many things…

To recap, the last few posts have highlighted some of the issues we need to be aware of when working with floating point numbers. Some of the posts have been somewhat theoretical. How do these theoretical aspects affect us in everyday real-life programming?

Let’s think about a common everyday scenario: how do we choose a `float`

data type over a `double`

to represent a value? And once we’ve chosen one over the other, how will it affect the speed of our application? Is a `float`

faster than a `double`

, or vice versa?

## Range and Precision

Obviously the very first thing that we need to know is what are the values we want to work with. That is the primary driver for our decision. For this we must know the *range* of `float`

s vs `double`

s. I’ve covered that in the blog post Floating Point Numbers in Java.

So we don’t have to flip to that post right now, I’ll include the table here:

Type | Size | Precision | Minimum Value | Maximum Value |
---|---|---|---|---|

`float` |
4 bytes | 6 – 7 digits | ±1.4E-45 | ±3.40282347E+38F |

`double` |
8 bytes | 15 digits | ±4.9E-324 | ±1.7976931348623157E+308 |

We also need to decide on the level of *precision* we need. The more useful type for most calculations is `double`

. It is a **lot** more accurate than a `float`

! The limited precision of `float`

is often not good enough for many calculations. We generally only use `float`

when we need to save memory space and/or we know that high precision calculations aren’t needed.

Bjarne Stroustrup, the inventor of C++, suggests using `double`

s over `float`

s if we’re uncertain or don’t know any better:

*“The exact meaning of single-, double-, and extended-precision is implementation-defined. Choosing the right precision for a problem where the choice matters requires significant understanding of floating-point computation. If you don’t have that understanding, get advice, take the time to learn, or use double and hope for the best.”*

## Speed

Once we’ve made those decisions, we’d like to know how will it affect the speed of our application.

The size of a `float`

is 4 bytes, while the size of a `double`

is 8 bytes. So our first assumption is that `float`

s should be faster because they are smaller.

Unfortunately that assumption isn’t necessarily correct. The answer is that it depends on a lot of different factors, like:

- What is the native hardware? Does it have a floating point unit (FPU) that supports both
`float`

and`double`

operations? For example, if we’re using an Intel CPU, does it only have legacy x87 FPU support, or does it have the modern SSE instruction set? - Does the hardware implement both
`float`

s and`double`

s, or only one or the other? Or neither? - Are floating point operations emulated in software?
- What is the application doing, especially as far as from where it is retrieving its data?
- Are we working on huge sets of data? Can the data be cached internally, cached externally in RAM, or must it be retrieved from disk?
- What compiler settings are we using?
- Are we using
`float`

or`double`

versions of any maths libraries? - What operating system are we using?
- The list goes on…

Thinking in terms of memory usage, `float`

values take half as much memory as `double`

values. If we’re dealing with very large datasets, this can be a very important factor. If we’re doing a lot of data access, we need to think carefully about memory and cache usage. Taking up twice the memory for each `double`

value gives a heavier load on the internal caches and more memory bandwidth will be needed to fill those caches to/from RAM. Because `float`

values are smaller, we might have fewer cache misses. If using `double`

means we have to cache to disk instead of to RAM, then the speed difference can be huge.

The only way to find out is by benchmarking our particular solution. Small changes in instructions and memory usage can have a significant impact. For large amounts of data, there’ll probably be an advantage in using single precision `float`

s. This obviously assumes that we don’t need the extra range or precision of `double`

s.

## Application Code Gotchas

Not directly related to `float`

s and `double`

s are some real-world problems (and their solutions) that we need to be aware of.

### Promotion

Given the following code:

```
float a, b, c;
foo(a * 3.14 + b);
```

We must be aware that the compiler will promote the values in the `a`

and `b`

variables to `double`

. We should avoid that by writing `3.14F`

to help the compiler generate efficient assembler code/byte code that retains the values as `float`

if that’s what we want.

### Libraries

The `float`

versions of many C library functions like `logf(float)`

and `sinf(float)`

will be faster than `log(double)`

and `sin(double)`

, because they work with fewer bits of precision. They can use polynomial approximations with fewer terms to get full precision for `float`

vs `double`

.

### Division vs Multiplication

Every FPU performs multiplications much faster than divisions. Multiplication can be done in parallel, while division can’t so division is *always* slower than multiplication.

Many CPUs can perform multiplication in one or two clock cycles, but division always takes longer. Division can sometimes take up to or more than 24 clock cycles.

Why is division so much slower? Multiplication can be done with many simultaneous additions. Division requires iterative subtraction that can’t be performed simultaneously. Some FPUs speed up division by performing a reciprocal approximation and multiplying by that value. For example, instead of dividing by 2, it multiplies by 0.5. Depending on the values, it might not be quite as accurate, but is generally much faster.

So instead of

`double d = 420.0 / 2.0;`

We can write

`double d = 420.0 * 0.5;`

## Further Reading and Signing Off

Lots more on floating point arithmetic to come!

There are a few pages on Stack Overflow that make for very interesting reading, but lead down all sorts of rabbit holes. Try this one or this one.

There’s also a very interesting page about PhysX87 (a real-time physics engine/library) here.

Was this useful? Please share your comments on the blog post, and as always, stay safe and keep learning!