Floating Point

tags : Math, Fixed Point

The term floating point refers to the fact that the number’s radix point can “float” anywhere to the left, right, or between the significant digits of the number. This position is indicated by the exponent, so floating-point can be considered a form of scientific notation.

Introduction

FP are just scientific notion

Advantage

Speed: Commonly measured in terms of FLOPS. (Okay THIS MIGHT BE WRONG)

# int
Integer add: 1 cycle
32-bit integer multiply: 10 cycles
64-bit integer multiply: 20 cycles
32-bit integer divide: 69 cycles
64-bit integer divide: 133 cycles
# float
Floating point add: 4 cycles
Floating point multiply: 7 cycles
Double precision multiply: 8 cycles
Floating point divide: 23 cycles
Double precision divide: 36 cycles
 
source: http://www.phys.ufl.edu/~coldwell/MultiplePrecision/fpvsintmult.htm

Efficiency : Can deal with really big and small numbers without needing large amount of space.

How

Without FPU

A floating-point unit(FPU) is a part of the processor specifically designed to carry out floating-point numbers ops.

With FPU

Bases

In practice, most floating-point numbers use base 2, though base10 (decimal floating point) is also common, there are also other bases used for FP.
The base determines the fractions that can be represented
- 1/5 cannot be represented exactly as a floating-point number using a binary base
- 1/5 can be represented exactly using a decimal base (0.2, or 2×10-1)
- 1/3 cannot be represented exactly using binary or decimal base
- 1/3 can be represented exactly using base 3 (0.1, or 1×3-1)

Anatomy

A floating-point format is specified by

Base (radix): b

Precision: p (significand, i think)

Exponent range: emin to emax, w emin = 1 − emax for all IEEE 754

Significand

A signed (+/-) digit string of a given length in a given base(radix).

This digit string is referred to as the significand, mantissa, or coefficient.
The radix(base) point position is always somewhere within the significand
The length of the significand determines the precision to which numbers can be represented.
Eg. p=24, b=2, single-precision(32bit) : significand will be string of 24 bits. So precision will be till 24bits.

Exponent

A signed integer exponent (also referred to as the characteristic, or scale)
Modifies the magnitude of the number.
- eg. -5 is smaller than 3 but greater in magnitude than 3 (-5+3=-2)
The exponent shifts the radix point in the significand and changes the magnitude

Rounding and Error

Associativity and Commutativity

It’s not associative, not commutative

octave:1> x=0.1+0.2+0.3
x = 0.60000
octave:2> y=0.3+0.2+0.1
y = 0.60000
octave:3> x==y
ans = 0
octave:4> x-y
ans = 1.1102e-16

Deterministic?

They’re deterministic but partially implementation defined.
Some instructions don’t guarantee the maximum possible precision, and implementations can differ in the least significant bits of their result.
There are minor differences between hardware (x87 vs SSE is the most famous one, but there are others). Changing compiler, its version or options may produce subtly different results (the most obvious example is the -ffast-math flag). And even the bigger problem is implementation of non-primitive (e.g. trigonometric) functions. Usually your program will use implementation from a system or vendor library, which probably have different underlying implementations.

Precision (based on IEEE 754)

16-bit: Half (binary16)

32-bit: Single (binary32)

64-bit: Double (binary64)

Can represent about 15 decimal digits of precision, enough to describe any position in the solar system with millimeter precision.

Extended precision

The x86 extended precision format is an 80-bit format
long double in C is 80bits

128-bit: Quadruple (binary128)

This has no hardware support
__float128 lacks hardware support, hence is slower than double.
Many space trajectory calculations use quadruple precision arithmetic, and most galactic evolution research considers statistical distributions rather than say something precise about the future state of the galaxy.

Arbitrary precision

Arbitrary-precision arithmetic
Implementations of much larger numeric types (w a storage count that usually is not a power of two) using special software
The dc and bc programs are arbitrary precision calculators
Javascript uses arbitrary precision for BigInt
IEEE754 does not require correctly rounded mathematical functions, it only recommends them. So, the accuracies vary from one mathematical library to another

When deciding what to use

For the same cost of doing a single float64 multiplication, you can do four float32 multiplications. So in practice, people choose the smallest data size that is good enough to work.

Floating Point and Processors

No FPU

What happens when certain precision is not supported by the CPU?

The quad precision software floating point should speed up slightly, but it’s still implemented in scalar integer arithmetic.

x87 co-processor

Languages usage

Javascript

See Javascript JS has just 2 number types

`number`

Uses double-precision 64-bit binary format IEEE 754 or binary64.

Safely represents between -(253 − 1) and 253 − 1 without loss of precision.
Number.MAX_VALUE : Largest number possible to represent
Number.MAX_SAFE_INTEGER : Largest integer to be used safely in calculations.
Number.MIN_SAFE_INTEGER : Smallest integer to be used safely in calculations.

`BigInt`

Numeric primitive in JavaScript that can represent integers with arbitrary precision.
Makes it possible to correctly perform integer arithmetic without overflowing.

Other notes

One of the other things in I-triple-E 754 floating point numbers is an explicit encoding for things that are not numbers, like infinity.

🐏 mogoz

Table of Contents

Floating Point

Introduction §

Advantage §

How §

Without FPU §

With FPU §

Bases §

Anatomy §

Significand §

Exponent §

Rounding and Error §

Associativity and Commutativity §

Deterministic? §

Precision (based on IEEE 754) §

16-bit: Half (binary16) §

32-bit: Single (binary32) §

64-bit: Double (binary64) §

Extended precision §

128-bit: Quadruple (binary128) §

Arbitrary precision §

When deciding what to use §

Floating Point and Processors §

No FPU §

What happens when certain precision is not supported by the CPU? §

x87 co-processor §

Languages usage §

Javascript §

number §

BigInt §

Other notes §

Resources §

Basics §

Wikipedia §

Videos §

Intermediate §

Precision §

Error §

Graph View

Backlinks