Summary
Representation of real numbers for more complex calculations
Fixed point
- fixed number of bits allocated to the integer and fraction parts
- limited range
- limited accuracy
Floating point(IEEE 754)
- larger range
- single-precision(32 bits): 1 bit sign, 8 bit exponent in excess-127 and 23 bit mantissa
- double-precision(64 bits): 1 bit sign, 11 bit exponent in excess-1023 and 52 bit mantissa
excess-127 is used on 8 bits instead of excess-256 to support more +ve exponents
Application
Decimal to binary fixed point
- divide by 2 for integer part
- multiply by 2 for fraction part
Binary fixed point to floating point(single-precision)
Floating point to decimal
Complements of fixed points
- ignore the decimal
Scientific notation
c
float AVOGADRO = 6.022E23;
float GRAV = 6.6743e-11; // both E and e are fine
double COSMOLOGICAL = 1.089E-52;