fixed & floating point


Summary

Representation of real numbers for more complex calculations

Fixed point

  • fixed number of bits allocated to the integer and fraction parts
  • limited range
  • limited accuracy
integerpartfractionpartassumedbinarypoint

Floating point(IEEE 754)

  • larger range
  • single-precision(32 bits): 1 bit sign, 8 bit exponent in excess-127 and 23 bit mantissa
  • double-precision(64 bits): 1 bit sign, 11 bit exponent in excess-1023 and 52 bit mantissa
signexponentmantissa

excess-127 is used on 8 bits instead of excess-256 to support more +ve exponents

Application

Decimal to binary fixed point

  • divide by 2 for integer part
  • multiply by 2 for fraction part

Binary fixed point to floating point(single-precision)

Floating point to decimal

Complements of fixed points

  • ignore the decimal

Scientific notation

c
float AVOGADRO = 6.022E23;
float GRAV = 6.6743e-11; // both E and e are fine

double COSMOLOGICAL = 1.089E-52;