Floating-Point Tuning

Unroll Inner Loops to Hide FP Latency

In the following dot-product example, two points are illustrated. If the inner loop indices are small then the inner loop overhead makes it optimal to unroll the inner loop instead. In addition, unrolling inner loops hides floating point latency. A more advanced notion of micro level optimization is the measure of the relative rate of operations and the number of data access in a compute step. More precisely it is rate of Floating Multiply Add to data access ratio in a compute step. The higher this rate, the better.

...
do i=1,n,k
  s1 = s1 + x(i)*y(i)
  s2 = s2 + x(i+1)*y(i+1)
  s3 = s3 + x(i+2)*y(i+2)
  s4 = s4 + x(i+3)*y(i+3)
  ...
  sk = sk + x(i+k)*y(i+k)
end do
dotp = s1 + s2 + s3 + s4 + ... + sk

Avoid Divide Operations

The following example illustrates a very common step, since a floating point divide is more expensive than a multiply. If the divide step is inside a loop, it is better to substitute that step by a divide outside of the loop and a multiply inside the loop, provided no dependencies exist.

a=...
do i=1,n
  x(i)=x(i)/a
end do
 Becomes 
a=...
ainv=1.0/a
do i=1,n
  x(i)=x(i)*ainv
end do