### Square Roots

With the shift-and-add implementations of sines, cosines, exponentials, and logarithms, the runtime was much improved. However, another function was now taking a lot of the time—the square-root function. This had been naïvely implemented using a Newton-Raphson approximation, and was taking a very long time to converge in some cases. What was needed was a systematic method of calculating a square root that had a bounded iteration count.

Yet again, you can use powers of two to your benefit. If *y=x* then *y*^{2}*=x*. If you break *y* down by extracting the highest power of 2, so *y=2 ^{n}+r*, then

*y*

^{2}

*=2*

^{2n}+r*2^{(n+1)}+r^{2}. Consequently, you can find

*n*from the highest even power of two in

*x*, then it's just a matter of finding

*r*, which you can do by trial and error, hunting for each bit in turn.

If you start with *a=2 ^{n}*, and

*z=x-a*

^{2}, then you can iterate through the remaining powers of two less than

*n*to find the result. In each iteration you try

*y=a+2*, for decreasing powers of

^{m}+r1*m*. Now, you know that

*y*

^{2}

*=a*

^{2}

*+2*a*2*, and the ideal is that

^{m}+2^{2m}+r2*r1*(and thus

*r2*) is zero, so

*y=x*. However, in each iteration you've got

*z=x-a*

^{2}, so you need to compare

*z*against

*2*a*2*. Since we're going in decreasing values of

^{m}+2^{2m}*m*, if

*z>=2*a*2*, then you know that

^{m}+2^{2m}*2*is a component of

^{m}*y*, since

*r2*will always be positive. Because a given power of 2 can only be present once, you can now move to the next smaller value of

*m*, updating the new value of

*a*and

*z*. Since you've got a fixed-point value, once you run out of bits you're done, so there is a hard limit on the number of iterations.

### Multiply and Divide

After investing the effort in optimizing the complex operations, plain old multiply and divide were now top of the heap. These functions are more than just simple integer operations, because they need to take care of the fixed-point offset. Not only that, but the target platform is a 32-bit platform, so the compiler-supplied 64-bit integer operations are multiple instructions anyway, thus it is important to ensure that every instruction does something useful. For multiplication, this means splitting the top 32 bits and the bottom 32 bits—there's no point doing unnecessary 32-bit times 64-bit multiplications just for completeness. Doing the split like this also lets the appropriate bit shifts be incorporated directly without having to worry about loss of precision.

Division is more complicated, as to get the final answer correct to the available precision, you need more than 64 bits in some cases. To circumvent this problem, when the numerator is more than the denominator, then the denominator is scaled up until it is at least as big as half the numerator. This then makes for division of values of similar size, so the shift-and-add method used for the actual calculation is most effective, and doesn't lose any precision—basically, each bit is calculated in turn, starting with the most significant bit of the result.

### Conclusion

Using fixed-point math gave a considerable performance boost to this application, and I anticipate that there are many other applications that could benefit from the techniques described here. It is important to check that the required range of values for an application can comfortably be represented within the chosen fixed-point representation.

When trying to optimize code, it's important to profile both before and after an optimization. Unfortunately, on the embedded target there was no profiler available, and the profile on the PC was radically different due to the different CPU. This left only one sensible option—write a profiler. Actually writing a full-blown profiler that can integrate with the compiler and get full line-by-line performance data is a huge task, and well outside the scope of this project, so I opted for the next best thing—a code-based profiler. Because the application is written in C++, I could make use of the magic of macros and deterministic destruction to get timing data from every block of code. It works really simply: Replace the opening brace of every block of code to be tested with "{INSTRUMENT_BLOCK;". In traced builds, INSTRUMENT_BLOCK expands to an instance of a class which records profiling data. In nontraced builds, it expands to an empty string. This then allows the recording of call-count and execution time for every block of code in the application, which can then be dumped to a file on exit. Profiling of individual blocks can be adjusted by adding and removing the INSTRUMENT_BLOCK from the start; individual lines can be profiled by enclosing them in their own block, and adding the INSTRUMENT_BLOCK invocation to that.
Though the profiling does make the application run slower (about 10 times slower in fact, which made test runs VERY tedious, often taking over half an hour per run), the results can generally by adjusted to account for the overhead of the profiling, and it does highlight the parts of the application that are most in need of optimization—the functions which consume the most total runtime. Some of these are functions that are run millions of times, and others are run a few times but take a long time for each execution.
—A.W. |

*Anthony is founder and managing director of Just Software Solutions. He is also maintainer of the Boost thread library and a member of the BSI C++ Standards Panel. He can be contacted at anthony@justsoftwaresolutions.co.uk.*