Performance
Of course, there are other high-productivity environments (such as Matlab) for prototyping numerical applications. However, VSIPL++ is designed to provide high performance in addition to a convenient programming style. Because VSIPL++ can be used on workstations, supercomputers, or embedded devices, it's easy to prototype an algorithm on a workstation, then move it to the target device.
The VSIPL++ API lets you manually specify the layout of data. For example, it is usually more efficient to arrange a matrix in row-major order if performing computations along the rows, but more efficient to arrange the matrix in column-major order if performing computations along the columns. VSIPL++ programmers can pick whichever arrangement is most convenient.
Sourcery VSIPL++ uses several additional techniques to obtain good performance. Sourcery VSIPL++ can dispatch operations (like FFTs) to math libraries that have been carefully tuned for the target system. For example, on Intel processors, the Intel Performance Primitives (IPP; http://www.intel.com/cd/software/products/asmo-na/eng/perflib/ipp/index.htm) provide handwritten code for FFTs. If no library is available for a particular operation, Sourcery VSIPL++ falls back to generic routines.
The generic routines for some computations (like the *= operator used to perform element-wise multiplication in the example) use expression templates to perform loop fusion and eliminate temporaries. In the *= case, this line of code:
tmp *= filters.row(i);
is transformed into code like this:
for (length_type j=0; j<N; ++j) tmp[j] *= filters[j];
This technique is even more effective on code such as:
Vector<T> A, B, C; A = B + cos(C);
which is transformed into code like:
for (length_type i=0; i<A.size(0); ++i) A[i][j]=B[i][j]+cos(C[i][j]);
which can be compiled to very efficient code.
Some compilers (the GNU C++ compiler, for instance) have extensions that can be used to explicitly request that Single-Instruction Multiple-Data (SIMD) units on the processor be used to perform several computations at once. Sourcery VSIPL++ uses these extensions when they are available.
The net result of these techniques is that the Sourcery VSIPL++ performance for an application is usually within a hair's breadth of the performance attained by directly using the underlying low-level math libraries. So, VSIPL++ users get productivity and portability, without sacrificing performance.
On occasion, Sourcery VSIPL++ is able to achieve better performance than the underlying math libraries, due to the use of loop fusion. For example, in the cosine example, using a library like IPP, you would have to perform the addition and cosine operations separately. As a result, you would get code that looks more like this:
for (length_type i = 0; i < A.size(0); ++i) A[i][j] = B[i][j]; for (length_type i = 0; i < A.size(0); ++i) A[i][j] += cos(C[i][j]);
Because there are two separate loops, there is more overhead and an inferior cache access pattern. The loop fusion used by Sourcery VSIPL++ collapses these two loops into a single loop.