The authors are members of the Intel Software and Services Group.
Microprocessor design and manufacturing process innovations continue to improve software application performance through both implicit and explicit mechanisms. In addition to improving performance, there is also now the new challenge of reducing power consumption in many applications of interest, such as in mobile devices. Multicore and many-core processors are one type of design evolution that can address both performance and power efficiency. In addition to multiple cores, such processors typically also include per-core vector units providing an additional level of parallelism.
The benefits of these architectures can only be fully realized by writing parallelized and vectorized code. Some of the existing approaches to parallelization include using Windows and POSIX thread APIs, using MPI, and using the OpenMP shared memory threaded programming model. Vectorization can be accomplished by using vector intrinsics or by depending on auto-vectorization in the compiler. However, using most existing programming models, combined with usage of threads and vector instructions requires a great deal of expertise from the programmer and often results in code that underperforms or that is overly tied to a specific processor architecture.
CPU threading APIs provide a generic programming model for multicore parallelization, but applications using this model still need fine tuning of activities such as task spawning, data distribution, and synchronization to extract the best performance. Even so, threading by itself does not provide access to per-core vector parallelism. On the other hand, GPU-derived programming models such as OpenCL provide a separate compiler and runtime to extract application parallelism and can target vectorization as well as core parallelism. However, these programming models are still fairly low-level and expect the implementation of applications written using them to be directly targeted at specific architectures.
To address these problems, Intel is introducing a suite of programming models, the Intel Parallel Building Blocks, that can target both vector and core parallelism in a general, scalable, and architecture-independent fashion. These models are intended to support future scaling so that code written today will be able to harness both today's and tomorrow's processors. Intel Array Building Blocks is one of these models, supporting data parallelism in a compiler-independent fashion. Array Building Blocks provides an abstract scalable API based on the composition of structured data-parallel constructs. It is independent of machine architecture and allows users to focus on development of scalable parallel algorithms rather than becoming experts in particular machine-dependent parallel mechanisms. In this article, we present some of the features of the Array Building Blocks programming model and will provide some code examples.
Array Building Blocks: An Overview
The goal of Array Building Blocks is to define a programming model that efficiently and portably targets s/w for multicore and many-core architectures. The design philosophy of Array Building Blocks is to get application developers to "think parallel" while hiding nuances of the underlying execution layer like hardware threads, cores, and vector ISA. The programmer can then expresses parallelism in an architecture-independent fashion.
Array Building Blocks provides a dynamic execution engine which comprises of three major services:
- Threading Runtime dynamically adapts to the underlying architecture. The threading runtime (TRT) provides a fine-grained model for data and task parallel threading. TRT also handles complex fine-grained synchronization patterns.
- Memory Manager segregates the Array Building Blocks memory/ vector space. It has a set of lock-free memory interfaces as well as a garbage collector. The memory manager is responsible for allocation, data formatting, and in conjunction with TRT, partitioning data for parallel operations.
- Just-in-time Compiler/Dynamic Engine constructs an intermediate representation (IR) of the computations, performs optimizations and generates the code that is to be executed. Compilation occurs only if required; otherwise code is pulled from the code cache. The Array Building Blocks compiler has three phases: high-level (HLO), low-level (LLO) and Converged Vector ISA code generation. Converged Vector Intrinsics (CVI) is an abstracted and generalized IA32/Intel 64 vector ISA.
The HLO phase performs architecture independent code optimizations to reduce threading overhead, memory usage and redundant computation. The LLO phase does runtime-dependent optimizations. These optimizations include 1) Generation of parallel kernels using the threading runtime, 2) Translation of optimized kernels into vector code, and 3) Generation of architecture-independent CVI code. The CVI code is not bound to a particular generation of Intel vector instructions (such as SSE), thus ensuring forward-scaling and architecture independence of the overall stack. CVI code is then generated for the particular ISA version in the target machine.
Array Building Blocks offers programmers the ability to selectively target portions of their C/C++ programs to rewrite in Array Building Blocks. It allows programmers to apply a rich set of operators to a very expressive set of types that includes 1D, 2D and 3D dense containers, nested containers, and (in the future) indexed and sparse containers. It also provides safety, both by isolating its data objects from the rest of the C/C++ program and also by designing away conflicting concurrent accesses to shared objects. The isolation eliminates the need for locks, precludes data races, and obviates the need to create complex parallel data structures.
Array Building Blocks goes beyond replacing simple loops with array operations. There are several different facets of Array Building Blocks:
- A Programming Model: To express data parallelism with sequential semantics, Array Building Blocks allows operations to be expressed at an aggregate collection level instead of at an individual element level.
- A Language: Array Building Blocks adds new types/operations and mimics C/C++ control flow. Array Building Blocks adds to the C/C++ language through header files and a runtime library.
- An Abstract Machine: The Array Building Blocks high-level interface abstracts the details of the underlying machine while the dynamic compiler and runtime reduce the need to code at the hardware level to extract good performance. The various degrees of parallelism supported with threads, vectors, the ISA, memory model, and cache sizes are hidden from the programmer, making Array Building Blocks code quite portable and easy to write.