John Gross is chief engineer and Jeremy Orme chief architect at Connective Logic Systems. They can be contacted at firstname.lastname@example.org and email@example.com, respectively.
This multi-part article introduces Blueprint, a technology developed Connective Logic Systems (where we work) that provides an alternative approach to multi-core development. The Blueprint toolset is built on the premise that we actually have no problem understanding complex concurrency -- until we project it into the one-dimensional world of text. For this reason, a visual representation is much easier to work with than its textual equivalent. There are some cases where text is obviously more intuitive -- such as parallelizing data-parallel loops. In these cases, technologies such as Microsoft's Task Parallel Library (TPL) and Intel's Threaded Building Blocks, can be intuitive and productive. For more background information, see Multi-Core OO: Part 1.
In this article series, we have considered how to develop applications that make best use of parallel hardware consisting of perhaps an unknown number of processors in an unknown topology (this information will be "automatically" discovered at runtime). We have already moved into a world where we can't assume the processor count and it is also likely that "unknown topology" will also soon become an issue. The process for insulating applications from these uncertainties while retaining the object-oriented (OO) abstraction is:
- Separate infrastructure from processing and describe it using an intuitive visual language.
- Use an asynchronous class equivalent to encapsulate and re-use concurrent infrastructural logic.
- Use the lightweight process of accretion to map the application to multiple processes for deployment.
This final installment of the series considers the next challenge -- dynamic reallocation of hardware at runtime. We expand on the SVP abstraction and introduce the concept of "process colonies" which provide applications with a real-time scale-on-demand capability. It also provides a high-level overview of the distributed scheduler which is responsible for preemptively balancing the load across participating machines and managing their dynamic recruitment and retirement.
What is Scale-on-Demand Processing?
This abstraction augments the concept of a simple "process" with an equivalent "colony" of collaborating processes. Colonization, like Accretion, is consistent with the SVP model of the world, and so application code is unaffected by this later-stage mapping.
Figure 1 shows a simple colony containing a single master process and an arbitrary number of slave processes (the general case can support applications with any number of "slave-able" processes). The translator uses this information to create two versions of the specified process (see "Accretion" in Part 4). The "master" version will instantiate all of the specified definitive circuitry, while its "slave" equivalent will only instantiate the functionality required to execute the circuitry that has been designated as "slave-able" (that is, able to be offloaded to slave processes).
Designating a method as slave-able only involves setting one attribute. The method is subsequently displayed with a shaded interior, and non-colonized accretions and builds will ignore the setting. Colonization is completely transparent to the application developer; some of its functionality is provided by generated code, and the rest is provided by the runtime scheduler.
At runtime, each master process in the colony registers with the "registrar process", which can be hosted by the application instance's "root-master" process, or a dedicated network server process (with a fixed address). Any number of uniquely named application instances can execute on the same network (the instance name is passed to each process at runtime), but in practice, turnkey systems typically need predictable performance and so the multiple instance capability is primarily provided to let engineers share resources for development purposes. This means that slave processes can be deployed on any available machine in the network and will automatically locate and connect-to their designated master instance.
The colony editor lets developers specify each master/slave colony's resource requirements. In some cases a given accreted process may wait for a specific number, or minimum number of slaves to connect (typical for real-time applications). In other cases, the application may wish to start execution with whatever is currently available (typical for off-line applications).
The runtime's distributed scheduler transparently manages a number of aspects of application execution;
Slaves can be recruited and/or retired at any stage of execution, and the runtime will automatically ensure that slaves do not "exit" until all of their allocated tasks are complete, and any definitive data has been redistributed to adjacent slaves (or the master itself).
The scheduler will dynamically distribute the load across each participating slave. To support distributed real-time applications, the scheduler addresses prioritization and preemption first, load balance second, and bandwidth optimization third.
Network scope preemption is difficult to implement in a topology independent manner, but is actually essential for applications like military sonar, and/or oil and gas survey systems, financial modeling can also be extremely time critical, because the ability to prioritize tasks at network scope can save vital seconds.
Consider a signal processing system that needs to perform a determined number of FFTs at 1Hz, and assume that the network turn around time for this is 0.5 of a second. Now assume that in parallel with this, it also performs another set of smaller FFTs at 20 Hz, and that the turnaround time for this is 0.02 of a second.
Most standard mechanisms like scatter/gather do not execute preemptively, and so for at least 0.5 of a second, the higher frequency calculations would be stalled. This could be addressed by executing two processes on the same machine, but context switching a large process at 20Hz would not have been practical for most of the Blueprint real-time systems undertaken to date. This then leaves little choice but to map particular processes to particular machines/cores and thus lose the obvious benefit of optimal turnaround time for high-priority processes (where every CPU/core in the system is recruited).
In Blueprint, applications network scope preemption is transparently achieved by keeping track of each slave's prioritized job list. The master scheduler will never send a priority "N" job to a target that is fully loaded at priority "N" (or above), if there is another target that has spare capacity at that priority. This is essentially a distributed equivalent of SMP thread scheduling, and each slave scheduler is responsible for ensuring that it always has enough worker threads to provide preemption for each of its CPU cores.
The task manager in Figure 3 show a sonar demonstrator with its master process running on a dual core laptop, and a single slave process executing on an 8-way desktop. There is no limit to the number of slave processes that could be recruited if required.
The runtime will automatically balance the load (within the constraints imposed by preemption), and the more and/or faster cores that a machine provides, the more work it will be given; this allows applications to utilize resources in a whatever/whenever (scale-on-demand) sense, rather than depending on dedicated clusters. If a slave machine becomes busy/not-so-busy at any point (e.g., its virus checker is running/finished), the master scheduler will detect this and automatically adjust the slave machine's workload.
To minimize communication bandwidth (within the above constraints), the scheduler adopts a "jobs-to-data", rather than "data-to-jobs" strategy. For a job to be executed by a given slave process, the latter must have a copy of each of the job's input arguments, as well as an up to date definitive copy of the job's state (if required). All else being equal, the scheduler will therefore allocate the job to whichever slave owns (or has cached) the most job-specific input/state data.
This is a three-stage process:
- If required, the first stage involves simultaneously selecting and instructing adjacent slaves to send copies of any outstanding data that are required by the target slave.
- The second stage involves the nominated slaves sending their data to the target slave.
- Finally, when all data (and state) has arrived, the target slave will schedule the job for execution and cache/flush its outputs appropriately. Cached data that is no longer referenced is automatically garbage collected.
Scheduling latency is minimized by "job-overlapping"; the scheduler looks ahead, predicts load, and allocates a specified number of "buffer-able" jobs to each slave; this allows processing and communication to be fully overlapped and in typical cases, hides the latency associated with the movement of data.
Parallel Recent Articles
This month's Dr. Dobb's Journal
Most Recent Premium Content
- November - Mobile Development
- August - Web Development
- May - Testing
- February - Languages
- Open Source
- Windows and .NET programming
- The Design of Messaging Middleware and 10 Tips from Tech Writers
- Parallel Array Operations in Java 8 and Android on x86: Java Native Interface and the Android Native Development Kit
- January - Mobile Development
- February - Parallel Programming
- March - Windows Programming
- April - Programming Languages
- May - Web Development
- June - Database Development
- July - Testing
- August - Debugging and Defect Management
- September - Version Control
- October - DevOps
- November- Really Big Data
- December - Design
- January - C & C++
- February - Parallel Programming
- March - Microsoft Technologies
- April - Mobile Development
- May - Database Programming
- June - Web Development
- July - Security
- August - ALM & Development Tools
- September - Cloud & Web Development
- October - JVM Languages
- November - Testing
- December - DevOps
Dr. Dobb's Journal
Dr. Dobb's Tech Digest