Graduating from Multitasking to Multiprogramming
For two decades, event-driven programming in multi-threaded environments created the cinematic illusion of parallel processing. Now that multicore chips have brought down the price of parallel processing to consumer levels, we are reminded that the programming practices which deliver efficiency in the multi-threaded emulation of parallelism are not genuine parallel algorithms.Having decided over time that our servers are workstations, and later, that workstations are personal computers, the needs of userland threading dominate much of the programming practice.
That multiprogramming != (differs from) multitasking is easy to prove. Work units in parallel algorithms are well-factored and well-packaged. Threaded tasks are slices of a processor's time, slices sized more by the general requirements of an operating environment than by the needs of the algorithm.
What one might call "pure" parallel algorithms return the same volume of work in proportion to the same execution time across the same number of processors, which is not true of multiple threads on a general-purpose workstation operating system today outside of very special programming techniques with peculiar consequences on those general-purpose workstation operating systems.
How to Parallelize
We've many of us found tools that offer access to parallelization, but how does one usefully parallelize a typical large business application? It's one thing to calculate all the prime numbers held in Zeus's left hand, but what about enterprise workflow systems?
In recent years, I designed, and my team very kindly built for me, what we hope is a highly unoriginal and sensible-as-casual-shoes utilization of multiple cores in a workflow.
Our system gathers data reported to the Server from Clients phoning home via HTTP. Many sorts of calculations take place on this data, either immediately or at fixed intervals. User-definable summary views spring into being on request. This description matches many enterprise applications.
The work was, in the older system, performed by several daemon threads and one Apache module. Some were written in C, some in Perl, some in Ruby. It was a comfortable rapid development environment with reasonable if not superb stability as deployed.
The new parallelized system is minimal processes (miniprocs), each with a local cache shadowed to disk. The processes communicate via message queues. Any component can reside on any physical machine in a network, since all communication, including control messages like Start and Shutdown are messages in to a queue. Some miniprocs are queue subscribers, others are topic subscribers.
The guiding principles are:
- All work in the system can be factored in such a way that processing of a real-world work unit becomes vector A -> B -> C -> ..
- And A - B - C - .. are of roughly the same load
- And A - B - C - .. are parallelizable under your chosen runtime environment
There you have a vector parallelized system that will scale nicely to the kind of multiprocessor server boxes, concurrent operating systems and queuing middleware that are easily obtainable today.
Mechanistic Beauty
The mechanistic beauty of of process-queue-cache pattern is that it's three snap-together parts. The only thinking you really have to do is when you draw the map of queue-connected miniprocs necessary to achieving your computational goals. The implementation is simple, mostly a series of one-function overrides of a small set of base classes.
- Process
- Does preferably One Thing and One Thing Only
- Factor into evenly sized work units and the operating system scheduler is entirely adequate for realtime.
- Simplicity is a virtue in coding as in life.
- Gets its work from an In Queue
- Which is the Out Queue of another Process or a Queue used for Dispatching to Processes
- Outputs to its Out Queue
- Which is, of course, the In Queue of some other Process or the return Queue to a Dispatcher
- Does preferably One Thing and One Thing Only
- Queue
- Connects Processes
- Three kinds
- In
- Out
- Work output of a Process
- Is effectively the In Queue of another Process
- Command
- System commands to individual Processes
- Cache
- Local persistent and transient data that belongs
- to an individual Process
- to a group of Processes
- Implement this as an in-memory cache that behaves like a
- Database
- Tuple store
The Benefits
- Easy to code
- Easy to understand
- Easy to maintain
- Distributable across multiple platforms
- Message architecture
- Locality of Repertory
- Offers small scope to typical parallelization errors
It's easy to model:
- Grab some queuing middleware
- Grab some in-memory cache middleware
- Write in a compiler language without its own scheduler
- Leverage the operating system support for multiproc scheduling
- We used Fedora 8 to model and CentOS to deploy
This parallelizes without writing a lock. All the nasty, error-prone parallelization code is in the message queue middleware. You've just built a giant chutes-and-ladders system where the little balls (work units) roll down the path and get dispatched to other chutes down other paths, with a small and carefully factored bit of work getting done at each landing.
- Local persistent and transient data that belongs

