Fault Tolerance: We Gotta Get There!!
When parallel processing is required, virtually every aspect of the software design and implementation is affected. The developer is faced with what we call the "10 challenges of concurrency".
Here are the 10 challenges of concurrency:
- Software decomposition into instructions or sets of tasks that need to execute simultaneously
- Communication between two or more tasks that are executing in parallel
- Concurrently accessing or updating data by two or more instructions or tasks
- Identifying the relationships between concurrently executing pieces of tasks
- Controlling resource contention when there is a many-to-one ratio between tasks and resource
- Determining an optimum or acceptable number of units that need to execute in parallel
- Creating a test environment that simulates the parallel processing requirements and conditions
Recreating a software exception or error in order to remove a software defect
- Documenting and communicating a software design that contains multiprocessing and multithreading
- Implementing the operating system and compiler interface for components involved in multiprocessing and multithreading
Some of the concurrency challenges have to be checked in the testing phase and accounted for in exception handlers. These challenges are:
- Incorrect/inadequate communication between two or more tasks that are executing in parallel
- Data corruption as a result of unsafe updating of data by two or more instructions or tasks
- Resource contention when there is a many-to-one ratio between tasks and resource
- An unacceptable number of units that need to execute in parallel
- Missing/Incomplete Documentition for communicating a software design that contains multiprocessing and multithreading
The mechanism to synchronize communication and data or device access between concurrently executing threads or processes (for instance mutexes and semaphores) are used control and prevent errors that would occur from Challenge 2. Timed mutexes can be used to control and prevent errors that would result from the problems that could occur from Challenge 3. Documentation in so many cases receives the least amount of attention and dedicated resources but is one of the most important components of a software deployment. As with everything else with parallel programming and multithreading documentation is even more critical for these classes of application. The testing process should verify and validate that the design documentation and the post production documentation match! Table 1 shows which mechanisms can be used to prevent control and prevent some of the 5 challenges.
TYPES OF SEMAPHORES
|Mutex Semaphore||Mechanism used to implement mutual exclusion in a critical section of code.|
|Read-write Locks||Mechanism used to implement read-write access policy between tasks.|
|Multiple Condition Variable||Same as an event mutex but includes multiple events or conditions.|
|Condition Variables||Mechanism used to broadcast a signal between tasks that an event has taken place. When a tasks locks an event mutex, it blocks until it receives the broadcast.|
The mechanisms listed in Table 1 are low-level mechanisms. Fortunately using features of higher-level component libraries such as TBB, or the standard C++ concurrent programming library will take some of tedium away during the testing process. These issues are meant to be dealt with in Layer 2 and 3 from the PADL (Parallel Application Design Layers) analysis model. There are several words that are used in discussions on testing, error handling and fault tolerance that are often used in correctly or loosely. Table 2 contains the basic definitions.
|Defect||A flaw in any aspect of software or software requirements that contributes or may potentially contribute to the occurrence of one or more failures.|
|Error||An inappropriate decision made by a software engineer/programmer that leads to a defect in the software.|
|Exception Handling||A mechanism for managing exceptions (unanticipated conditions during the execution of a program) that changes the normal flow of the execution of a program/software.|
|Failure||An unacceptable departure from the operation of a software element that occurs as a consequence of a fault.|
|Fault||A defect in the software due to human error that when executed under particular conditions causes failure.|
|Fault Tolerance||A property that allows a piece of software to survive and recover from the software failures caused by faults (defects) introduced into the software as a result of human error.|
|Reliability||The ability of the software to perform a required function under specified condition for a stated period of time.|
Since some of the terms in Table 2 such as error, failure and fault are commonly used in many different ways, we have provided simple definitions for how they can be used. The extent to which our software is able to minimize the effects of failure is a measure of its fault tolerance. Achieving fault tolerant software is one of the primary goals of any software engineering effort. However, the distinction between fault tolerant software and well tested software is often misunderstood and blurred. Sometimes the responsibilities and activities of software verification, software validation, and exception handling are erroneously interchanged. To work towards our goal of using the C++ exception handling mechanism to help us achieve logical fault tolerant software, we must first be clear where exception handling fits in the scheme of things.
For more on parallelism and fault tolerance, see Using Erlang to Build Reliable, Fault Tolerant, Scalable Systems and Top 10 Challenges in Parallel Computing
(This is an excerpt from our book "Professional Multicore Programming: Design and Implementation for C++ Developers: Chapter 10: Testing and Logical Fault Tolerance for Parallel Programs".)