It's Not My Fault ...
Failures are the result of a defect in hardware, software, or human operation. If the software is not running, then it cannot encounter defects.
Although this is an obvious statement it is important in understanding some of the distinctions between the responsibilities and activities of the testing phase versus those of the exception handler.
The more defects found and removed during testing, the less defects encountered by the software during runtime. Defects encountered during runtime lead to failures in the software. Failures in the software produce exceptional conditions for the software to operate under. The exceptional conditions require exception handlers. So the balancing act is between defect removal during the testing stages versus defect survival during exception handling.
When choosing defect survival over defect removal, the problem is that exception handling code can become so complex that it introduces defects into the software. So instead of providing a mechanism to help achieve fault tolerance the exception handler becomes a source of failure. Defect survival reduces the software's chance to operate properly. Extensive and thorough testing removes defects which reduces the strain on the exception handlers. It is also important to note that exception handlers do not occur as free standing pieces of code. They occur within the context of the overall software architecture. The journey towards fault tolerance in our software begins by recognizing that:
- No amount of exception handling can rescue a flawed or inappropriate software architecture.
- The fault tolerance of a piece of software is directly related to the quality of its architecture.
- The exception handling architecture cannot replace the testing stages.
To make a discussion about exception handling clear and meaningful, it is important to understand that the exception handling architecture occurs within the context of the software architecture as a whole. This means that exceptions are identified by the PBS (Predicate Breakdown Structure) and PADL (Parallel Application Design Layers) analysis. The solution model has a PBS when we have an unavoidable, uncontrollable, unexplainable deviation from the application architecture's PBS then we have an exception. So the exception is defined by clearly articulated architectures. If the software architecture is inappropriate, incomplete, or poorly thought out then any attempt at after-the-fact exception handling is highly questionable. Further, if short cuts have been taken during the testing stages (i.e. incomplete stress testing, incomplete integration testing, incomplete glass box testing and so on) then the exception handling code will have to be perpetually added to and will become increasingly complex, ultimately detracting from the software's fault tolerance and the declarative architecture of the application. On the other hand if the software architecture is sound and the exception handling architecure is compatible and consistent with the PBS and Layers 3, 4, and 5 of the PADL (see blog) analysis, then a high degree of fault tolerance can be achieved for our parallel programs. If we approach our goal of context failure resilience with an understanding of the roles that software application architecture and testing play then it is obvious that we choose defect removal over defect survival. Defect removal takes place during testing.
So what about parallel systems? Parallel systems require even more effort during the testing phase. So we make use of our PADL analysis and PBS breakdown during our test plan. We break up the testing goals of parallel programs into answering three fundamental questions:
- Do the design models and PBS correctly and completely represent the solution model? (assuming that the solution model solves the original problem)
- Does the implementation model map correctly to the design models and the PBS (Layer 4 and 5 from PADL)?
- Have all of the challenges to concurrency in the Implementation model been addressed?
(This is an excerpt from our book "Professional Multicore Programming: Design and Implementation for C++ Developers: Chapter 10: Testing and Logical Fault Tolerance for Parallel Programs".)

