Checks and Balances
The MPL Report includes several project-level recommendations that seem applicable to nearly any software project.
R1) For highly cost- and schedule-constrained projects, it is mandatory that sufficient systems engineering and technical expertise and the use of the institution's processes and infrastructure be applied early in the formulation phase to ensure sound decision making in baseline design selection and risk identification.
In other words, when you're expected to work smarter, you must actually think about the project, preferably before you begin. Ignoring potential problems is not a strategy.
R2) Do not permit important activities to be implemented by a single individual without appropriate peer interaction; peers working together are the first and best line of defense against errors. Require adequate engineering staffing to ensure that no one individual is single string; that is, make sure that projects are staffed in such a way as to provide appropriate checks and balances.
We've found that bouncing your coding ideas off somebody else really is the first and best line of defense against errors. That person must, of course, be competent to recognize errors and omissions, which was part of the problem in both of these mishaps.
While testing cannot ensure the absence of errors, it can demonstrate that the project meets specifications and, with any luck, also show that the specifications don't have any glaring omissions. The tests must match up with reality, rather than with an idealized model: end-to-end validation...through simulation and other analyses was potentially compromised in some areas when the tests employed to develop or validate the constituent models were not of an adequate fidelity level to ensure system robustness.
Reading these NASA mishap reports makes perfectly clear the suffocating number of reviews, cross-checks, verifications, and sign-offs required to get anything done. Many of the recommendations boil down to "be more careful" and "check more often," but it seems that, beyond a certain point, humans simply cannot and will not do more checking.
The N-Prime mishap clearly reveals that limiting point. The Report does not include the checklist itself, but the missing steps give some indication of how an intricate procedure can go wrong in real life.
L-1) The RTE decided to "assure" the cart configuration through an examination of paperwork from a prior operation rather than through physical and visual verification. The RTE made a second decision error in dismissing a comment by the Technician Supervisor concerning empty bolt holes.
L-2) The technicians, with the exception [of the] Technician Supervisor noted above, failed to notice the missing bolts, even though they were working within inches of where the bolts were supposed to be.
L-3a) The PQC [Product Quality Control] and the PA [Product Assurance] signed-off on "assure the configuration" of the TOC procedure step without personally validating the TOC configuration or, in the case of the PA, even being present at the time this step of the procedures was completed during the operation.
L-3b) The safety representative was not present as called for in the procedure. Again, this investigation determines that such a violation is routine.
These elements described above led the MIB to conclude that decision and skill-based errors and routine violations by the NOAA N-PRIME I&T [Integration and Test] team were manifested as a failure to adhere to procedures.