As with many technologies, the military was one of the first "industries" to mainline the use of computers in both its infrastructure and weapons systems. One can only imagine some of the spectacular failures that lead to the development of some of the military specific standards.
Regardless, they were one of the first to propose a rigor for the development of software used in military devices. From military applications, it was a natural evolution for software to move into civilian applications such as avionics. First used in communication, diagnostics and guidance systems, software controls systems have moved into the arena of flight control systems, where fly-by-wire systems have now been deployed in commercial aircraft. The European Airbus 380 is a perfect example of an aircraft flown entirely by computer; there are no mechanical linkages between the pilot and the flight control surfaces.
Medical devices are another area where the safety of software plays a role in ensuring both operator and patient safety. Programmable electronic devices are deployed in everything from portable blood glucose monitors to implanted heart defibrillators. Increasingly, automobile manufacturers are adding more and more computing power to their products. The reasons range from safety concerns, to environmental, to cost.
Engine management software cleans our exhaust, controls the transmission to insure optimal performance, and anti-locking braking software maximizes stopping power. In the late 1990s, BMW replaced the wiring harness used for controlling things like electric door locks, mirror and window controls with a simple two-wire CAN bus, and as a result, eliminated over 10 Kg of wiring from the vehicle. Nowadays, modern luxury vehicles contain upward of 80 or more programmable electronic devices.
Many automotive manufacturers are toying with the idea of X-By-Wire systems (steer by wire, break by wire). This is an attractive feature to add from the standpoint of safety as the steering column has been removed along with the prospect of impaling the driver who is involved in an accident. Furthermore, now the manufacturer no longer has to maintain two versions of the vehicle as the steering wheel and the glove box can be interchangeable. The dealer can customize the car for either driving in the US/Europe or the UK/Japan/Australia.
The use of software in the aforementioned devices improves their functionality and usefulness, but if that software fails, then in some cases the results are catastrophic. Expensive devices may be ruined, but worse, there is a potential for loss of life.
Notable software Bugs
July 28, 1962 - Mariner I space probe. A bug in the flight control software causes the Mariner I rocket to calculate the incorrect trajectory. The rocket was destroyed by Mission Control over the Atlantic.
1982 - Soviet gas pipeline. Conspiracy theories aside, a bug in the Soviet gas pipeline software controls caused the largest non-nuclear, man-made explosion in history.
1985-1987- Therac-25 medical accelerator. A therapeutic device that utilizes radiation has a bug which can lead to a race condition. If that condition occurs then the patient receives multiple times the recommend dosage of radiation. The failure directly caused the deaths of five patients and harmed many more.
January 15, 1990 - AT&T Network Outage. A bug in a new release of code causes the switches of AT&T to crash. Over 60 thousand New Yorkers were left without phone service for nine hours.
June 4, 1996 - Ariane 5 Flight 501. A bug in the Ariane 5 rocket caused the engines to over power resulting in such extreme acceleration that it caused the rocket to rip itself apart.
November 2000 -- National Cancer Institute. Panama City Operators find that they can trick the software of a therapeutic device that utilizes radiation for treatment. Despite the legal requirement that all treatment schedules be rechecked by hand, the device delivers twice the recommended dosage. Eight patients die and 20 more will undoubtedly be permanently disabled.
May 2004 Mercedes-Benz - "Sensotronic" braking system - One of the largest recalls in automotive history; Mercedes-Benz has to recall 680,000 cars due to a failure of its Sensotronic breaking system.
It is interesting to note that in every case, these system failures occurred in devices whose designers knew in advance the possibly devastating results that a software failure could cause, and made every effort to prevent. It is also interesting to note that in the case of the National Cancer Institute in Panama, even with a supposedly attentive operator bound by law to recalculate the settings by hand (but didn't), the device still caused harm.
From a historical perspective, there are a number of accepted if not mandated standards that many industries must adhere to: military and avionics, aerospace, nuclear and power plants, rail and medical. Their standards provide guidance as to how software (if not the entire device) is to be designed and deployed. They vary in their rigor, guidance, application and impact on development, but their goal is the same; to produce safe and reliable devices.
As a side note, it was pointed out to me that software safety, software security and software reliability are not one and the same. As a contrived and trivial example of the difference, a fire suppression system does not have to be reliable in that it works as one would expect it to; the goal of safe software is so that if it fails, it fails in a safe fashion. In the case of a fire suppression system, it may be that if the software fails, the fire suppression system comes on.
The two standards to be examined, in reality, view the device, which in the case of avionics, is the aircraft and in the second case, a medical device, as a total system. But for this paper, it is just the software aspects that will be considered.
The first is the Federal Avionics Administration's DO-178B standard. Titled "Software Considerations in Airborne Systems and Equipment Certifications," the standard known as DO-178 was first published in 1982 by the Radio Technical Commission for Aeronautics (RTCA).
After two revisions, the current version B was released in 1992. The standard was developed to establish guidelines on how software is designed, maintained, implemented and used in aircraft. Basically, it specifies that every line of code be directly traceable to a requirement, every test case be traceable to a line of code and every line of code has a corresponding test case.
The DO-178B standard has five levels of certification, each of which equates to the potential for harm if the system fails. The lowest is Level E and the highest is Level A. The potential for harm and the level of certification are:
*Level A: Where a software
failure would cause and or contribute to a catastrophic failure of the
aircraft flight control systems.
* Level B: Where a software failure would cause and or contribute to a hazardous/severe failure condition in the flight control systems.
* Level C: Where a software failure would cause and or contribute to a major failure condition in the flight control systems.
* Level D: Where a software failure would cause and or contribute to a minor failure condition in the flight controls systems.
* Level E: Where a software failure would have no adverse effect on the aircraft or on pilot workload.
As an example of the various types of applications and their potential for causing harm, the in-flight entertainment system may be considered Level E, while a fly-by-wire system is considered Level A. As the potential for catastrophic failure increases, so does the amount of diligence to prevent that potential for catastrophic failure. For all levels of the standard, almost all of the following "Certification Artifacts" are required:
* Plan for Software Aspects of Certification
* Software Development Plan
* Software Verification Plan
* Software Configuration Management Plan
* Software Quality Assurance Plan
* Software Requirements Standards
* Software Design Standards
* Software Coding Standards
* Software Requirements Data
* Software Design Description
* Software Verification Cases and Procedures
* Software Life Cycle Environment Configuration Index
* Software Accomplishment summary
The documents above provide "Best in Practice" techniques for design, implementation, deployment and maintenance during its life cycle. The records kept below prove that those practices were followed.
Records and Test Results
* Software Verification Results
* Problem Reports
* Software Configuration Records
* Software Quality Assurance Records
The most rigorous aspect of the DO-178B standard is its approach, quality assurance and testing of the code. That goal is accomplished by "Functional Analysis" of the software and by "Structural Coverage Analysis" of the software.
The goal of functional analysis is to show a one-to-one correspondence between the code that makes up the software and the requirements (traceability); basically, "this code is here because of this requirement." The functional analysis tests the software through boundary testing and other techniques, and demonstrates that it does what it is supposed to without undefined results.
There are three levels of structural analysis:
* Statement Coverage
* Decision Coverage
* Modified Condition/Decision Coverage
Statement coverage essentially means that each line of code has been executed at least once. Decision coverage means that each entry and exit point has been executed at least once and all possible outcomes have been executed at least once. Modified Condition Decision Coverage exercises each entry and exit point at least once and that every conditional branch has been covered at lease once. Furthermore, each condition in a decision independently affects the executions outcome.
The amount of structural coverage analysis depends on the level of certification that is desired and is outlined below:
Level E - No Structural
Level D - 100% traceability
Level C - Level D plus 100% code coverage
Level B - Level C plus decision coverage
Level A - Level B plus 100% modified condition decision coverage
The DO-178B specification spells out what, and to a large degree, how a flight system must be designed, implemented, tested and maintained.
The other extreme to specifying safety in a device is the FDA's approach. The Food and Drug Administration's (FDA) 510(k) requires that manufacturers notify the FDA 90 days before they plan to market a medical device. It is similar to the FAA's DO-178B in that its intent is to make sure that medical devices are designed and deployed in a manner that ensures patient and operator safety.
The FDA takes a "kinder, gentler" approach to device design. In their guidance documents, they state that it is their desire to allow developers to use a "Least Burdensome" approach. I am not implying that this particular standard is more lax than the FAA's. The FDA's approach does not constrain development to be done according to a single paradigm.
One company could use extreme programming techniques and another could use the traditional waterfall approach. As long as both companies adhere to the practices that they document and provide proof of due diligence, both approaches are fine with the FDA.
Above and beyond the FDA regulations on device development; in the US, due to the nature of its liability laws, it is in the best interest of a medical device manufacturer to deliver very safe products.
Converging to Software Control
Historically, operator, plant and stakeholder safety depended on operator training, physical barriers, mechanical interrupts and mechanical fail safes and lockouts. As technology evolved, so did the safety systems. Electrical interrupts and lockouts replaced mechanical ones, and physical barriers were replaced by beams and light curtains. The really disruptive aspects of technology occurred when plant systems that depended on operator control and intervention started becoming "automated." The machinery began to think for itself.
There are a multitude of reasons for using programmable logic and electronics in industrial devices. In some cases, it is because the speed of the plant operation becomes so fast or complicated that a human can no longer keep up with their task. It could be said that quality control was better. Computer-based systems don't have bad days, or end-ofshift fatigue. In reality, the reason for the explosion of automation can be summed up in two words; cost reduction.
Digital systems are faster, more precise and, over the long haul, are less expensive than a $35 an hour laborer who has a pension. Like the BMW example given earlier, it is so much more cost effective to replace a wiring harness or pneumatic actuators with a single wire or bus control system. Not only does it reduce the BOM for the system, but in most cases, the labor involved installation is lower. In large interconnected systems such as a paper machine, the savings in material and labor to install it can make the difference between a positive ROI and a negative ROI.
One of my first jobs as an adult was working as an industrial electrician at a local paper mill. I pulled many a mile of cable that year, working with hundreds of others doing the same. At the same time the instrumentation crews bent and installed thousands of miles of pneumatic tubing. While there is still a need for the cabling required to power the thousands of motors that are used in a paper machine, most of the "one switch, one control cable" and pneumatics can be replaced with busses, each of which can support many switches and controllers
The mill had a number of processes that were largely performed using programmable logic elements. At this time the wisdom was that automation required redundant or an isolated safety system. That way if the control portion of the bus system went nuts and started a broadcast storm that caused a process to malfunction, the safety related system could still put the machine in a safe state. This "separation of church and state" approach works pretty well, but redundancy is expensive.
Jack Ganssle said recently that the most expensive thing in the universe is software. That is true, but it is only true because doing the next alternative (doing it purely with logic circuits) is prohibitively expensive.
Cultural and Philosophical
There are several cultural differences between the US and Europe as to the evolution of safe software standards and the overall acceptance of them between the two geographical regions.
Europeans in general are used to more regulation in their daily lives and European governments tend to be more supportive of standards. European states use standards and certifications as barriers to trade. The European legal system is somewhat sympathetic to companies who comply with standards groups as opposed to those who do not comply with them.
Compliance with standards tends to protect manufacturers against liability in the event that they produced an unsafe product. Furthermore, European workers are motivated to adhere to safety standards as they, as individuals, are likely to be held civilly or criminally responsible for the products they develop. In fact, it is the personal responsibility of the chief officers of the company to make every effort to ensure safe products are developed.
Some European companies take this so far as to have their officers sign a "Declaration of Conformity" to ensure that the device was produced in accordance with standards and is in compliance with national standards.
In the US, rightly or wrongly, acceptance of standards and common practices, no matter how stringent, does nothing to mitigate a manufacturer's liability in the eyes of both the law and the jury. With the exception of those committing gross negligence " for example an inebriated pilot crashing a plane " an employee will not face civil or criminal charges as a result of an unsafe product reaching the market.
So, the only reasons for US manufacturers to choose to adhere to a standard is that they see it as a marketing tool that differentiates them from their competitors, it is a government regulation or they fear litigation if a product harms someone.
Do not misunderstand the prior statement. Many US companies do have internal coding, quality and safety standards that they follow; they are motivated by the market to produce safe products so that is not the issue. It is that there is rarely an incentive for them to join and follow external standards groups.
The Tipping Point
As a product marketing manager, one aspect of my job is to keep a finger on the pulse of the embedded space. I do a lot of reading, a lot of talking and most of all, a lot of listening. I read blogs, trade journals, I talk to a lot of people and to customers of course; I talk to lost sales and to what essentially amounts to cold calls at trade shows. Since I am interested both personally and professionally in industrial automation, as well as safety critical applications such as avionics, I tend to ask questions pertaining to that aspect of people's projects.
What I am finding is that strategic thinking of developers and manufacturers of home, building and industrial automation is split along geographical lines. My perception of this split started 24 months ago. IEC 61508 was mentioned during a call with our German sales office.
I had never heard of it. Neither had any of the US-based customers I normally spoke with. DO-178B, 510(k), I was familiar with. Over the next few months, the German office reported more and more interest in IEC 61508. Then interest arose in France and the UK. I received two from Japan today.
A decade ago, the International Electro-technical Commission issued the final version of its IEC 61508 specification governing the development of electrical/electronic/programmable electronic safety-related systems.
The main thrust of IEC 61508 is to provide "guidance" for developing devices that are functionally safe. In the context of IEC 61508, functional safety is defined as: "Functional safety is part of the overall safety that depends on a system or equipment operating correctly in response to its inputs. Functional safety is achieved when every specified safety function is carried out and the level of performance required of each safety function is met."
Basically, the standard strives to ensure that safety systems perform as specified, and if they fail, they fail in a manner that is safe. One thing that needs to be (re)emphasized is that when discussing safety in this context, reliability is not implied, only that if there is a failure, that it will fail safely.
In many ways, the IEC 61508 standard is very similar to the DO-178B standard. It is very structured in its approach in developing software. Unlike the DO-178B standard, the IEC 61508 standard does allow certification of standalone software. Basically, it allows software reuse without having to go through the process of recertifying the entire portion of code that has been previously certified. Of course, all of the code that can be precertified must be code that is independent of the hardware.
Even while all specific code such as drivers must be certified, the ability to pre-certify generic code has a dramatic impact on the expense of developing safety systems. Since estimates for developing and certifying code to these standards run upward of $100 per line of code, this ability to amortize the cost of development over multiple projects makes these features feasible.
It also makes commercially available, pre-certified, software attractive as software vendor's business model to amortize their development costs over many, many sales. An added benefit to this is that manufacturers can add features such as USB or Ethernet connectivity at a reasonable price, where before they could not afford to certify the extra tens of thousands lines of additional code.
Another bright spot for manufacturers is that the standard allows developers to partition their systems into safe and non-safe feature sets. When properly implemented, by using MMU hardware, the standard allows developers to avoid the costly burden of validating the application code that runs in the partition and does not perform safety related activities. While not a trivial task in terms of the work needed to guarantee the non-safe partition can't bring down the safety related partition, the benefits to the manufacturer and end customer are immense (when the other options involve the validation process at a cost of $100s per LOC).
Another major difference between DO-178B and IEC 61508 is that at its highest level of safety SIL 4, IEC 61508 is stricter in how that safety is achieved. Just like DO-178B, as one works through the four levels of failure reduction SIL 1- 4, the degree of functional and structural analysis is more rigorous. Unlike DO-178B, at its highest level SIL 4, IEC 61508 calls for redundancy.
Not only does it call for the use of multiple (at least two) processors, but also through the use of two or more different types of processors (ARM vs MIPS), with the software written for each processor by different teams. For more information on hardware redundancies, see IEC 61508-2. For more information about using different implementation teams, see: IEC 61508-3 section 22.214.171.124 and IEC 61508-7 Appendix B 1.5, and C 3.1 " C 3.5.
Opportunities for Cost Reduction
Automation was first introduced to improve quality, efficiency and productivity. However, some of those gains were offset due to the need to develop safety systems to deal with automation.
That required redundant systems to monitor the automated systems. With them came the added expense of isolated busses and control systems. So expense added up, not only due to the development of the safety system, but its manufacturer and installation as well.
I think in general we can say that manufactures are developing safe devices, regardless of their adherence to a safety standard that was developed in house, or an open standard developed by a committee. It is infrequent that a truly catastrophic event occurs due to a software error.
That safety record has come at a relatively high cost when comparing features and functionality to device counterparts that occupy the consumer space. The question arises, which way is better; proprietary, in-house safety standard or use of an open standard such as IEC-61508?
There is some data available on this question. The quantitative approach used by many safety standards reduces costs by preventing either over engineering or under engineering. Shell Global Solutions cut up to 20% from the cost of implementing safety systems. Extensive investigation showed that about 65% of safety functions are overengineered while 10% are actually under engineered and represent a weak link in the overall safety management of the facility. Only 25% didn't require changes. (exida.com)
The question of "Can adherence to a safety standard save money?" is answered positively. Now, what about the question of will adherence make money? I think the answer to that question is also yes. From my small and clearly unscientific study of our current and potential customer base, I can conclude that if one does not begin to plan for utilizing design and maintenance guidelines that are set forth in standards such as the IEC-61508, one is effectively writing off a growing segment of the international market. Will IEC- 61508 go the way of the Dodo bird and ISO9000? Only time will tell. Right now, it seems it is becoming established and that momentum is growing.
FIPS 140-2. On, May 26, 2006, the Federal Information Processing Standard (FIPS) 140-2 "Security Requirements for Cryptographic Modules" took effect. The standard was developed in conjunction with the NSA and is published by the National Institute of Standards and Technology (NIST).
It describes the requirements and standards that a hardware and/or software product must meet to be purchased for government use, for sensitive but Unclassified (SBU) use. The standard has been adopted by the Canadian Communications Security Establishment (CSE) as well as the American National Standards Institute.
In essence, FIPS 140-2 specifies the security requirements provided by the cryptographic module that is used to protect sensitive but unclassified information. The standard covers all computer and communication systems, providing four levels of increasing security: Level 1, Level 2, Level 3 and Level 4. Many of the devices requiring adherence to FIPS 140-2 are easy to identify; PC, laptops, printers, routers, switches, basically anything attached to the network.
Others are not identified so intuitively; things like telephones, both traditional and IP-based, are covered. What about cell phones? It is possible, with the advent of combining traditional cell with VoIP-based services, that the lines are being blurred.
HIPPAA. To improve the efficiency and effectiveness of the health care system, the Health Insurance Portability and Accountability Act (HIPAA) of 1996, Public Law 104-191, included "Administrative Simplification" provisions that required Health and Human Services (HHS) to adopt national standards for electronic healthcare transactions. At the same time, Congress recognized that advances in electronic technology could erode the privacy of health information.
Consequently, Congress incorporated into HIPAA provisions that mandated the adoption of Federal privacy protections for individually identifiable health information.
This new U.S. regulation gives patients greater access to their own medical records and more control over how their personally identifiable health information is used. The regulation also addresses the obligations of healthcare providers and health plans to protect health information.
There are many more software safety standards that exist than the few that are mentioned in this paper. However, the IEC 61508 standard seems to be becoming a de facto standard, especially in areas before where there were either no standards for the industry, or there where no regulatory reasons to adopt one.
One of the primary reasons is that IEC 61508 is a standard that is generic in application, but comprehensive in its approach to achieving safety. Companies that previously utilized proprietary or in-house standards are adopting IEC 61508 as a marketing tool to prevent them from being shut out of markets.
Another factor that may drive North American manufacturers to adopt IEC 61508 is the 2002 Sarbanes-Oxley Act governing the behavior of corporate management. In the litigious society of North America, it will only be a matter of time before some enterprising attorney connects the Sarbanes-Oxley act with an unfortunate software failure.
Furthermore, mandates for security in various segments of government, healthcare and finance are forcing manufacturers of infrastructure and office equipment to either conform to expensive adherence of security standards or to write those markets off entirely.
Because of the rapid convergence of functionality into such things as cell phones, it is my personal belief that not only will these sorts of safety and security requirements thrive in the current areas of acceptance, but they will also grow into other areas. I also believe that it is better to adopt them now while they provide a marketable differentiation in a product that will command a premium, rather than wait until it is just an expected commodity feature of a product that commands no value.
Todd Brian is a product manager for Accelerated Technology, an Embedded Systems Division of Mentor Graphics where he is responsible for kernels and related products.
|This article is excerpted from a paper of the same name presented at the Embedded Systems Conference Boston 2006. Used with permission of the Embedded Systems Conference. For more information, please visit www.embedded.com/esc/boston/|
1) Garfinkel, Simson, History's Worst Software Bugs Wired News Nov. 8 2005
2) Validated Software's FAQ Page
3) Ganssle, Jack The Embedded Muse 124 Feb. 9, 2006
4) IEC Web Page: Functional Safety Zone - E
5) IEC Web Page: IEC-61508