A patient undergoing radiation therapy for cancer wants to be sure that the radiation being delivered is just the amount prescribed and no more. Nuclear power plants must have systems installed to ensure that radiation leaks and accidents do not occur. Today, controlling these protection systems flawlessly depends upon computer software, which occasionally contains unforeseen "bugs." Software bugs on your computer at home are annoying, but at a nuclear power plant or during radiation therapy they can be life-threatening.
At Lawrence Livermore, Gary Johnson's Computer Safety and Reliability Group--part of the Fission Energy and Systems Safety Program--has been working with the Nuclear Regulatory Commission for several years to avoid software problems in safety systems at nuclear power plants. Livermore brings to this job decades of systems engineering experience as well as a regulatory perspective from years of working with the NRC and other regulators.
Johnson's group and the NRC developed software and computer system design guidance that the NRC uses to evaluate the design of safety-critical systems for U.S. plant retrofits. Overseas, where new nuclear power plants are being built, regulators and designers are using this state-of-the-art guidance to help assure plant safety. For the last few years, representatives from Hungary, the Czech Republic, Ukraine, Korea, Taiwan, and Japan have been calling upon Johnson and his group for assistance in setting criteria for their nuclear power plant control systems.
This software design guidance is also applicable to other computer-controlled systems that could endanger human life if they are poorly designed--medical radiation machines, aircraft flight control systems, and railroad signals, for example.

When Software Fails
Perhaps the best-documented example of the harm resulting from poorly designed software involved the Therac-25, an accelerator used in medical radiation therapy. The IEEE Computer Applications in Power reported, "Between June 1985 and January 1987, six known accidents involved massive overdoses by the Therac-25--with resultant deaths and serious injuries" at treatment centers around the U.S. and in Canada.1 Between the patient and the Therac-25's radiation beam was a turntable that could position a window or an x-ray-mode target between the accelerator and patient, depending on which of two modes of operation was being used. If the window was positioned in the beam's path with the machine set to deliver radiation through the x-ray-mode target, disaster could result because software errors allowed the machine to operate in this configuration.

Engineering Reliable Software
Conditions such as these, where software was assigned sole responsibility for safety systems and where a single software error or software-engineering error could have catastrophic results, are precisely what Johnson's group aimed to avoid when it helped to prepare the portion of the NRC's recently published Standard Review Plan2 regarding computer-based safety systems. Their process requires that software for nuclear power plant protection systems be written in accordance with good engineering practices. That is, software should follow a step-by-step approach of planning, defining requirements for worst-case scenarios, designing the software, and following a detailed inspection and testing program--known as verification and validation--during each step of development and installation. This process avoids the pitfalls that can occur when software designers are expected to begin writing code before the design is complete, a process akin to pouring concrete footings for a building before knowing how tall the building will be.
Good engineering practices make for better software. But it is axiomatic in the safety-systems business that software can never be perfect. Figure 1, above, showing IBM's experience from the Space Shuttle program, illustrates the futility of attempting to produce perfect software for every need. The point of diminishing returns is reached well before perfection can be guaranteed.
When the consequences of failure are very high, safety-critical systems should always incorporate "defense-in-depth and diversity," which can be accomplished by incorporating different kinds of hardware, different design approaches, or different software programming languages. The idea is to have different kinds of systems available to accomplish the same goal; for example, a digital system is often backed up with a tried-and-true analog system. Then, if one system fails, a different system is in place to carry on. Simple redundancy, having two versions of the same system, is not enough because both could fail simultaneously as a result of the same flaw (Figure 2).

Many kinds of diversity are possible. Although only some scientific basis dictates what kinds of diversity are the best or how much diversity is enough, experience has shown an effective combination of protections to be the use of different hardware and software acting on different measurements to initiate different protecive actions. Based on that experience, the NRC's Standard Review Plan requires that at least two independent systems, incorporating multiple types of diversity, protect against each worst-case scenario.

Making the Systems Better
The Institute of Electrical and Electronics Engineers has defined good engineering processes for the design of computer software, but there is little agreement on such specifics as notation, the preparation of specifications, or design and analysis methods. Testing is another challenge. As yet, although methodologies are available, there are no mathematical equations or models that can be used to evaluate a software program to determine whether it is dependable or whether it contains errors and may fail.
Johnson's group has three projects under way that tackle the software quality issue head on--developing a method for determining the reliability of high-integrity safety systems, developing techniques to test commercial software, and developing methods for establishing the requirements to which software is written.
The first project is related to the current move toward "risk-based regulation," which uses assessments of risk to determine how to regulate. Livermore is developing a methodology for determining software reliability (and conversely, the failure probability of software). Information about software reliability will be integrated with hardware reliability measurements to form a measure of overall system reliability. The result will be used in NRC probabilistic risk assessment analyses.
In the second project, the group is looking for ways to "qualify" software that has not been through a rigorous development process, such as software that is frequently embedded in commercial instrumentation and control components. Demonstrating the dependability of such software is a challenge because thoroughly testing the large numbers of program states and combinations of input conditions is virtually impossible. The group is identifying techniques that limit the inputs and conditions that must be tested to qualify a software design. Considerable research has been done on software integrity, but this particular approach has not been tried before. If this project is successful, the process of introducing new control systems into nuclear power plants will be simplified.
The aim of the third project is to clarify the project goals, or requirements, that are developed before software programs for protection systems are written. Requirements must be carefully specified, but it is almost impossible to anticipate every situation. Livermore engineers are working with Sandia National Laboratories, Albuquerque, to adapt techniques used to design control systems for the defense industry. In the process, they are developing methods for determining the fundamental goals of a computerized control system.
As computers become more powerful and software more complex, projects such as these will go far to increase confidence in the reliability of systems designed to protect workers and the public.
-- Katie Walter

Key Words: instrumentation and control systems, nuclear power plants, safety-critical systems, software engineering.

1. "An Investigation of the Therac-25 Accidents," IEEE Computer Applications in Power, July 1993, pp. 18-41.
2. U.S. Nuclear Regulatory Commission, Office of Nuclear Reactor Regulation, "Instrumentation and Controls," Chapter 7, Standard Review Plan, NUREG-0800, Rev. 4, June 1997.

For further information contact Gary Johnson (925) 423-8834 (johnson27@llnl.gov). Publications from Livermore's Computer Safety and Reliability Center are available at http://nssc.llnl.gov/FESSP/CSRC/refs.html.

Back to March 1998