## Reliable design of concurrent software

*Gerard is a member of the Computing Sciences Research Center at Bell Labs in Murray Hill, New Jersey. He can be contacted at gerard@research.bell-labs.com.*

Whether we like it or not, we often design software either by trial and error or by duplicating and modifying a piece of code that does something similar to what we want. This works fine for small applications, but fails miserably for large design projects or for critical code (code that controls a nuclear power plant, manages your bank account, and the like).

In particular, conventional static- and dynamic-testing techniques quickly break down if they are applied to distributed-systems software. The bugs in such systems are usually triggered by unreproducible event sequences that can make debugging a nightmare. Tools such as the Spin model checker, however, can help you build reliable systems. In this article, I'll explain how Spin works, and what types of errors it can help you find.

### Model Checking

The methods an engineer uses to design a bridge have parallels in programming. The essential steps are to be explicit about the requirements for a new design and to build design prototypes that can be checked for compliance with these requirements. This may sound forbidding at first, but with the right tools, all this can become quite straightforward.

How can you build a prototype of a new program? Actually, this is more common than you might imagine. It is precisely what writers do when they write about programming. To justify a particular programming procedure, the author gives a rather abstract algorithmic description of the program, often in a pseudolanguage. All you have to do is to standardize this pseudolanguage, develop a tool that inspects the descriptions in this language, and compare them with a list of requirements. This procedure is called "model checking." If applied correctly, the method can be amazingly effective in shaking out very subtle bugs in program designs.

Model checkers such as Spin come with a specification language for quickly defining prototypes of distributed-systems code, and a simple notation for defining the requirements for the correctness of that code. There are two conditions that must be fulfilled for model checking to work -- the model (that is, the system prototype) must be finite state and it must be closed.

A system is closed if no part of the system behavior is left undefined. This is not true (in general) for sequential computer programs, since their specific behavior can depend on the inputs that they receive. To verify such a program, the input source has to be included in the model to close the system.

A system is finite state if it only has a finite number of potential states. The state of a system is simply the exhaustive set of all its data values and control points. If there are a finite number of such state descriptions, a verifier could, in principle, perform any type of verification by enumerating the reachable states and checking for the adherence to or violation of the design requirements.

### The Spin Model Checker

Spin, which was developed by Bell Labs' formal methods and verification group, is a freely available software package (http://netlib.bell-labs.com/netlib/spin/) that supports the formal verification of distributed systems. Spin has been used to trace logical design errors in distributed-systems design -- operating systems, data-communications protocols, switching systems, concurrent algorithms, railway- signaling protocols, and the like. Spin checks the logical consistency of a specification, reporting on deadlocks, unspecified receptions, incomplete flags, race conditions, and unwarranted assumptions about the relative speeds of processes.

Figure 1 is Spin's main control window. Spin is written in ANSI standard C and runs on UNIX and Windows 95. XSpin, Spin's graphical interface, is a simple Tcl/Tk application that operates independent of Spin itself.

To verify a design, a formal model is built using Promela, Spin's input language. Promela, short for "Process Meta Language," is a nondeterministic language, loosely based on Dijkstra's guarded command language notation and C.A.R. Hoare's CSP language, extended with some powerful new constructs. It contains the primitives for specifying asynchronous (buffered) message passing via channels, with arbitrary numbers of message parameters. It also allows for the specification of synchronous message-passing systems. Mixed systems, using both synchronous and asynchronous communications, are also supported.

The language can model dynamically expanding and shrinking systems: New processes and message channels can be created and deleted on the fly. Message channel identifiers can be passed from one process to another in messages.

Correctness properties can be specified as standard system or process invariants (using assertions), or as general linear temporal logic requirements (LTL), either directly in the syntax of next-time free LTL, or indirectly as Buchi Automata (expressed in Promela syntax as "Never Claims").

### A Spin Example

New and old algorithms for achieving mutual exclusion in a distributed system via software provide a rich source of examples to demonstrate the power of model-checking techniques. To get the flavor of this, look at Listing One which is one of the first algorithms suggested to solve this problem in 1966 (see "Comments on a Problem in Concurrent Programming Control," by H. Hyman, *Communications of the ACM*, 1966, Volume 9).

This algorithm was proposed as a simplification of a (correct) algorithm by the Dutch mathematician Dekker, and published by Edsger Dijkstra in 1965 (see "Solution of a Problem in Concurrent Programming Control," by E.W. Dijkstra, *Communications of the ACM*, 1965, Volume 8). In this description, the value of *j* is defined to be the opposite of *i*; that is, equal to 1-*i*. The algorithm can be executed by two processes concurrently, sharing access to the global variables. The algorithm is meant to ensure that the two processes cannot simultaneously execute the part labeled *critical section*.

It is a simple matter to turn this description into a model that has all the properties you desire -- being in a finite state and closed.

Listing Two, implemented in Promela, is close to the original, with some minor syntactical differences for specifying control-flow structures. The *if..fi* construct from Promela is based on an older notation for specifying nondeterministic guarded commands, also due to Dijkstra (see "Guarded Commands, Nondeterminacy and Formal Derivation of Programs," by E.W. Dijkstra, *Communications of the ACM*, 1975, Volume 18), that turns out to be eminently suitable for the description of high-level models of distributed software.

Unlike a normal programming language, the options in the *if..fi* selection construct need not be mutually exclusive. If more than one option is selectable, the choice between them is nondeterministic.

To check the properties of this algorithm, you have to be a little more explicit about the requirements for its correctness. I have done so in this model by adding two assertions. The first one, on line 5, merely checks that the processes have identifiers (predefined in Promela as *_pid*) that match our assumptions. If the *_pid* is either 0 or 1, you can obtain the *_pid* of the competing process with 1-*_pid*.

The second assertion, on line 16, states the actual mutual-exclusion requirement. If mutual exclusion is correctly maintained, the value of the variable *cnt* can never exceed 1.

It takes Spin a fraction of a second to establish that the first assertion is valid, but the second is not -- the mutual-exclusion property can be violated. Now all this would still be of little use unless the tool is able to tell you just exactly how the violation can occur. Spin does so by generating an execution sequence for the model that leads from the initial program state to the violation of the assertion on line 16. There are a number of different ways to explore the sequence. The default is to have it printed as a simple interleaving sequence, displayed on an ASCII terminal as in Listing Three.

The trace can also be made more informative by adding the printouts of variable values and message-channel contents (not used in this example) at every step. A front-end tool, XSpin-to-Spin can also display the sequence graphically, or it can separate the actions of the two processes and print the sequence in two columns; see Listing Four.

In any case, it takes considerably less effort to show that the design is flawed than it took the creator of the algorithm to argue the opposite.

The model built for the mutual-exclusion algorithm trivially has the two required properties that allow model-checking techniques to apply: It defines a closed system, with no undefined external stimuli, and it defines a finite-state system, with only a finite number of control states per process and a finite number of possible values per control variable used.

Figure 1 is an annotated version of the mutual-exclusion example. Figure 2 shows the automaton for one of the processes in the mutual-exclusion example, thereby illustrating how Spin works internally. Spin computes a partial product of automata that it compares against a temporal formula that contains the requirements. The automata that Spin generates behind the scenes can be made visible with XSpin. Finally, Figure 3 shows how Spin reports bugs as execution sequences. By default, these sequences are printed in ASCII. This figure shows an alternative provided by XSpin, in which sequences are shown graphically (in this case, for a leader-election protocol in a distributed ring). By hovering the cursor over the chart, the corresponding source line in the main Spin control window is highlighted.

### Conclusion

The advantage of the model-checking technique is that if the tool completes its search for bugs and comes back empty-handed, this will actually guarantee that all explicitly stated requirements are satisfied for the model, a much stronger result than can be achieved with any other debugging technique. Of course, there are also disadvantages, among them that it is not easy to extend the model-checking technique to debug models for arbitrary (and possibly infinite) system components. Another is that showing the correctness of the major design decisions captured in a model does not necessarily guarantee that no new bugs are introduced when the model is implemented. As always, there's no free lunch. Model-checking tools can only help you rule out one class of errors that can be hard, if not impossible, to find with standard static or dynamic testing techniques.

Model checkers are not restricted to checking simple state assertions. Basically, any type of requirement on valid or invalid execution of the system can be formalized in the underlying logic of the model checker. The underlying logic is linear temporal logic (LTL). The logic contains simple operators to state requirements of sequences of conditions that ought to be satisfied; that is, that once a process tries to enter its critical section, it eventually must gain access. (Not true for the example algorithm, as Spin can show with an example in a fraction of a second. To appreciate this capability, try to find an execution sequence that denies one of the processes entry to the critical section forever.)

The model checker has been applied to many problems of considerable size. While Spin has found bugs in distributed algorithms, protocol designs, and the like, it is not just for big design problems.

#### For More Information

Griffin Technologies LLC

1617 Saint Andrews Drive

Lawrence, KS 66047

785-832-2070

800-986-6578

http://www.griftech.com/

#### Listing One

1 Boolean array b(0;1) integer k, i, j, 2 comment process i, with i either 0 or 1; 3 C0: b(i) := false; 4 C1: ifk != i then begin 5 C2: if not (b(j) then go to C2; 6 else k := i; go to C1 end; 7 else critical section; 8 b(i) := true; 9 remainder of program; 10 go to C0; 11 end

#### Listing Two

1 int cnt, k, b[2]; /* all variables are initially 0 */ 2 3 active [2] proctype P() 4 { 5 assert(_pid == 0 || _pid == 1); 6 7 C0: b[_pid] = 0; 8 C1: if 9 :: (k != _pid) -> 10 C2: (b[1 - _pid]); /* wait for this to be nonzero */ 11 k = _pid; 12 goto C1 13 14 :: else -> 15 cnt = cnt+1; 16 assert(cnt == 1); /* the critical section */ 17 cnt = cnt-1; 18 19 b[_pid] = 1; 20 fi; 21 goto C0 22 }

#### Listing Three

1: proc 1 (P) line 5 "hy66" (state 1) [assert(((_pid==0)||(_pid==1)))] 2: proc 1 (P) line 7 "hy66" (state 2) [b[_pid] = 0] 3: proc 1 (P) line 9 "hy66" (state 3) [((k!=_pid))] 4: proc 0 (P) line 5 "hy66" (state 1) [assert(((_pid==0)||(_pid==1)))] 5: proc 0 (P) line 7 "hy66" (state 2) [b[_pid] = 0] 6: proc 0 (P) line 14 "hy66" (state 7) [else] 7: proc 0 (P) line 15 "hy66" (state 8) [cnt = (cnt+1)] 8: proc 0 (P) line 16 "hy66" (state 9) [assert((cnt==1))] 9: proc 0 (P) line 17 "hy66" (state 10) [cnt = (cnt-1)] 10: proc 0 (P) line 19 "hy66" (state 11) [b[_pid] = 1] 11: proc 1 (P) line 10 "hy66" (state 4) [(b[(1-_pid)])] 12: proc 0 (P) line 7 "hy66" (state 2) [b[_pid] = 0] 13: proc 0 (P) line 14 "hy66" (state 7) [else] 14: proc 1 (P) line 11 "hy66" (state 5) [k = _pid] 15: proc 1 (P) line 14 "hy66" (state 7) [else] 16: proc 1 (P) line 15 "hy66" (state 8) [cnt = (cnt+1)] 17: proc 1 (P) line 16 "hy66" (state 9) [assert((cnt==1))] 18: proc 0 (P) line 15 "hy66" (state 8) [cnt = (cnt+1)] spin: line 16 "hy66", Error: assertion violated 19: proc 0 (P) line 16 "hy66" (state 9) [assert((cnt==1))] spin: trail ends after 19 steps #processes: 2 cnt = 2 k = 1 b[0] = 0 b[1] = 0 19: proc 1 (P) line 17 "hy66" (state 10) 19: proc 0 (P) line 17 "hy66" (state 10) 2 processes created

#### Listing Four

_pid: 0 1 | |>assert(((_pid==0)||(_pid==1))) | |>b[_pid] = 0 | |>((k!=_pid)) |>assert(((_pid==0)||(_pid==1))) |>b[_pid] = 0 |>else |>cnt = (cnt+1) |>assert((cnt==1)) |>cnt = (cnt-1) |>b[_pid] = 1 | |>(b[(1-_pid)]) |>b[_pid] = 0 |>else | |>k = _pid | |>else | |>cnt = (cnt+1) | |>assert((cnt==1)) |>cnt = (cnt+1) |>assert((cnt==1)) # assertion violated

**DDJ**

*Copyright © 1997, Dr. Dobb's Journal*