Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Tools

Parker's Perceptions


OCT89: PROGRAMMING PARADIGMS

I pulled up in front of Dave Parker's house with a trunkful of preconceptions, the consequence of knowing too much history.

There are several reasons why research in neural networks went quiescent for over a decade, but the most effective single cause for the big chill was surely the book Perceptrons. In that book, Marvin Minsky and Seymour Papert proved that the single-level Perceptron model could not compute certain basic functions of their inputs. Then M&P went on to speculate, erroneously, that research in multi-level Perceptrons and similar models would be equally sterile. Anyone looking for funding for neural net research after Perceptrons came out met a cold response.

There are also several reasons why neural network research has been heating up of late, and why it has begun to produce some success stories, but the most effective, single cause may have been the discovery -- by several people at different times -- of an algorithm called "back propagation." "Backprop" shows that Minsky and Papert were wrong about the sterility of multi-level nets. One of the independent discoverers of this algorithm is Dave Parker. A lot of money is now flowing into neural net research and development, and Parker is one of the people who turned the faucet back on.

None of that money is flowing in Parker's direction. Today, he works out of his modest Silicon Valley home, running a small graphics software company. His company, Acrobits, hasn't made much of a splash yet, although it's a young company and it seems to have a good product. My preconceptions said that Parker wasn't profiting from the current interest in neural networks -- either financially or in prestige -- as much as others were who had made less significant contributions to the field. My preconceptions said that a discoverer of the back-propagation algorithm, who was now writing graphics software for PCs, had reason to be, if not bitter, at least a little cynical.

Parker threw a bucket of clarity on my preconceptions. When I gently prodded him about why he wasn't working in neural nets, he explained patiently that, fundamentally, this hot new research area involved nothing more earth-shaking than some parallel implementations of familiar minimization algorithms. As for the value of his own back-propagation algorithm, the day is past, he explained, when you could make a living selling an algorithm. Of course, some people in neural nets are making a very good living. But Parker doesn't seem to care one way or the other. He enjoys what he's doing -- programming, running his own business, satisfying customers rather than grantfunders -- and likes the feeling that he's in control of where his money is coming from. He continues to track research and developments in neural networks, but his own present involvement in the area is limited to occasional lectures and classes, where his emphasis is on demystifying the complex of algorithms, architectures, and ideas that are collectively called neural networks.

After I met Dave's wife, who is building a laser in the garage, and their two children, one walking, one crawling, we went into the office, which the contractor no doubt thought was a spare bedroom. Here Parker started pulling out neural net materials for me: articles, transparencies, a listing of a Turbo Pascal program demonstrating back propagation. Since Parker hasn't worked in the area in two years, the neural nets shelf of his bookcase was a disorganized heap of papers and journals. It reminded me of the confusing state of the literature on neural nets, and I mentioned this as we put cups of instant coffee into the microwave.

Parker's response was to give me a dose of demystification.

Parker: It is confusing. People may make up new names and use all kinds of jargon. There are whole strains of neural networks with their own jargon. Like Brain State in a Box: This is a kind of parallel steepest descent, but it has its own jargon because it was developed at one university and everybody there talks the same. You read the literature and you think: How does this relate to anything else?

Swaine: Is there some way to keep various models straight?

Parker: Yes. It's very simple if you view everything as a parallel implementation of a minimization algorithm. Learning can be viewed as a minimization problem, so you just ask, What is the minimization algorithm being implemented?

Swaine: I see. The whole point of neural nets is that they learn, so you can sort out the different models by asking how they learn. Let's try it. Hopfield Net.

Parker: OK. You ask, What minimization algorithm does it implement? The answer is, none. A Hopfield Net doesn't learn. Hopfield nets are useful; they can tell us some things about the capabilities of neural nets, but they don't learn anything.

Swaine: That was easy. Boltzmann Machine.

Parker: A Boltzmann Machine can learn. Since it learns, a Boltzmann Machine is a (parallel) implementation of some sort of minimization algorithm. The question is, what algorithm? [He picks up a pile of paper from a desk, drafts for an Acrobits ad, turns them over, and begins drawing on the back of the top sheet.] The simplest way to see it is to look at the one-dimensional case. Plot all the weights on one axis, and performance on the other. [He draws two axes and a sinusoidal curve, with two dips, one shallow and one deep.] The weights are just all of the adjustable parameters in the network; in the brain, these correspond to permeabilities and whatnot. You're plotting everything you can adjust versus how well your brain -- or your neural network -- does.

Swaine: So the algorithm is trying to find the set of weights that minimizes the performance measure -- which is probably something like number of errors, or distance from a goal, since it's something we want to minimize.

Parker: Right. One very crude algorithm is to pick a random set of weights, see how you do, remember the set, then pick another random set of weights. If this set is better, remember it; if it's worse, forget it. And one way of implementing the randomization is to add noise to the weights you've chosen.

Swaine: I've read that the adding of noise is called simulated annealing, a reference to a process in metallurgy. Is this a good algorithm?

Parker: Well, it's been proven that if the distribution of the noise slowly gets smaller, the probability will go to one that you will find the global minimum. [He points to the deep dip in the curve.] It takes a long time, though. This is just a random search, so a Boltzmann Machine, if you look at it as a minimization algorithm, is a parallel implementation of a random search.

Swaine: Which is about as crude as you can get.

Parker: But it has the nice property that if the noise you add gets less and less noisy, the probability goes to one that you'll find the global minimum. The idea is that when you first start, you search all over the surface, and then when you narrow down to a certain area you gradually add less and less noise because the probability is higher that you're [near the global minimum].

Swaine: Here's an easy one: Back propagation.

Parker: Back propagation is "parallel steepest descent." Steepest descent is a well-known algorithm. Rather than picking a random place, you just keep going downhill. That won't necessarily get you out of a local minimum. [He points to the shallow dip in the curve.] But there are ways around that.

Swaine: I'd like to hear about the ways around.

Parker: Oh, that's a neat one. [He pulls out another piece of paper and redraws the graph, this time with a diagonal line representing a third dimension at right angles to the performance and weight dimensions.] I should give credit for this: This picture was first drawn for me by mathematician Dan Asimov. This third dimension represents the inputs to the system, the things you have no control over. For instance, in your brain you supposedly have control over protein levels and that kind of stuff, but you have no control over the fact that a car is bearing down on you. The inputs are things that you have no control over during the learning process, the data that you're getting from the external world; the weights are the stuff that's inside you; and the performance is [the output] that we measure. In any particular task you're working on you're only in a small area of all the possible inputs the system could be getting. There are entirely different inputs for, say, scuba diving, versus learning physics. So if you're studying physics, as my wife is, you can look at the performance surface in that narrow band [of inputs]. Now if you're studying physics and using a steepest descent algorithm and you're here [he points to the shallow dip in the curve], you're in big trouble, because you're always going to be trying to go downhill, which will always take you to the local minimum, and you're never going to get over the hump. You're stuck in a rut. There's a human analogy: People are often said to be stuck in a rut, doing a lot of work but not getting anywhere. Now, what a human will often find helpful in that situation is to go and do something different for a while.

Swaine: Get up and go for a walk.

Parker: Yeah, go for a walk, play tennis, go hiking.... And what they're doing is going to a different subspace of their inputs. And there the performance surface may look radically different: You don't expect that a tennis pro, just by playing tennis all the time, will become a great physicist, or vice versa. So what happens when the physicist goes off and plays tennis, since the optimum tennis weights are not the optimum physics weights, is that the weights get moved in some direction. And if the tasks are uncorrelated, that direction will be random. So you go off and do something different and when you come back to your original task, if you're lucky, you'll be working on it for a while and you'll say, Hey, why didn't I think of that before?

Swaine: So you adjust the inputs, and the weights are adjusted as a consequence, rather than directly tweaking the weights randomly, as the Boltzmann Machine does.

Parker: By exploring different parts of your input space you can give your weights random pushes in different directions and hopefully be able to find the global minimum.

Swaine: Why is that approach better than a Boltzmann Machine?

Parker: They're good at different things. If you have access to the weights, and all you're doing is learning physics or doing circuit board layout, it is much quicker to just kick all the weights randomly, stay in the task subspace, do a Boltzmann Machine followed by steepest descent, and repeat. But people don't have access directly to their weights, so rather than kicking the weights directly, you can go to an uncorrelated subspace of your inputs, and then come back to work on your task again. It's a slower process, but you don't need access to your weights.

Swaine: We've all had that experience of getting up from the keyboard when we're stuck on a problem and going off to do something seemingly completely unrelated, then finding when we come back to the problem that the solution is suddenly obvious. Some people would claim, I think, that the intervening activity is not really uncorrelated, that there is some deep connection between this particular physics problem and, say, tennis.

Parker: And when it works, they'd be right. [It would be a great aid to teaching physics] if you knew that, when you were in the physics subspace and were at a certain local minimum, a good thing to do would be to go play the harpsichord. But we don't generally know that. [But that's not the only advantage of this approach over the Boltzmann Machine.] To do the Boltzmann Machine you need to keep around two sets of weights: Your best and the current. Using this approach you only need your current ones, but it does take longer.

Swaine: One begins to get the message from all these algorithms that learning simply takes a long time.

Parker: Teaching people takes years and years. So anyway, even if you use parallel steepest descent -- which isn't guaranteed to find the global minimum -- you can still find it, by doing this kind of thing.

Swaine: So what's the most powerful neural net model? What's the best we can do today?

Parker: Is there a more powerful minimization algorithm than steepest descent? Yes, there is. It's been around for a long time, and it's called "Newton's Method." Steepest descent is based on the first derivative of the performance surface: It looks at the first derivative and tries to follow it. Newton's Method uses the second derivative, so it's based on more information.

Swaine: How does that work?

Parker: A handy analogy is from skiing. Imagine that you're on a steep overhang and the obvious way to get back to the lodge is to go straight down the steep slope to the valley and then down the gentle slope of the valley to the lodge. Steepest descent is like equivalent to doing nothing but sitting on your skis. You go very quickly down the slope to the valley, but then you get to the valley and you slow down, and it gets very inefficient. What expert skiers will do is cut across the slope, by putting pressure on one of their skis, and that's exactly what Newton's Method does. The parameters that control Newton's Method are equivalent to the pressure you put on your ski. If you just sit on your skis, your speed is proportional to the steepness of the slope, which is true of steepest descent. But with Newton's Method, you move at constant velocity.

Swaine: But at some cost. What does using Newton's Method cost you?

Parker: Newton's Method is trickier to implement and more computationally intensive.

Swaine: It sounds like there is no one best algorithm for all possible situations, which I suppose is not surprising.

Parker: Newton's Method is a superior algorithm, so if you can afford the computational overhead, it pays off for you.

Swaine: Where exactly does it pay off?

Parker: One way in which it pays off is this. Steepest descent takes a curved path, and Newton's Method takes a straight path. The best path is the straight path.

Swaine: Obviously. But you mean something more than the fact that a straight line is the shortest distance between two points.

Parker: Right. The disadvantage of the curved path is, if you follow a curved path and then stop learning halfway through, you may be no closer to where you want to go, and you've moved away from where you used to be. So you've forgotten old but useful information without having learned very much new information. Whereas if you follow a straight line and stop, you're closer to where you want to go and as close to where you used to be as you can be.

Swaine: So if you can afford it, Newton's Method is best. Sort of the high-priced algorithm. It occurs to me that, even if we can't pick an absolute best algorithm, nature has already made a choice. Too bad we don't know -- yet -- what algorithm nature picked for human learning.

Parker: Yes; you can think of almost any example of human learning as minimization. If you accept the hypothesis that learning is exactly minimization, then people must be implementing some minimization algorithm, and since Newton's Method is the more powerful of the two, one would hope that it was a form of Newton's Method.

Swaine: Really? I was finding your description of modified backprop pretty convincing in terms of my experience in learning things.

Parker: I would think it was a back-propagation form of Newton's Method.

Swaine: Nature has probably had time to try out many algorithms. But if Newton's Method is the most powerful minimization algorithm we know, and if it's actually feasible to implement in a human brain....

Parker: The performance difference is great enough that the other guys would have died out.

The listing that accompanies this month's column is Parker's demonstration of back propagation, written in Turbo Pascal. It's not intended to be an efficient implementation, and it won't teach you much if you just run it and examine the output. The code, though, is well documented and is intended to be read. Here are some aids to reading the listing.

The program simulates a neural net (simulates because it is purely software and is not a parallel implementation) of four cells; four neurons. Each neuron is a simple computational unit. It has two input channels, a channel for receiving the error signal and one output channel. The inputs are modified by a weight associated with each input. The neuron computes its output based on the inputs and the weights; in this implementation the computation is a sigmoid function, but other computations are possible.

The neurons are connected in layers, with the outputs of neurons in one layer feeding the inputs of neurons in the next. The original inputs are treated as the outputs of a nonexistent 0-th level, and the outputs of the top level are compared with the desired outputs to generate error signals. The error signals propagate back through the net, causing the weights in the neurons to be adjusted. The idea is that "good" paths through the net will get weighted more and more heavily and "bad" paths will get lower and lower weights, so that eventually only the "good" paths will be taken.

The demo neural net (Listing One) learns a simple task. Given one of four input patterns, (0, 0), (1, 0), (0, 1), (1, 1), the net is to produce the appropriate one of the possible output patterns, (0, 0), (0, 1), (0, 1), (1, 1). Described in this way, it's a pretty meaningless task, but it makes a little more sense when expressed in functional terms: The two values in the output pattern are respectively the logical AND and the logical OR of the two input values.

So the operation of the program is:

  • Compute the inputs and corresponding outputs
  • Propagate the inputs forward through the net
  • Compare the actual output to the desired output and compute the error
  • Propagate the error signals back through the net and update the
  • Weights, and check to see if the classification has been learned.
The last step involves examining the sum of squares of the error signals and testing for convergence. The input and output values are not really 0s and 1s as stated above, but numbers close to 0 and 1, and the program quits when the classification is learned to within a preset degree of tolerance -- when it gets close enough to 0s and 1s.

Training a neural net of this sort requires that desired output for each pass be known, so that the error signals can propagate back through the net to cause the weights to be adjusted. That's the back propagation, and what the net is learning to do is to minimize these error signals. In this demo program, the computation of error is handled automatically by having an array of the desired outputs for each set of inputs. That might make the whole enterprise seem pointless, and in the case of the demo, it is: If you examine the routine that computes the inputs and desired outputs, you'll see that we have already calculated the right answers before the network even started running.

In a real neural net implementation, computing the error could be more complicated, and in some cases might require that a person evaluate the response. Or, more interestingly, we might be able to write a routine that recognizes and characterizes errors when it sees them, even though we couldn't write an algorithm to produce correct output. In chess, for example, we can't yet write a program that plays perfectly, but we can write programs that do a perfect job of recognizing checkmate, and we can implement various position-evaluation heuristics. These are cases in which we can generate error signals to train the net even though we don't know how to solve the problem. And if our error signals are good, and if our algorithm is one that doesn't get stuck in a rut, our network, left to run uninterruptedly and to learn from its mistakes, will eventually learn to play perfect chess.

But with current hardware, probably not in our lifetimes.

Availability

All source code for articles in this issue is available on a single disk. To order, send $14.95 (Calif. residents add sales tax) to Dr. Dobb's Journal, 501 Galveston Dr., Redwood City, CA 94063, or call 800-356-2002 (from inside Calif.) or 800-533-4372 (from outside Calif.). Please specify the issue number and format (MS-DOS, Macintosh, Kaypro).

_Programming Paradigms_ by Michael Swaine

[LISTING ONE]

<a name="0218_0006">

Program BackPropagationDemo;

Const NumOfRows    = 2;     (* Number of rows of cells.            *)
      NumOfCols    = 2;     (* Number of columns of cells.         *)
      LearningRate = 0.25;  (* Learning rate.                      *)
      Criteria     = 0.005; (* Convergence criteria.               *)
      Zero         = 0.05;  (* Anything below 0.05 counts as zero. *)
      One          = 0.95;  (* Anything above 0.95 counts as one.  *)

Type CellRecord = Record
                  Output : Real; (* Output of the current cell.             *)
                  Error  : Real; (* Error signal for the current cell.      *)
                  Weights: Array[0..NumOfCols] Of Real; (* Weights in cell. *)
                  End;

Var CellArray     : Array[0..NumOfRows,0..NumOfCols] Of CellRecord; (* Cells. *)
    Inputs        : Array[1..NumOfCols] Of Real; (* Input signals.           *)
    DesiredOutputs: Array[1..NumOfCols] Of Real; (* Desired output signals.  *)

Procedure CalculateInputsAndOutputs( Iteration: Integer );
Var I: Integer;
Begin (* Calculate the inputs and desired outputs for the current iteration. *)
      (* The inputs cycle through the 4 patterns (0.05,0.05), (0.95,0.05),   *)
      (* (0.05,0.95), (0.95,0.95).  The corresponding desired outputs are    *)
      (* (0.05,0.05), (0.05,0.95), (0.05,0.95), (0.95,0.05).  The first      *)
      (* desired output is the logical AND of the inputs, and the second     *)
      (* desired output is the logical XOR.                                  *)
If (Iteration Mod 2) = 1 Then Inputs[1] := One Else Inputs[1] := Zero;
If (Iteration Mod 4) > 1 Then Inputs[2] := One Else Inputs[2] := Zero;
If (Inputs[1] > 0.5) And (Inputs[2] > 0.5) Then DesiredOutputs[1] := One
Else DesiredOutputs[1] := Zero;
If (Inputs[1] > 0.5) Xor (Inputs[2] > 0.5) Then DesiredOutputs[2] := One
Else DesiredOutputs[2] := Zero;
End;

Procedure UpdateCellOnForwardPass( Row, Column: Integer );
Var J  : Integer;
    Sum: Real;
Begin (* Calculate the output of the cell at the specified row and column. *)
With CellArray[Row,Column] Do
     Begin
     Sum := 0.0; (* Clear weighted sum of inputs. *)
     For J := 0 To NumOfCols Do (* Form weighted sum of inputs. *)
         Sum := Sum + Weights[J]*CellArray[Row-1,J].Output;
     Output := 1.0/(1.0+Exp(-Sum)); (* Calculate output of cell.  This *)
                                    (* is called a sigmoid function.   *)
     Error := 0.0; (* Clear error for backward pass. *)
     End;
End;

Procedure UpdateCellOnBackwardPass( Row, Column: Integer );
Var J: Integer;
Begin (* Calculate error signals and update weights on the backward pass. *)
With CellArray[Row,Column] Do
     Begin
     For J := 1 To NumOfCols Do      (* Back propagate the error to the cells *)
         CellArray[Row-1,J].Error := (* below the current cell.               *)
             CellArray[Row-1,J].Error+Error*Output*(1.0-Output)*Weights[J];
     For J := 0 To NumOfCols Do (* Update the weights in the current cell. *)
         Weights[J] :=
          Weights[J] +
          LearningRate*Error*Output*(1.0-Output)*CellArray[Row-1,J].Output;
     End;
End;

Var I, J, K            : Integer; (* I loops over rows, J loops over columns,*)
                                  (* and K loops over weights.               *)
    ConvergedIterations: Integer; (* Network must remain converged for four  *)
                                  (* iterations (one for each input pattern).*)
    Iteration          : Integer; (* Total number of iterations so far.      *)
    ErrorSquared       : Real;    (* Error squared for current iteration.    *)

Begin
ClrScr; (* Initialize the screen. *)
Writeln('Iteration     Inputs    Desired Outputs   Actual Outputs');
Iteration := 0;            (* Start at iteration 0. *)
ConvergedIterations := 0;  (* The network hasn't converged yet. *)
For I := 1 To NumOfRows Do (* Initialize the weights to small random numbers.*)
    For J := 1 To NumOfCols Do
        For K := 0 To NumOfCols Do
            CellArray[I,J].Weights[K] := 0.2*Random-0.1;
For I := 0 To NumOfRows Do (* Initialize outputs of dummy constant cells. *)
    CellArray[I,0].Output := One;
Repeat
     CalculateInputsAndOutputs(Iteration);
     For J := 1 To NumOfCols Do (* Copy inputs to dummy input cells. *)
         CellArray[0,J].Output := Inputs[J];
     For I := 1 To NumOfRows Do (* Propagate inputs forward through network. *)
         For J := 1 To NumOfCols Do
             UpdateCellOnForwardPass(I,J);
     For J := 1 To NumOfCols Do (* Calculate error signals. *)
         CellArray[NumOfRows,J].Error :=
             DesiredOutputs[J]-CellArray[NumOfRows,J].Output;
     For I := NumOfRows Downto 1 Do (* Propagate errors backward through *)
         For J := 1 To NumOfCols Do (* network, and update weights.      *)
             UpdateCellOnBackwardPass(I,J);
     ErrorSquared := 0.0;       (* Clear error squared.     *)
     For J := 1 To NumOfCols Do (* Calculate error squared. *)
         ErrorSquared := ErrorSquared + Sqr(CellArray[NumOfRows,J].Error);
     If ErrorSquared < Criteria Then (* If network has converged, increment  *)
             ConvergedIterations := ConvergedIterations + 1 (* convergence   *)
     Else ConvergedIterations := 0;  (* count, else clear convergence count. *)
     If (Iteration Mod 100) < 4 Then (* Every 100 iterations, write out *)
        Begin                        (* information on the 4 patterns.  *)
        If (Iteration Mod 100) = 0 Then GotoXY(1,2);
        Write('  ',Iteration:5,'     '); (* Write iteration number. *)
        For J := 1 To NumOfCols Do (* Write out input pattern. *)
            Write(Inputs[J]:4:2,' ');
        Write('     ');
        For J := 1 To NumOfCols Do (* Write out desired outputs. *)
            Write(DesiredOutputs[J]:4:2,' ');
        Write('       ');
        For J := 1 To NumOfCols Do (* Write out actual outputs. *)
            Write(CellArray[NumOfRows,J].Output:4:2,' ');
        Writeln;
        End;
     Iteration := Iteration + 1; (* Increment iteration count *)
Until (ConvergedIterations = 4) Or (Iteration = 32767);
      (* Stop when the network has converged on all 4 input patterns, or when*)
      (* we are about to get integer overflow.                               *)
If ConvergedIterations <> 4 (* Write a final message. *)
Then Writeln('Network didn''t converge')
Else Writeln('Network has converged to within criteria');
End.











Copyright © 1989, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.