How important is numerical accuracy?
Jack is a DDJ contributing editor. He can be contacted at firstname.lastname@example.org.
When news of the Pentium processor floating-point bug broke in the fall of 1994, the Associated Press knew who to call for background information -- Professor William Kahan, of the electrical engineering and computer sciences department at the University of California at Berkeley. Kahan, a noted mathematician and computer scientist who received his doctorate at the University of Toronto, was a consultant on the design of the original 8087 mathematical coprocessor. He is also generally credited as the architect of ANSI/IEEE Standard 754-1985, the standard for binary floating-point arithmetic. To this day, few people understand the math capabilities of Intel processors to the degree that William Kahan does.
Among Kahan's many awards and lectureships are the John von Neumann Memorial Lecture for the Society for Industrial and Applied Mathematics (1997), the ACM Turing Award (1989), the Prize for Outstanding Paper (with J. Demmel) from the SIAM Activity Group on Linear Algebra (1991), and the ACM G.E. Forsythe Memorial Award (1972). (For more information, see http://http.cs.berkeley.edu/~wkahan/.)
Over the years, Kahan has been known for his rigorous analysis of numerical computation techniques in an era of rapid change and challenge for scientific and engineering programming. DDJ contributing editor Jack Woehr recently had the opportunity to speak with Kahan in his office at UC Berkeley, where he has taught since 1969.
DDJ: In addition to being a computer scientist, mathematician, and educator, you're also a researcher. What are your current interests?
WK: One of my areas of research is exception handling. My thesis is that exceptions are not errors unless they are handled badly. Exceptions are opportunities for extra computation.
DDJ: Modern C and C++ agree with you.
WK: An exception is an event for which any policy you choose in advance will subsequently be found disadvantageous for somebody who will then, for good reason, take exception to the policy.
Now, 3×6=18. That's not exceptional. Nobody's going to sue you. But what should we do with zero division? The APL language took the approach that 0/0=1. Years later the guys who did it acknowledged, "If we knew then (in 1966), what we know now (1972), we wouldn't have done it."
DDJ: You have to let something outside the function decide whether an attempt to divide by zero is exceptional or not.
WK: It depends. Sometimes the event can be caught by the program module in which it occurs, and it can be regarded as an indication of something called "removable singularity," which the program can fix, and there's no point in telling anyone about it. An example would be the attempt to graph using the function
f(x) = (sin x)/x
in whose graph one point is missing -- that at zero. So you chose a value for f(0) and that removes the singularity, which exists only because of an accident in a way we represent the function. The singularity isn't really part of the graph, the singularity is part of the expression. The expression is an incomplete expression of what we want to do, which is
f(x) = (sin x)/x if x != 0
f(0) = 1
DDJ: What should we programmers be learning?
WK: What you should be learning are things from numerical-analysis classes, such as why the less accurate of two ways of calculating a function may be perfectly satisfactory for engineering work. But I also tell programmers that argument [such as on the phone with customers] takes time and money, and since we know a way to do it which will circumvent the necessity for argument, by all means do it that way, if you can. You can dispense with the necessity of trying to persuade somebody of something which is, in fact, true -- that two methods are just as good as one another, but which they have every right to disbelieve. There are all sorts of folks who tell you, "It's okay, this is just as good," and are wrong.
Let's calculate accuracy well out of proportion with what anybody actually needs. They don't need it, but it doesn't matter. We'll do it that way and then there won't be any argument.
DDJ: "How do we do it that way?" the programmer asked the math professor.
WK: Get it from the public domain. You can get a lot of it from 4.3 BSD UNIX, or from the library that my former students at Sun have put in the public domain, or maybe all you do is hear that somebody can do it. Why not you?
DDJ: And you reverse engineer the algorithm.
WK: There are people who find that enjoyable. They're my kind of guys!
DDJ: But at a time when three C++ compilers don't even parse the same expression identically, how do we get them to calculate the same mathematical results?
WK: That's a problem John Palmar tried to address back in 1976-77. John got a Ph.D. at Stanford in mathematical analysis, then went to work for Intel. At that time, the microprocessor was new. Everybody had his own idea of how to do arithmetic and libraries. The anarchy you've hinted at among compilers was reflected perfectly. Every compiler had its own math library, often done ineptly because the compiler writer was clever about something else, but not about how to do these functions. So you never knew how an operation would come out.
John's idea was to put all these functions that you'd normally find in a math library on a chip. This is the 8087, a chip on which we'd put the good arithmetic. John asked for the best, which was a mistake, because he didn't know what he was getting into. (Laughs.)
We wanted to have the whole library. We wanted decimal-to-binary conversion and back, we wanted the trig functions, and log functions, and exponentials, and arctans, and we also wanted the arithmetic to be reliable.
Arithmetic is reliable when it conforms to rules which, when you acquaint yourself with them, seem to be not at all surprising because you understand why they are there. There isn't, for instance, any reason nowadays you should get bad binary-to-decimal conversion other than that somebody is ignorant of the existence of good conversion in the public domain. The same thing is true of most math library functions.
DDJ: So for those who are interested in the mathematical accuracy of their praxis, there are tools available for free that allow them to achieve this.
WK: It is remarkable how infrequently these tools are used by people who ought to know better.
DDJ: Like who?
WK: Like Microsoft. But let's get back to John Palmar.
Palmar really wanted to get the whole library on one chip. Something really unfortunate happened. We only had 40,000 transistors. So we couldn't get it all there. We got most of it in. Later, Motorola came out with the 68881 around 1981-82. They had more transistors, so they did a better job. They had more of the library in there. Instead of having just decimal-to-binary conversion for integers, which is on this 8087 chip, they had it for the whole range.
On the Pentium, they have improved the arithmetic a little, not greatly. They really can't afford to put too many instructions onto the chip that won't be used by existing software. The correction of mistakes, the filling in of gaps...unless it's going to be used by software, you may go to a lot of bother for nothing.
John managed the 8087 project. I was a consultant on that project. At least 90 percent of what I wanted got into that little chip. So when you ask, "How can people get the right stuff?" -- well, they could get the right stuff from the Intel chip with a suitable support library. Unfortunately, they might have trouble finding a suitable support library, because there are various proprietary considerations at stake. Many times, the library you get is going to be a lousy one.
DDJ: Why is that?
WK: Most numerical computation doesn't matter. I know that sounds perverse, but in my experience, looking over the shoulders of other people, at least nine-tenths of what they compute gets thrown away, much of it without ever being looked at. They have only to get a glimpse of something to realize they never should have computed it in the first place.
An awful lot of what people compute numerically is a response to a "what-if" sort of question. Long before you get much of your answer, you realize it was a silly question.
DDJ: It's less than zero, therefore it fails.
WK: People don't look at pages of numbers the way they used to. Even the graphs tend to be thrown away without being looked at. Most numerical computation doesn't matter, therefore a great deal of it can be wrong without deflecting the course of history.
Some numerical computation matters a lot. We don't usually know what it is until afterwards. We may not know until too late. How do you know the answer is wrong unless you know the right answer? And if you knew the right one, why would you compute the wrong one?
So there are some interesting questions about the incidence of wrong results. It's one of these unknowables about which you have to draw inferences indirectly. When I was younger and computers were big, huge boxes that filled a room, the number of people who used computers was many fewer than today. They would work in a small number of rooms near the big computer, each waiting for his chance to get on the computer. The computation came to the computer as a spool of punched paper tape. Somebody would finally finish their computation, usually involuntarily, either because the computation bombed, or they ran out of time. Their scheduled moment ended, and mine would begin, and they would have to clear away from the console as quickly as they could, and I would have to get in there and slip my tape in...
You knew each other, and [you knew] what everyone was doing. And I became less than welcome because I discovered that if I was interested in what someone else was computing because it looked odd to me, in one case in three the results were significantly more wrong than the guy who was computing them thought. "Gee, how does he know he can compute that by such-and-such a method?" "I didn't think we had enough digits to compute something like that..."
People weren't happy to hear that, but that was an education for me, because these weren't dumb guys. It made me reflect on my own stuff and wonder how much of that was wrong.
When we test a program, we feed it data we know the answer to and see if it comes up with the correct solution. Half the time, we find out that what we thought was the right solution was wrong. So that's educational, too.
You can cross-check your work a number of ways. If the solution should have some symmetries, you can check for the symmetries. If something ought to be conserved, maybe energy or momentum, let's see if my solution conserves the things it ought to conserve.
Since we lack a statistical method whereby we may evaluate the incidence of error, all the evidence is anecdotal. The typical way in which numerical error is caught tells us that the initial cost will be attempting to reconcile the irreconcilable. Someone will suspect that something is wrong, a new machine will deliver different results. You try to track down what has happened, and that can cost you hours spent on something that's largely futile because, typically, the causes are misdiagnosed. The bug you fixed is not the bug you should have fixed, and now you've got two bugs.
The time that engineers should be spending on engineering, the time that economists should be spending on economics, is sometimes time spent instead figuring out why there is something odd about a program. Because these people are not interested in computers, but rather in getting their work done, they will often stop with an incomplete and often incorrect diagnosis. They'll "do something." What they do causes a change. If the change is tolerable, then they never find what causes the original bug. But they've spent time in it.
That turns out to be the principle way in which numerical errors cause trouble. Most of these errors are actually errors in the compiler. It is faulty optimization or other mistakes that are responsible, in my experience, for the majority of errors people detect.
Let me give an example, which arose yesterday. A student was doing a computation involving a simulation of a plasma. It turns out that as you go through this computation there will be certain barriers. This may be a reflecting barrier. That means if a particle moves through this barrier, it should not really go through, it should be reflected. Others may be absorbing, others may be periodic. He has a piece of code that is roughly
float x, y, z;
x = y + z;
if (x >= j) replace (x);
y = x;
As far as we can tell, when he turns on optimization, the value of x is computed in a register with extra width. This happens on the Intel machine because the registers have extra width. It also happens on some others. The value of x that he computes is not, in fact, a float. It's in a register that's wider. The machine stores x somewhere, and in the course of doing that, converts it to a float. But the value used in the comparison is the value in the register. That x is wider. The condition that the register x be greater than or equal to j doesn't guarantee that x, when stored in y, will be less than j. Sometimes y=j, and that should never be. My student counted on what compiler writers call "referential transparency" and was disappointed because the compiler writer saved some time in optimization by not reloading the register from the stored value.
DDJ: Would application of the volatile keyword fix the problem?
WK: It's possible, but then every time the variable is referenced it will be fetched from memory. Telling him to declare something volatile could be a cure worse than the disease if it makes his program run slower in an innermost loop. The problem is not the declaration -- it's that the compiler writer failed to maintain referential transparency and allowed two different versions of the same variable to exist. Two different versions of the same value we wouldn't care. But it just didn't occur to the compiler writer that he'd be changing the value by copying the register to a narrower form than the register held.
Normally, it's advantageous to have extra digits. But if you are going to carry extra digits, then there are two things you have to do. First, this register format with the extra digits has to be a type supported in your language, you have got to be able to declare intermediate variables to have the same width as your registers. The second thing is that you have to recognize that a coercion which changes the width may change the value, and therefore you shouldn't regard that as a legitimate step in optimization.
DDJ: Does anyone get it right?
WK: One of the compilers I enjoyed a lot was made by Language Systems Inc. LSI has since been bought out by Fortner. That LSI Fortran compiler runs on my Mac Quadra with a 33-MHz 68040. But the compiler writer was very conscientious. In fact, all of the Standard Apple Numeric Environment (they called it "SANE") was carefully done. Even if I didn't agree with every word of it, I certainly had to respect the care that went into it and the benefits it conveyed to applications programmers. They had a numerical environment which was not only stable and reproducible, but also sensible. It did things that a reasonable person would appreciate.
But the Macintosh based on the 68000 gave way to the PowerPC, whose arithmetic is inferior.
The PowerPC Architecture was developed, by and large, for people accustomed to using IBM mainframes for number crunching. I think PowerPCs were intended for the upper end of the desktop market, as engineering workstations for people who not only have engineering expertise, but who also have experience with computers. Many of these people take pride in their ability to work their way around various anomalies and limitations.
The 68K Macintosh was intended for a much wider range of skills, supported, I think properly, by the LSI Modula-2 compiler and also its C compiler. MPW has everything you dislike about UNIX, but the compiler itself is excellent. They did understand the issues I've been complaining about. Various anomalies which I can list by the dozen did not occur, to my knowledge, in that compiler, and I used it enough to get a reasonable feel for it.
Compiler writers nowadays usually pay no attention to numerics because in the mass market as now perceived, numerics has never been a significant factor. For instance, Borland's C compiler is numerically superior to Microsoft's, but you don't see Borland triumphing over Microsoft.
Some compiler writers go to the books that taught them whatever math they barely learned and use those formulas. Those formulas are frequently quite elegant. Formulas that are numerically stable are sometimes quite inelegant. Formulas that give you greater accuracy are sometimes full of cases, ugly cases. I can appreciate the disinclination of someone teaching mathematics to teach numerically stable formulas. He wants to get an insight across. The programmer who has learned from that source may not be aware that this is a mathematically elegant but numerically unreliable formula.
DDJ: How do you protect yourself from error?
WK: Error creeps in. You have to learn the techniques of error analysis and decide what degree of error is tolerable. There are three good books: Accuracy and Stability of Numerical Algorithms, by Nicholas J. Higham (Society for Industrial and Applied Mathematics, 1966, ISBN 0-89871-355-2), Matrix Computations, Third Edition, by Gene F. Golub and Charles F. van Loan (Johns Hopkins University Press, 1986, ISBN 0-8018-5413), and The Algebraic Eigenvalue Problem, by J.H. Wilkinson M.A., D. Sc. (Clarendon Press, 1965). The latter is the model which inspired the others.
But look at the thickness of these books! Look at the bibliography in this one -- 1134 citations! People generally are not going to read these books. Therefore, those of us who design the underlying systems -- the hardware, compiler, languages -- must do so knowing in advance what people are likely to do so that we can enhance the prospects that what they do will work.
When John Palmer asked me to help him design the chip that became the 8087, he said he wanted really good floating point. I said, "Well, the IBM 370 floating point isn't bad. If you design something like that, think of all the software you'll be able to run." He said, "No, no, we want good floating point."
I said, "There's the DEC VAX. If you built floating point like the DEC VAX, I don't think people would complain." He said, "We want the best floating point." And he said something else, "Besides, we have a mass market in mind."
At that time, I didn't appreciate what a mass market meant. "That must mean hundreds of thousands," I guessed.
DDJ: Instead of them being in every home in America.
WK: John Palmar tried to convince the people at Intel that the market for the 8087 was enormous. But they had their own numbers, and John couldn't persuade them that their pricing and distribution policies should be geared to a much larger market. So finally, in exasperation, he made them a sort of "put up or shut up" argument.
He said, "I'll tell you what. I'll relinquish my salary, provided you'll write down your number of how many you expect to sell, then give me a dollar for every one you sell beyond that." They didn't do it, but if they had, John Palmar would not have to think of working for a living.
The work on the 8087 became the basis for the IEEE standard, at least all the rational and algebraic operations. Transcendental functions didn't get into the standard, but all the other things did, pretty much.
The 8087 formed the basis for the 80287, 80387, and floating-point circuitry on the Pentium.
At the time the 80387 was introduced, Intel attempted to correct a serious flaw in the design which arose from the fact that it was impossible for someone in Santa Clara and someone in Israel to be awake at the same time. When John tried to explain to the Israelis what we wanted in order to cope with stack overflow and stack underflow, they misunderstood and thought that what we wanted would imply that an address would have to be kept on the coprocessor to tell where the stack underflow/overflow area was. They thought this couldn't be right, the coprocessor didn't have any address decoding capability of its own, only the ability to increment the address bus and let the next byte come in.
So they proposed another mechanism, and John, who was sleepy at the time, said, "Well, if you can get it to work okay, but write the software first."
Well, everybody thought that someone else was writing the software, but it wasn't until the 8087 was coming off the fab line that it dawned on anyone that it might not be possible to handle 8087 stack underflow/overflow in a reasonable way. It's not impossible, just impossible to do it in a reasonable way.
The consequence was that the compiler community couldn't use the stack as a stack. They had to try to use it as a flat register set, for which it was very unsuited. Therefore, they got into the habit of loading a couple of operands into the registers, doing their thing, then dumping it to memory immediately. Leaving things in the stack turned out to be potentially dangerous, because if your compiler was going to do this automatically, your compiler had to do a lot of figuring to make sure you weren't going to get a stack overflow, since a stack overflow would be poisonously expensive to handle.
There were only two compiler writers who managed to get stack underflow/overflow to work. One of them wrote for the Alsys Ada compiler. The other wrote it for a Modula-2 compiler, which unfortunately never got finished because the principles parted in a snit. If I understand what the Microsoft C compiler does, it allows overflow to occur at run time, and then you get a run-time message, "Your expression is too complicated, go back rewrite it in simpler pieces and recompile."
To do this, they have to test the invalid flag, which indicates either an invalid arithmetic operation or a stack overflow, depending on the state of another bit in the condition code. The trouble is, to detect this, they clear the flag after every statement, therefore you the programmer can't use it. You can't find out if an invalid operation has occurred, because Microsoft has cleared the flag before the code returns.
DDJ: So you have to write your own libraries?
WK: It's not the libraries. The compiler generates code that blitzes the invalid flag before you see it. You disable the trap on invalid because you want to test it. The compiler reenables the trap, you do your arithmetic operation, it traps, the compiler resets the invalid flag, and if the trap wasn't caused by stack overflow, it just continues.
Suppose I'm going to write a program to solve a linear equation where all you have to do is pass me the function that computes the left side and a guess or two at the arguments, the region where you want to start looking for the root of this equation. My program samples your function at various arguments, starting at the ones you suggested, and on the basis of those arguments, it works out a strategy for approaching the result. The caller doesn't have to know how the root finder works. This is the sort of thing that's behind financial calculators where you enter the payment size, current interest rate, and how long you are going to pay the mortgage, and it figures out how much you can afford to borrow, or you tell it how much you can afford to borrow, and it tells you how much the payments are.
Some equations are expressions that do not have defined results at every point. A denominator might become zero, a logarithm might be sought for a negative number. If, in the course of the equation-solving search, you get out of the range where solutions are defined, there's nothing to be found, and you continue your search, adjusting accordingly. But that's not the way invalid operations are handled in a world where they are considered to be errors which stop the computation.
DDJ: NaN is not ABEND.
WK: Exactly. HP calculators, whose hardware conforms to the IEEE 854 Standard, raise NaN, and your root finder says, "I guess I'm outside the domain of the equation; retract and try again," and your computation continues. This is possible with the Microsoft way of doing things, but since no exception is raised, you have to test each result you get back for NaN. That's simple enough with a simple array, but when it's a structure you're manipulating, you have to test each member instead of just ask a flag.
This is grievous and arises from ignorance or thoughtlessness on the part of the compiler community. But we can't blame the compiler community for doing what they do naturally, because the customers are at least as ignorant for the most part, and don't know what they're missing.
DDJ: And they're screaming for other optimizations.
DDJ: So only the people who write MathCAD get upset about this stuff.
WK: MathCAD is designed to run on a number of platforms, so their way of dealing with this is that anything that can be found on the Intel platform that can't be found on the DEC Alpha platform isn't used. Alpha has some stripped-down arithmetic and is very bad at exception handling because the Alpha has a trap-barrier mechanism which they must use to catch exceptions. The trap barrier is potentially very expensive.
If you say on an Alpha, "I want to conform to the IEEE Standard," if underflow occurs, you'll want to trap, and you'll have to set the trap barrier to catch the trap. The trap barrier is the place where the program pauses to see if an exception has occurred in a basic block between two trap barriers. Naturally, such an exception hardly ever occurs, but just in case it might have occurred, you have to wait. You're waiting all the time for an event that almost never happens.
The Alpha architecture could have been designed to treat gradual underflow like a subroutine call, a call with very limited dependency relationships so you could more or less leave the state of the machine practically unaltered, go and do what you have to do and return. But they chose to make it a full-fledged trap. Because it's a full-fledged trap, you have to use a trap barrier to catch it, you have to mark the instructions which might trap, so that when the trap barrier tells you there has been a trap, you scan the instructions that might have trapped back to the previous trap barrier, and you must make sure that code never jumps into this block, never jumps out of this block, and never reuses a register. So register residency is restricted, and you're suffering a performance penalty in any number of ways. This is because they had this bee in their bonnet that gradual underflow is bad for you, that you shouldn't do it.
DDJ: So what should our industry be doing?
WK: I think it was Churchill who once said something about the essence of leadership being figuring out where people are going to go anyway and getting there ahead of them. The industry, in its own best interests, has to take a role that leads the customers, but not by too much. When it comes to numerical things, part of leadership is to provide an intellectually economical numerical ambiance.
"Intellectually economical" doesn't necessarily mean simple. We have to understand the various arenas in which people want to depend upon computers. We have to understand them in some ways better than the customers. The customers don't always understand the issues because they are preoccupied with what matters to them, not with the design and construction of the systems upon which they depend.
It's like being engaged in sewer repair. The art of sewer repair is to do it in such a fashion so as not to oblige others to think about it at all. If you have to think about it, the sewer repairmen haven't been doing their job.
DDJ: So we programmers are not doing our job?
WK: That's correct. It is straightforward to implement numerics better. I've written about it in a number of places. Look at my article about miscalculating the area and angles of a triangle at http://http.cs.berkeley.edu/~wkahan/Triangle.ps. There's an abstract there that points to a number of misconceptions or superstitions. It doesn't seem to me to be all that difficult for the compiler community to serve these needs.
If the industry gets far enough down a certain path, it can't turn, never mind go back! Java threatens us now in a way that the other languages didn't.
DDJ: Because it's so good? Because it's such a good idea overall that any bad ideas within it...
WK: Java is a reaction to the excesses of C++. Java has all sorts of things about it that are going to have to be fixed. The folks at Sun say, "There are so many JVMs out there that we can't change it." But the JVMs aren't all the same, there are inconsistencies, so eventually the JVM is going to be changed anyway. They have to standardize, but do so in a way that still allows innovation and development. That's a major intellectual challenge. You can't expect to get it right the first time.
I don't feel that the architects of Java are wicked so much as premature to think that they can write a universal language for every man to write everything to run everywhere. That isn't going to happen, it's going to be a turf war -- Bill Joy against Bill Gates. It's a lousy reason to write a language, to write it in order to precipitate a turf war to bring things into a territory more advantageous to Sun than the current situation.
What we have to understand is that different ambiances require different numeric facilities. The people who do the design of nuclear weapons or who solve partially differential equations with shock waves are much less dependent on the fine points of arithmetic than are the people who do statistical correlations, financial calculations, geometrical calculations, especially robotics, displays, architectural presentations, and so on. That's a pretty big market that depends more delicately upon the arithmetic. So if we were to design the arithmetic to support the algorithms which we know these people are going to use...that's what the Intel chip was designed for.
When I laid out the specs for that chip, I had in mind everything about the present ambiance except the numbers of units we eventually shipped. Supporting that ambiance was why I put that stuff in there, that's why they built it in there, and we're not getting access to it! Compilers are wasting their time optimizing floating-point expressions such as multiplication by zero, not actually doing the multiply. It's wrong! If you multiply infinity by zero, you get a NaN. There are all sorts of floating-point optimizations that are not worth the effort, even when they work. Nobody is going to write code that needs such optimizations unless they have a very dumb macro preprocessor that leaves all sorts of fragments around that it should have collected and gotten rid of. Parentheses are hard to get rid of, but zero-times-something is relatively easy to get rid of in a macropreprocessor.
But if you get rid of an expression when someone has written it in cold blood, you're changing the semantics of his program.
Similar are things like constant folding. You (the optimizer) look at an expression and decide, "Oh, this is a constant, so I can compute it at compile time." But maybe that expression was designed for a different rounding mode at run time than the one which prevails at compile time. There are reasons to change the rounding mode without mentioning it.
There's all sorts of stuff in the IEEE Standard designed to help ordinary people do things like diagnose what may be screwing them in a module they got from someone else. Perhaps there's an algorithm that is pretty good for almost all data except yours. It doesn't know you personally, you understand! Your data just happens to be the kind that that particular algorithm doesn't like. How do you find out if this is the case?
One way is to change the rounding mode, which will create different rounding errors. This identifies modules where it might be worth your time to investigate. If you didn't have this ability to change the rounding mode, you might not have any other way to identify which module among many supplied by multiple vendors was the likely candidate for further investigation.
In any army, there's a soldier who doesn't get the message. DEC Alpha decided to put the rounding mode not in a control word alone, but also in the opcode. That means that when you compile, you can compile the rounding modes in such a way that they don't respond to a change of control word setting, and therefore, when you rerun the code with the changed control word, you're going to get the same results as before.
It isn't that this was a bad idea, to put the rounding mode in the opcode as well as the control word. The mistake was that they put the default rounding mode into the opcode. The default should have been left in the control word.
DDJ: How should programmers protect themselves?
WK: Programmers have to understand that compiler writers are not going to do better unless programmers say they want better compilers. There is a proposal called C9X. It's up before ANSI X3J11, the ANSI C Technical Committee. The problem with C9X is that a lot of stuff that people used to do is grandfathered in, so it looks very complicated. But if you strip out the grandfather clauses and look only at the stuff that is for IEEE Standard conformance, you'll see a reasonable attempt to give the programmer access to, firstly, the hardware features -- all three precisions if the machine has them -- and secondly, to enable programmers to write programs to exploit that extra precision if it is there and otherwise do as well as it can without. The latter is a linguistic necessity catered to by the C9X proposal. There's also an attempt to deal with exceptions.
What the programming community has to say is, "Yes, we care about these things, not because all of us want to think about them, we would just like somebody to think them through so that we don't have to."
DDJ: That's what standards are for.
WK: That's what civilizations are for. Civilizations are designed to give you the benefits of others' experience without having to relive it.
The language community should take some numerical analysts seriously enough -- it's numerical analysts who care about these support issues -- and not ask the guys designing H-bombs or supersonic wings, because these latter don't give much of a damn about floating point. Their algorithms are very robust, they can tolerate all sorts of floating point, after all, they run on Crays! Anything that runs on a Cray doesn't care how you round, because a Cray rounds in a way that beggars description.
What we want to do is look into places where there are people who really do care about the details of arithmetic, find out why they care, and see if you can benefit from their rationale. I hope you can!
There is nothing so fine as having folks who are knowledgeable. To be knowledgeable about numerical issues is great, but to tell programmers that they should be knowledgeable about numerical issues is sort of what the Catholics call "counsel of perfection" and not applicable to everyone. But programmers do have to take these issues seriously enough to realize that superficial diagnoses may not be right, that they might really want to find out what is actually happening, why things really malfunction. They may discover, if they will take the trouble to do things like look at assembly listings, that what's going wrong is not what they thought.
If it's the compiler that's causing the trouble, you have to look at assembly code. If it's the algorithm causing the trouble, it takes a different kind of diagnosis.
If things flare, as they often do in e-mail, it's possible to create an environment in which all everyone wants is for the issues to go away, and the easiest way to do that is to make the wrong decision, to make everything so simple that when anything goes wrong, it's obvious, but there are things you just can't do. If you can't do them, then they can't go wrong.
DDJ: Microsoft-style programming.
WK: You've got the picture. Things are genuinely simple when you can think correctly about what's going on without having a lot of extraneous or confusing thought to impede you. Think of Einstein's maxim, that "everything should be made as simple as possible, but no simpler."
Numerics can be a lot simpler than they are, but they are not as simple as Java thinks, and they are not as simple as Microsoft thinks.
In the case of Java, they made some hasty decisions that should be reversed. They decided not to support different floating-point semantics. They decided that everything should be exactly reproducible, but the fact is, "exactly reproducible" is useful only in certain circumstances. They denied people the advantage of better hardware when they have got it. And 99 percent of the people have that hardware now.
DDJ: So your advice on numerical issues on modern platforms is, "If you have these issues, either become interested in these issues or find someone who is interested, someone who gets satisfaction from dealing with these issues."
WK: Above all, think things through in such a way that when you understand, there is a light that illuminates your understanding and gives you confidence: You are not faking it. You really can see the rationale, the forces that mean things must be done in a certain way.
This type of analysis is what should convey a certain sense of satisfaction that people often attempt to get prematurely from aesthetic criteria. They believe that if a program or algorithm looks pretty, why then, it must be okay. If you think that beauty is the sole criterion, remember that beauty is in the eye of the beholder, and in the eyes of a bug, a rose is just fodder!
Copyright © 1997, Dr. Dobb's Journal