Dr. Dobb's | Living By the Rules

Living By the Rules

Understanding compiler rules better equips you to interpret messages the compiler sends you.

May 16, 2006
URL:http://www.drdobbs.com/cpp/living-by-the-rules/187203727

Pete is a consultant specializing in library design and implementation. He has been a member of the C++ Standards Committee since its inception, and is Project Editor for the C++ Standard. He is writing a book on the newly approved Technical Report on C++ Library Extensions to be published by Addison-Wesley. Pete can be contacted at [email protected].

BILL CLINTON GOT IN TROUBLE over a nuanced statement about context dependency, delivered in a context where nuance can be drowned in deliberate noise. The C++ Standard makes nuanced statements about valid and invalid programs, which usually don't drown in mere noise. Instead, they often disappear in a forest of perceived complexity, some essential, some the byproduct of simplifications made by teachers for the benefit of newcomers who soon outgrow them, and some from ad hoc learning that displaces detailed study. The rules for what a compiler must tell you when you try to compile invalid code are more complex than most programmers realize; at the same time, they're also much simpler. In this column, I examine those rules to understand their necessary complexity and their actual simplicity. With that knowledge, you'll be better equipped to interpret messages that you get from your compiler.

Grammars, Human and Otherwise

Most programming languages are modeled on human languages. They have a grammar that consists of syntactic rules, semantic rules, and transformations that map things in the language into things in the computer system that we're writing code for [1].

Syntactic rules are often expressed through grammar productions. You can read the productions for C++, in far greater detail than you're interested in, in the C++ Standard. They tell you what constitutes a valid statement in C++. Just as "This sentence no verb" is not a valid sentence in English, int x 3; is not a valid statement in C++. Both are missing a required element, and in both cases you can look through the grammar rules and find the one that's being violated.

Semantic constraints deal with context: A statement that is syntactically valid might not make sense in the place where it's used, typically because something is missing or ambiguous in the broader context. "C and C++ are third-generation programming languages. It can be far more expressive than assembler." Both sentences are syntactically correct, but the combination of the two doesn't make sense, because it's not possible to determine whether the "it" at the beginning of the second sentence refers to "C" or "C++." The sentences, taken together, violate the semantic rule that a singular pronoun must have exactly one antecedent to refer to. Similarly, in the code fragment void f(char); void f(double); f(3);, none of the three statements standing alone violates any syntax rules, but the combination is ambiguous, because it's not possible to determine whether the call to f(3) in the third statement refers to the first or the second version of f. The statements, taken together, violate the semantic rule that a call to an overloaded function must refer to exactly one of the overloads.

And, finally, the transformation from a valid statement in the language into a meaningful concept in the outside world is fraught with danger. "Colorless green ideas sleep furiously" violates neither syntactic rules nor any semantic rules. Nevertheless, when you map the words it uses and the abstract structure of the sentence into real-world concepts, it doesn't mean anything. Similarly, cout < "The cosine of 30 degrees is " < tan(90) < '\n'; violates no syntactic rules or any semantic rules, but its output is meaningless.

Of course, as programmers, it's our job to make sure that the output of the programs we write, whatever that output's form may be, is meaningful. So we try to write programs that don't violate any syntactic rules or any semantic rules. Having done that, we mentally apply the transformation rules that give meanings to programs, determine exactly what it is that the code we wrote is supposed to do, and check to be sure that what it does is what we want it to do. Easy enough, right? But it's not that simple: There are places where the transformation rules allow more than one meaning.

Wiggle Room

One of the primary goals of the C++ programming language, which it inherited from C, is to permit compilers to generate optimal code for their target platform. As a result, the C++ Standard doesn't spell out the exact details of things such as the size of the various integer types or the order of evaluation of arguments to functions. It leaves it to the compiler writer to decide what works best for the computer system that the compiler is targeting, and to fill in the details appropriately, within certain constraints.

For example, each of the integer types unsigned char, unsigned short, unsigned int, and unsigned long must be able to represent all of the values that can be represented by the type that precedes it in that list. In addition, each of the types must be able to represent all of the values in the minimal range specified for that type. As a result, this code snippet:

unsigned int i = 65535;
cout < ++i < '\n';

can display the value 0 or the value 65536, depending on the actual range of values supported by an unsigned int [2]. The Standard requires that an unsigned int be able to store values that are greater than or equal to 0 and less than or equal to 65535. With a compiler that supports that exact range, incrementing the value of i makes it wrap around to 0. With a compiler that supports a larger range, incrementing the value of i simply sets it to the next value, 65536 [3].

To provide some guidance for users of features that have this sort of flexibility, the C++ Standard provides two categories for requirements that can be satisfied in more than one way: implementation-defined behavior and unspecified behavior. In both cases, the implementor chooses from a usually small set of alternatives. When something is designated as implementation-defined behavior, the implementor must document which choice was made. For unspecified behavior, no documentation is required. For example, the fundamental type char must have the same representation as either signed char or unsigned char. The choice is implementation-defined, so the documentation that comes with the compiler must tell you which one it is. On the other hand, the order of evaluation of arguments to a function is unspecified. The following code can write its two output lines in either order, and the implementor has no obligation to tell you what the order will be [4]:

    int f() {
    cout < "In f\n";
    return 3;
    }

    int g() {
    cout < "In g\n";
    return 4;
    }
    int sum(int i, int j) {
    return i + j;
    }
    int main() {
    return sum(f(), g());
    }

In both cases, if you want to write portable code, you should make sure that any code whose effect depends on implementation-defined or unspecified behavior is easily identifiable and well isolated, so that it can be changed easily if the need arises.

Well-Formed Programs, Ill-Formed Programs, And Diagnostic Messages

Some of the semantic rules in the C++ Standard explicitly say that if they are violated, "no diagnostic is required." Some say that violating the rule results in undefined behavior. Some say nothing about what behavior is required. The rest of the semantic rules are referred to as diagnosable semantic rules [5].

A well-formed program is a C++ program that does not violate any syntax rules, diagnosable semantic rules, or the One Definition Rule [6]. An ill-formed program is a program that is not a well-formed program. A diagnostic message is a message that is one of an implementation-defined subset of the implementation's output messages. If a program contains a violation of any diagnosable rule [7], the C++ Standard requires the implementation to issue at least one diagnostic message. If a program does not contain a violation of any of the rules (diagnosable or otherwise), and the program isn't too big for the compiler to handle [8], the C++ Standard requires the implementation to "accept and correctly execute" the program.

Read those definitions again, paying particular attention to what they don't say. They don't say that any code shouldn't compile, nor do they say anything about error messages or warnings. In fact, a compiler that always gives the message "compile successful," regardless of any violations of language rules, meets those requirements [9]. Of course, nobody would use such an uninformative compiler if they could avoid it, but that's just a practical detail.

Practical Details

Compilers have to offer more than mere standards conformance to succeed in the marketplace. When there's an error in the code, we expect compilers to tell us something about what was actually wrong and where. We also expect that a compiler won't generate an executable file if there's nothing sensible it can do with the code we fed it. That's what most programmers mean when they say that they expect "an error," or that the compiler "shouldn't compile" the code.

On the other hand, there often is something sensible that the compiler can do with code that violates diagnosable rules in the C++ Standard. That's known as a language extension, and it's one of the reasons that the C++ Standard says so little about what compilers should do with ill-formed programs. For example, C++ code that defines a variable whose type is long long int violates a syntactic rule: There is no such type. If the Standard prohibited compilers from accepting code that violated syntactic rules, that common extension (based on C99 and almost certainly coming to C++ in its next revision) couldn't be used. As it is, the compiler must issue a diagnostic, and it is then free to do just what you expect it to do: Treat your variable as an integer with type long long int, and adjust the normal rules as appropriate.

In practice, this means that we have two typical categories of diagnostic messages: Error messages, meaning "This code violates a requirement of the C++ Standard, and I refuse to compile it," and warnings, meaning "This code violates a requirement of the C++ Standard, but I'm going to do something (probably) sensible with it, anyway." Unfortunately, most compilers also use warning messages to give advice about programming style. Compilers should not be in the business of criticizing style; there are other tools that do that. Compiler output messages should clearly distinguish between extensions and advice, either with a different kind of message or with a switch that turns off all messages that don't relate to violations of language rules. That would make it much easier to ignore their advice, and concentrate on the real coding problems [10].

The next time someone comes to you with code that they say "shouldn't compile," smile knowingly, and get on to the real problem.

Notes

[1] I'm using "semantic" here with its meaning in computer languages; in human languages, its meaning tends to incorporate both this meaning and the mapping between words and ideas.
[2] Java tried to do away with this flexibility by prescribing exact sizes for all integer and floating-point types. That seems to have worked okay for the integer types (although prematurely settling on 16 bits for its character type means that writing general-purpose character handling code is unnecessarily difficult now that Unicode doesn't fit in 16 bits), but with floating-point types it caused a major problem. Prescribing sizes and exact semantics meant that the Java runtime couldn't use the fastest available floating-point implementation on some hardware, so floating-point Java code ran significantly slower on Intel hardware than the equivalent C or C++ code. These restrictions on floating-point types have been relaxed at the request of people who write number-crunching code.
[3] If you need fully predictable behavior for code like this, the usual solution is an #if/#elseif chain that checks the range, using the macros defined in the header <climits>. In its latest revision, the C Standard provides a set of types with well-defined sizes. The C++ Standards committee's Technical Report on C++ Library Extensions also adds these types.
[4] The order sometimes changes when the code is recompiled with the same compiler and different optimization settings.
[5] Violations of semantic rules that are not diagnosable semantic rules result in undefined behavior; that is, the C++ Standard doesn't impose any requirement on what a compiler does when faced with such code. When referring to this, please don't use the abominable wording "This program invokes undefined behavior." The correct phrasing is "The behavior of this program is undefined." And please keep in mind that undefined behavior means only that the C++ Standard doesn't say what the code in question does. It does not mean that compilers are obliged to do nasty things like set fire to your hard drive. Often, the best way to write code that takes maximal advantage of the hardware it will run on is to use code constructs whose behavior is undefined, but well understood. For example, if you really need speed, instead of testing whether an integer value is greater than or equal to zero and less than some upper limit, you can convert it to an unsigned integer type with the same number of bits and test whether the result is greater than the upper limit. On most architectures, converting the value doesn't change any bits, so it doesn't require any code; negative values are simply treated as large unsigned values, which will always exceed the limit. The behavior of that code is formally undefined, but it works. Except when it doesn't. Test for it.
[6] The One Definition Rule says, in essence, that things that ought to be defined the same way must be defined the same way. For example, defining a struct named foo with a single member of type int in one translation unit and with a single member of type double in another translation unit violates the ODR. Violations of the ODR result in undefined behavior.
[7] The diagnosable rules are the syntactic rules and all diagnosable semantic rules.
[8] Some valid programs are simply too big for a compiler to handle. That doesn't make the compiler nonconforming, even if other compilers can handle the same program.
[9] Provided, of course, that its documentation lists that message as its diagnostic message.
[10]Advice is sometimes useful, especially for beginners, but I'm tired of having to write code that satisfies four different compiler writers' ideas of what constitutes good style, so that customers can compile it with warnings cranked up to the maximum. Yes, if(0 <= x) might be a mistake when x is an unsigned type; on the other hand, when I'm writing template code that works with all integral types, I don't want to write two separate versions for signed and unsigned types, and I don't want to have to write some compiler-breaking mess of metaprogramming. I just want to write the test, even though it's always true for an unsigned type. The code is legal and its meaning is well defined and clear. If the compiler writer wants to remove the test when x is unsigned, that's fine with me.