Dr. Dobb's | Computer Programming and Precise Terminology

Computer Programming and Precise Terminology

Teaching a new programming language is difficult enough without confusing the very concepts we are trying to teach

July 10, 2008
URL:http://www.drdobbs.com/jvm/computer-programming-and-precise-termino/208808373

Jack Purdum is an assistant professor of computer technology at Purdue University. He can be contacted at [email protected].

Teaching a new programming language is difficult enough without confusing the very concepts we are trying to teach. The problem of concept confusion is compounded because the art of software development is a building process. That is, the concepts introduced early in the learning process serve as the foundation for understanding more complex concepts that are introduced later on.

Define vs. Declare

One early point of confusion comes from the belief that data definitions and data declarations are the same thing. They are not. Perpetuating this confusion is to doom the students' understanding of subsequent, more complex, topics. As stated by Brian W. Kernighan and Dennis M. Ritchie in The C Programming Language:

It is important to distinguish between the declaration of an external variable and its definition. A declaration announces the properties of a variable (its type, size, etc.); a definition also causes storage to be allocated.

If you really understand the difference between these two concepts, some nettlesome programming concepts become duck soup to understand. More importantly, understanding the distinction between define and declare applies to any language, although I concentrate on C#, Visual Basic, and Java here.

An Example

Suppose we have the statement (in C, C++, C# or Java):


int k;			// In C, C++, C#, Java


Dim k as Integer	' Visual Basic

Ask yourself: What does the compiler do with such a simple statement? First, the compiler checks to see if the syntax is correct, which it is in this example. (Okay, I take some liberties with the actual internal workings of the compiler. However, when teaching beginning students, such abstractions and simplifications often help the students to understand what otherwise might be obscured by unnecessary details.) Next, the compiler scans its symbol table to see if a variable named k has already been defined. Table 1 is a (greatly simplified) symbol table. In the table, a variable name i has already been defined elsewhere in the program. Table 1 shows the state of the symbol table after variable i has been defined, but before variable k has been entered into the table.

Table 1: A Hypothetical Symbol Table.

Looking through the symbol table in Table 1, the compiler does not find another variable named k at the same scope level. Therefore, the compiler fills in the attribute list for the new variable as it current understands the variable named k. The new state of the symbol table is in Table 2.

Table 2: A Hypothetical Symbol Table, after attributes for k entered.

lvalues

Note in Table 2 that we have not filled in the lvalue for k. What is an lvalue? The lvalue of a variable is the memory location where we can find that variable stored in memory. (The term "lvalue" comes from the old assembly language days and referred to the "location value", or memory address, of a variable.) At this point, Table 2 shows us that we know quite a bit about the variable named k, but we do not know where it is stored yet.

The compiler now issues a request to the operating system's memory manager and asks for 4 bytes (the storage requirements for an int data type, taken from column 3 in Table 2) of storage. The memory manager looks for 4 contiguous bytes of storage and, assuming the request can be fulfilled, the memory manager passes back to the compiler the memory address of those 4 bytes of storage. For illustration, we assume the memory manager passes back address 910,000. Table 2 then changes state to look like the symbol table in Table 3.

Table 3: A Hypothetical Symbol Table, after storage is allocated for k.

Now recall the K&R statement:

A declaration announces the properties of a variable (its type, size, etc.); a definition also causes storage to be allocated.

The first four columns in Table 3 describe the basic properties of our variables, but it is column 5, the lvalue column, that has to be filled in for us to have a data definition. Notice that a data definition includes a data declaration while a data declaration does not include a data definition. Indeed, the state of variable k in Table 2 is a data declaration for variable k. On the other hand, Table 3 completes the entry and forms a data definition for variable k because the variable has been allocated storage (i.e., its lvalue is assigned a memory address).

Now, contrast the definition of variable k we just discussed with the C statement:


extern int j;

After checking the syntax for this statement, the compiler fills in the attribute list for the variable j. The symbol table now looks like Table 4.

Table 4: A Hypothetical Symbol Table, after processing an extern variable.

In C, the keyword extern means that the variable is defined in another source file, but we would like to be able to use that variable in this source file. Because of the way extern variables work, the definition of j has already been processed when some other source file was compiled. (The fact that j was already given a home in memory is why we have the Scope column set to a different value.) Therefore, there is no need to allocate memory for variable j here. (It is the linker's responsibility to sort out where j actually lives in memory when all of the compiled source files are brought together.) Because no storage is allocated for j when the current source file is compiled, the lvalue column for j is not filled in. Therefore, the statement:


extern int j;

is a data declaration because no storage (i.e., no lvalue) was allocated for jj is used in the current source file.

Another common example illustrating the declare/define distinction was created when ANSI standardized the C programming language (i.e., X3J11) and allowed function prototypes. For example:


int myFunction(int a, double b);

The purpose of function prototypes is to allow the compiler to perform type-checking on the function parameters and return type. The symbol table might again change to something similar to that in Table 5.

Table 5: A Hypothetical Symbol Table, after processing a function prototype.

Once again, no memory is allocated for myFunction(), as indicated by the empty lvalue column in Table 5. Only the attributes of the function are recorded in the symbol table so the compiler can perform type-checking when the function is used. (The Attributes in column 6 of the table might represent the number of arguments for the function; column 7 might be the byte count for the first argument, etc. Symbol tables can be quite complex and tables with dozens of columns are not uncommon.) Constructs similar to function prototypes are found in other languages, like interfaces in C++, C#, and Java.

As K&R point out, the critical distinction between data definition and declarations is that data definitions do cause storage to be allocated (i.e., there is an lvalue entry in the symbol table) while data declarations do not. Usually, data declarations appear in a source file for information purposes (e.g., to permit type-checking or enforce signature rules).

Sadly, programmers blur the distinction between data definitions and data declarations all the time. Indeed, most textbooks seem to be oblivious to the distinction. Microsoft's documentation uses the term "declaration" everywhere, and the term is used incorrectly most of the time. This not only makes learning programming more difficult, it robs the student of some useful learning techniques.

Data Definitions

The narrative above begs the question: Why is all of this important for the beginning programmers to understand? First, if all students use the same terms in a consistent way, there is less chance for them to misunderstand the topic at hand. Most textbooks use the terms "data definition" and "data declaration" as though they are synonyms. They are not. Making the distinction early makes teaching more complex topics easier, as you will see in a moment.

Second, taking the time to explain what a symbol table is and just some of the information it contains helps students better understand error messages issued by the compiler. For example, after spending one lecture on the concepts discussed above, we have never had a student ask what a "Duplicate Definition" error message means. They immediately understand what it means and how to correct it. While most modern programming languages require a variable to be defined before it can be used in an expression, understanding the concepts behind a symbol table makes it clear to beginning students why they must define a variable before they can use it. Even though such things may be intuitively obvious to us, they are not to beginning students.

Finally, understanding topics like value types versus reference types and pass-by-value versus pass-by-reference become much easier to explain using the concepts that we demonstrate in the remainder of this article. As someone once said: "If the only tool you have is a hammer, all your problems begin to look like a nail." Understanding the difference between value and reference variables are often complex topics for students to comprehend and the techniques described here can be another tool to use when teaching such topics.

We find that the following diagrams make it easier to present the concepts to the students. We start off with a simple definition of an integer variable:


int i;	// Statement 1

We can then represent this statement as in Figure 1.

Figure 1: Lvalue and rvalue after syntax parsing.

We tell the students that the reason the lvalue and rvalue boxes have question marks is because the compiler has not sent a request to the operating system for storage. In other words, all the compiler has done to this point is checked the syntax in Statement 1 (which is okay) and checked the symbol table to see if variable i is already defined at the same scope level. Figure 1 represents the state of variable i at this point in the program.

The compiler then asks the operating system's memory manager for 4 bytes of storage. (See column 3, Table 5.) Assuming the memory manager finds 4 contiguous bytes of storage, it passes back the memory address of those 4 bytes of storage (e.g., assume memory address 900,000). Our diagram now becomes like Figure 2.

Figure 2: Lvalue and rvalue after memory allocation

Note, because variable i now has an lvalue, we have a data definition for variable i.

What does the rvalue represent? The rvalue is what it stored at the lvalue. (Again, the term harkens back to the old assembly language days and represented the "register value" of a data item.) In other words, the rvalue is the current value of variable i. We have left the rvalue unknown because, at this juncture, some compilers may initialize the value to 0, while other languages leave the 4 bytes unchanged and the rvalue is whatever random bit pattern happens to be in memory at that (lvalue) memory address. (We always teach our students to never assume the compiler initializes a variable with a meaningful rvalue.)

Now consider the statement:


i = 10;			// Statement 2

After the compiler checks the statement for proper syntax, it processes the statement by going to the symbol table, finding the lvalue for the variable (900,000) and depositing "4 bytes with a value of 10" at that memory address. The state of variable i is transformed to reflect the state in Figure 3.

Figure 3: Lvalue and rvalue after assignment

Note that the rvalue is now 10. Also, note that, if variable i was a data declaration, there would be no lvalue and, hence, no way to change its value. At this point, we review the fact that an assignment statement is always concerned with moving whatever is on the right side of the expression into the rvalue of whatever is on the left side of the expression. We also make the assertion that the assignment operator can only be used with variables that have been defined previously at some point in the program.

Quite honestly, some students' eyes are a little glazed over at this point, suggesting that this approach is less than intuitively obvious to some students. Fortunately, that is easily resolved.

The Bucket Analogy

While students initially have some difficulty understanding lvalues and rvalues, we developed the "Bucket Analogy" to help them understand the concepts of lvalues and rvalues in a simple way. We use the Bucket Analogy immediately after the discussion of lvalues and rvalues presented above. Simply stated, an lvalue is the memory address where you can find a variable's bucket in much the same way that a street address tells you where to find a specific house. The rvalue is what you see when you look inside the bucket. And finally, the variable's data type (see column 2 in the symbol table) determines the size of the bucket. (While most buckets have their size expressed in gallons, our buckets' size is expressed in bytes.)

Using the Bucket Analogy to Explain Casts

All kinds of teaching concepts can benefit from the Bucket Analogy. For example, in C# and Java, consider the statements:


int val;
double x;

// some code...

val = x;	// Statement 20

Technically, we could tell the students: "The compiler does not like the assignment of x into val in Statement 20 because data narrowing reflects an impedance mismatch between the two variables' data types resulting in a possible loss of information." Or, we can use the Bucket Analogy and the symbol table information and simply say: "The compiler's complaining because you are trying to pour 8 bytes of information into a 4-byte bucket." That is, the 8 bytes of double data stored in x's bucket won't fit into val's 4-byte int bucket and information might be spilled and lost in the process.

We then ask them how to solve the bucket overflow problem. Perhaps they come up with:


val = (int) x;

In terms of the Bucket Analogy, you can explain a (data narrowing's need for a) cast as the compiler's attempt to adjust the bucket size from a larger bucket to one that matches the destination bucket not to "spill" any information during the assignment process.

When you attempt to explain data widening using the statement:


x = val;

ask the students why the compiler does not complain even though the two variables are not of the same data type. The answer is simple: "Data widening is not a problem because you are pouring 4 bytes of information into an 8 byte bucket...no information is spilled or lost in the process." (We also point out, however, that they should still use a cast to document the silent cast being performed by the compiler.)

Once the students have grasped the basic concepts, you can go back and fill in the explanation using more technical terms if one thinks it is necessary.

Explaining Value Types versus Reference Types

The concepts of lvalues and rvalues in conjunction with the Bucket Analogy also makes it easier to explain the difference between value types and reference types in languages that support objects. Consider the following statements (for C++, C#, or Java):


int i;
clsPerson myFriend;

We might reflect these two statements in a symbol table like that in Table 6.

Table 6: A Hypothetical Symbol Table, value and reference types.

Using the symbol table information from Table 6, we can draw the associated lvalue-rvalue diagrams as in Figure 4.

Figure 4: Lvalue and rvalue Diagrams for i and myFriend

In this example, we assume that the two variables are instance variables being defined for use in a program. Most OOP languages initialize such variables so value types are initialized to 0 and reference types are initialized to null, as in Figure 4.

The stumbling block for many students is the distinction between a reference variable and an instance object of a class. The students probably understand the definition of variable i using the narrative associated with Figure 2. Explaining the statement:


clsPerson myFriend;

however, often takes a little more effort. From the symbol table in Table 6, we can see that we have defined a reference variable named myFriend. At this point, you would give the students the following rule:

A reference variable can only have an rvalue with one of two possible values: 1) null, or 2) a memory address.

If we look at Figure 4, we can see that myFriend does have an lvalue of 750,000, but it has an rvalue of null. This means: we have defined a reference variable named myFriend, but we have also declared a clsPerson object. The interpretation is that myFriend does exist (i.e., it is defined), but no object yet exists because the rvalue of myFriend is null (i.e., the object is declared, but not defined). At this point, we simply have information that describes an object (i.e., it can "become" a clsPerson object), but that object does not yet exist in memory. Again, thus far, we have defined a reference variable named myFriend which is a declaration for a clsPerson object. (This is the point where programmers who treat the terms definition and declaration as synonyms get into trouble when trying to explain object instantiation.)

To define a clsPerson object that we can actually use in our code, we need to "finish" the data definition for an instance of a clsPerson object. We do this with the statement:


myFriend = new clsPerson();

After the compiler checks the syntax and finds it acceptable, the compiler issues a memory request to the operating system's memory manager for enough memory to hold a clsPerson object. An object might take only a few bytes of memory or it might require several kilobytes of memory depending upon the object's complexity. Whatever the actual request is, the compiler makes the request to the operating system's memory manager and returns the memory address of where the bytes for that object are located. Having fulfilled that memory request, code to call the class constructor is generated and the constructor instantiates the object according to the constructor's code. Because the rvalue of myFriend contains a valid memory address, variable myFriend now references an object of clsPerson that we can use in our program.

Just to make things more concrete, assume a clsPerson object takes 2,500 bytes of storage and the memory manager found that many free bytes of memory at memory address 780,000. Figure 4 now becomes Figure 5.

Figure 5: Lvalue and rvalue Diagrams for i and myFriend

Note how the rvalue of myFriend has changed from null to the memory address of where the 2,500 bytes of memory associated with the clsPerson object is located. In other words, we have now defined a clsPerson object that we can access through the myFriend reference variable. Also notice that when a reference variable has an rvalue that is null, it does not reference a "useable" object. That is, a null rvalue for a reference variable means we have declared an object (i.e., we know something about it), but the object is not yet defined (i.e., we cannot do anything with it because the object is not yet instantiated with a known memory address). Once the reference variable's null rvalue is replaced with a valid memory address, we know we have defined a class object that we can use via the reference variable named myFriend.

Using lvalues and rvalues to Explain Argument Passing

Pass by Value

You can also use the lvalue-rvalue diagram and the Bucket Analogy to explain the concepts of "pass by value" and "pass by reference". For example, for several different languages we might find data definitions similar to those in Table 7.

Table 7: Data Definitions and Method Calling in Different Languages

Despite the minor language differences in Table 7, all of them pass the value of 10 to a method named func(). The signature for this function in the different languages might be written:


void func(int p)

Visual Basic


public sub func(ByVal p as Integer)

C#, Java


public void func(int p)

The lvalue-rvalue diagram is always the same, regardless of the language, so we can represent all three language variations in Figure 6, which shows how variable i looks in the calling routine.

Figure 6: Lvalue and rvalue for Variable i

Now simply tell the students that the 4-byte value 10 is copied to a memory location on a special segment of memory called a stack (you can explain what the stack is if you wish) and, once inside the function, those 4 bytes are popped off the stack into the rvalue of a variable named p. Varaiable p is a temporary variable and might look like Figure 7.

Figure 7: Lvalue and rvalue for Variable p

(The temporary variable p has a relatively large lvalue because temporary variables are allocated on the stack, which tends to be located in high memory.) The important thing to notice is that p is a defined variable that was just assigned the value that was passed to it from variable i. Stated differently, the rvalue of i has been copied into the rvalue of p.

Notice that the lvalues for i and p are very different, but the rvalues are now the same. This is what is meant by "pass by value". Pass by value means that the rvalue of a variable in one part of the program is passed to a method (or function) located at a different part of the program. However, since the lvalues for the two variables are different, anything done to p in the method has absolutely no direct impact on variable i. This also means that you cannot contaminate the contents of i accidentally by things you do to pi and p are located at entirely different places in memory and, hence, what we place into one bucket has no affect on the other bucket. (Refer to the lvalues in Figures 6 and 7.)

Pass by Reference

The student is now ready to understand what pass by reference means. In Table 8, we reproduce part of the lines from Table 7 that might be affected, plus their associated method signatures.

Table 8: Implementing Pass by Reference

The only difference in the calling routines is for the C language. In that case, we must preface the variable with the ampersand (&, or address-of) operator. This tells the compiler to place the lvalue on the stack, rather than the rvalue as was the case with pass by value.

Now look at the signatures for all three methods in Table 8. For the C language, the *p tells the compiler that the function named func() is being passed a pointer to a piece of data. A pointer is nothing more than the lvalue for the data item. For Visual Basic, the ByRef keyword says the same thing to the compiler: "Send me the lvalue of the data item, not its rvalue." For C#, the ref keyword sends the same message to the compiler: "Give me an lvalue, not an rvalue". As a result, the lvalue for the temporary variable p in Figure 8.

Figure 8: Lvalue and rvalue for Variable p

Notice that the lvalue of pi in Figure 6. In other words, the bucket for i and the bucket for p are exactly the same bucket. Variable p, therefore, is nothing more than an alias for variable ip in the func() method permanently affects the value of i as defined somewhere in another part of the program. (A good way to prove this to the students is to single-step the program with a watch window that observes the values of i and pp in the method, which simultaneously changes i in the calling code.) We often follow this discussion with the impact of pass by reference on data encapsulation.

Conclusion

Regardless of the language you use, the distinction between declare and define are important and real. These distinctions become clearer when you use the symbol table and lvalue-rvalue concepts to explain how the two terms are different. Further, the Bucket Analogy can be used to explain concepts that are often difficult for beginning students to understand, such as casting, value versus reference types, and pass-by-value versus pass-by-reference. A side benefit is that students are more adept at using and understanding the information made available by a debugger.