Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Using Large Arrays In Turbo C


January 1991/Using Large Arrays In Turbo C

Stuart Baird first learned to program on an IBM 1620 in 1966, going on to study computer science at the University of Illinois. Now a General Electric employee, he has long since abandoned professional programming for administration, but remains an avid recreational programmer at home, using C.

Declaration Of Large Arrays

Declaring large arrays within a function such as main() is something programmers brought up on other programming languages are used to doing without extra thought. For example,

int a[20000],b[20000];
is the C equivalent of the DIM or DIMENSION statements that a programmer would use in BASIC or FORTRAN. It's a perfectly acceptable declaration in C, and it compiles in Turbo C (Borland International's C compiler) without comment. So, it comes as a shock when a simple program using that declaration doesn't work.

Turbo C, of course, is written for the Intel 8086 and related microprocessors, which have a segmented memory architecture. These microprocessors normally address only one segment, 64K bytes of memory, at a time. The designer of a compiler that produces code to execute on such a machine must decide whether to insulate the programmer from the idiosyncrasies of the hardware or to cling closely to the architecture for greatest control and maximum speed. Given the nature of C, the latter is the natural choice. Unfortunately, that choice presents unexpected problems to the programmer using large arrays. ("Pointer Arithmetic at Memory Segment Boundaries" by Daniel and Nancy Saks, The C Users Journal, October 1989, discusses other aspects of this problem.)

One way Turbo C solves the addressing problem is by using different models of the compiler and libraries, depending on the anticipated size of the program. Turbo C's documentation disccusses the models. The small model is recommended for most programs; it allows 64K bytes of code and 64K bytes of static data. This article refers to the small model unless it specifies otherwise. Since I discuss only large data, not large code, I will also use the term small memory model to refer to any model that limits data to 64K, namely the tiny, small, and medium models. The term large memory model will refer to any model that lifts that restriction, namely the compact, large, and huge models.

In the array declarations at the beginning of this article, the Turbo C compiler assigns both arrays to the same data segment (of 64K bytes). This action is understandable, since in a small memory model a near pointer, only two bytes long, is used to address the data. It's impossible for two-byte pointers to represent more than 64K different addresses. However, if the declared arrays are large enough, they overlap, but the compiler issues no warning message. In the example, 40,000 two-byte integers cannot all fit into 64K (that is, 65,536) bytes. At some point, one of the arrays wraps around to the beginning of the data segment, so some of b's elements "share" memory locations with elements of a, which is not good. (Incidentally, you cannot assume that b is the array that wraps around, nor can you assume that you have a full 64K to work with.) Assignment to the same segment is true even in the large memory models, which seems to be a bug.

The solution must be to use dynamic memory allocation. See Listing 1.

It's clear that near pointers, which address only within 64K, are insufficient, hence the far modifiers. Alas, this doesn't work, either. The type of pointer malloc() returns in a small memory model is a near pointer. It may gladly return a non-NULL pointer, seemingly indicating memory is available, but the address space is still within 64K, and the arrays still overlap. Turbo C does remind you of the non-portable pointer assignment (that is, the conversion of malloc ()'s near pointer return value to the far pointer a) at compile time. You can eliminate the message with a cast, but the problem itself remains.

In the Compact model (one of the large memory models), the previous example does work, with or without the far modifiers, which are the default pointer type. Other reasons for not using this model will be mentioned later.

In a small memory model, I must use farmalloc() instead of malloc ():

int far *a, far *b;
if ((a = farmalloc(20000L*sizeof(int))) ... etc.
The library function farmalloc() returns a far pointer, and everything now works fine. In particular, I can continue to refer to these allocations as arrays, for example

a[n] = b[n] + 1;
which of course means the same as

*(a+n) = *(b+n) + 1;
A disadvantage of using farmalloc() is that allocations may fail if Turbo C is loaded into memory when you run the program. (The complier seems to occupy the area of memory where farmalloc() goes looking.) In that case, execution will have to be deferred until you exit Turbo C. Standalone execution forfeits Turbo C's useful run-time debugging features.

Global Data

Global data (that is, data elements declared outside of and accessible to all functions in a program) seems to be restricted to 64K in all Turbo C's memory models - even huge, which, according to the documentation, lifts restrictions on static data. Yet it is often useful for a large array to be accessible throughout a program.

Again, the solution is to allocate the large array(s) dynamically in main() as above, and keep only pointers to the data in the global area. For example, if you need a number of arrays with dimension 8,000, rather than depleting the global data area by a multiple of 8,000 bytes per array, store only the pointers (at two or four bytes apiece) in the global area. The memory allocated for the arrays will be somewhere else. See Listing 2. (The error returns from malloc() and farmalloc() are not checked in Listing 2 but should be in an actual program.) With 40,000 bytes allocated beforehand, the last array cannot possibly fit within the same 64K as the other arrays, so it needs to use farmalloc() and the modifier far (or huge).

Global items, of course, make the programmer honor-bound to use names and types consistently in all functions. Unintentional conflicts are difficult to catch. Furthermore a function that uses global data can't be lifted out and used in another program without rework. A language such as BASIC has no other choice, but C does. As an alternative to using global data, you can pass pointers as arguments to those functions that need them. Although incurring some additional overhead, this method is often regarded as "cleaner" since the function is then self-contained.

Scope Of Modifiers In Declarations

In the declaration

extern int i,j,k;
all three variables are understood to be both external and integers. However, in a Turbo C declaration

int far *p,*q,*r;
only p is a far pointer. q and r (if compiled in a small data model) have the default attribute near. If q and r are supposed to be far pointers to int as well, you must use

int far *p, far *q, far *r;
as you've already done in previous examples, or use a typedef:

typedef int far * fpi; 
fpi p,q,r;
This might seem to be a quirk of the non-standard modifiers near, far, and huge, but there is a precedent in Standard C. The indirection symbol * also binds to the variable. [So do the type qualifiers const and volatile. -pjp] In the declaration

int * p,q,r;
p is a pointer to int, while q and r are just ints.

In general, the scope of the elements in a declaration (that is, to how much of the complete declaration they apply) does not seem to be well documented in C - the documentation, if it exists, is not obvious. I couldn't find this topic specifically addressed in K&R, in Standard C, or in Turbo C's documentation. It's clear what syntax is legal, but it's not immediately clear just what the construction means, until you try it in a program. Lack of documentation is unfortunate, but the dialect resulting from having to use non-standard features (the modifiers) is the real culprit.

Arrays Greater Than 64K

Allocating an array of greater than 64K bytes using farmalloc() is easy: allocate the memory using a long integer expression (for example 100000L), assign the return value (the starting address) to a pointer, and use the pointer as an array name.

What kind of pointer? Clearly not a near pointer, which can only address within a 64K segment. It is not a far pointer, either. While far pointers are capable of addressing anywhere in memory, far pointer arithmetic does not work beyond 64K. Suppose you have a large array and pointers to it named a and b, respectively:

char far *a, far *b;
long count = 100000L;
a = farmalloc(count);
(Normally you would check for a NULL return indicating allocation failed, of course.) Suppose you try to initialize the array to blanks:

b = a;
while (count- -) *b++ = ' ';
You would find that only the first 65,536 or so characters of array a were initialized properly; the remainder would be untouched. That is because when b has a value of a+65535 and is then incremented, the next value becomes not a+65536 but a+0. In the interest of speed, far pointers are not checked for possible overflow, leading to wraparound and erroneous results such as this. Since Turbo C's documentation is clear on this point, the arithmetic error cannot be considered a bug. Less clear from the documentation is what happens if you use subscripting instead

long i;
for (i=0L; i<count; i++) a[i] = ' ';
Unfortunately, this works exactly the same as the pointer version above. The address of a[65536] is translated to the pointer a+65536, which (not checking for overflow, long arithmetic notwithstanding) becomes just a (or &a[0]); a[65537] becomes a[1], etc. All the elements beyond 65535 are therefore inaccessible. The same problem surfaces earlier with larger data types: a two-byte int array declared as far can't go beyond element 32767, for example.

The numbers above have been quoted as though exactly 65,536 bytes are available. In practice, Turbo C seems to require a few bytes of overhead. The offset portion of the allocated addresses are 8, not 0. This implies that the wraparound takes place earlier: at element 65527 or 32763, for instance.

The solution is to declare such pointers as huge rather than far. Both far and huge pointers are capable of representing the same range of addresses, and both take up the same amount of space (four bytes). The difference is that huge pointers are normalized after every arithmetic operation. Normalization yields correct results, though at some cost in speed. (far and huge pointers and normalization are discussed in Turbo C's documentation and need not be covered in detail here.)

Piecemeal Allocation

Why allocate everything at once? In some applications, it makes sense to allocate memory only as it is required. Perhaps the program can be designed to be flexible. Accumulate data and fill as much memory as is available. Then if eventually an allocation request fails, process what is there and continue on. One might wish to use this approach for a word processor or sort utility, for example.

There's a significant difference between piecemeal allocation and allocating everything at once, however. The single large allocation

a = malloc(10000);
is guaranteed to give the address of a block of 10,000 bytes of memory in a row (unless there is an error return or wraparound, of course). That block can be treated as an array. Alas, there is no such guarantee with individual small allocations. Examine the following piece of code, which allocates memory one byte at a time and stores individual characters as they are read:

char *a,*b;
int count=0;
a = b = malloc(1);
while ((*b = getchar()) != EOF) {
   b = malloc(1);
   count++;
}
So far, this ought to work (again assuming no error returns). Characters are read and stored. This subsequent code, however, would not work:

for (i = 0; i<count; i++)
   putchar (a [i ] );
This code would not work because it is a mistake to assume that the addresses that successive allocations assigned to b are contiguous. You could not blithely refer to the second character obtained by getchar() as a[1], unless somehow you had ascertained that the second value that malloc() assigns to b were indeed larger than the first by exactly sizeof(*a). In Turbo C, using either malloc() or farmalloc(), it definitely isn't.

If what you really want is one large array, piecemeal allocation is not practical.

Pointers To Pointers

Suppose you have an array a declared as a huge pointer:

char huge *a;
a = farmalloc(100000L);
Perhaps it contains a number of strings with variable lengths. If you wanted to refer to one of those strings, you would need another huge pointer to do so:

char huge *aptr;
You might want to refer to all of them as a set, for example to maintain an index:

char huge *aindex[NSTRINGS];
Suppose that the first three elements of aindex refer to strings which begin at a, a+47000L, and a+93333L. With 100,000 elements allocated, and huge pointers, these are all valid addresses. Rather than make this example completely dry and numerical, let's say that the second of these strings is Skipping rivulet and fountain.

Now suppose you want to use a pointer to refer to the various elements of aindex, perhaps just to step through the index or (in a typical application) to sort it. What kind of pointer do you need? The array aindex could be quite small, perhaps even just the three elements previously defined. So in that case, an ordinary near pointer would suffice. The task at hand is to declare that pointer. While a compiler ought to be able to parse such a declaration, it would be confusing to the ordinary reader, since it would contain both the huge and near modifiers. (In a small memory model, only the huge modifier, even though the resulting pointer type is near.) I wouldn't even attempt it without using a typedef:

typedef char huge * hugecp;
hugecp aindex[NSTRINGS];
hugecp near *pindex;
In a small memory model, the near modifier is unnecessary, but why not make things explicit for clarity?

On the other hand, you might need aindex to contain very many elements. Since a huge pointer is four bytes long, any value of NSTRINGS over 16,383 (or so) would mean you have to declare pindex as

hugecp huge *pindex;
In fact, declaring aindex as an array would no longer work, since it would be larger than 64K. You would need to allocate it dynamically:

hugecp huge *aindex;
aindex = farmalloc((long)
        NSTRINGS*sizeof(hugecp));
(Again, better check for an error return in actual practice.) If you used this method of allocating aindex even if its size were small, you would still need to declare pindex huge (or far) just to allow it to point to the correct data segment, since the far heap where aindex is allocated is somewhere different from where automatic variables are allocated.

Let us now look at the resulting pointer pindex. Suppose

pindex = &aindex[1];
What is *pindex? It is the contents of element number 1 of aindex, which is the address a+47000L. Regardless of whether pindex is near, far, or huge, *pindex is a huge pointer, referring to the string "Skipping rivulet and fountain." How about **pindex? It is the contents of address a+47000L, the character 'S'. The value of *(*pindex + 1) would be 'k', etc. The type of **pindex is char. Given all this, you'd think we could just declare pindex as follows:

char **pindex;
but that would (presumably) make both *pindex and pindex the default type of pointer, namely near in the small data models. That wouldn't do, in our example. If you declared

char huge **pindex;
then (presumably) *pindex would be a huge pointer, but pindex would still be near - or is it the other way around? This wouldn't do, either. Besides being unclear, it would probably not have the effect of making both of them huge.

In summary, declarations for pointers to pointers need extra care when large arrays are involved. Though we haven't discussed it, arrays with double subscripts require the same care.

Mixed Pointers

Turbo C allows you to mix near, far, and huge pointers in any of the memory models, with some restrictions noted in the documentation. Mixed pointers are perhaps slightly difficult to keep untangled in one's own code and functions, but with care it can be done. Using the standard library with mixed pointers, however, is much more of a problem.

Some standard functions, such as printf(), have built-in ways to deal with large pointers in a small data model. For example, to print as an address the value of an ordinary (near) pointer in a small data model, you use the format specification %p, but it's possible to print the value of a far (or huge) pointer using %Fp. The options are limited, however, and it's easy to make mistakes. One such mistake is supplying pointers of one size when printf() is expecting pointers of a different size. There is generally no compile-time warning, and the mismatch between format specification and parameter list gives unpredictable results.

Turbo C calls in separate libraries suitable for the different memory models, so that in a large data model, such as compact, the version of strcpy() contained in the library accepts far pointers, which are the default in that model. For most programs, which do not need to mix pointers, this is a boon. The programmer does not need to worry about a different set of standard functions with unique names, let alone write new ones. Most standard functions cannot cope with non-default pointers, however. If you need a capability provided by a standard function, you are forced to program around the limitation, such as by writing a new function or performing an intermediate conversion.

In the huge memory model, far pointers (not huge pointers) are still the default. The standard library in that case also uses far pointers. A function that accepts far pointers will also accept huge pointers. The format is the same, and the value of a huge pointer is just one of many far pointer values that refer to a particular address. However, the limitations of far pointers previously discussed also apply to the internal workings of such functions. The standard library functions may not work properly with large arrays. The function qsort(), for example, which works so beautifully most of the time, fails for arrays larger than 64K. You can provide it with huge pointers as arguments, but you can't make it use huge pointers internally.

Incidentally, there seems to be a small bug in printf() in the huge model. Even though far pointers are the default, an obscure linker error message complains if you try to use %p to print a far pointer's value. Use %Fp and the message goes away.

In other words, be wary of mixing near, far, and huge pointer types, especially when using the standard library. You may end up having to write functions for your particular pointer needs.

Summary

Although my experience with C on an 8086-style microprocessor is limited to Borland's Turbo C, it seems reasonable to assume that other C compilers for segmented memory micros will have similar limitations when large arrays must be used. Arrays whose sizes are large enough to cross a segment boundary require much more understanding and planning on the part of the programmer. Those of us who need to use large arrays in C look forward to the time when improved software or hardware in personal computers makes the size of an array transparent!


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.