Channels ▼
RSS

C/C++

BUILDING SOFTWARE FOR PORTABILITY

Source Code Accompanies This Article. Download It Now.


DEC88: BUILDING SOFTWARE FOR PORTABILITY

Greg Blackham is the director of Unix product development at WordPerfect Corp. He can be reached at 1555 N. Technology Wy., Orem, UT 84057.


Software developers are finding it increasingly more important to make their applications portable across any operating environments as possible. What developers hope for, of course, is that they will eventually be able to write only one highly portable version of software that will run on any computer. There would then be no need to endure the trauma of porting software to other operating systems and hardware.

To achieve maximum portability, software should be designed in a modular way. Code that must be changed for different environments should be isolated in separate modules. These environments should be isolated in separate modules. These environment-specific pieces can then be adapted to work properly for each environment.

Portability problems can be divided into three main categories: operating system issues, architecture issues, and compiler issues. All the issues discussed in this article fall into one or more of these categories. Furthermore, the techniques described here have been used to port WordPerfect 4.2 (which includes approximately 160,000 lines of C source code) to 12 different platforms with dissimilar operating systems and architecture.

Operating System Issues

Different operating systems provide different services to an application. This is a veritable Pandora's box of problems for which there are no easy solutions. In general, the best approach for providing portability among operating systems is to isolate all operating-system-dependent code in separate modules. You can then rewrite or adapt these modules for each environment and can maintain them separately. The vast majority of the application will generally be in modules that ale not operating system dependent. For these modules, you need to maintain only one copy of source.

Every operating system has its own file system. Most file systems are hierarchical but some, such as VM/CMS on the IBM 370, are flat. -Each has different, allowable file-name characters, and each allows a different number of characters in a file name. Some allow applications to assign a type to every file created. The type may mark the file as a formatted text document, a C source file, or an executable program. Developers must take care that the kinds of files an application uses are available in all desired target environments.

On multiuser systems file locking becomes an issue. If several users need to access one file, you must take precautions to prevent one user from overwriting the changes of another. Different operating systems provide very different levels of file locking, so to facilitate porting, you should isolate file-locking code in a separate module.

Each operating system provides a different set of system calls for getting characters from the keyboard and to the screen. Many systems provide ROM-based toolboxes or software library toolboxes such as the Unix curses package. Making screen I/O completely generic results in unacceptable performance on many systems --we have found that the best approach is to isolate screen I/O in one module and customize and optimize that module for each environment.

A related issue concerns character sets. Although ASCII is the most widely accepted standard, some environments still use other character sets, the most notable exception being IBM with EBCDIC. Even if ASCII were universally accepted, it is only a 7-bit standard. International and extended character set standards are under consideration, but the huge installed base of existing equipment supports a variety of nonstandard character sets. Even on variants of the same operating system, such as Unix, no standard character set can be counted on. Applications that make use of intentional, technical, or graphics characters must be written carefully to make sure they don't depend on one character set.

It is a fact of life that on some platforms an application must follow certain user interface standards if it is to be accepted by the user base. This is especially true of machines that provide a mouse-based graphical user interface. An application for the Macintosh, for example, must have a "Mac-like" user interface to be accepted. Similarly, a product for Sun workstations that does not take advantage of the mouse and windowing system is not acceptable. You must give careful consideration to user interface decisions if an application is to be ported to both graphical and text-based environments.

Each operating system has a distinct set of possible error conditions that can be encountered by an application. Many of these extend across all environments. For example, every operating system returns a "file not found" error when an "open" system call fails because the file does not exist. Each environment may also have its own unique set of messages. On multiuser or networking systems, an "open" might also fail because the user does not have rights to access the file, or because the file is in use by another user. Error-handling code must be flexible to handle the various exceptions appropriately.

Architecture Issues

Architecture issues arise from the instruction set available on the specific processor on which the application will run. Each processor has certain idiosyncrasies that you must consider when you write portable code.

Different chips store integer data in different ways. Most Intel chips, for example, store a 16-bit integer with the low-order byte first, followed by the high-order byte. This is commonly referred to as "little endian," because the little end comes first. In contrast, most Motorola chips store the high-order byte before the low-order byte ("big endian"). Still more variations are possible when 16-bit words are stored big endian but the bytes within the words are stored little endian. To assure portability, you should not write code that depends on the order in which integer data is stored.

Nor can you make assumptions about the size of data addresses or code addresses. It is especially dangerous to assume that data addresses are the same size as function addresses or that an address can be saved or typecast into an integer variable.

Most IBM PC compilers allow a variety of memory models to be used. This is because of the segmented architecture of the 8088 line of microprocessors. If one code segment is defined, then a function pointer will be a 1-bit item, whereas it must be a 32-bit object if multiple code segments are defined. Similarly, models are allowed that define one or multiple data segments. Nearly any combination is possible. For example, we had to change certain table-lookup utilities that were used for both code and data addresses when we ported them to an 80286-based environment.

Some architectures require that some types of integer data be stored at even byte addresses. On the Western Electric WE32100, for example, some operations will crash the application if the data is not properly aligned. Still other processors sport better performance if integer data is aligned on even byte boundaries, so compilers force alignment of data. For application programmers, this means it is unsafe to assume that two variables that are defined consecutively will be contiguous in memory.

It is often desirable, if not an absolute necessity, to share data files among environments. For example, word processing documents, spreadsheets, and database files should be stored in a format which is directly loadable on all machines where these applications run. This allows users on many different systems to share information conveniently. The architecture issues mentioned previously make this portability of data files more complicated to achieve. Byte order varies according to processor and architecture, and word alignment may cause identically defined data structures on two machines to be different sizes. Because of these possibilities, it is not wise to write data directly from internal data structures to data files. You must take care to assure that data files are written and read so that the data is correct for all architectures.

Some operating systems allow only certain types of files. Some do not allow ISAM files; others may not allow arbitrary length files and always fill to a full physical disk block. If a file must be portable, select a file structure that is available on all systems to which it will be ported.

Compiler Issues

Several portability issues result from the various implementations of C compilers. Some of these arise from nonstandard but useful extensions, others arise from incorrect or nonstandard implementations, and still others arise in ambiguous areas of the language specification where no "right" answer is available.

The most critical portability issue arising from compiler implementation is the size of data items. In C the default integer data type is int. Unfortunately, the size of this type varies from 2 bytes on most PC compilers to 6 bytes on certain RISC chips. Whether integer items are signed or unsigned is also dependent on the implementation of the compiler, and this can be critical when you are dealing with -bit character sets. You must be careful that the chosen data types are of an appropriate size and type to handle the data for all compilers used.

Sometimes even people who write compilers foul up, and certain implementations simply don't work according to the language specification. On most 80386-based C compilers, for example, the "=, /=, |=, and &= operators do not work with 1-byte variables. Code with these operators compiles to invalid assembly code. in cases like this, you can choose either to wait for "the Fix," which may delay the release of your product, or you can work around the compiler problem by using a different construct.

Another problem occurs when a compiler is implemented in a nonstandard way. As an example, in the C compiler for VAX/VMS, the assumption is made that all data is shareable among multiple users unless the data is explicitly declared unshared. In every other C compiler, the exact opposite is assumed. Work-arounds for this kind of problem can be difficult if you don't maintain separate source code.

More Details.

Minimizing Portability Problems

The following discussion describes the techniques we have used in the C versions to reduce portability problems to a minimum. Of necessity, this discussion is more C language specific than the previous one; however, analogous techniques are possible in nearly any high-level programming language.

The typedef operator available in C allows us to create our own data type names so that, when a data item is defined, we know its exact size. Example 1, this page, shows an excerpt from the file machine.h, which contains our machine-dependent data-type definitions. For example, a UWORD is an unsigned integer quantity that is 2 bytes wide, and a BYTE is a single-byte unsigned integer (used for 8-bit characters). All typedef declarations are in a separate include file that contains the appropriate types for each machine. Each time a port is made to a new machine, the data types must be defined appropriately for the new environment. This technique allows us to know the exact size of every data item defined, regardless of the environment.

Example 1: An excerpt from the file machine.h, which provides information on the exact size of every data item defined, regardless of the environment.

     /*   MACHINE.H                                    */
     /*   MACHINE DEPENDENT DATA TYPE DEFINITIONS      */
     /*   COPYRIGHT (C)  1987                          */
     /*                                                */

     #ifdef    IBM-PC
          typedef unsigned char    BYTE;     /* unsigned,  8 bits     */
          typedef int              WORD;     /* signed,   16 bits     */
          typedef unsigned         UWORD;    /* unsigned, 16 bits     */
          typedef lone             DWORD;    /* signed, 32 bits       */
          typedef unsigned long    UDWORD;   /* unsigned, 32 bits     */
     #endif

     #ifdef    HP
          typedef unsigned char    BYTE;     /* unsigned, 8 bits      */
          typedef short            WORD;     /* signed,  16 bits      */
          typedef unsigned short   UWORD;    /* unsigned, 16 bits     */
          typedef int              DWORD;    /* signed, 32 bits       */
          typedef unsigned int     UDWORD;   /* unsigned, 32 bits     */
     #endif

     #ifdef    ATT6386
          typedef unsigned char    BYTE;     /* unsigned, 8 bits      */
          typedef short            WORD;     /* signed, 16 bits       */
          typedef unsigned short   UWORD;    /* unsigned, 16 bits     */
          typedef int              DWORD;    /* signed, 32 bits       */
          typedef unsigned int     UDWORD;   /* unsigned, 32 bits     */
     #endif

     ...

You should take great care to assure that whenever possible data files can be shared across environments. This requires that such files be read and written in a special way. As mentioned earlier, writing a data structure directly from memory to disk creates a nonportable file because the byte order in integer data will reflect the architecture and also may include "space holder" bytes to ensure word alignment.

To deal with this situation, data files can be created one byte at a time and interpreted one byte at a time when read. If the reading and writing are done in large blocks, disk access can be minimize. Example 2, page 26, illustrates how identical source code on two different processors creates radically different files. Not only does the integer data get written in a different order, but also the files are a different size because one machine aligns on 4-byte boundaries whereas the other aligns on 2-byte boundaries. Example 3, page 26, shows how you might modify the code to produce identical data files on all machines, regardless of byte order within integer data and word alignment.

Example 2: This example program writes a data structure directly to disk

/**********************************************************************
 *                                                                    *
 *   This example program writes a data structure directly to disk    *
 *   Only 5 bytes of data are significant, but more than 5 bytes      *
 *   are written to the file because of word allignment. The          *
 *   component bytes of the integer data are also written in a        *
 *   different order depending on machine architecture.               *
 *********************************************************************/

#include <fcnt1.h>
#include "machine.h"

/*   5 byte structure definition */
/*   word alignment may occur between elements 1 and 2! */

struct {
     BYTE element1;
     UDWORD element2;
} s = ('A',              /* recognizable values for each byte */
        0x08040201);     /* each byte represented by 2 hex digits */

main()
{
     int fd;
     fd = open("data.fil", O_RDWR | O_CREAT,0666);
     write(fd,(char *)&s, sizeof(s));        /*open and write file */
     close(fd);
}

Contents of dat.fil created on HP 9000 Model 350

65,       This is the "A"
0,   This byte is inserted by the compiler for word alignment
8,   Bytes of element2 stored with the highest order byte first
4,
2,
1         Lowest order byte of element 2

Contents of data.fil created on AT&T 6386:

65,       This is the "A"
0,   These 3 bytes are inserted by the compiler for word alignment
0,   The 6386 compiler aligns on 4-byte boundaries
0,
1,   Bytes of element2 stored with the highest order byte first
2,
4,
8              Highest order byte of element2


Example 3: This version of the example program writes the same data structure to disk in a portable way.

/**********************************************************************
*    This new version of the example program writes the same data     *
*    structure to disk in aportable way. It creates a memory image    *
*    of what the disk file should look like, then writes the file     *
*    in one block. When writing the integer data the bytes are        *
*    written starting with the least significant byte.                *
 *********************************************************************/

#include <fcnt1.h>
#include "machine.h"

/*   5 byte structure definition */
/*   word alignment may occur between elements 1 and 2!     */

struct {
     BYTE element1;
     UDWORD element2;    /* 4 byte integer */
} s = { 'A',             /* recognizable values for each byte    */
          0x08040201};   /* each byte represented by 2 hex digits */

main()
{
     int fd,                            /* file descriptor */
          i;                            /* counter variable */
     BYTE array(5);                     /* output buffer    */
     UDWORD tmp;                        /* a 4-byte integer */

     array(0) = s.element1;             /* 1-byte variables are easy */
     tmp = s.element2;                  /* Don't clobber the real data */
     for (i=1; 1<4; i++) {              /* put each byte of the UDWORD */
          array(i) = (BYTE)(0xff & tmp);     /* in separately */
          tmp >>= 8;                    /* get next 8-bits */
     }
     fd = open("data.fil", O_RDWR | O_CREAT.0666);
     write(fd, array,5);                /* open and write file */
     close (fd);
}

Contents of data.fil created on all machines using this code:

65,       This is the "A"
1,   Bytes of element2 stored with the lowest order byte first
2,
4,
8              Lowest order byte of element2

Note that the C compiler takes care of getting the bytes in the correct order, no matter what the internal byte order. The code to read the file reverses the process with a similar "for" loop.


To assure maximum portability, screen output and keyboard input need to be handled in a portable way. A generic library of screen and keyboard functions should be provided. These functions will produce the same result in all environments, but can be optimized to work efficiently on each platform. On machines where it is possible (such as the PC), you should write directly to screen RAM, as this method provides the best possible performance. In environments that use 9,600-baud terminals, you can optimize a package to avoid redundant writes to the screen.

Because no standard character set exists, some internal representation should be chosen for extended/foreign characters. This allows files with these characters to be moved easily between platforms. We chose the IBM PC extended character set to ensure compatibility with our existing PC product, but any character set could be used as long as it is the same on all platforms. The screen/keyboard library handles the translation of the internal representation to the native character set.

Like screen and keyboard functions, file I/O and other operating system calls should be isolated in generic library functions. These can be customized and optimized for individual environments. This technique allows the vast majority of the source code, which does not involve operating system calls, to be environment independent.

As mentioned earlier, a variety of compiler implementations conform in varying degrees to official language specifications. Wherever possible, try to avoid all but the most standard features. In the case of C, the ANSI C standard provides functionality that was not defined in the original Kernighan & Richie specification. Not all target environments have ANSI compilers yet, so for now, avoid using ANSI extensions unless you treat them as environment-specific code and isolate them in separate modules.

Many environments provide programs that point out nonportable constructs. The most widely available tool for C is called lint; it is available on all Unix systems as well as with most PC C compilers. Similar tools are available for other languages and environments.

It is extremely helpful, whenever possible, to develop a portable application on several distinct platforms at the same time. Most of the problems discussed here were learned from sad experience, not from portability textbooks. If you choose several disparate platforms, most portability problems will surface early in development, which often allows you to make adjustments to the implementation before you create a lot of nonportable code.

If you can't afford the luxury of concurrent development on several platforms, try to do the most generic environment first. This varies depending on the programming language you select. In the case of C, the most generic environment is Unix. Almost all other C implementations provide Unix system call libraries. PC compilers, on the other hand, provide many semistandard constructs for segmenting programs and accessing DOS and the ROM BIOS. In my experience C code ports more easily from Unix to other environments than vice versa.

Summing the Parts

Creating applications that port easily to a variety of operating environments, though challenging, is not impossible. The techniques discussed in this article can improve porting productivity dramatically. Porting WordPerfect among Unix platforms may require only days or weeks; other ports may require only a few hours. By improving the portability of your source code, you can provide software products on more machines at a lower cost. Without portable code, many of these ports would be simply too expensive to consider.

A Few More Thoughts on Program Portability

by Bill Fitler

Bill Fitler is a senior staff engineer and heads the X/GEM application team at Digital Research Inc. He can be reached at 70 Garden Ct., Box DR1, Monterey, CA 93942.

Here are several more broad categories of portability problems that an application developer may choose to consider. Usually these decisions revolve around issues that should be resolved as early in the design process as practical, such as which machines and devices will the application work with, what countries will the application sell into, how sensitive will the customers be to application performance, and what other products will the application have to work with.

Device Independence

The subject of portable graphics applications is broad and difficult to handle in a small space. There are several application environment products, however, that make it easier to write graphical applications that can run with a variety of different devices, ranging from simple monochrome screens to higher resolution color devices. One such environment is Digital Research's Graphical Environment Manager (GEM), which runs on a diverse range of target systems, including Motorola-based Atari's, as well as Intel-based IBM PCs and clones.

Internalization

As the article mentioned, appropriate error analysis and reporting is a key to writing professional applications. If you plan to sell the program into markets where your users don't speak English, you need to make sure that your application handles foreign languages. One consideration mentioned was the ability to handle alternate character sets.

You should be aware that Japanese and Far Eastern languages use 16-bit character sets, which means your application should be careful not to assume that all input and output will be limited to 7- or 8-bit characters.

A more important consideration is that the program's messages (including errors) should be easily translated into foreign languages. This works best when all error messages are grouped in a single file and are NULL delimited rather than fixed length.

Performance-Sensitive Optimization

In order to compete effectively in the marketplace, your program must perform well. Modularity plays an important part here, too. Those portions of a product that are called extremely often should be isolated in their own modules. A version of the module can be kept in a higher-level language like C. However, you may want to allow for time in the development of your application to hand-code the performance sensitive modules into assembly code. Thus, you will still be able to quickly port the product to the next environment by using the high level language module, while still getting the performance you need in more competitive environments by using the assembler version. Also, by carefully designing and analyzing your program, you will probably find a relatively small amount of code requiring hand-tuning. Many ports of the Unix operating system, for example, perform well with less than five to ten percent of the operating system handcoded in assembler.

Adhering to Standards

No man is an island and neither is your program (especially in these days of open systems). Programs that use standard operating system calls or data formats wherever possible will port more easily and work well with other pie grams in a variety of environments.

The ANSI C standard run-time library defines a very useful set of I/O interfaces that can be implemented on a wide variety of platforms. Your application should use the standard calls as much as possible. If your application needs more from the operating system interface than is provided by ANSI C, POSIX extends beyond the ANSI C functions to provide a Unix-like operating system interface that will be ported to a number of Unix and non-Unix systems.

For data independence, Unix defines a model of standard input and output and provides the ability to send the standard output of one application into the standard input of another.

Using a standard character set like ASCII rather than a binary data representation facilitates this kind of interaction on Unix and non-Unix systems alike.

There are a variety of other standards that will become increasingly important across a variety of different environments. To represent documents, your application may need to handle the standard generalized markup language (SGML) format, which is specified by the American Association of Publishers. To exchange graphical vector information, you may want to use the graphical metafile format, used by GEM applications, or the computer graphics metafile (CGM) standard format. To handle image and scanned data, you will probably want to handle tagged image file format (IFF) files.

One final note. Writing portable programs is hard and can be very expensive. Knowing how to write portable programs will take you most of the distance, and using application environments and tools like C and GEM will get you even farther. Knowing how much effort to expend to write portably, though, is up to you. -B.F.



Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video