Migrating C Code to Unicode

To compete in the global software market, application software must accommodate any country's locale conventions, culture, and written language. Tim presents strategies and code for migrating existing C source code from ANSI to Unicode, independent of any operating system, compiler, or API.


August 01, 1994
URL:http://www.drdobbs.com/cpp/migrating-c-code-to-unicode/184409287

Figure 1


Copyright © 1994, Dr. Dobb's Journal

Figure 1


Copyright © 1994, Dr. Dobb's Journal

AUG94: Migrating C Code to Unicode

Migrating C Code to Unicode

One code base, two character-encoding schemes

Timothy D. Nestved

Tim is the principal architect of Unicode and National Language support issues for software applications and third-party libraries. He can be reached at P.O. Box 540754, Orlando, FL 21854-0754 or at [email protected].


To compete in the global software market, application software must provide national language support (NLS), thereby accommodating any country's locale conventions, culture, and written language. Of the three, language support is the most costly and time-consuming to implement. One encoding scheme that supports all modern written languages of the world would be ideal--and that's precisely what Unicode provides. What this means to you is that almost any written language can be supported using Unicode, eliminating the need for specialized algorithms to process various character-encoding schemes, such as double-byte character sets (DBCS). In addition to providing characters and symbols--including East Asian ideographic characters--Unicode also provides abundant space for future expansion.

In this article, I'll describe the process of migrating existing C source code from ANSI to Unicode, independent of any operating system, compiler, or API. This process is based on work I was involved in when migrating Windows NT's built-in tape-backup program (NTBACKUP.EXE) for Conner Software. Because of limited encoding space, the NT WIN32 OEM character set doesn't support all the written languages of the Americas, Europe, Asia, and the Middle East. Consequently, Microsoft adopted Unicode as NT's character-encoding scheme, requiring that programs such as NTBACKUP.EXE also support Unicode. What this ultimately means is that one code base must support two character-encoding schemes: ANSI and Unicode. Both ANSI and Unicode applications can be created from the same source code by simply setting a compilation variable.

Character or Wide Character?

To computers, a character is a unit of encoding with an associated code or value. Therefore, the encoding size of a character is determined by the size of the character's data type, and the data type is determined by the number of unique codes required to represent all the characters for a particular encoding scheme. For ANSI, a character's size is one byte (28=256 chars), for Unicode it's two bytes (216=65,536 chars), and for other encoding schemes, anywhere from one to four bytes. A wide character is still a character, but it's a character with an encoding size greater than one byte. In the case of Unicode, a wide character's encoding size is two bytes.

Mapping at the Boundary

Mapping is the process of converting a non-Unicode character-code point (a specific character's code or value) to its Unicode equivalent, and vice versa. A boundary exists wherever mapping is required. Input, output, and displaying (rendering) are examples of mapping boundaries; see Figure 1. Mapping at the boundary is useful for resolving Unicode implementation issues, such as rendering and processing. For example, Novell's NetWare 4.0 uses Unicode because of the benefits of a universal data repertoire. User information for NetWare is entered and rendered using the default code page (cp) for the workstation (cp437 is the United States/English code page). Once entered, the information is mapped to and stored as Unicode in NetWare's new Directory Services component.

By mapping at the boundary, NetWare resolves the issue of displaying Unicode text and providing Unicode I/O routines for workstations that don't support Unicode. That is, information entered using the default code page is mapped to Unicode; before rendering any Unicode data, it is mapped back to the default code page.

The following steps explain how to migrate existing code to Unicode without creating separate code bases. The migration process is divided into two stages: Stage one involves header-file modifications, while stage two consists of source-code modifications. For this article, all ANSI examples use cp437 and precomposed characters. That is, the letter 'ü' is a single character and not composed of multiple characters such as 'u+¨'.

Stage One: Header-File Modifications

Stage one consists entirely of header-file modifications. Proper migration allows for both ANSI and Unicode support using the same source-code base by defining generic text-data types, macros, and prototypes in header files for use in the source code. Use the compile-time, command-line symbolic constant /DUNICODE (for example, cl /DUNICODE_ for Microsoft C) to force a Unicode compilation. If the header files are properly constructed, the compilation mode is completely transparent when viewing the source code.

Step 1. The first step in stage one is to create explicit data types. Explicit text-data types are typedefed as CHAR (one byte) and UNICHAR (two bytes) for ANSI and Unicode, respectively. For completeness, explicit pointer data types P_CHAR and P_UNICHAR are also typedefed. (You may use the standard C data type wchar_t in place of unsigned short for UNICHAR, but verify that wchar_t's type is properly defined by your compiler. Earlier versions of some compilers defined wchar_t to be of type unsigned char.)

Step 2. The second step involves the creation of generic text-data types TCHAR and P_TCHAR, using the explicit typedefs created in Step 1. To support both ANSI and Unicode in a single code base, generic text-data types are essential. The compile-time identifier UNICODE determines the explicit data type to typedef as TCHAR and P_TCHAR (for example, CHAR for ANSI or UNICHAR for Unicode). If UNICODE is not defined, TCHAR is typedefed as CHAR; if UNICODE is defined, TCHAR is typedefed as UNICHAR. Converting explicit data types into generic types at compile time makes support for multiple fixed-width encoding schemes transparent and easy to configure.

Use an explicit data type whenever a specific text-data type is required, independent of the compilation mode. For example, mapUnic2Ansi() maps Unicode characters to ANSI using two explicit parameters independent of the compilation mode: the Unicode source parameter type P_UNICHAR and the ANSI destination parameter type P_CHAR. If the parameter P_CHAR is defined as any other type, a data-type inconsistency is generated. On the other hand, encryptString() accepts a string using the generic-parameter data type P_TCHAR, because the string type is based on the compilation mode. (Listing One includes the function prototypes mapUnic2Ansi() and encryptString().)

It is important not to confuse or interchange byte data (data buffers) with text or character data. Declare nontext data (or byte data) using the byte data types BYTE and P_BYTE. As a rule, byte data should never be processed by text routines. If strcpy() is used on byte data, replace it with either memcpy() or memmove(). The proper use of data types greatly enhances source-code readability and eliminates potential semantic errors.

Step 3. Next, a text macro is created to convert character constants and string literals into their proper types based on the compilation mode. Most C compilers use the syntax L'c' for a wide-text character and L"string" for a wide-text string. The text macro TEXT() substitutes the proper generic literal constant syntax during compilation (for instance, TEXT("string") becomes "string" for ANSI and L"string" for Unicode). Remember, use explicit literal constants whenever the text-data type is independent of the compilation mode.

Step 4. The final step in stage one is to create explicit and generic function prototypes. Each text function requires an ANSI and Unicode explicit prototype and a generic prototype. The explicit function-naming convention ends ANSI function names with an A and Unicode function names with a U (Microsoft uses W for wide). For example, strlenA() is ANSI and strlenU() is Unicode. The generic function-naming convention replaces str with txt (Microsoft uses wcs for wide-character string), so strlen() would become txtlen(), strncpy() would become txtncpy(), and so on. Using the string-length example again, if UNICODE is not defined, txtlen() is defined as strlenA(); if UNICODE is defined, txtlen() is defined as strlenU(). strlenA() is a redefinition of the standard-library function strlen(). (See Listing One for examples of defining ANSI, Unicode, and generic prototypes for several string functions.) All generic prototypes are resolved and defined at compile time.

When compiling for ANSI, do not inadvertently create duplicate standard C-library functions. Duplicates are generally created by improperly defining the data types for function declarations that have standard library equivalents. For example, a Unicode string-length routine uniStrLen(P_TCHAR pStr) is created with one text-string parameter using the generic data type P_TCHAR. When compiling for ANSI, P_TCHAR becomes CHAR; therefore, uniStrLen() and the standard-library function string length strlen() are duplicates, although their specific characteristics may differ slightly. To avoid creating duplicate functions, never use generic data types for explicit functions and don't create generic prototypes using explicit data types. I also recommend that you redefine all standard-library text functions that you use with the ANSI-explicit function names (that is, strlenA()). This should eliminate all standard-library text function names from your source code. Again, this is for readability and completeness.

Stage Two: Source-Code Modifications

Stage two involves modifications to the source code and represents the major effort in the migration process. The conversion of data types, function names, and text-pointer arithmetic--plus the creation of special routines--is the major task of stage two. Avoid the temptation of adding conditional compilation directives for Unicode (such as #ifdef UNICODE) in your source code. The creation of additional macros, such as the TEXT() macro, may be helpful in eliminating unnecessary #ifdefs. (Listing Seven shows how the TEXT() macro eliminates conditional directives.)

(Asmus Freytag, Microsoft's Unicode-implementation architect, provided some conversion statistics for migrating over one million lines of Windows code to support Unicode. He found that approximately 10 percent of the code could be modified using a search-and-replace mechanism, 5 percent required an inspection of the code's intent before modifications could be made, and less than 1 percent required changes in algorithms. Freytag's statistics should be a good sample population for determining the amount of work your application may require.)

Step 1. The first step in stage two is to compile all of your source code and save the error/warning messages generated by your compiler. The errors and warnings are a result of changing the function prototypes in stage one, and they assist in completing the tasks in stage two. Compiling the source code also validates that your header files are free of any syntax errors. Most modifications required in stage two are identified using the compiler's error/warning reports.

Step 2. Next, all text function definitions and function calls in the source files must be updated with the appropriate explicit/generic function name, return data type, and parameter data types. Use the compiler's error/warning reports and the function prototypes declared during the function-prototype step in stage one to complete this step. Remember, duplicate routines may be created if text function definitions are not converted properly.

Step 3. Once a function's definition is updated, various data types within the function's body may require modification. For example, local variables may need to be modified because the function's parameter-data types have changed. Another possibility is that the return value for the routine has changed, or maybe function calls within the body of a routine require different return or parameter types. Regardless of the type of changes, function-body modifications can be made later or when a function's definition is modified. I recommend that you first make all the function-definition modifications, then recompile the source code, saving the error/warning messages that are generated. The new error/warning report will include the function body modifications that resulted from modifying the function definitions.

Step 4. Another important aspect of migrating ANSI-based source code to Unicode is pointer arithmetic. Whenever the number of actual bytes in a text string is required, simple pointer arithmetic is no longer valid. For an ANSI string, every character is one byte, so pStrPos2--pStrPos1 is the number of bytes from pStrPos1 to pStrPos2. Since Unicode consists of two bytes per character, Unicode pointer arithmetic will calculate characters based on the data type UNICHAR, not the actual number of bytes. Therefore, pointer arithmetic may need some minor modifications, so avoid using a search-and-replace mechanism for making pointer-arithmetic modifications.

If two generic text variables (of type P_TCHAR) contain valid addresses within a string, the total number of bytes (byte count) is calculated using the formula ((existing calculation)*sizeof(TCHAR)). Likewise, to determine the actual number of characters given a byte count, use the formula ((existing calculation)/sizeof(TCHAR)). It is important that you never have an odd byte count when calculating the number of Unicode characters. By using sizeof(TCHAR) in the calculations, the generic text arithmetic instructions are correct, regardless of the compilation mode. It may be useful to create two new generic macros, CALC_BYTE2CHAR(exp) and CALC_CHAR2BYTE(exp) for readability and consistency.

Step 5. The final step in stage two is the creation of any mapping routines that may be required at I/O and rendering boundaries. These routines are generally new code and deal strictly with converting Unicode to and from various code pages. The mapAnsi2Unic() function (Listing Two) is a basic algorithm for mapping ANSI cp437 data to Unicode.

To demonstrate all of the modifications described in this article, I created a before (Listing Three) and after (Listing Four) set of listings. The code was created simply to give as many examples as possible. The source file UCS.C (Listing Two) contains several Unicode string functions and a basic mapping function to map ANSI cp437 data to Unicode. Because Unicode's first 127 (0x7F) characters are the same as ASCII, the mapping of standard ASCII characters is simply a data-type cast from CHAR to UNICHAR. The extended ASCII characters--those from the base value 0x80 (ANSIEXTCHBASE) to 0xFF--have different Unicode code values and require more than a simple data-type cast. Therefore, the function mapExtCh2Unic() contains an array extChArray[] for mapping ANSI-extended characters to Unicode. The ANSI-to-Unicode values were taken from the "Latin PC Code Page Mappings" table supplied with the Unicode standard.

Listing Three provides source code before migration, and Listing Four shows the same code after migration. For completeness, Listing Five displays the ANSI output, and Listing Six displays Unicode. You will notice the Unicode output displays an empty string for the token. This is because the string is Unicode and must be converted to ANSI for the standard C-library functions to work properly. This is an example of where a mapping at the boundary function is required. The token's length of 14 is correct because a Unicode version of string length was used to calculate its length. I added an explicit Unicode function to Listing Four named ExplicitUnicode().

Conclusion

ANSI-based applications can be migrated to Unicode following the guidelines presented here. Once the migration is complete, you should be able to create both ANSI- and Unicode-based applications from the same source code. Initially, a Unicode application doesn't have to read, write, or render Unicode data. Routines that map at the boundary may be added to supplement any of the I/O and rendering shortcomings associated with implementing Unicode on operating systems that do not completely support Unicode--yet. Similarly, third-party libraries may also be used with Unicode-compliant applications if mapping routines are used.

Figure 1 Application mapping boundaries for varying character sets.

Listing One

/** Name:   UNICODE.h
    Desc:   Contains both explicit/generic data types, macros, and function
            prototypes. Stage 1 modifications. ANSI compilation:    default
            UNICODE compilation: use  /DUNICODE. Hungarian notation is not used
**/
#ifndef _unicode_h_
#define _unicode_h_
/*  data type definitions */
/** CHAR and BYTE may already be #define'd or typedef'd in the compiler's 
standard include files, so a conditional check may need to be added to avoid 
redefinitions, errors and warnings. **/
// explicit types
typedef char            CHAR ;      // standard char 
typedef CHAR *          P_CHAR ;
typedef unsigned short  UNICHAR ;   // UNICODE explicit data types
typedef UNICHAR *       P_UNICHAR ;
typedef unsigned char   BYTE ;      // data buffer data types
typedef BYTE *          P_BYTE ;
// text character generic types 
#if defined( UNICODE )
   typedef UNICHAR      TCHAR ;     // generic data types (really Unicode)
   typedef TCHAR *      P_TCHAR ;
#else
   typedef CHAR         TCHAR ;     // generic data types (really ANSI)
   typedef TCHAR *      P_TCHAR ;
#endif
/* macros */
#if defined( UNICODE )
#   define TEXT(literal)   L##literal   // wide literal constant L'c' L"str"
#else
#   define TEXT(literal)   literal      // literal constant 'c' "str"
#endif
#define CALC_CHAR2BYTES(exp) ( (exp) * sizeof(TCHAR) ) // n chars -> n bytes 
#define CALC_BYTE2CHARS(exp) ( (exp) / sizeof(TCHAR) ) // n bytes -> n chars
/* function prototypes */
#if defined( UNICODE )
void        mapAnsi2Unic( P_CHAR pAnsiStr, P_UNICHAR pUnicStr ) ;
P_UNICHAR   ucschr( P_UNICHAR pUCS, UNICHAR token ) ;
P_UNICHAR   ucscpy( P_UNICHAR pDst, P_UNICHAR pSrc ) ;
int         ucslen( P_UNICHAR pUCS ) ;
void        mapAnsi2Unic( P_CHAR pAnsiStr, P_UNICHAR pUnicStr ) ;
UNICHAR     mapExtCh2Unic( CHAR ansiChar ) ;
#endif
/** The function prototypes should be placed in the appropriately named header
file. Always define the ANSI and Unicode explicit set first, then the generic
set using the explicit names previously defined. **/
#define strcmpA      strcmp   // ANSI explicit string compare
#define strcpyA      strcpy   // ANSI explicit string copy
#define strlenA      strlen   // ANSI explicit string length
#define strchrA      strchr   // ANSI explicit
#define strcmpU      ucscmp   // UNICODE explicit string compare
#define strcpyU      ucscpy   // UNICODE explicit string copy
#define strlenU      ucslen   // UNICODE explicit string length
#define strchrU      ucschr   // UNICODE explicit
#if defined( UNICODE )
#   define txtcmp    strcmpU   // generic string compare (really UNICODE)
#   define txtcpy    strcpyU   // generic string copy    (really UNICODE)
#   define txtlen    strlenU   // generic string length  (really UNICODE)
#   define txtchr    strchrU   // generic string         (really UNICODE)
#else
#   define txtcmp    strcmpA   // generic string compare (really ANSI)
#   define txtcpy    strcpyA   // generic string copy    (really ANSI)
#   define txtlen    strlenA   // generic string length  (really ANSI)
#   define txtchr    strchrA   // generic string         (really ANSI)
#endif
/** Prototype parameter examples. It is important not to declare functions, 
parameters and return values incorrectly.  Follow these two basic rules:
1) Never use generic data types for an explicit prototype
2) Avoid creating a generic prototype using explicit data types **/
#if defined( INCORRECT )
int     mapUnic2Ansi( P_TCHAR pUniStr, P_TCHAR pAnsiStr, TCHAR rplCh ) ;
P_CHAR  encryptString( P_CHAR pStr ) ;
int     uniStrLen( P_TCHAR pUniStr ) ;
#else
int     mapUnic2Ansi( P_UNICHAR pUniStr, P_CHAR pAnsiStr, CHAR rplCh ) ;
P_TCHAR encryptString( P_TCHAR pStr ) ;
int     uniStrLen( P_UNICHAR pUniStr ) ;
#endif
#endif   // end include file

Listing Two

/** Name:   UCS.c
    Desc:   Unicode character string functions. Hungarian 
notation is not used.
**/
#if defined( UNICODE )
#include <stdio.h>
#include "unicode.h"
#define ANSIEXTCHBASE   0x80      // start of ANSI extended 
characters
static UNICHAR mapExtCh2Unic( CHAR ansiChar ) ;
P_UNICHAR
ucschr( P_UNICHAR pUCS, UNICHAR token )
{
P_UNICHAR pStr = pUCS ;
   for ( ; *pStr != token; pStr++ ) ;
   return( pStr ) ;
}
P_UNICHAR
ucscpy( P_UNICHAR pDst, P_UNICHAR pSrc )
{
P_UNICHAR pStr = pDst ;
   while ( *pDst++ = *pSrc++ ) ;

   return( pStr ) ;
}
int
ucslen( P_UNICHAR pUCS )
{
P_UNICHAR pStr = pUCS ;
   for ( ; *pStr; pStr++ ) ;
   return( (int)( pStr - pUCS ) ) ;
}
void
mapAnsi2Unic( P_CHAR pAnsiStr, P_UNICHAR pUnicStr )
{
P_CHAR    pSrc = pAnsiStr ;
P_UNICHAR pDst = pUnicStr ;
   for ( ; *pSrc; pSrc++ )
   {
      *pDst++ = (UNICHAR)( ( *pSrc > ANSIEXTCHBASE ) ?
                mapExtCh2Unic( *pSrc ) : *pSrc ) ;
   }
   *pDst = L'\0' ;      // NULL terminate the Unicode string
}
UNICHAR
mapExtCh2Unic( CHAR ansiChar )
{
static UNICHAR extChArray[] = {
     /* 80 */  0x00C7, 0x00FC, 0x00E9, 0x00E2, 0x00E4, 0x00E0, 
0x00E5, 0x00E7,
               0x00EA, 0x00EB, 0x00E8, 0x00EF, 0x00EE, 0x00EC, 
0x00C4, 0x00C5,
     /* 90 */  0x00C9, 0x00E6, 0x00C6, 0x00F4, 0x00F6, 0x00F2, 
0x00FB, 0x00F9,
               0x00FF, 0x00D6, 0x00DC, 0x00A2, 0x00A3, 0x00A5, 
0x20A7, 0x0192,
     /* A0 */  0x00E1, 0x00ED, 0x00F3, 0x00FA, 0x00F1, 0x00D1, 
0x00AA, 0x00BA,
               0x00BF, 0x2310, 0x00AC, 0x00BD, 0x00BC, 0x00A1, 
0x00AB, 0x00BB,
     /* B0 */  0x2591, 0x2592, 0x2593, 0x2502, 0x2524, 0x2561, 
0x2562, 0x2556,
               0x2555, 0x2563, 0x2551, 0x2557, 0x255D, 0x255C, 
0x255B, 0x2510,
     /* C0 */  0x2514, 0x2534, 0x252C, 0x251C, 0x2500, 0x253C, 
0x255E, 0x255F,
               0x255A, 0x2554, 0x2569, 0x2566, 0x2560, 0x2550, 
0x256C, 0x2567,
     /* D0 */  0x2568, 0x2564, 0x2565, 0x2559, 0x2558, 0x2552, 
0x2553, 0x256B,
               0x256A, 0x2518, 0x250C, 0x2588, 0x2584, 0x258C, 
0x2590, 0x2580,
     /* E0 */  0x03B1, 0x00DF, 0x0393, 0x03C0, 0x03A3, 0x03C3, 
0x00B5, 0x03C4,
               0x03A6, 0x0398, 0x03A9, 0x03B4, 0x221E, 0x03C6, 
0x03B5, 0x2229,
     /* F0 */  0x2261, 0x00B1, 0x2265, 0x2264, 0x2320, 0x2321, 
0x00F7, 0x2248,
               0x00B0, 0x2219, 0x00B7, 0x221A, 0x207F, 0x00B2, 
0x25A0, 0x00A0 
} ;
   return( extChArray[ (int)( ansiChar - ANSIEXTCHBASE ) ] ) ;
}

Listing Three

/** Name:   BEFORE.c   (convert)
    Desc:   Shows a functions before migration to allow you to 
            apply the steps discussed in the article and compare 
            your results. The functions are strictly examples. 
            Hungarian notation is not used.
**/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "unicode.h"
P_CHAR   GenericText( P_CHAR text, int *bytes ) ;
P_CHAR   ExplicitText( P_CHAR pStr ) ;
P_CHAR   DataStream( P_CHAR pDst, P_CHAR pSrc, int count ) ;
P_CHAR
GenericText( P_CHAR text, int *bytes )
{
P_CHAR   pStr1 ;
P_CHAR   pStr2 ;
   pStr1  = strchr( text, 'A' ) ;   // locate the token char
   pStr2  = strchr( pStr1, '@' ) ;   // locate the delimiting char
   *pStr2 = '\0' ;                  // terminate string at delimiter
   *bytes = pStr2 - pStr1 ;         // calc number of bytes
                                    // number chars == number of bytes
   return( pStr1 ) ;                // return start of token found
}
P_CHAR
ExplicitText( P_CHAR pStr )
{
P_CHAR   pAnsiStr ;
   // search a string that is explicitly an ANSI string
   pAnsiStr  = strchr( pStr, 'W' ) ;
   *pAnsiStr = '\0' ;
   return( pStr ) ;
}
P_CHAR
DataStream( P_CHAR pDst, P_CHAR pSrc, int count )
{
   strncpy( pDst, pSrc, count ) ; // strncpy used purely as an example
   *( pDst + 13 ) = 0 ;          // just to randomly truncate the stream
   return( pDst ) ;
}
int main( void )
{
CHAR      text[ 78 ] ;      // text string - conversion
P_CHAR   pToken ;
P_CHAR   pStr1 ;
CHAR      specialThanksTo[] = "Dawn Woods for editing my article" ;
P_CHAR   pStr2 ;
CHAR      data[ 78 ] = "Imagine this is byte data and not text!" ;
CHAR      temp[ 78 ] ;      // temp data stream buffer
P_CHAR   pStr3 ;
int      nBytes ;
   strcpy( text, "Text with A Token@ in the stream." ) ;
   /* text string that should be converted to a generic string */
   pStr1  = GenericText( text, &nBytes ) ; // get a token from the string
   pToken = (P_CHAR)malloc( nBytes + 1 ) ; // alloc space for token only
   strcpy( pToken, pStr1 ) ;               // cpy str to alloc'd space
   printf( "Token '%s' (A Token)\n, pToken ) ;
   printf( "Bytes: %d\nCharacters: %d\n\n",strlen( pToken ),strlen( pToken));
   free( pToken ) ;
   /* explicit text string that should not be generic */
   pStr2 = ExplicitText( specialThanksTo ) ;
   printf( "Thanks '%s'\nBytes: %d\nCharacters: %d\n\n",
            pStr2, strlen( pStr2 ), strlen( pStr2 ) ) ;
   /* processing data in a stream */
   pStr3 = DataStream( temp, data, 78 ) ;
   printf( "Stream '%s'\nCopy   '%s'\n", data, pStr3 ) ;
    return( 0 ) ;
}

Listing Four

/** Name:   AFTER.c   (convert)
   Desc:   Shows a functions after migration to allow you to 
           apply the steps discussed in the article and 
           compare your results. The functions are strictly 
           examples.
           Hungarian notation is not used
**/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include "unicode.h"
P_TCHAR  GenericText( P_TCHAR text, int *bytes ) ;
P_CHAR   ExplicitText( P_CHAR pStr ) ;
P_BYTE   DataStream( P_BYTE pDst, P_BYTE pSrc, int count ) ;
P_TCHAR
GenericText( P_TCHAR text, int *bytes )
{
P_TCHAR  pStr1 ;
P_TCHAR  pStr2 ;
   pStr1  = txtchr( text, TEXT('A') ) ;      // locate the 
token char
   pStr2  = txtchr( pStr1, TEXT('@') ) ;     // locate the 
delimiting char
   *pStr2 = TEXT('\0') ;                     // terminate 
string at delimiter
   *bytes = pStr2 - pStr1 ;            // calc number of bytes
                                       // number chars == 
number of bytes
   return( pStr1 ) ;                   // return start of token 
found
}
P_CHAR
ExplicitText( P_CHAR pStr )
{
P_CHAR   pAnsiStr ;
   // search a string that is explicitly an ANSI string
   pAnsiStr  = strchr( pStr, 'W' ) ;
   *pAnsiStr = '\0' ;
   return( pStr ) ;
}
P_UNICHAR
ExplicitUnicodeText( P_UNICHAR pStr )
{
P_UNICHAR   pUnicStr ;  // purely an example, code not called
   // search a string that is explicitly an ANSI string
   pUnicStr  = strchrU( pStr, L'W' ) ;
   *pUnicStr = L'\0' ;
   return( pStr ) ;
}
P_BYTE
DataStream( P_BYTE pDst, P_BYTE pSrc, int count )
{
   memmove( pDst, pSrc, count ) ;
   *( pDst + 13 ) = 0 ;          // just to randomly truncate 
the stream
   return( pDst ) ;
}
int main( void )
{
TCHAR    text[ 78 ] ;      // text string - conversion
P_TCHAR  pToken ;
P_TCHAR  pStr1 ;
CHAR     specialThanksTo[] = "Dawn Woods for editing my 
article" ;
P_CHAR   pStr2 ;
BYTE     data[ 78 ] = "Imagine this is byte data and not text!" 
;
BYTE     temp[ 78 ] ;      // temp data stream buffer
P_BYTE   pStr3 ;
int      nBytes ;
   txtcpy( text, TEXT("Text with A Token@ in the stream.") ) ;
   /* text string that should be converted to a generic string 
*/
   pStr1  = GenericText( text, &nBytes ) ;   // get a token 
from the string
   pToken = (P_CHAR)malloc( nBytes + 1 ) ;   // alloc space for 
token only
   txtcpy( pToken, pStr1 ) ;                 // cpy str to 
alloc'd space
   printf( "Token '%s' (A Token)\n, pToken ) ;  // need 
mapUnic2Ansi()
   printf( "Bytes: %d\nCharacters: %d\n\n",
            CALC_CHAR2BYTES(txtlen( pToken )), txtlen( pToken ) 
) ;
   free( pToken ) ;
   /* explicit text string that should not be generic */
   pStr2 = ExplicitText( specialThanksTo ) ;
   printf( "Thanks '%s'\nBytes: %d\nCharacters: %d\n\n",
            pStr2, strlenA( pStr2 ), strlenA( pStr2 ) ) ;
   /* processing data in a stream */
   pStr3 = DataStream( temp, data, 78 ) ;
   printf( "Stream '%s'\nCopy   '%s'\n", data, pStr3 ) ;
   return( 0 ) ;
}

Listing Five

Token 'A Token' (A Token)
Bytes: 7
Characters: 7

Thanks 'Dawn'
Bytes: 5
Characters: 5

Stream 'Imagine this is byte data and not text!'
Copy   'Imagine this '

Listing Six

Token '' (A Token)
Bytes: 14
Characters: 7
Thanks 'Dawn '
Bytes: 5
Characters: 5
Stream 'Imagine this is byte data and not text!'
Copy   'Imagine this '

Listing Seven

/** Name:   TRANSPAR.c
    Desc:  Example of how to use macros (such as the TEXT() 
macro) to eliminate
           unnecessary conditional directives, potential 
semantic errors and 
           improve readability.
**/
//    Migration is NOT transparent. Requires a conditional 
directive in source.
void notTransparent( void )
{
      // ...
#if defined( UNICODE )
      ucscpy( text, L"Text with A Token@ in the stream." ) ;
      // ...
#else
      strcpy( text, "Text with A Token@ in the stream." ) ;
      // ...
#endif
      // ...
}
//    Migration is transparent. No conditional directive 
required
void transparent( void )
{
      // ...
      txtcpy( text, TEXT( "Text with A Token@ in the stream." ) 
) ;
      // ...
}
_


Copyright © 1994, Dr. Dobb's Journal

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.