Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Channels ▼


Wide-Character Format String Vulnerabilities

Source Code Accompanies This Article. Download It Now.

December, 2005: Wide-Character Format String Vulnerabilities

Robert is a senior vulnerability analyst for CERT/CC and author of Secure Coding in C and C++ (Addison-Wesley, 2005). He can be reached at [email protected]

The ISO/IEC C Language Specification (commonly referred to as "C99") defines formatted output functions that operate on wide-character strings, as well as those functions that operate on multibyte-character strings. The wide-character formatted output functions include: fwprintf(), wprintf(), swprintf(), vfwprintf(), vswprintf(), and vwprintf(). (There is no need for snwprintf() or vsnwprintf() functions because the swprintf() and vswprintf() include an output length argument.) These functions correspond functionally to the multibyte-character formatted output functions (that is, the similarly named functions with the "w" removed) except that they work on wide-character strings such as Unicode and not on multibyte-character strings such as ASCII. (A multibyte character is defined by the ISO/IEC 9899:1999 as a sequence of one or more bytes representing a member of the extended character set of either the source or the execution environment. ASCII strings are represented as multibyte-character strings, although all ASCII characters are represented as a single byte.)

Formatted output functions are susceptible to a class of vulnerabilities known as "format string" vulnerabilities. Format string vulnerabilities can occur when a format string (or a portion of a format string) is supplied by a user or other untrusted source. Listing One, for example, is a common programming idiom, particularly for UNIX command-line programs. The program prints usage information for the command. However, because the executable may be renamed, the actual name of the program entered by users and specified in argv[0] is printed instead of a hardcoded name.

By calling this program using execl(), attackers can specify an arbitrary string as the name for arg[0], as in Listing Two. In this case, the specified string is likely to cause the program to crash, as the printf() function on line 6 of Listing One attempts to read many more arguments off the stack than are actually available. However, this could be much, much worse. In addition to crashing a program (and possibly causing a denial-of-service attack), attackers can also exploit this vulnerability to view arbitrary memory or execute arbitrary code with the permissions of the vulnerable program. For example, attackers can execute arbitrary code by providing a format string of the form:

address advance-argptr %widthu%n

The address field contains a Little-endian encoded string; for example,\xdc\xf5\x42\ x01. The advance-argptr string consists of a series of format specifiers designed to advance the internal argument pointer within the formatted output function until it points to the address at the start of the format string. The %n conversion specifier at the end of the string writes out the number of characters output by the formatted output function. The %widthu conversion specifier advances the count to the required value. When processed by the format output function, this string writes an attacker-provided value (typically the address of some shellcode) to an attacker-specified address such as the return address on the stack. When the vulnerable function returns, control is transferred to the shellcode instead of the calling function, resulting in execution of arbitrary code with the permissions of the vulnerable program.

A detailed description of format string vulnerabilities and possible exploits with multibyte-character strings is presented in my book Secure Coding in C and C++ (Addison-Wesley, 2005; ISBN 0321335724) and by Scut/Team Teso (see "Exploiting Format String Vulnerabilities," http://www.mindsec.com/files/formatstring-1.2.pdf). In this article, I focus on the vulnerabilities resulting from the incorrect use of wide-character formatted output functions.


Formatted output functions that operate on wide-character strings are also susceptible to format string vulnerabilities. To understand the effect of wide characters on format string vulnerabilities, you must understand the interactions between the program and environment. Here, I examine the mechanisms used to manage these interactions for Windows and Visual C++.

Visual C++ defines a wide-character version of the main() function called wmain() that adheres to the Unicode programming model. Formal parameters to wmain() are declared in a similar manner to main():

int wmain( int argc[ ,
wchar_t *argv[ ] [, wchar_t *envp[ ] ] ] );

The argv and envp parameters to wmain() are of type wchar_t *. For programs declared using the wmain() function, Windows creates a wide-character environment at program startup that includes wide-character argument strings and optionally, a wide-character environment pointer to the program. When a program is declared using main(), a multibyte-character environment is created by the operating system at program startup. Typically, a programmer that uses wide characters internally will use wmain() to generate a wide-character environment, while a program that uses multibyte characters internally uses main() to generate a multibyte-character environment.

When a program specified for a multibyte-character environment calls a wide-character function that interacts with the environment (for example, the _wgetenv() or _wputenv() functions), a wide-character copy of the environment is created from the multibyte environment. Similarly, for a program declared using the wmain() function, a multibyte-character environment is created on the first call to _putenv() or getenv().

When an ASCII environment is converted to Unicode, alternate bytes in the resulting Unicode string are null. This is a result of the standard Unicode representation of ASCII characters. For example, the ASCII representation for a is \x60 while the Unicode representation is \x6000. Alternating null bytes creates an interesting obstacle for shellcode writers.

Wide-Character Format String Vulnerabilities

Listing Three illustrates how the wide-character formatted output function wprintf() can be exploited by an attacker. This example was developed using Visual C++ .NET with Unicode defined and tested on a Windows 2000 Professional platform with an Intel Pentium 4 processor. Because the program was declared (on line 6) using wmain(), the vulnerable program accepts wide-character strings directly from the environment that have not been explicitly or implicitly converted. The shellcode is declared on line 7 as a series of nop instructions instead of actual malicious code. Unfortunately, examples of malicious shellcode are not hard to locate on the Internet and elsewhere. The wchar_t array format_str is declared as an automatic (stack) variable on line 9. I'll return to the mysterious float variable shortly.

The idea of the sample exploit is to create a wide-character format string such that the execution of this string by the wprintf() function call on line 31 results in the execution of the shellcode. This is typically accomplished by overwriting a return address on the stack with the address of the shellcode. However, any indirect address can also be used for this purpose. The modulo divisions are to ensure that each subsequent write can output a larger value while the remainder modulo 0x1000 preserves the required low-order byte.

Although most of the conversion specifications interpreted by the formatted output function are used to format output, there is one that writes to memory. The %n conversion specifier writes the number of characters successfully output by the formatted output function (an integer value) to an address passed as an argument to the function. By providing the address of the return address, attackers can trick the formatted output function into overwriting the return address on the stack.

The number of characters output by the formatted output function can be influenced using the width and precision fields of a conversion specifier. The width field is controlled to output the exact number of characters required to write the address of the shellcode. Because there are some practical limitations to the size of the width and precision fields, the exploit writes out the first word of the address (line 20) followed by the second word (line 27). The first word is written to 0x0012f1e0 and the second word is offset by two bytes at 0x0012f1e2. These addresses are specified as part of the format string on lines 29-30.

The next trick is to get the argument pointer within the formatted output function to point to this address. Formatted output functions are variadic functions, typically implemented using ANSI C stdargs. These functions have no real way of knowing how many arguments have been passed, so they will continue to consume arguments as long as there are additional conversion specifiers in the format string. Once the formatted output function has consumed all the actual arguments, the function's argument pointer starts to traverse through the local stack variables and up through the stack. This makes it possible for attackers to insert the address of the return address in a local stack variable or, as in this case, as part of the format string that is also located on the stack. The address is typically added at the start of the format string, which is then output by the formatted output like any other character. The wide-character formatted output functions, however, are more likely to exit when an invalid Unicode character is detected. As a result, an address included at the beginning of a malicious format string is likely to cause the function to exit without accomplishing the attacker's goal of executing the shellcode because the address is unlikely to map to valid Unicode. This is not a significant obstacle to determined attackers, however, because the address can be moved to the end of the format string as in lines 29-30 in a wide-character exploit. These addresses may still cause the function to exit, but not before the return address has been overwritten.

The greatest problem introduced by moving the address pairs to the end of the string is that attackers must now progress the argument pointer past the conversion specifiers used to advance the argument pointer to the start of the dummy integer/address pairs. This creates a race in that adding conversion specifiers increases the distance the argument pointer must be advanced to reach the dummy integer/address pairs. In ASCII, the conversion specifier %x requires two bytes to represent and, when processed, advances the argument pointer by four bytes. This means that each conversion specifier of this form narrows the gap between the argument pointer and the start of the dummy integer/address pairs by two bytes (that is, the four bytes the argument pointer is advanced minus the two bytes the start of the dummy integer/address pairs is advanced). The wide-character representation of the %x conversion specifier requires four bytes to represent but only advances the argument pointer by four bytes (the length of an integer). As a result, the %x conversion specifier can no longer be used to gain ground on the dummy integer/address pairs.

Is there a conversion specifier that can be used to gain ground on the dummy integer/address pairs? One possibility is the use of a length modifier to indicate that the conversion specifier applies to a long long int or unsigned long long int. Because these data types are represented in eight bytes and not four, each conversion specifier advances the argument pointer by eight bytes. Visual C++ does not support the C99 ll length modifier but instead provides the I64 length modifier. A conversion specifier using the I64 length modifier takes the form %I64x. This conversion specifier requires five wide characters or 10 bytes to represent but, as already noted, only advances the argument pointer by eight bytes. Now you are actually losing ground! Using a compiler that supports the standard ll length modifier (such as GCC) is not much better because the conversion specifier requires four characters or eight bytes.

Another possibility is using the a, A, e, E, f, F, g, or G conversion specifiers to output a 64-bit floating-point number and thereby incrementing the argument pointer by eight bytes. For example, the conversion specifier %f requires two wide characters or four bytes to represent but advances the argument pointer by eight bytes, which lets attackers gain four bytes on the address for each conversion specifier processed by the formatted output function. Lines 11-14 in the wide-character exploit show how the %f conversion specifier can be used to advance the argument pointer. Line 11 also adds a single wide character (a) to properly align the argument pointer to the start of the dummy integer/address pairs. The only problem with the %f conversion specifier is that it can cause the abnormal termination of the program if the floating-point subsystem is not loaded—hence, the extremely unlikely declaration of a float on line 10. In theory, this problem could limit the number of programs that could be attacked using this exploit. In practice, most nontrivial programs load the floating-point subsystem.

Venetian Shellcode

Again, when a Windows program is declared using main(), an ASCII environment is created by the operating system for the program. If a wide-character representation of an environmental variable is required, it is generated on demand. Because the Unicode string is converted from ASCII, every other byte will be zero. For example, if the ASCII string "AAAA" is converted to Unicode, the result (in hexadecimal) is 00 41 00 41 00 41 00 41. This creates an interesting obstacle for exploit writers.

Chris Anley has done some work (see "Creating Arbitrary Shellcode In Unicode Expanded Strings: The "Venetian" Exploit; http://www.ngssoftware.com/papers/ unicodebo.pdf) in creating Venetian shellcode with alternating zero bytes (analogous to Venetian blinds). While creating these programs by hand is quite troublesome, Dave Aitel's makeunicode2.py and Phenoellit's "vense" generator are both capable of automatically generating Venetian shellcode.


Wide-character formatted output functions are susceptible to format string and buffer overflow vulnerabilities in a similar manner to multibyte-character formatted output functions, even in the extraordinary case where Unicode strings are converted from ASCII.

Unicode actually has characteristics that make it easier to exploit functions that use these strings. For example, multibyte-character functions recognize a null byte as the end of a string, making it impossible to embed a null byte (\x00) in the middle of a string. The null character in Unicode, however, is represented by \x0000. Because Unicode characters can contain null bytes, it is easier to inject a broader range of addresses into a Unicode string.

There are a number of mitigation strategies for format string vulnerabilities. The simplest solution that works for both multibyte- and wide-character strings is to never allow (potentially malicious) users to control the contents of the format string.


Listing One

 1. #include <stdio.h>
 2. #include <string.h>

 3. void usage(char *pname) {
 4.   char usageStr[1024];
 5.   snprintf(usageStr, 1024, 
        "Usage: %s <target>\n", pname);
 6.   printf(usageStr);
 7. }

 8. int main(int argc, char * argv[]) {
 9.   if (argc < 2) {
10.     usage(argv[0]);
11.     exit(-1);
12.   }
13. }
Back to article

Listing Two
1. #include <unistd.h>
2. #include <errno.h>

3. int main(void) {
4.   execl("usage", "%s%s%s%s%s%s%s%s%s%s", NULL);
5.   return(-1);
6. }
Back to article

Listing Three
 1. #include <stdio.h>
 2. #include <string.h>

 3. static unsigned int already_written, width_field;
 4. static unsigned int write_word;
 5. static wchar_t convert_spec[256];

 6. int wmain(int argc, wchar_t *argv[], wchar_t *envp[]) {

 7.   unsigned char exploit_code[1024] = "\x90\x90\x90\x90\x90";
 8.   int i;
 9.   wchar_t format_str[1024];
10.   float x = 5.3;

    // advance argument pointer 63 x 4 bytes
11.   wcscpy(format_str, L"a%f");  // 2 bytes filler
12.   for (i=0; i < 63; i++) {
13.     wcscat(format_str, L"%f");
14.   }

15.   already_written = 0x084d; 

    // first word   
16.   write_word = 0xfad8;
17.   already_written %= 0x10000;

18.   width_field = (write_word-already_written) % 0x10000;
19.   if (width_field < 10) width_field += 0x10000;
20.   swprintf(convert_spec, L"%%%du%%n", width_field);
21.   wcscat(format_str, convert_spec);

    // last word
22.   already_written += width_field;
23.   write_word = 0x0012;
24.   already_written %= 0x10000;

25.   width_field = (write_word-already_written) % 0x10000;
26.   if (width_field < 10) width_field += 0x10000;
27.   swprintf(convert_spec, L"%%%du%%n", width_field);
28.   wcscat(format_str, convert_spec);

    // two dummy int/address pairs
29.   wcscat(format_str, L"ab\xf1e0\x0012");
30.   wcscat(format_str, L"ab\xf1e2\x0012");

31.   wprintf(format_str);

32.   return 0;
33. }
Back to article

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.