Dr. Dobb's | Multithreaded Asynchronous I/O & I/O Completion Ports

Multithreaded Asynchronous I/O & I/O Completion Ports

I/O completion ports provide an elegant solution to the problem of writing scalable server applications that use multithreading and asynchronous I/O.

August 03, 2007
URL:http://www.drdobbs.com/open-source/multithreaded-asynchronous-io-io-comple/201202921

When developing server applications, it is important to consider scalability, which usually boils down to two issues. First, work must be distributed across threads or processes to take advantage of today's multiprocessor hosts. Second, I/O operations must be scheduled efficiently to maximize responsiveness and throughput. In this article, I examine I/O completion ports—an elegant innovation available on Windows that helps you accomplish both of these goals.

I/O completion ports provide a mechanism that facilitates efficient handling of multiple asynchronous I/O requests in a program. The basic steps for using them are:

Create a new I/O completion port object.
Associate one or more file descriptors with the port.
Issue asynchronous read/write operations on the file descriptor(s).
Retrieve completion notifications from the port and handle accordingly.

Multiple threads may monitor a single I/O completion port and retrieve completion events—the operating system effectively manages the thread pool, ensuring that the completion events are distributed efficiently across threads in the pool.

A new I/O completion port is created with the CreateIoCompletionPort API. The same function, when called in a slightly different way, is used to associate file descriptors with an existing completion port. The prototype for the function looks like this:


HANDLE CreateIoCompletionPort(
   HANDLE FileHandle,
   HANDLEExistingCompletionPort,
   ULONG_PTR  CompletionKey,
   DWORD NumberOfConcurrentThreads
   );

When creating a new port object, the caller simply passes INVALID_HANDLE_VALUE for the first parameter, NULL for the second and third parameters, and either zero or a positive number for the ConcurrentThreads parameter. The last parameter specifies the maximum number of threads Windows schedules to concurrently process I/O completion events. Passing zero tells the operating system to allow at least as many threads as processors, which is a reasonable default. For a discussion of why you might want to schedule more threads than available processors, see Programming Server-Side Applications for Windows 2000 by Jeffrey Richter and Jason D. Clark.

Associating File Descriptors with a Port

Once a port has been created, file descriptors opened with the FILE_FLAG_OVERLAPPED (or WSA_FLAG_OVERLAPPED for sockets) may be associated with the port via another call to the same function. To associate an open file descriptor (or socket) with an I/O completion port, the caller passes the descriptor as the first parameter, the handle of the existing completion port as the second parameter, and a value to be used as the "completion key" for the third parameter. The completion key value is passed back when removing completed I/O requests from the port. The fourth parameter is ignored when associating files to completion ports; a good idea is to set this to zero.

Initiating Asynchronous I/O Requests: OVERLAPPED Explained

Once a descriptor is associated with a port, (and you may associate many file descriptors with a single I/O Completion Port), an asynchronous I/O operation on any of the descriptor(s) results in a completion event being posted to the port by the operating system. The same Windows APIs that let callers perform standard synchronous I/O have a provision for issuing asynchronous I/O requests. This is accomplished by passing a valid OVERLAPPED pointer to one of the standard functions. For example, take a look at ReadFile:


BOOL ReadFile(
  HANDLE    File,
  LPVOID    pBuffer,
  DWORD     NumberOfBytesToRead,
  LPDWORD   pNumberOfBytesRead,
  LPOVERLAPPED  pOverlapped
  );

For typical (synchronous) I/O operations, you've always passed NULL for the last parameter, but when doing asynchronous I/O, you need to pass the address of an OVERLAPPED structure in order to specify certain parameters as well as to receive the results of the operation. Asynchronous calls to ReadFile are likely to return FALSE, but GetLastError returns ERROR_IO_PENDING, indicating to the caller that the operation is expected to complete in the future.

A common mistake when using OVERLAPPED structures is to pass the address of an OVERLAPPED structure declared on the stack:


OVERLAPPED Gone;
// Set up 'Gone'..
ReadFile ( hFile, pBuf, Count,
          &NumRead, &Gone );

This just won't work because ReadFile returns immediately, and when the function containing the call to ReadFile exits, the stack will be unwound and the data pointed to by &Gone will become invalid. Thus, you should ensure that your program manages its OVERLAPPED structures (and any buffers you're using) carefully. The example employs a fairly common strategy that involves having a C++ class representing a connection derive from OVERLAPPED—which may offend some C++ purists, but is a practical solution to the problem. The connections are allocated on the heap, and when I/O operations are initiated, the connections' pointer is passed as the pointer to OVERLAPPED.

Retrieving Completed I/O Events from the Port

Now that we know how to create a completion port, associate descriptors to it, and initiate asynchronous I/O operations on the descriptors, it's on to retrieving completion events from the port. A thread removes an event from the port's queue by calling the GetQueuedCompletionStatus function:


BOOL GetQueuedCompletionStatus(
  HANDLE   CompletionPort,
  LPDWORD   pNumberOfBytes,
  PULONG_PTR   pCompletionKey,
  LPOVERLAPPED*   ppOverlapped,
  DWORD   Timeout
  );

Obviously, the first parameter to this function is the handle to the port object, followed by several pointers and a Timeout value. Once an operation has completed successfully, the variable pointed to by pNumberOfBytes contains the number of bytes written or read during the I/O completion, the pCompletionKey value contains the value of the completion key passed when associating the file descriptor to the port, and the ppOverlapped variable points to the OVERLAPPED pointer passed as the parameter to the asynchronous I/O function. The timeout value, which is specified in milliseconds, works just like other Windows functions in that the special value INFINITE may be passed to specify "wait forever."

Sending Your Own Events: PostQueuedCompletionStatus

Before we move on to a practical example, there's one more function to discuss:


BOOL PostQueuedCompletionStatus(
  HANDLE       CompletionPort,
  DWORD   NumberOfBytesTransferred,
  ULONG_PTR    CompletionKey,
  LPOVERLAPPED pOverlapped
  );

This function lets you post completion events to the port. Typically, this function is used to send implementation-specific messages to the port. When you post a completion event to a port, one of the threads blocking on the port successfully returns from its call to GetQueuedCompletionStatus with copies of the parameters as they were posted.

This function is often used to notify worker threads of some global or application-wide event. Along those lines, the sample program presented in this article posts completion events with a special completion key value of COMPLETION_KEY_SHUTDOWN in order to tell the worker threads that the server is shutting down.

A Practical Example: The Fire Web Server

When I first sat down to write this article, I recalled that my own learning experience was made somewhat difficult by the relative dearth of real-world examples. The same was true of my experience learning about some of the more esoteric Windows Sockets APIs. As these are both important Windows innovations, I decided to develop a simple, multithreaded web server that demonstrates the use of both. It is named in honor of my friend and colleague Ray Schraff, who often tells our customers that mankind has adopted the World Wide Web faster than any technology since the invention of fire. The Fire web server (available in the zip file at the top of the page) exploits I/O completion ports and the best features of Windows Sockets to deliver respectable performance in about 500 lines of C++ code.

The main() Event

All important initialization occurs in the main function, including the initialization of the Windows socket library, registration of an event handler to capture the user's request to stop the server via CTRL-C, and creation of the listener socket.

Next, a single I/O completion port is created, followed by the creation of a small pool of worker threads. Finally, a fixed number of Connection objects (each of which manages one socket) are created.

The Connection Class

The real meat of the program lies within the implementation of the Connection class. Its constructor creates a socket, associates it with the I/O completion port previously created in the main function, and finally, issues an asynchronous request to accept a client connection.

People familiar with the standard accept API may be confused by the fact that a client socket is created prior to the call to AcceptEx, so let me explain. AcceptEx requires that the client socket be created up-front, but this minor annoyance has a payoff in the end: It lets a socket descriptor be reused for a new connection via a special call to TransmitFile. This means that a server that deals with many short-lived connections can utilize a pool of allocated sockets without incurring the cost of creating new descriptors all the time.

The rest of the Connection class is a simple state machine; any given connection may be in any of four states:

WAIT_ACCEPT. Waiting for AcceptEx to complete.
WAIT_REQUEST. Waiting for the client request to be complete.
WAIT_TRANSMIT. Waiting for the response to be sent.
WAIT_RESET. Waiting for the client socket to be reset.

Here's how things get rolling: When the Connection objects are allocated in the main function, they all issue asynchronous accept calls on their sockets. This means that shortly after startup, all connection objects are in the WAIT_ACCEPT state, until a client actually connects and the operating system wakes one of the worker threads.

The handling worker thread takes advantage of the fact that the Connection class is derived from OVERLAPPED, casts the OVERLAPPED pointer into a connection object, and assuming the pointer checks out, calls the Connection's OnIoComplete function.

OnIoComplete implements the Connection class's state machine—essentially transitioning from one waiting state to another by calling the appropriate CompleteXxx function. For example, when a new client connects, the CompleteAccept method is called to perform the necessary steps to prepare the socket for actual use.

Likewise, each CompleteXxx function's last move is to issue another asynchronous I/O request, whether to read more data from the client, transmit a response, or ask that the socket be reset and ready to accept a new client.

At this point, several items merit mention when designing around I/O completion ports and asynchronous I/O. First, as has already been discussed, because asynchronous I/O functions typically return immediately, you must ensure that any buffers passed to the calls remain valid at least until the completion event is handled. This implies heap allocation, since buffers allocated on the stack in a function get junked on exit.

Second, in a server application such as Fire, any thread could handle any connection at any time. As soon as an asynchronous file operation is issued on a descriptor, it is up to the operating system to pick a thread to run the completion routine. Put differently, there is no guaranteed affinity between the thread issuing an asynchronous I/O call and the thread receiving the completion notification. For this reason, you must design your data structures carefully in order to ensure that threads don't tromp all over each other when trying to handle a request.

Last, you must be extremely careful to design your application to avoid races and the other classes of problems that arise when writing multithreaded programs. For example, when designing Fire, I spent a considerable portion of my development time convincing myself which states were necessary to consider. I was also careful to make sure that the various asynchronous I/O requests (read data, write data, reset the socket, and so on) were always the last operations performed in any of the completion handlers.

The reason is subtle, but clear: Were I to perform any other types of work in a handler function after issuing an I/O request, I would have a race condition—it would be possible for more than one thread to be operating on a connection object concurrently.

The benefit of all this careful planning, however, is that Fire does not require any mutual exclusion mechanism in its implementation.

The result should be that Fire scales well on multicore machines since individual worker threads are never competing for resources. Accompanying source code is available here.

Conclusion

I/O completion ports provide an elegant solution to the problem of writing scalable server applications that use multithreading and asynchronous I/O. While it is important to design such applications carefully to avoid certain types of problems such as race conditions or excessive resource contention, the benefits of doing so far outweigh the costs, especially considering the world of multicore, multiprocessor servers in which we now reside.

Acknowledgments

Thanks to Dave Cutler, Len Holgate, Paul Lloyd, and especially Dad for his detailed review.

Tom is a development team leader for Hyland Software. He can be contacted at [email protected].

A Practical Example: The Fire Web Server

When I first sat down to write this article, I recalled that my own learning experience was made somewhat difficult by the relative dearth of real-world examples. The same was true of my experience learning about some of the more esoteric Windows Sockets APIs. As these are both important Windows innovations, I decided to develop a simple, multithreaded web server that demonstrates the use of both. It is named in honor of my friend and colleague Ray Schraff, who often tells our customers that mankind has adopted the World Wide Web faster than any technology since the invention of fire. The Fire web server (available at www.ddj.com/code/) exploits I/O completion ports and the best features of Windows Sockets to deliver respectable performance in about 500 lines of C++ code.

The main() Event

All important initialization occurs in the main function, including the initialization of the Windows socket library, registration of an event handler to capture the user's request to stop the server via CTRL-C, and creation of the listener socket.

Next, a single I/O completion port is created, followed by the creation of a small pool of worker threads. Finally, a fixed number of Connection objects (each of which manages one socket) are created.

The Connection Class

The real meat of the program lies within the implementation of the Connection class. Its constructor creates a socket, associates it with the I/O completion port previously created in the main function, and finally, issues an asynchronous request to accept a client connection.

People familiar with the standard accept API may be confused by the fact that a client socket is created prior to the call to AcceptEx, so let me explain. AcceptEx requires that the client socket be created up-front, but this minor annoyance has a payoff in the end: It lets a socket descriptor be reused for a new connection via a special call to TransmitFile. This means that a server that deals with many short-lived connections can utilize a pool of allocated sockets without incurring the cost of creating new descriptors all the time.

The rest of the Connection class is a simple state machine; any given connection may be in any of four states:

WAIT_ACCEPT. Waiting for AcceptEx to complete.
WAIT_REQUEST. Waiting for the client request to be complete.
WAIT_TRANSMIT. Waiting for the response to be sent.
WAIT_RESET. Waiting for the client socket to be reset.

Here's how things get rolling: When the Connection objects are allocated in the main function, they all issue asynchronous accept calls on their sockets. This means that shortly after startup, all connection objects are in the WAIT_ACCEPT state, until a client actually connects and the operating system wakes one of the worker threads.

The handling worker thread takes advantage of the fact that the Connection class is derived from OVERLAPPED, casts the OVERLAPPED pointer into a connection object, and assuming the pointer checks out, calls the Connection's OnIoComplete function.

OnIoComplete implements the Connection class's state machine—essentially transitioning from one waiting state to another by calling the appropriate CompleteXxx function. For example, when a new client connects, the CompleteAccept method is called to perform the necessary steps to prepare the socket for actual use.

Likewise, each CompleteXxx function's last move is to issue another asynchronous I/O request, whether to read more data from the client, transmit a response, or ask that the socket be reset and ready to accept a new client.

At this point, several items merit mention when designing around I/O completion ports and asynchronous I/O. First, as has already been mentioned, because asynchronous I/O functions typically return immediately, you must ensure that any buffers passed to the calls remain valid at least until the completion event is handled. This implies heap allocation, since buffers allocated on the stack in a function get junked on exit.

The benefit of all this careful planning, however, is that Fire does not require any mutual exclusion mechanism in its implementation.

The result should be that Fire scales well on multicore machines, since individual worker threads are never competing for resources.

Conclusion

Acknowledgments

Thanks to Dave Cutler, Len Holgate, Paul Lloyd, and especially Dad for his detailed review.