Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Embedded Systems

Asynchronous Communications Using select and poll


Sep98: Asynchronous Communications Using select and <i>poll

Sean has been a BSD developer and contributor for many years, starting with 386BSD. Sean can be contacted at [email protected].


One important feature of UNIX-like operating systems is their ability to efficiently handle diverse I/O requirements, including slow serial terminals, high-speed disk drives, multiple network protocols, and pseudodevices that provide a variety of services. For applications to efficiently juggle multiple I/O streams, the operating system provides the select() system call. With this call, you tell the operating system what streams you want to monitor, and it wakes you up when any of those streams have activity.

In "Tracing BSD System Calls" (DDJ, March 1998), I showed how I added support in the kernel for one process to monitor system calls made by another process. In this article, I'll make that facility easier to use by adding support for select() and poll().

Using Select

The select() system call was added to 4.2BSD to support a form of asynchronous I/O. Before select(), a program could only determine if a file descriptor was ready for I/O by actually doing the I/O. The two biggest advantages of select() were that it allowed multiple file descriptors to be queried at the same time, and it had a time-out value associated with it.

For example, select() makes it possible to check for input on 20 different network connections, waiting until input is available, or half a second has passed. This is particularly useful in interactive applications.

The select() call accepts three bitmaps and a time-out value. The bitmaps are of type fd_set, an opaque type you manipulate using the FD_SET, FD_CLR, FD_ISSET, and FD_ZERO macros. Each bit in one of these bitmaps refers to a particular file descriptor; the default in FreeBSD (http://www.freebsd.org/) allows for up to 256 file descriptors to be monitored, although you can make this limit larger.

Listing One ses select() to implement a simplistic but complete chat server. This server accepts connections on port 12345, and, once connected, sends whatever it receives to all other connections.

Listing One shows some of the power of select(). My simple server is a single process, yet it efficiently interleaves multiple inputs. If there is no input ready from any of the file descriptors (and, as the code shows, a new connection attempt counts as input being ready for the primary socket), the process simply waits, consuming no CPU resources. As Listing One also demonstrates, select() sets the appropriate bit on return to indicate which files match the specified criteria.

This sample program only checks for available input on a stream; the other two bitmasks let you check whether output is possible or there is an exception.

In select() parlance, exceptions are file-specific states that are not covered by the terms "input ready" or "output ready. A file descriptor is in exception state if the file-specific code decides that it is; there is no other standard. For example, a closed pipe, or out-of-band socket data may result in an exception state. Most file types and device drivers do not have an exception state.

Select Implementation

Since select() is a system call, the bulk of the code resides in the kernel -- /usr/src/sys/kern/sys_generic.c, in this case.

select() is a moderately complex system call; it also requires vnode-specific support. (A vnode is an operating-system data structure that represents a specific communications connection, such as an open file or socket.)

The top-level select() function works in a loop. On each iteration, select() checks all of the indicated file descriptors (calling the appropriate file-specific code). If any selected file descriptor is in the requested state, the loop ends, select() returns, and the process can service the request. Otherwise, select() relinquishes the processor immediately to let other processes run, and repeats the loop when something changes.

The two complicated parts are calling the appropriate file-specific code for each indicated descriptor and arranging for the loop to repeat only when some descriptor actually changes state. The first part is handled by the internal selscan() function. The second part involves some standard kernel functions -- tsleep() to put a process to sleep and wakeup() to wake it up again -- combined with a few tricks to make it all more efficient.

Checking File Descriptors

selscan calls the file-specific select routines for each selected file; if any of these functions returns a nonzero value, this indicates that the object in question is immediately ready for the requested I/O. selscan() returns the number of files ready in *retval (or p->p_retval1 in FreeBSD-current).

Some types of files (such as disk files) are always ready for I/O. In this case, selscan() will return immediately. For file systems that fully support select, the function will check to see if I/O is ready at the time, and, if not, schedule notification to be made when it is.

Waiting for a Change

The select() call waits for file-status changes in a pretty direct way. At the bottom of its top-level loop, select() calls tsleep() to put this process on the sleep queue and schedule another process. The channel it is waiting on, &selwait, is a global variable; every process sleeping in select() is sleeping on the same channel. In essence, whenever a file descriptor changes state, this channel is notified and all pending select() calls wake up and loop again to check the relevant descriptors. The actual implementation uses a few tricks to make this more efficient.

If I/O isn't immediately possible, the file-specific code calls selrecord(). Essentially, all selrecord() does is keep track of whether there is a "collision" -- that is, whether there's more than one process waiting on this event.

When a file's status changes, the file system code calls selwakeup() to wake up any processes that might be waiting for a change in this file. If there is a collision, selwakeup() simply wakes up all of the pending select() calls by notifying the shared channel with wakeup((caddr_t)&selwait). This prompts every select() call to recheck all of its file descriptors.

If there is only one process waiting for this event, selwakeup() cheats a bit and uses its knowledge of process scheduling to wake up only that single process; see Example 1.

The first if statement in Example 1 checks what channel the process is sleeping on. All processes waiting in select() have the same channel -- the address of the global variable selwait (which is used only as an address; it is never read nor modified by anything in the kernel). If the process is indeed waiting on &selwait, then selwakeup() wakes it up directly.

Once awaken, the select() call loops and calls the file-specific select functions again. This continues until a file matches the requested state, the given time-out runs out, or the process is interrupted by a signal.

File-Specific Select

The rest of the implementation of select() depends on the particular file, but is usually straightforward. For example, the UFS (UNIX File System) select function immediately indicates that I/O is ready by returning a 1. It does this even if the file is empty. A read() of such a file will return 0 bytes read, but will not hang, which satisfies the purpose of select().

Pipes have more complete support for select(). Listing Two, an excerpt from pipe_select(), is called with three parameters -- the file being examined, a flag indicating whether the file is being examined for input, output, or exception, and the process descriptor structure.

The first case in Listing Two, FREAD (input is being scanned), is typical for a file-specific select routine. If input is available immediately, it returns 1 (which tells selscan() that this file has I/O ready); otherwise, it calls selrecord() to indicate that the pipe is being selected on.

Whenever the pipe code reads or writes into the pipe buffer, or closes the pipe, it calls pipeselwakeup(). This function merely tests if the pipe is currently being selected; if so, it calls selwakeup(). This causes the monitoring process to wake up and repeat the select() loop, which results in another call to pipeselscan(), which will notice that I/O is ready.

The main point of the preceding discussion is that, if you can determine when I/O will be ready, it takes little code to fully support select().

Poll versus Select

The poll() system call is a similar routine that originated in System V Release 3.2 UNIX and has recently been added to FreeBSD. The main difference is that poll() takes an array of structures, instead of three bitmasks, to describe which file descriptors and events should be examined.

Each function has some advantages and disadvantages: select() allows time-outs to be measured in microseconds, while poll only allows tenths of a second. select() can only monitor for "input ready," "output ready," and "something else," while poll() supports many kinds of file-type-specific events.

Within the kernel, the main difference is that poll() can check for more events, and the file-specific routine specifies which events occurred. Since the functions called by selscan check a single event, and return nonzero if that event happened, the same file-specific function can serve both purposes.

procfs_select

Each file object needs to store certain select-related information. For procfs, I could have stored this information either in the proc structure (which is unique to each process) or in the pfsnode structure (which is unique to each open procfs object).

Placing it in the pfsnode structure makes sense because this information is relevant only to procfs. However, I need to add a selwakeup() call to the generic stopevent() kernel function, which is not part of procfs. For that reason, it makes more sense to store the procfs_select information in the proc structure. Modifying the proc structure is not something you do lightly; a lot of code uses and manipulates proc structures, including loadable kernel modules (LKMs) and programs such as ps. Since these aren't automatically recompiled when you recompile the kernel, there's a potential for version skew problems.

To alleviate these problems, I moved the procfs-related elements out of the proc structure, and stored only a pointer to them in the proc structure. I then padded the proc structure so that its length is unchanged; this reduces the version-skew problem.

Example 2 shows the new structure. The p_sel element is the new variable for handling select() and poll(). Now, when an event being monitored by procfs occurs, stopevent() (in /usr/src/kern/sys_process.c) must call selwakeup(); see Example 3.

The remainder of the work is to implement procfs_poll(). This is the function that will be called by selscan() (or the equivalent part of the poll() system call) for a procfs file handle. Listing Three is the result.

The bulk of procfs_poll() simply locates the proc structure of the process being monitored and checks (using the CHECKIO() macro) that the monitoring process has the necessary permission.

The semantics I've chosen for procfs_poll() are essentially identical to the procfs ioctl, PIOCWAIT; that is, it is used to find out when a monitored process stops for some event. This just requires testing procp->p_ev->p_step, which is set in stopevent() when a process stops. That is the only time it will be set.

If p_step is not set, then procfs_poll() calls selrecord(), and returns 0, indicating that none of the requested information is available now.

The only other kernel code change is to tell procfs about the new function; that is done by adding {&vop_poll_desc, (vop_t *) procfs_poll} to the structure that defines which features procfs supports.

Using select() in procfs

Initially, the UNIX truss utility used an ioctl to wait for a monitored process to stop (see "Tracing BSD System Calls," DDJ, March 1998). However, this can only monitor one process at a time, which is a problem if the program being trussed happens to do a fork() system call. To watch multiple processes, truss would need a way to watch multiple file descriptors -- which, obviously, is where all of this talk and code comes into play.

Example 4 shows the change to truss that allows it to watch multiple processes. This approach is more complex: In addition to keeping track of "maxfd," truss also has to go through ifds after select() returns to determine which file descriptors have input ready. However, there is no other practical way to monitor multiple processes short of running multiple copies of truss (which would make the output very difficult to follow).

Conclusion

At this point, the kernel has had a major feature added -- with only a few lines of code. And, with a bit more code being modified, truss is capable of using this feature.

DDJ

Listing One

/* Simple select()-driven chat server */#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <fcntl.h>
#include <err.h>
#include <netdb.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/time.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>


</p>
main() {
  struct protoent *tp = getprotobyname("tcp"); /* TCP protocol */
  int serv_sock; /* Master socket for server */
  fd_set ifds; /* Bitmap of active sockets */
  int maxfd; /* Last active socket */
  struct sockaddr_in sin;
  char **hostnames;  /* Names of currently connected hosts */


</p>
  /* Create a TCP socket */
  if (tp == NULL) err(1, "Cannot get tcp protocol\n");
  serv_sock = socket(PF_INET, SOCK_STREAM, tp->p_proto);
  if (serv_sock < 0)
    err(2, "Cannot create socket:  %s\n", strerror(errno));
  /* Bind the socket to port 12345 */
  memset(&sin, 0, sizeof(sin));
  sin.sin_len = sizeof(sin);
  sin.sin_family = PF_INET;
  sin.sin_port = htons(12345);
  sin.sin_addr.s_addr = htonl(INADDR_ANY);
  if (bind(serv_sock, (struct sockaddr*)&sin, sizeof(sin)) < 0)
    err(3, "Cannot bind socket:  %s\n", strerror(errno));
  /* Allow incoming connections */
  if (listen(serv_sock, 64) < 0)
    err(3, "Cannot listen to socket:  %s\n", strerror(errno));
  FD_ZERO(&ifds);
  FD_SET(serv_sock, &ifds);
  maxfd = serv_sock;
  for (;;) {
    int fd = -1; /* Current file descriptor (socket) with input */
    int i;
    struct timeval to = { 2, 0 }; /* 2 seconds, 0 microseconds */
    /* Wait for input from anywhere */
    fd_set tifds = ifds;
    int numfds = select(maxfd+1, &tifds, NULL, NULL, &to);
    if (numfds == -1) err(4, "Cannot select:  %s\n", strerror(errno));
    if (numfds == 0) continue;
    /* Find first connection with available input */
    fd = -1;
    for (i = 0; i <= maxfd; i++)
      if (FD_ISSET(i, &tifds)) {
        fd = i;
        break;
      }
    if (fd == -1) err(5, "Oops");
    if (fd == serv_sock) { /* New connection request */
      struct sockaddr_in remote;
      int len = sizeof(remote);
      int rfd;
      /* Accept the connection */
      rfd = accept(serv_sock, (struct sockaddr*)&remote, &len);
      if (rfd == -1) err(6, "Cannot accept:  %s\n", strerror(errno));
      /* Add the new connection to file descriptors I watch */
      FD_SET(rfd, &ifds);
      if (hostnames && rfd < maxfd && hostnames[rfd]) {
        free(hostnames[rfd]);
        hostnames[rfd] = strdup(inet_ntoa(remote.sin_addr));
      } else {
        maxfd = rfd;
        hostnames = realloc(hostnames, (maxfd+1)*sizeof(char*));
        hostnames[rfd] = strdup(inet_ntoa(remote.sin_addr));
      }
    } else {  /* Input is from a chat participant */
      char buf[80];
      int count;
      char *h = hostnames[fd];
      /* Get the input */
      count = read(fd, buf, sizeof(buf));
      if (count == 0) FD_CLR(fd, &ifds);
      else {
        /* Relay it to every other participant */
    int i;
        for (i = 0; i <= maxfd; i++) {
          if (FD_ISSET(i, &ifds) && i != serv_sock && i != fd) {
            write(i, h, strlen(h));
            write(i, ": ", 2);
            write(i, buf, count);
          }
        }
      }
    }
  }
}

Back to Article

Listing Two

/* Excerpt from pipe_select() in /usr/src/sys/kern/sys_pipe.c */switch (which) {
case FREAD:
  if ( (rpipe->pipe_state & PIPE_DIRECTW) ||
       (rpipe->pipe_buffer.cnt > 0) ||
       (rpipe->pipe_state & PIPE_EOF)) {
         return (1);
  }
  selrecord(p, &rpipe->pipe_sel);
  rpipe->pipe_state |= PIPE_SEL;
  break;
case FWRITE:
  if ((wpipe == NULL) ||
      (wpipe->pipe_state & PIPE_EOF) ||
      (((wpipe->pipe_state & PIPE_DIRECTW) == 0) &&
       (wpipe->pipe_buffer.size - wpipe->pipe_buffer.cnt) >= PIPE_BUF)) {
          return (1);
  }
  selrecord(p, &wpipe->pipe_sel);
  wpipe->pipe_state |= PIPE_SEL;
  break;
case 0:
  if ((rpipe->pipe_state & PIPE_EOF) ||
      (wpipe == NULL) ||
      (wpipe->pipe_state & PIPE_EOF)) {
         return (1);
  }
  selrecord(p, &rpipe->pipe_sel);
  rpipe->pipe_state |= PIPE_SEL;
  break;
}

Back to Article

Listing Three

/* Excerpt from file /usr/src/sys/miscfs/procfs/procfs_vnops.c *//* Implement poll for procfs -- not as featureful as it could be, but a start
 * for now, it only checks for a process stopping via STOPEVENT().
 */
static int
procfs_poll(ap)
  struct vop_poll_args /* {
    struct vnode *a_vp;
    int a_events;
    struct ucred *a_cred;
    struct proc *a_p;
  } */ *ap;
{
  struct pfsnode *pfs = VTOPFS(ap->a_vp);
  struct proc *procp, *p;
  int error;
  p = ap->a_p;
  procp = pfind(pfs->pfs_pid);
  if (procp == NULL) {
    return ENOTTY;
  }
  if (!CHECKIO(p, procp))
    return EPERM;
  if (!procp->p_ev) {
    get_procfs_event(procp);
  } else if (procp->p_ev->p_step) {
    return (ap->a_events & (POLLRDNORM | POLLWRNORM));
  }
  selrecord(p, &procp->p_ev->p_sel);
  return 0;
}

Back to Article


Copyright © 1998, Dr. Dobb's Journal

Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.