Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Design

Porting Unix to the 386: Device Drivers


APR92: PORTING UNIX TO THE 386 DEVICE DRIVERS

Bill was the principal developer of 2.8 and 2.9BSD and was the chief architect of National Semiconductor's GENIX project, the first virtual-memory, microprocessor-based UNIX system. Prior to establishing TeleMuse, a market research firm, Lynne was vice president of marketing at Symmetric Computer Systems. They conduct seminars on BSD, ISDN, and TCP/IP. Send e-mail questions or comments to [email protected]. (c) 1992 TeleMuse.


For the last couple of months, we've been examining device drivers and 386BSD. We pick up where we left off last month, focusing on interrupt routines.

In Listing One (page 108), we use a series of macros to implement the interrupt entry stubs, using the C preprocessor. Typically, for each system compiled, a configuration program creates this file of interrupt stub routines (see Listing Two, page 108), which handles interrupt entry (at Vwd0 in Listing Two) and processes this file to the point where it can be passed off to the driver interrupt-handler function (in this case, wdintr). To keep the macros succinct, they are written out of other macros. To keep the code short, it is implemented inline with a minimum of branches, although it could have been written as a series of subroutines.

The macros ORPL( ) and SETPL( ) are nearly identical to the internal macros __ORPL__( ) and __SETPL__( ), used to implement the spls. They are used for the same reasons, but in assembler, not C. Macro INTR( ) builds an interrupt entry stub per invocation. The address of these stubs is obtained from the IDT table entry that the 386 fetches when processing an interrupt. The first instructions executed when handling an interrupt are at the beginning of this macro. We begin by crafting a trap frame suitable for a variety of purposes that may result from this incoming interrupt, and saving the processor state so we may restore it when we return from the interrupt later.

After saving the state in a trap frame, we send a command to the ICUs to dismiss the "in service" bit (turned on when the processor began processing the interrupt to lockout lower-priority requests). This indicates to both ICUs that the interrupt is stacked and that we will allow any subsequent interrupts that are unmasked, regardless of priority. This is done as soon as possible to forestall interrupts that might otherwise be missed.

Next, the segment registers the kernel will use are saved and loaded with known values. Alternatively, the interrupt could have been received when the processor was in user mode. In that case, both the ES and DS segment selectors would reflect the user process, not the kernel. We need worry only about the ES and DS registers because the kernel will not use FS and GS. The instruction selector and stack selector are handled automatically by the interrupt mechanism of the 386 and are already stacked.

We then OR in our interrupt vector's group mask to disable any companion-device interrupts (as well as the device). We must save the old mask value on the stack, so that it is put back the way we found it. Finally, we stack the unit number of this interrupt vector and reenable interrupts to allow unmasked devices to interrupt while we are processing this interrupt (nesting). We can now call a C device interrupt-handler function.

At this point, an interrupt frame is present on the stack, the kernel's segment registers are set to be consistent with what the kernel's program expects, all competing interrupts are blocked and unrelated interrupts are unblocked and active. The interrupt vector stub (see Listing Two) then calls the C handler (in our earlier example wdintr) to process the device's request and return. At this point, all device interrupt routines are effectively called with:

wdintr(frame)       struct intr_frame frame;

Getting Out of Interrupts

You'd think that leaving an interrupt would be as easy or easier than getting in, but this is not the case. We still have some housekeeping chores to do on the way out that may be unrelated to the interrupt we were servicing. This clouds the picture a little, as seen in Listing Three (page 108). This housekeeping is basically a performance enhancement that moves the processing of certain kernel services running at interrupt level (and blocking the processor from other interrupts) to run just ahead of the user process.

In anticipation of these chores, we've given both interrupts and traps similar stack frames. By popping off two words, we are back into a trap frame and ready to look for windows to wash. We restore the previous mask into the ICUs and check whether we are returning to a point where interrupts were masked before (a nested interrupt or critical section, for example). If so, we branch to do a simple return by unwinding our trap frame. We can do this because we are certain that the code we are returning to will eventually return to an unmasked level, and then the chores will be done.

We first check whether any network protocol software interrupts must be processed. Incoming packets received by drivers are enqueued at the device-interrupt level, and a software interrupt is requested. On some machines (DEC VAX, for instance), the software-interrupt mechanism is implemented in hardware. (One sets a bit on a special register, which causes an interrupt when the processor returns to an unmasked level.)

For the 386, we emulate this feature in software. If we find a request, we lockout other competing interrupt returns by setting the PROCESSING bit and successively call all protocol input routines that have been requested to be called, clearing each request as we go.

In a similar fashion, we call the software interrupt associated with the rescheduling clock. Earlier versions of UNIX did most scheduling calculations and watchdog routines at high levels. (Even worse, because they didn't use queues of processes, linear searches of all processes were done, and this was expensive--cf. Version 6.) This method did not enhance UNIX's reputation in the real-time department. One significant change in the BSD kernel was to move as much of the interrupt-level processing out to interruptible points and minimize the impact of overall system overhead. Clock processing turns out to account for much overhead in this regard (so much so, that 4.2BSD's 100-Hz clocking was cut back to 60 Hz), so this change was well justified.

Our final chore is to check whether we are returning to a user process, as opposed to the kernel. The contents of the code selector are checked to see if the selector will be executing user or kernel code. We're a bit sloppy here, because we must check only the 2 low-order bits of the selector to find out if it belongs to the USER ring (3). This suffices for now, but we may have to revisit this code when 386BSD is enhanced to support multisegmented programming.

If we are headed back to a user program, we must quickly check to see if the kernel has marked this process to be rescheduled. If so, we take the rescheduling trap we prepared for at the start of the INTR( ) macro. As mentioned in the 386BSD articles on multiprogramming (see DDJ, September and October 1991), we have few opportunities to allow rescheduling. We could induce a case in which we're safe from deadlocks by rescheduling when we are returning to the user program.

UNIX as Interface--Is it Adequate?

Although every edition of UNIX has contained innovative new areas of design, many device interfaces have not been altered to take advantage of this new technology. As such, the mechanism for communicating with the device driver is virtually identical from micro- to supercomputer, from first to last edition.

Few people are willing to take on the daunting (or insane) challenge of breaking and fixing every single driver created! Because the best drivers already exploit the current interface to great advantage, changing the interface may seem a bigger problem than it is worth.

The rewards coupled with this challenge are less obvious, so a careful look at emerging technology considerations is in order. A problem this big must be carefully outlined:

Multithreaded I/O Devices. A near-term nuisance, commonly noticed with the use of disk arrays in particular, is the difficulty in adapting the characteristics of multithreaded (that is, more than one concurrent stream of I/O operations) devices to the flat, strictly synchronous I/O model implied by the current UNIX device-driver model. This does not suggest an arbitrary jump to an asynchronous model. History has shown the simplicity and elegance of the synchronous model to be valuable allies in the battle against operating-system kernel "bloat." However, here we need coordinated, synchronous scheduling of I/O that can be used by the higher levels in the operating system to arrange the "time domain" of multithreaded operation to best advantage.

Streamer Devices. Almost every UNIX system has a magical kludge to support streaming devices such as cartridge tape, DAT, or 8MM video tape. These devices require extensive buffering to maintain "streaming operation"--in which the device can synchronously accept data to match its physical transfer "commitment". If this commitment is not met, the device must stop, rewind, get up to speed again, and re-do the transfer when the data is finally present. This causes the tape drive to "wheeze" back and forth and the data transfer to slow to a fraction of its top data rate; this turns system backups into interminable waits. Worse yet, the nature of these devices is that you only learn of a tape write error long after the driver has reported the block "written." This is because our UNIX device-driver interface implies a 1:1 relationship between "raw" transfers and their error status, which means that only one can be outstanding at a given time. Therefore, most of the kludges break the rules to attempt to keep the device double-buffered, all to varying degrees of success. The amount of RAM and disk storage is going up radically with time (some say soon to about 1 gigabyte per user), so high-speed backup is not just desirable; it's necessary for survival! We need a class of device with so-called "tear-away" I/O, where a portion of a process's address space is passed off to the device driver and the backup process. This way, a queue of regions of memory can be operated on by the driver in strict, synchronous order, with the depth of the queue appropriate to the demands of the device.

Redundancy of Common Code. Even with the previously mentioned efforts to collect driver common code into what is effectively a support library of routines, many of the system's drivers still have considerable "common code" left inside. You begin to wonder if this method is adequate. An alternative approach would be to express the routines as methods of a driver "object" in an object-oriented language, such as C++. Thus, the commonality could be dealt with by having classes of drivers, and the state of an active driver/device could be expressed as an instance variable created on demand. In other words, it may be time to apply object-oriented languages to the kernel itself.

Information Caching. The driver interface also suffers in terms of cached information possibly shared by other processors (say, in a multiprocessor or a dual-ported disk or via a network). To maintain cache consistency, invalidation mechanisms are necessary, perhaps even at the driver level. Also, cache mechanisms above the driver level need additional information about the status of a drive's queued transfers to determine if the drive is already saturated with requests. If so, the cache should decline requests on that drive until it finishes that to which it has already committed. Drivers should be viewed as presenting an employable set of capabilities that the file-system cache can exploit.

The device-driver interface should provide a real-time schedule for I/O to conclude in a deterministic period, and a mechanism for ordered writes in a strictly ordered fashion. (With almost every UNIX system today, one has to "sync" multiple times to ensure that sandbagged disk I/O makes it out before shutting down the system.) The file-system cache is also handicapped on writes because it must keep the disk consistent--so a single write of an additional block of data in a file causes as many as three or four separate writes to update related data structures on the disk to keep them consistent. By adding these missing mechanisms, the system's cache could maintain stability without losing performance.

Device Conflict. Because autoconfiguration is usually only done at bootload time, devices that conflict (and are hence "lost") cannot be used, and devices cannot be added after the system is up and running. This was acceptable when BSD UNIX ran on VAX mainframes for months at a time between reconfigurations, but it's irritating to constantly reboot a PC if, say, a device was unplugged or turned off when the system last booted. Configuration and resource allocation (interrupts, DMA channels) need to be done uniformly, at will, and in an automatic fashion.

File System to Device Disparity. The semantics of many device drivers fit the "file" metaphor, but UNIX drivers don't have the complete semantics of files at all. Although this is no great loss in many cases (for example, unit record devices, such as serial ports and printers), for others (notably mass storage devices), file attributes such as size, write-protection, modes, and even device-naming conventions visible to the programmer are not visible to the device driver, causing a loss of potential functionality. We can repair this by developing a special device (spec) file system, where device drivers are attached. In this case, the primitive operators of the Virtual Filesystem of the BSD kernel correspond to all of the user-level process file operations to the device driver itself. (For example, we export the virtual file abstraction down to the device driver.) This is currently implemented in BSD. At the last moment, however, it converts to the ancient UNIX device-driver conventions to avoid dealing with the problem (for now). (If this sounds like shades of "Plan-9," who are we to say differently?...)

MACH devotees have decreed that device drivers in user processes are a must, in order to allow them to be dynamically loaded. But by choosing our file abstraction appropriately, we can gain the facility for loadable drivers without all the smoke and mirrors. If access through to the drivers is uniform through the central concept of a virtual file system, then the interface can be "remoted," much as NFS provides for remote files. (This is just an analogy, not a suggested mechanism.) In short, the choice of a better device-driver interface can offer substantial advantages without necessarily requiring large amounts of radical code that would contribute to the bloat factor of the kernel.



_PORTING UNIX TO THE 386_
by William Frederick Jolitz and Lynne Greer Jolitz


[LISTING ONE]
<a name="00eb_0006">

/* [Excerpted from /sys/i386/isa/icu.h] */
 ...
/* Macro's for interrupt level priority masks (used in assembly code) */
/* mask additional interrupts */
#define ORPL(m) \
    cli ;               /* disable interrupts */ \
    movw    m , %dx ;       /* get the mask */ \
    inb $ IO_ICU1+1, %al ;  /* next, get low order mask */ \
    xchgb   %dl, %al ;      /* switch the old with the new */ \
    orb %dl, %al ;      /* finally, or it in! */ \
    outb    %al, $ IO_ICU1+1 ; \
    inb $0x84, %al ; \
    inb $ IO_ICU2+1, %al ;  /* next, get high order mask */ \
    xchgb   %dh, %al ;      /* switch the old with the new */ \
    orb %dh, %al ;      /* finally, or it in! */ \
    outb    %al, $ IO_ICU2+1    ; \
    inb $0x84, %al ;    /* flush write buffer, delay bus cycle */ \
    movzwl  %dx, %eax ;     /* return old priority */ \
    sti ;               /* enable interrupts */
/* force interrupt mask */
#define SETPL(v)    \
    cli ;               /* disable interrupts */ \
    movw    v , %dx ; \
    inb $ IO_ICU1+1, %al ;  /* next, get low order mask */ \
    xchgb   %dl, %al ;      /* switch the old with the new */ \
    outb    %al, $ IO_ICU1+1 ; \
    inb $0x84, %al ; \
    inb $ IO_ICU2+1, %al ;  /* next, get high order mask */ \
    xchgb   %dh, %al ;      /* switch the copy with the new */ \
    outb    %al, $ IO_ICU2+1    ; \
    inb $0x84, %al ;    /* flush write buffer, delay bus cycle */ \
    movzwl  %dx, %eax ;     /* return old priority */ \
    sti ;               /* enable interrupts */
/* Mask a group of interrupts atomically - interrupt entry */
#define INTR(unit,mask,offst) \
    pushl   $0 ;          /* first, build a trap frame for ... */ \
    pushl   $ T_ASTFLT ;  /* ... possible rescheduling that may occur */ \
    pushal ; \
    nop ; \
    movb    $0x20, %al ;    /* next, as soon as possible send EOI ... */ \
    outb    %al, $ IO_ICU1 ; /* ...so in service bit may be cleared ...*/ \
    inb $0x84, %al ;    /* ... ASAP */ \
    movb    $0x20, %al ;    /* likewise, the other one as well */ \
    outb    %al,$ IO_ICU2 ; \
    inb $0x84,%al ; \
    pushl   %ds ;       /* save our data and extra segments ... */ \
    pushl   %es ; \
    movw    $0x10, %ax ;    /* ... and reload with kernel's own */ \
    movw    %ax, %ds ; \
    movw    %ax, %es ; \
    incl    _cnt+V_INTR ;   /* tally interrupts */ \
    incl    _isa_intr + offst * 4 ; \
    movw    mask , %dx ;    /* assert group mask */ \

   inb $ IO_ICU1+1, %al ; /* next, get low order mask */ \
    xchgb   %dl, %al ;  /* switch the old with the new */ \
    orb %dl, %al ;  /* finally, or it in! */ \
    outb    %al, $ IO_ICU1+1 ; \
    inb $0x84,%al ; \
    inb $ IO_ICU2+1, %al ; /* next, get high order mask */ \
    xchgb   %dh, %al ;  /* switch the old with the new */ \
    orb %dh, %al ;  /* finally, or it in! */ \
    outb    %al, $ IO_ICU2+1 ; \
    inb $0x84, %al ; \
    pushl   %edx ;      /* save old mask for when we return */ \
    pushl   $ unit ;    /* finish off interrupt frame with unit # */ \
    sti ;           /* and allow other unmasked interrupts */
...





<a name="00eb_0007">
<a name="00eb_0008">
[LISTING TWO]
<a name="00eb_0008">

/* AT/386 -- Interrupt vector routines -- Generated by config program  */

#include    "machine/isa/isa.h"
#include    "machine/isa/icu.h"

#define VEC(name)   .align 4; .globl _V/**/name; _V/**/name:
    .globl  _hardclock
VEC(clk)
    INTR(0, ___highmask__, 0)
    call    _hardclock
    INTREXIT1

    .globl _wdintr, _wd0mask
    .data
_wd0mask:   .long 0
    .text
VEC(wd0)
    INTR(0, ___biomask__, 1)
    call    _wdintr
    INTREXIT2

    .globl _fdintr, _fd0mask
    .data
_fd0mask:   .long 0
    .text
VEC(fd0)
    INTR(0, ___biomask__, 2)
    call    _fdintr
    INTREXIT1

    .globl _cnrint, _cn0mask
    .data
_cn0mask:   .long 0
    .text
VEC(cn0)
    INTR(0, ___ttymask__, 3)
    call    _cnrint
    INTREXIT1

    .globl _npxintr, _npx0mask
    .data
_npx0mask:  .long 0
    .text
VEC(npx0)
    INTR(0, _npx0mask, 4)
    call    _npxintr
    INTREXIT2

    .globl _comintr, _com0mask
    .data
_com0mask:  .long 0
    .text
VEC(com0)
    INTR(0, ___ttymask__, 5)
    call    _comintr
    INTREXIT1

    .globl _weintr, _we0mask
    .data
_we0mask:   .long 0
    .text
VEC(we0)
    INTR(0, ___netmask__, 6)
    call    _weintr
    INTREXIT1

    .globl _lptintr, _lpt0mask
    .data
_lpt0mask:  .long 0
    .text
VEC(lpt0)
    INTR(0, ___ttymask__, 7)
    call    _lptintr
    INTREXIT1





<a name="00eb_0009">
<a name="00eb_000a">
[LISTING THREE]
<a name="00eb_000a">

/* [Excerpted from /sys/i386/isa/icu.h] */
 ...
/* First eight interrupts (ICU1) */
#define INTREXIT1   \
    jmp doreti
/* Second eight interrupts (ICU2) */
#define INTREXIT2   \
    jmp doreti
 ...
/* [Excerpted from /sys/i386/isa/icu.s] */
 ...
/* Handle return from interrupt after device handler finishes  */
doreti:
    /* move to a trap frame */
    cli             /* interrupts off while we work ... */
    popl    %ebx            /* remove unit number */
    popl    %eax            /* get previous priority mask */

    /* restore previous mask */
    movw    %ax, %cx
    outb    %al, $ IO_ICU1+1
    inb $0x84, %al
    movb    %ah, %al
    outb    %al, $ IO_ICU2+1
    inb $0x84, %al

    /* are we at interrupt level / nested interrupt already ? */
    cmpw    ___nonemask__, %cx
    jne 3f

    /* do we need to process an network software interrupt ? */
    cmpl    $0, _netisr
    je  2f
    btsl    $ NETISR_PROCESS, _netisr
    jb  2f

#include "../net/netisr.h"
#define DOCALL(n, s, c) ; \
    .globl  c ;  \
    btrl    $ s , n ;  \
    jnb 1f ; \
    call    c ; \
1:
    /* process a network software interrupt */
    sti
    DOCALL(_netisr, NETISR_RAW, _rawintr)
#ifdef INET
    DOCALL(_netisr, NETISR_IP, _ipintr)
#endif
#ifdef IMP
    DOCALL(_netisr, NETISR_IMP, _impintr)
#endif
#ifdef NS
    DOCALL(_netisr, NETISR_NS, _nsintr)
#endif
    btrl    $ NETISR_PROCESS, _netisr
2:
    /* do we need to process a software clock "interrupt" */
    cli
    btrl    $ SCLK_NEED, ___softclock__
    jnb 1f
    btsl    $ SCLK_PROCESS, ___softclock__
    jb  1f

    /* process a software clock "interrupt" */
    sti
    pushl   ___nonemask__   /* to an interrupt frame again */
    pushl   %ebx
    call    _softclock
    popl    %eax        /* back to trap frame for possible AST */
    popl    %eax
    btrl    $ SCLK_PROCESS, ___softclock__
1:
    /* see if we need to process an AST (rescheduling) fault */
    cmpw    $0x1f, tCS*4(%esp) /* were we executing from a user mode ... */
    jne 3f      /* ... code selector? */
    DOCALL(___ast__, AST_NEED, _trap);
3:
    /* restore the state and return */
    popl    %es
    popl    %ds
    popal
    nop
    addl    $8, %esp
    iret
 ...


Copyright © 1992, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.