Software Partitioning for Multitasking Communication

Partitioning communications into layers is one technique for increasing performance.

September 01, 1991
URL:http://www.drdobbs.com/parallel/software-partitioning-for-multitasking-c/184408619

Figure 1

Figure 2

Figure 3

SEP91: SOFTWARE PARTITIONING FOR MULTITASKING COMMUNICATION

SOFTWARE PARTITIONING FOR MULTITASKING COMMUNICATION

The key to high performance from any hardware

David McCracken

David is a consulting engineer in the embedded systems field and can be contacted at 6850 Freedom Blvd., Aptos, CA 95003.

Embedded applications are increasingly demanding concurrent functions. Users no longer tolerate, for example, a machine that shuts down the user interface while printing. Nowhere is this more evident than in a rapidly emerging application class in which the computer allows its operator to communicate transparently with other machines through a variety of media. Especially where the computer acts as a turn-key controller, the user does not interact with it as a computer and is unforgiving of behavioral restrictions dictated by the deficiencies of a hidden entity.

It has been suggested that general purpose multitasking operating systems will provide the basis for such applications. But to achieve generality, these operating systems suffer an enormous context-switch time penalty. Naive programmers think that hardware manufacturers will simply make machines that are fast enough to solve any performance problems. But embedded applications are often cost-sensitive. Additionally, at any point in time there is a bulk of computing hardware that provides the best cost/performance, has the most alternate sources, and is well-understood. If software can be crafted to keep an application within the limits of this hardware, the resulting product is easier to manufacture and maintain. Proper software partitioning is a key element in extracting the highest performance from any hardware. This is particularly true in the design of multitasking applications.

Multitasking generally serves two purposes: to simplify complex programming tasks and to improve performance. A general-purpose operating system such as Unix provides an example of the former. Independently written programs, both cooperative and stand-alone, can peacefully coexist in Unix. The easiest way to develop a large program is to divide it into processes that can be developed in relative isolation. But this doesn't afford any improvement in performance. In fact, the overhead of task swapping lowers performance compared to the same job being accomplished by a single-task process. To improve performance, event-driven multitasking is needed. Assume that a given application includes a job that can't complete without communicating with an external entity, and that this communication doesn't require the CPU's total processing bandwidth. By using an interrupt dedicated to the specific entity to grab CPU time, we not only tell the CPU when to work but also what to work on.

Machine Hierarchy

To simplify application programming, some operating systems use a single unified task-switching mechanism. The mechanism provides time slices for separate programs and responds to external events by analyzing their priorities and launching appropriate tasks. Thus, tasks that communicate with external entities are full-fledged programs. The principal problem with this is the enormous overhead associated with dispatching a task that has all the rights of a high-level program. External communication usually involves moving many individual bytes -- a very simple task -- until a critical amount of information has moved, allowing the program to advance to its next major state, at which point it probably requires high-level privileges.

An analogous situation exists in translators, such as compilers. Input text is analyzed by grouping individual characters into tokens and then parsing the tokens into statements. Tokenizing requires only a simple-state machine, a Deterministic Finite Automaton (DFA). Parsing requires a more complex PushDown Automaton (PDA), which is essentially a state machine with a stack. A PDA certainly has the power to do anything that the DFA can, including tokenizing, but at a higher computational price, because it's a more complex machine. Similarly, the higher levels of translation are too complex for the PDA, requiring the power (and price) of a Turing machine.

Giving all levels of a communication task equally powerful computing facilities is equivalent to a compiler using a Turing machine for tokenizing (scanning), parsing, and code generation. But to achieve higher performance, most compilers use (simulated) DFA and PDA for the first two phases, reserving the powerful and expensive Turing capability for use only where it is essential.

Partitioning

To better utilize our machine's power, we need to identify its computing mechanisms of varying power. We then need to partition any communication function into layers that are distinguished by computing complexity and that align with the available mechanisms. Note that the mechanisms are machines only in a theoretical sense: They are not specific hardware. The same basic hardware can, for example, support two different kinds of task switchers, one that apportions CPU time to fully privileged high-level tasks, and another that provides reduced computing capability in response to external events. In this scheme, the typical application is partitioned horizontally by functions, such as user interface, LAN interface, printer, and so on, and vertically by computing complexity. Figure 1 illustrates an application consisting of four tasks, A through D, operating in a system with three computing levels, 1 through 3. Each task may execute at all three levels or at only one or two of them.

The horizontal task partitioning can usually be done without considering the vertical computing levels. Obviously, though, we can't do an adequate vertical partitioning without knowing just what kind of computing power is available at each level. To a large extent, this depends on the computing platform -- the base hardware, added hardware, the native operating system (if there is one), and any operating system support that we might provide in the application program itself. Many application designers overlook the latter, for example, believing that MS-DOS can't be used for a multitasking application.

A program designed for reliability and ease of maintenance minimizes the connections between the different task/level boxes. An extreme approach to task encapsulation would demand, for example, that taskA/level1 know nothing of the organization of data in taskA/level2 or in taskB/level1. In Unix, a pipe could be used to allow taskA to communicate with taskB while enforcing their complete separation. But performance can be improved by sharing data, either through a virtual pipe as in the MACH operating system, or through explicitly shared memory.

The job of partitioning a design is difficult because it entails many trade-offs and must be done before we understand the application fully. It must be done even when the scheme I've outlined here is not followed. This scheme, however, provides a rational approach to the process.

A Concrete Example

In many applications, a single computer is called upon to serve as a communication hub, connecting a variety of external devices. For example, I recently finished the design of a controller for a complex medical chemical analyzer. An ISA (Industry Standard Architecture, aka AT-compatible) computer was programmed to communicate simultaneously with the analyzer via GPIB (General-Purpose Interface Bus, IEEE488) and with a host computer via RS-232 while printing results and interacting with the user via a windowed interface. During all this overt activity, the application called for independent, concurrent long-range quality control related to both the test results and the controller's operation. I chose MS-DOS for the native operating system because it is compact, stable, and has a good supply of support applications, such as compilers and window libraries, and because most of the programmers on the project were familiar with it.

MS-DOS doesn't provide multitasking, and its functions are not reentrant. Confronted by these limitations, many programmers are surprised to see it providing the base for this sort of application. Identifying (and creating) the levels of computing machinery and partitioning the tasks to match are what make this design work.

The obviously discrete functions of the application determine the program's horizontal partitioning. DMA (Direct Memory Access) provides the lowest level computing mechanism. DMA represents a partial state machine, where each transition is based solely on the current state, regardless of the input. Smarter DMA devices that can look for defined inputs do exist, like Zilog's Z8410, but the Intel 8237 used in most ISA machines is not one of them. In fact, as used in ISA machines, the 8237 can't even provide complete DMA capability without some higher-level assistance because its 16-bit counters don't cover the 24-bit address range of the machine.

Interrupt SubRoutines (ISR) provide the next level of computation. Many operating systems use external event-driven interrupts only to trigger the dispatcher, which then launches the corresponding high-level task, effectively eliminating the ISR as a distinct computing level. In our design, the interrupts trigger an immediate and direct response, thereby automatically telling the CPU not only when to do a context switch but also what context to switch to, with very little overhead. However, the price we pay is that the ISRs cannot provide the highest level of computation. Interrupts can occur even while a DOS function is executing. Because DOS is not reentrant, the ISRs must execute without calling DOS. This doesn't mean that we have to, for example, write our own disk access functions for use in an ISR, but that we partition each task so that disk access is needed only at a higher level. Any data processed in an ISR must reside in memory.

The highest computing level is provided by simple, non-preemptive round-robin multitasking. Each task checks for any work to be done and then releases its time slot by calling a function whose sole purpose is to record the task's current context and restore that of the next task in the cycle. Each task in turn is given the opportunity to function as a full DOS program until it gives up its time. These are clearly cooperative tasks, in the usual sense that they operate on common data and toward common goals, and also in the sense that they must voluntarily give up their time.

The high-level partition of each task must use polling to determine the work to be done because its time slice is not synchronized to any events. This degrades performance only slightly because for each high-level event there can be thousands of low-level ones efficiently processed by the task's ISR and/or DMA partitions. For example, the GPIB ISR reduces all events to a single doubleword (unsigned long) bit array that the high level can test in a few CPU cycles.

Timing

In this application, the GPIB, RS-232, and printer tasks all have dedicated ISRs. Only the GPIB has a data flow volume--as much as 17K in one block--sufficient to demand using the DMA. The attached device limits the maximum data rate to 75K/sec, and often the rate is much lower. Consequently, I chose to operate the DMA in its CPU cycle-stealing mode, even though burst mode uses a more efficient memory cycle that effects a memory/input/output transfer in nearly half the time. When stealing cycles, the 8237 requires six clocks at five MHz (its clock is independent of the microprocessor's) or 1.2 microseconds, to perform a transfer. Thus, even at the maximum data rate, DMA consumes only 1.2/13.3, or nine percent of the bus bandwidth. Further, on many DMA hits, the CPU won't be slowed down at all because its multicycle instructions and prefetch queue elastically couple execution to bus access. In contrast, burst mode would last long enough for the CPU to deplete its available instructions and then be forced to remain idle.

That cycle-stealing DMA is essentially free (in CPU time) should be enough to convince anyone to try to find the portions of a task that such a dumb mechanism can compute. Unfortunately, ISA machines don't provide a means to connect the serial and parallel (printer) ports to DMA. GPIB capability is provided by a plug-in card that supports DMA.

The (CPU time) expense of an ISR is largely determined by the amount of data processing we want it to do. As expected, greater computing capability is more expensive. For example, most commercial serial communication libraries afford ISRs that provide little more than DMA does. Typically, the input function simply transfers bytes to a memory buffer and sets a flag when it sees a particular value. ISR context switching requires only the time needed for INT and IRET, 40 clocks (23+17) plus stacking and unstacking (say AX, DX, DS, and SI) 32 clocks. With a 20MHz CPU clock, this consumes 3.6 microseconds. The real work of the ISR probably consumes another 5 to 15 microseconds. But data processing is deferred. A high-level function is expected to poll the flag and process all data itself. This results in slow responses to external events. It also is not very efficient because the high level must poll for many communication situations that could have been handled by the ISR, and because the contents of the input buffer must usually be moved to memory locations determined by the application. It is often not possible to anticipate the ultimate destination of input before seeing some portion of it, such as a control header.

The ISR can be turned into a more powerful computing mechanism by giving it explicit state memory. Even the simple ISR has a little state awareness implied by the counter used to access memory buffers. One way to convert a simple ISR into an interrupt-driven state machine is to explicitly encode the address of the next state process in a table indexed by the current state either alone or in combination with an input value. At each interrupt, the appropriate process is invoked by calling (or jumping to) the address found in the table.

Most states persist for more than one interrupt, so it would be unreasonable to advance a state counter automatically. Instead, each state would independently decide when to advance Figure 2 diagrams a hypothetical case in which state 4 persists for nine interrupts and then advances to 4a where it increments the state counter to 5.

An explicit state table is not the best method of encoding the transition function. It is inefficient for each state to decide when to advance the counter and then simply increment a value that will subsequently be used to look up an address in the table. At lower cost, each state can put the address of the next state into the counter, which is not really a counter anymore, although it serves the same purpose. At the next interrupt, the ISR vectors directly through the counter. Thus, in Figure 2, the counter would contain an address, the table would be eliminated, and at 4a the address of state 5 would be put into the counter. Not only does this approach speed up the state machine overhead and reduce memory requirements, but it also simplifies modifying the transition function.

The basic overhead of the interrupt-driven state machine is slightly more than that of the simple ISR, adding two instructions. After pushing registers we add a jump through the state counter in data memory, JMP [state_counter], and at some point in each state we add a load state counter literal, MOV [ state_counter ], OFFSET next_state. These add 11 and three clocks or 0.7 microseconds to each interrupt. For negligible cost we buy a substantially more powerful mechanism.

Up to this point, partitioning decisions have been compelled by simple logic. The remaining decisions become increasingly arbitrary. For one, assembly language was chosen for all ISRs. Few programmers dispute that it can deliver higher performance than even C or Forth, and likewise that it can encourage software chaos. Consider that much of the usefulness of HLLs comes from their libraries, which are not safe to use in this application because they can call DOS without our knowing it. Also, most ISRs contain a substantial amount of low-level hardware manipulation, which C allows but doesn't facilitate. Perhaps the most important argument for using assembly language is that because interrupts can occur at any time, we are willing to pay a lot to be able to dispose of them as quickly as possible. The CPU time taken by the ISRs varies considerably over the different states, the quickest executing in about 10 microseconds. (including context switch) and the longest in about 100 microseconds. The simple round-robin task dispatcher shown in Listing One (page 96) allows a high-level context switch to occur in 5.3 microseconds. This obviously affords more efficient use of CPU time than more complex preemptive dispatchers, many of which take 1000 times longer to perform a context switch. However, the dispatcher doesn't assume responsibility for ensuring that tasks actually give up their time. The individual task programs must be designed to do their work in discrete chunks. Typically, they scan a prioritized work list, execute the first ready item and then give up their time by calling the release function shown in Listing One.

If the high-level portion of a task has no work ready to perform when it is dispatched, then it immediately releases, consuming less than 8 microseconds. If, on the other hand, it has work that involves disk access, its time slot may stretch out to 200 milliseconds or more. Consequently, the round-robin dispatch cycle time is nondeterministic. But we can determine the worst case maximum time. Any portion of a task that requires a faster response than the total cycle time must be partitioned into the task's ISR. Unfortunately, because the final program itself determines the cycle time, we can only estimate the timing threshold when we do the partitioning.

The ISRs are by nature more difficult to write, so we want to place most of a task in the high level. The worst result of overestimating the round-robin cycle time is that some of the ISRs may be doing more work than is dictated by timing requirements. Underestimating, however, may allow the threshold to push beyond a time-sensitive function that has been apportioned into the high level. The only way to guarantee reliable operation in this case is to move that function into the ISR. The actual design did not experience this problem because I guessed a very conservative 4-second cycle time. The actual worst case time is one second. The average time of about 80 milliseconds is fast enough that the user interface slows noticeably only during heavy window banging.

While the high-level portion of each task executes in a predetermined time slot (of varying width) the lower levels are distributed in time. Ideally, interrupts and DMA would be uniformly distributed in order to avoid time-demanding hot spots. The ISR and DMA portions of all tasks can steal CPU time from the high level of any task. Figure 3 illustrates a typical time slice in which ISRs steal cycles from the high level and DMA steals from ISRs as well as from the high level.

Data Connection

Having decided, for performance, to use directly shared memory for communication between the partitions, we have two major concerns: how to achieve data encapsulation and concurrency. It does seem as if shared memory and data encapsulation are incompatible. Consider communication between the GPIB high level and ISR. The obvious way to separate them is to provide one input pipe and one output pipe, thus establishing a very restricted connection. This incurs several severe performance penalties. One is that the high level has to move data in and out of the pipes. Small amounts could be moved reasonably quickly, but the GPIB data blocks are as large as 17Kbytes. A second problem is the memory wasted on pipes that have to be long enough to hold all of the data that might be transferred during a worst case round-robin dispatching cycle. A more subtle but equally severe problem is that inputs cannot be effectively prioritized, because the high level has to process the input in the order received. Even a virtual pipe, scanned by the high-level program, must be treated as a FIFO at least until the various input blocks have been separated and identified. As mentioned earlier, these problems also exist with most commercial serial communication libraries; but the greater speed and data volume of the GPIB amplifies their effect.

The only way to avoid handling all the data twice and at inopportune times is to give the ISR some understanding of the contents, which violates encapsulation and moves some of the sophistication from the high level to the ISR. The key to resolving this dilemma is, not surprisingly, proper partitioning to match task functions to mechanisms of appropriate power. Each GPIB interchange consists of an initial transmission by the external device (the chemical analyzer) followed by a response from our controller. Each analyzer transmission is from 1 of 16 categories, as indicated in a 10-byte header. The controller's response, which must occur in about 50 milliseconds, is based on the transmission category. The relatively rapid response time requires the ISR to handle the interchange without timely help from the high level.

To enable the ISR to respond appropriately to the input category while limiting its understanding of the data, the ISR is allowed to know only a generic version of this transaction. It uses the category listed in an input's header to index a table of transaction descriptors, which is maintained by the high level. Each descriptor is a structure that tells the destination address for this type of input, the source address of the response, and several pieces of information used to assure the integrity of the transaction. The ISR code i oblivious to the specific categories. The high-level program can modify the transactions simply by modifying the table (statically or on-the-fly) without affecting ISR code.

Listing Two (page 96) shows the first three of 16 transaction descriptors. From the assembler's point of view each structure is just a group of eight words (16 bytes), but comments and initializing statements are arranged to portray an array of structures. The first structure describes the response to the "status" input. The input destination is a far global pointer, OFFSET _status, SEG _status. The input length element tells the ISR that this address is the beginning of a block of 12,040 bytes. The ISR will not allow input data to fill beyond this point. The input header is supposed to list a data length no greater than the available space, but defensive programming is essential whenever dealing with inputs from the outside world.

The response half of the descriptor lists source address, OFFSET ack, SEG ack and its 11-byte length. This unvarying "acknowledge" response to several inputs, including status, doesn't need to be controlled by the high level. But the descriptor table treats all transactions uniformly so that the ISR code understands only the generic form. The 00h Most Significant Byte (MSB) of CHKSTAT/chkLB tells the ISR that the last word (2 bytes) of the source contains a valid checksum. The acknowledge response never changes, and recomputing the checksum at every transmission would be wasted effort.

In the second descriptor, the command_request input has an uninitialized destination so the high level must fill this in before the first command_request transmission. The CHKSTAT byte in the "command" response is 01h, not 00h as in the case of ack. This indicates that the ISR must compute the checksum before transmitting "command." Giving the ISR this responsibility allows the high level to change data in the "command" source at any time without having to recompute the checksum after each change. Whenever the high level changes data contents, it sets CHKSTAT to 1. The ISR computes the checksum only once, just before transmitting the data. It then changes CHKSTAT to 0 so that, unless the high level makes additional changes, the ISR knows that the checksum remains valid. Any response source that is in constant flux has CHKSTAT value 02h, which tells the ISR to always recompute the checksum. In this case, the high level can ignore CHKSTAT when changing data.

The third descriptor lists the same "command" response for the phase_request input as for the command_request input. Such redundancies are a small price to pay for generic ISR code. Another small price is that performance degrades slightly because the program has to access variables indirectly. A spaghetti-coded ISR could embed the values as literals.

Listing Three (page 96) shows the first three elements of a second table which the ISR consults to determine which flag to set to tell the high level that a particular input has been received. The high level, written in C, considers flags (gpib_stat in Listing Two) as a single unsigned long, two unsigned shorts, or four unsigned chars. The ISR treats them as an array of 4 bytes. Each entry in the table, which is indexed by the input category, indicates the byte and bit to set. This arbitrary mapping allows the high-level program to determine the optimum congregation of flags to minimize its testing effort. Note that the declaration extern unsigned char gpib_stat [4] supports simpler byte selection than if defined as a larger object, such as the unsigned long that it really is. For example, the second 2 bytes can be accessed as *(unsigned short*) (gpib_stat+1).

The GPIB DMA/ISR connection is largely determined by the decisions regarding the high level/ISR relationship. The high level has essentially nothing to do with the mechanics of communication. Once set in motion, the ISR/DMA runs freely in a cycle that begins with the DMA being set to input the 10-byte header. DMA is incapable of intelligent processing, so it is set up to issue the GPIB interrupt after receiving the last header byte. The external device (analyzer) treats the header and data as an uninterrupted transmission. Fortunately, GPIB specifies a transmission synchronized by handshaking on every byte. This gives the ISR time to inspect the header and determine the appropriate response from the transaction descriptor array.

The ISR sets up the DMA again, this time to receive the remainder of the input, whose length was listed in the header. When the ISR is invoked at the end of this, it is automatically vectored to its next state (as described earlier). DMA is set to transmit the entire response with an interrupt occurring only at the end, at which point the cycle repeats.

This is an abbreviated description. The actual ISR contains several additional states for dealing with various GPIB peculiarities. One more state is worth discussing. As mentioned earlier, the DMA controller, I8237, doesn't support the full address range of the computer, or even the reduced range (640K-bytes) of MS-DOS. It provides only the lowest 16 address bits, the remainder being provided by a simple latch which is mapped into the computer's I/O space independently of the DMA chip. Thus, DMA can't handle any input or output buffer that crosses a 64K boundary: 65536, 131072, 196608, and so on. The operating system and C libraries can allocate memory blocks that don't cross logical segment boundaries, but they are oblivious to these hardware boundaries. There is no reasonable way to guarantee that the application will not produce such blocks. We have to move the boundary problem to a mechanism of more power than the DMA. The ISR checks each input or output for boundary crossing before setting up the DMA transfer. If a breech would occur, then the transmission is split into two parts separated by an interrupt, where the ISR adjusts the latched address. It is best to try to avoid such untidy partitioning, but sometimes we have little choice.

A final data connection issue that must be addressed by all multitasking and multiprocessing systems is concurrency of nonatomic data. An isolated datum that can be read or written without interruption usually presents no problem (semaphores require a more demanding atomic test and set). In the C library, signal.h defines sig_atomic_t, the largest integer type the processor can load or store atomically in the presence of asynchronous interrupts. This is a short integer when the compiler generates less than 80386-specific instructions.

Even if all integer types could be atomically accessed, structures, arrays, and isolated data related by application are not atomic. For example, one of the controller's functions is to send to the chemical analyzer a list of the tests to be performed. The number of tests can vary, so the transmitted data specifies the count as well as the test types. If the count doesn't match the actual number of tests, the analyzer will misinterpret the entire transmission. When the analyzer asks for the test list, the ISR/DMA responds immediately. If the high-level process happens to be writing into the list, the count could be new and the tests old data, or vice versa. The only way to guarantee concurrency of tests and count is to make the group atomic.

Any input or output may contain atomic data groups. Obviously, there is no native (hardware or operating system) mechanism to provide atomic access; the application must provide it. One possibility is already available in the transaction descriptor array. Every buffer could be duplicated. The high level could ping-pong between two duplicates by changing the address in the appropriate descriptor to point to the buffer not being accessed. This approach presents some problems. One is that the buffers already consume about 60K of memory. A more subtle problem is that the output data is updated piecemeal by asynchronous processes. We want to transmit the most recent data available. In a sense, output preparation is never completed, and the only reasonable time to switch buffers is when the analyzer asks the controller to transmit. But we have already established that this point is not synchronized to the high level. Therefore, double-buffering does not solve the concurrency problem.

The most general solution is to block high-level access to communication buffers whenever the GPIB ISR or DMA is active. Such a draconian solution would give the high level very few opportunities to access data. A more reasonable approach is to realize that data is atomic relative to specific asynchronous events. By identifying the blocking events more specifically, we create larger access windows. For example, given ten input and ten output buffers, blocking a particular access only when the ISR and DMA are processing a particular buffer makes the window 20 times larger.

To access a group of data atomically, the high-level process first sets up a "load list" of copy descriptors, each of which contains the source, destination, and length of an item to be copied. For example, an array of integers, regardless of its length, is specified by a single descriptor, while noncontiguous integers must be specified by separate descriptors. The load list also identifies the one GPIB event that blocks access. The load list is passed to an access control function that disables interrupts and then checks the blocking event against current GPIB activity. If they don't match, the function executes all of the moves specified by the load list and then reenables interrupts. If they do match, the function reenables interrupts and either releases or returns to the caller, depending upon which action the caller has requested. If the access function has released, then each time it is subsequently redispatched, it checks again for unblocked access, and finally completes the transaction.

Performance

In the application I've described, the GPIB communication link has certain critical time periods established by the external device. Because there is no combination of standard hardware and general-purpose MTOS (MultiTasking Operating System) that could meet these requirements, the general opinion (but not mine) was that the goals could not be met without multiprocessing. We considered a design in which each communication link would be handled by a separate processor. Obviously, the extra hardware is more expensive, reduces MTBF (Mean Time Between Failures), and presents a very complicated debugging situation (even getting the program-under-development into the independent processors for testing is complicated). Furthermore, this approach does not eliminate the difficulties of controlling and synchronizing data.

Multitasking and multiprocessing have essentially the same intertask communication overhead per CPU. Therefore, the only legitimate excuse for solving a problem with multiple processors would be if the sum of the data processing in all tasks is more than one processor can handle in the time available. There is no simple formula to tell you whether this is the case before you actually implement your solution. Something to keep in mind is that in embedded applications, "too slow" means lost data, while "too fast" means too expensive and probably less reliable. "Fast enough" is exactly what we want, and proper partitioning is essential to achieving a design that is fast enough.


_SOFTWARE PARTITIONING FOR MULTITASKING COMMUNICATIONS_
by David McCracken

[LISTING ONE]



COMMENT $ ------------------ _RELEASE --------------------------------------
  A task can call this from any point to release its time. However, better
heap utilization will result from freeing memory allocated during a time slot
before releasing. To caller, release looks like a simple function whose
prototype is void release(void). All other currently enabled tasks are then
dispatched in turn until the cycle is complete and this task is redispatched
by returning from the call. The si, di, ds, bp registers are restored as
expected. Thus, the release function also acts as the dispatcher. By using
"ret" to dispatch, the dispatcher automatically knows the size of
task dispatch addresses by whether SMALL or LARGE CODE.
$
_release   PROC
   push bp
   push ds
   push di
   push si
   mov ax,dseg
   mov ds,ax
   mov si,[task_no] ;Get current task number.
   shl si,1    ;*2 to convert to index for sp word table.
   mov word ptr stk_ptrs[si],sp ;Save sp for next dispatch of this task.
nextsk:   dec [task_no]
   jnz dpatch    ;If task_no is OK then go dispatch the next task.
;Task 1 (on bottom of stack heap) has just finished; so verify it didn't
; corrupt stack below allotment by checking "end of stack" marker is valid.
   mov bp,mark_pos      ;Point to marker location on stack.
   cmp [bp],marker_value   ;Is the mark still there?
   jne scrash       ;Crash on task_no=0 means task 1 overran its stack.
   mov si,[max_t_no]
   mov [task_no],si    ;Rollover task count to upper limit.
dpatch: mov si,[task_no]    ;Get current task number.
   cmp byte ptr task_enable[si],0
   je nextsk       ;If this task is disabled then try for next one.
   shl si,1       ;*2 to convert to index for sp word table.
   mov bp,mark_pos[si] ;Get this task's stack marker.
   cmp [bp],marker_value   ;Did the preceding task overflow its stack?
   jne scrash
   cli          ;Don't allow interrupts while monkeying with stack.
   mov sp,word ptr stk_ptrs[si] ;Retrieve sp saved during last dispatch
                                     ;of this task.
   sti                  ;Interrupts OK now because any individual stack
                             ;should be able to support them.
   pop si
   pop di
   pop ds
   pop bp            ;Assume these registers are only ones stored through
                         ;release and re-dispatch cycle. However, watch out
                         ;for possible register variables.
   ret       ;Dispatch task by returning to its release return
                         ;address. Note that dispatch/release is implicitly a
                         ;loop only by operation with dispatched tasks.
_release ENDP

[LISTING TWO]



COMMENT $ GPIB transaction descriptors.
  Each descriptor lists input destination address, maximum input length that
can be tolerated, type of response (used for concurency control), checksum
status of the response, response source address and its length.
$
_gpib_trans LABEL WORD
;  |----INPUT DESTINATION---------| |------------- RESPONSE -----------------|
;      address              length   type CHKSTAT/chkLB    address     length
;...... analyzer status ..........| |............... acknowlege .............|
dw OFFSET _status, SEG _status, 12040, 0f000h, 00fdh, OFFSET ack , SEG ack , 11
;....... command request .........| |.............. command .................|
 dw      0   ,     0   ,        12, 0200h, 0100h, OFFSET _cmnd, SEG _cmnd, 245
;....... phase request ...........| |.............. command .................|
 dw      0   ,     0   ,        12, 0200h, 0100h, OFFSET _cmnd, SEG _cmnd, 245
               .
               .

               .

[LISTING THREE]



;-------------------- Input Available Status ----------------------
_gpib_stat db 0,0,0,0  COMMENT $ _gpib_stat is defined as a 4 byte object so
that a C program can access each byte as unsigned char (UC), any pair as
unsigned short (US), or the entire group as unsigned long (UL). In C, data is
declared by "extern UC gpib". A quick check for any input would be
" if(*(UL*)gpib_stat)". Accessing as smaller objects allows catagorical
checking without having to check each bit individually. For example,
"if(*(US*)(gpib_stat+1)" tests for any bit set in the second two bytes. $

;-------------------- Status flag selector table ---------------------
flag_select   LABEL WORD        ;Flag selectors: byte selector and bit mask.
        dw 0001h                ;status: flag byte 0, bit 0.
        dw 0002h                ;command request: flag byte 0, bit 1.
        dw 0004h                ;phase request: flag byte 0, bit 2.
               .

               .