Design

Demand Paged Virtual Memory

By Kent Dablgren, April 01, 1989

Demand paging involves the dynamic partitioning of a program's linear address space into individual "pages." It comes as no surprise that different CPU's such as Intel's 80386, Motorola's 68030, and AMD's Am29000 manage pages differently.

APR89: DEMAND PAGED VIRTUAL MEMORY

DEMAND PAGED VIRTUAL MEMORY

Putting a mainframe environment on a micro

Kent Dahlgren

Kent Dahlgren is the strategic planner for graphics products at Paradise Systems, a manufacturer of graphics devices. He can be reached at 800 E. Middlefield Rd., Mountain View, CA 94043.

Once an exotic large-system technology, Demand Paged Virtual Memory has come to the world of small computers. The new generation of 32-bit microprocessors, with blazing instruction throughput and advanced on-chip memory management, brings mainframe performance to the desktop. With it comes the ability to run several programs at the same time, each requiring more memory than the computer physically possesses. How? That's what this article is about.

So what is Demand Paged Virtual Memory, or DPVM? In a nutshell, it's the partitioning of a program's linear-address space into smaller units called "pages". Even very large programs confine their activities at any given instant to small regions of memory: a loop in executable code, a table in data space, some variables here and there. Thus, there is never a need for all the code and data to be in memory at once.

This observation, often referred to as locality of reference, is the basis for DPVM. As programs run, the pages they need are moved into memory, replacing pages no longer required. If a page containing changes is to be replaced, it's saved on disk to prevent data loss and for access later. Other jobs can run while this process, called swapping, is going on. In this way, DPVM supports concurrent execution of several programs, any one (or all) of which "thinks" its linear memory space is unlimited.

Naturally all this doesn't occur by magic. It takes a combination of hardware and software called a Virtual Memory Manager, or VMM. The CPU must provide memory management and the high-instruction rate required to run the sophisticated operating system that makes it happen. The system must also have a high-performance disk to handle swapping efficiently. And of course there have to be enough cycles left after the overhead to do productive work.

The key element in DPVM, then, is the CPU. Heretofore, microprocessors simply didn't have enough horsepower or features. Now they do.

So let's look at the workings and issues of DPVM, and how three current-generation microprocessors handle them.

How Does It Work?

Paged memory management uses a dynamic, relocatable partitioning scheme to manage main memory. The partitions, referred to as pages, are of uniform size, typically 4K. Unlike the segmented memory familiar to users of Intel processors, the existence and manipulation of pages are transparent to processes running on the system. This is a result of taking the addresses generated by the CPU (referred to as logical or virtual addresses), and translating them into a real memory address. Consequently a process always sees an uninterrupted linear address space, even though the pages allocated to it are probably not contiguous in main memory.

The logical to physical translation is accomplished by breaking addresses into two components. The high-order bits represent the page number, with the low-order bits giving the offset within the page. Logical addresses are mapped onto the physical addresses of page frames located in main memory using a data structure known as the page map table, which contains the current logical to physical mapping. The page offset bits are unmapped and therefore simply passed through as the physical page offset. The page map table also stores bit fields describing protection and cachability. Having such information for each page is another benefit of paged memory management. Figure 1 illustrates the mapping procedure.

The power of this mapping methodology makes the most compelling benefit of paged memory management possible. Because a complete memory image of the process is kept in secondary storage and only the portion that is currently being executed in main memory, the process can run in an address space much smaller than its total image. As the program refers to new sections (for example pages) of the image, they're brought into main storage. Hence the term "demand paged virtual memory". Newly arriving pages replace pages that are not currently being referenced. To support this scheme an additional data structure is required in order to find logical pages in secondary storage. This structure, the file map table, parallels the page map table.

Often, of course, a process refers to a page that is not physically present in main memory. This situation is called a page fault, and is treated as an exception. It requires that the CPU abort the memory cycle (data or instruction fetch) and save its state. The system must then find the required page and bring it into memory, after which the CPU can restart the aborted access.

Support for DPVM address translation and access validation are the core requirements for the Memory Management Unit (MMU). The scheme depends on its ability to translate each memory access transparently to the executing process. This is not a trivial task; mapping a logical address space of 4 gigabytes requires over a million 4K pages. Even when a process requires only 4 Mbytes, the MMU must translate a thousand pages to support it. This leads to extremely large map tables: so large that many systems swap portions of the map tables themselves in and out of main memory.

As a result of this swapping requirement, most systems don't use simple linear translation tables as shown in Figure 2. Instead the tables are arranged in a tree structure, in which the leaves contain page tables --arrays of logical to physical translation data --and intermediate nodes are hierarchical tables of pointers --or page directories --to the next lower level. The number of levels is a function of system requirements. More levels allow the memory manager to add new address translation information in smaller increments, but costs more memory cycles during a page map table search.

The arrangement used in the 80386 represents a compromise between pay-off and penalty. The page size is fixed at 4K with the remaining 20 bits of the logical page address split into two 10-bit indexes. This results in a 1,024-entry page directory consisting of 32-bit pointers to page tables, each of which contains 1,024 32-bit address translation entries. The beauty of this scheme is that both the page tables and page directory are 4K in size, simplifying the paging of these entries themselves.

But What About Performance?

The elegance of DPVM extracts a price: Managing the memory can consume a significant portion of CPU time. This is reflected in the fact that the bulk of the papers written about this topic deal with various performance optimization issues. The key areas are address translation and page replacement.

Two strategies can enhance address translation performance. The first is to include registers in the MMU for caching recent address translations. These caches are referred to as translation lookaside buffers (TLBs) or address translation caches (ATCs). A well-designed TLB can eliminate table searches for well over 90 percent of memory references, and for this reason all the available MMUs include them. The architecture of TLBs falls into two broad categories, fully associative and set associative.

Fully associative TLBs offer the most flexibility. Here, any of the entries can hold a given address translation, since logical addresses are compared with all of the logical address tags in the TLB. The drawback is that this requires address comparators for each entry and uses up area on the chip. In addition, the hardware required for keeping track of replacement information is significantly complex.

The goal of a set associative TLB design is to simplify the hardware. This is done by using a portion of the logical page address to select what is referred to as a line in the cache. Each line consists of one TLB entry for each set. The problem with this approach is that it effectively reduces the number of available TLB entries for mapping specific sets of address ranges.

Another strategy to improve translation performance is providing hardware or microcode in the CPU to search the page map table for a new translation when it cannot be found in the cache. This situation is called a TLB miss. This hardware also takes care of determining which entry to replace when a new entry is to be loaded. Not only does it improve performance, but it also means the programmer doesn't have to write code to handle this critical function.

Improvements in page handling performance are similar to those for address translation. The key issue here involves swapping: deciding which page to remove from main memory when a new one must be loaded. The goal is to remove pages that will not be needed again soon. Because it is not possible to predict future program behavior, past referencing patterns are used to guess which page is least likely to be needed. The two most popular techniques are FIFO and LRU. These algorithms have been the focus of much research over the past 20 years. The most preferable (in terms of performance) is LRU, which is achieved at the cost of algorithmic complexity (see sidebar). Hardware support for paging comes in the form of intelligent I/O controllers that relieve the CPU of some of the work involved in swapping pages in and out of memory. No doubt this approach, common in mainframes, will eventually migrate to small systems as multitasking operating systems become more prevalent.

Thrashing is a term describing situations in which the CPU spends more time managing memory (paging, searching tables, and so on) than executing process code. In some cases the situation can get so bad that only one percent of instructions executed are those of processes. The simplest way to reduce the problem is to add more real memory. Unfortunately, that costs real money. The alternative is to add sophisticated performance analysis and tuning software to the VMM. For example, the VMM in a multitasking environment can continually monitor the page faulting behavior of executing tasks and dynamically adjust their page allotments to optimize performance. Tasks creating large numbers of page faults have their allocation of available pages increased.

More Details.

And Then There's Real Time

The key problem with DPVM in real-time systems is the performance degradation and indeterminacy resulting from page faulting and, to a lesser extent, TLB misses. There are two approaches to this problem.

One is to lock pages containing memory and translation table entries into the TLB for time-critical routines. The only performance penalty is the address translation overhead, which is usually small in single chip CPU/MMU implementations. It's not difficult to lock pages in memory because page swapping is handled by system software. All that is required is an additional bit in the page map tables' address translation entries to indicate whether the page is swappable. In addition, the page replacement algorithm must eliminate such pages from the pool of available swapping candidates.

Locking TLB entries is another matter, because TLB reloading is usually handled by the CPU's microcode. In such cases, the CPU must directly furnish the ability to lock TLB entries. The Motorola 68851, the companion MMU for the 68020, includes such support.

Perhaps the best approach for enhancing real-time performance is to set aside regions of the virtual address space that are not mapped by the MMU at all. This completely eliminates address translation overhead and results in portions of the memory map where virtual and real addresses are the same. This is the approach taken by the designers of the 68030's MMU.

Virtual Address Spaces

An important decision in the design of a virtual memory manager is how many address spaces to support and how to arrange processes within them. This can go as far as running multiple operating systems concurrently, each in its own virtual machine. The most visible example of this type of implementation is Microsoft Windows/386, which uses the paging and protection capabilities of the 80386 to create multiple 640K DOS environments.

Other applications include high-reliability, fault-tolerant systems where it is essential that failing tasks not destroy data belonging to others. Putting each task in its own address space guarantees this.

Multiple address spaces can be created entirely via software in the virtual memory manager and/or with aid of hardware support in the MMU. The software approach involves setting up separate mapping tables for each address space and changing the pointer to the table area when activating a process that resides in a different address space.

Most MMUs support multiple address spaces in hardware with pseudo address bits: Status codes that may be combined with the regular address bits during address translation. Unfortunately, the pseudoaddress bits can confuse the VMM if it wants to move data between spaces. This can be handled by mapping addresses in both spaces to a common page in memory, but this circumvents the protection that was the goal of separate spaces in the first place. Consequently, most CPUs that use pseudo address bits also provide special privileged instructions that allow the VMM to read and write physical addresses.

Sizing Up The Players

There are currently three widely available 32-bit CPUs with on chip MMUs: The Motorola 68030, the Intel 80386, and the AMD Am29000. The latter is one of the new RISC processors. The MMU architecture of each reflects its target market and historical background. Table 1 gives a comparison of their features.

Table 1: MMU architectures compared

    CPU  TLB
         Size/Type    TLB Reload  Page Sizes    Translation  Real-Time
                                  Supported     Tree         Support
                                                Levels
                                                Supported
  ----------------------------------------------------------------------

  68030  22 Entry,
         Fully        Microcode   256, 512,
                                  1K, 2K, 4K,   1 to 5       Transparent
         Associative              8K, 16K, 32K               Mapping

  80386  32 Entry, 4
         Way Set      Microcode        4K          1         None
         Associative

  29000  64 Entry, 2
         Way  Set     Software    1K, 2K, 4K,   any number   None
         Associative              8K

The 68030 has the most sophisticated MMU of the three. Based on the 68851 MMU used with the 68020, it emphasizes flexibility and maximum hardware support. The reason for this approach can be traced to the 68000's success in the technical workstation market. Each of Motorola's customers had different system requirements; Silicon Graphics with a high-performance graphics environment; MassComp's real-time Unix implementation; Sun and Apollo with general-purpose workstations.

Flexibility is apparent in the wide range of page sizes as well as the ability to select from 1- to 5-level translation trees. The three-bit function code generated by the 68020 and 68030 supports up to eight separate 32-bit logical address spaces. The TLB is a fully associative 22 entry design with control fields for protection and memory caching suppression.

Real-time support is provided in the form of transparent translation (that is, no translation) for up to two regions. The regions can be as small as 16 Mbytes, with each allocated to a specific address space. The main drawback to this scheme is that it bypasses normal page protection methods, which are applied to the region as a whole.

Compared to the 68030, the 80386's paged memory management capabilities are minimal. This is the first member of the 80X86 family with any support. Intel also retained the segmented memory architecture of the earlier chips in the design, and made it possible to use both schemes simultaneously. These backward compatibility requirements, combined with the need to keep the chip size within producible limits, dictated the simplicity of the 80386's page management support. Page size is fixed at 4K and only two-level translation tables are supported. Despite these limitations, the 80386's ability to support a virtual machine environment for DOS applications makes it a strong contender for the high-end personal computer market.

The AMD 29000 is the first commercial 32-bit RISC microprocessor with an on-chip MMU. In keeping with the bare-bones, high-performance goals of RISC, the 29000's MMU provides the minimum hardware needed to support DPVM. Systems software for the 29000 must therefore perform some functions that are done by hardware in other architectures: TLB reloads are an example. While this adds complexity to the job of writing a VMM, it does yield the ultimate in system flexibility. That's a major benefit in the embedded real-time controller market for which the chip is targeted.

The 29000 MMU supports up to 256 32-bit address spaces through an 8-bit process identifier (PID). Unlike the 68000 family's function codes, which the processor automatically generates, the PID is loaded into a register in the MMU by system software. This gives the system designer more control over the allocation of address spaces. In a multi-OS virtual machine, the PIDs separate the address spaces of each guest OS. In high-reliability, multitasking environments, each task can be allocated its own address space using the PIDs. Figure 3 shows the structure of the 29000's MMU.

The trend is clearly toward CPUs, such as the 68030 and 80386 for both mid-range and high-end desktop systems. The incorporation of the MMU means support for DPVM, is a given in systems using these CPUs [HELP]. The major remaining requirement for DPVM is a reasonably fast fixed disk, and these are already widely available at low cost. As a result you will begin to see more and more systems featuring the multitasking and unlimited address spaces that DPVM makes possible on small computers.

References

Advanced Micro Devices Corporation. Am29000 32-Bit Streamlined Instruction Processor User's Manual. Sunnyvale, Calif.: Advanced Micro Devices, 1988.

Bate, S.F.; and Kenah, L. J. 1984. VMS Internals and Data Structures. Burlington, Mass.: Digital Press.

Donovan J.J.; Madnick, S.E. Operating Systems. New York: McGraw-Hill, 1974. Intel Corporation. 80386 Programmer's Reference Manual. Santa Clara, Calif. 1988.

Milenkovic, M. Operating Systems, Concepts and Design. New York: McGraw-Hill, 1987.

Motorola Corporation. MC68851 Paged Memory Management Unit User's Manual. Englewood Cliffs, NJ: Prentice-Hall, 1987.

Motorola Corporation. MC68030 Enhanced 32-Bit Microprocessor User's Manual, Second Edition. Englewood Cliffs, NJ: Prentice-Hall, 1989.

Implementing the LRU Algorithm

Tom Gettys

Tom Gettys is a principle software engineer for GDP Technologies and can be reached at 700 Snowberry, Lafayette, CO 80026.

The design of a Demand Paged Virtual Memory system revolves around two key decisions: the buffer write policy and the page replacement algorithm.

The two main write policies are write through and copy-back. Both involve keeping two copies of each page image, one in main memory, the other in secondary storage such as a disk. The write-through policy updates the copy in secondary storage every time its image in main memory is altered. This is the safest, since both always contain the same data. Its effectiveness is significantly degraded, however, if the number of read operations does not greatly exceed the number of writes.

The copy-back strategy updates secondary storage only when a buffer in main memory is to be overwritten. The algorithm knows whether to do this by maintaining a TRUE/FALSE value known as a "dirty bit" for each page buffer. The dirty bit is set to FALSE (0) when the page is first loaded into memory. The bit is then changed to TRUE whenever a write occurs in the page's address space. If this bit is still FALSE when it's time to overwrite the page buffer, the image is unchanged so there's no need to copy it back to secondary storage.

A page fault occurs when a process refers to a page that is not currently in memory. If there are no free page buffers in main memory, a page replacement algorithm must determine which buffer to overwrite. Clearly, it's most efficient to use a method requiring the fewest accesses to secondary storage. The optimum strategy is to overlay the page that will not be used again for the longest time. Because this requires predicting the future, though, it's hard to implement, so we must find another algorithm that yields something close to the optimum. The most popular of such methods is called "least recently used," or LRU.

LRU replaces the page that has not been accessed for the longest period of time. It performs remarkably well due to the principle of locality: Processes tend to access storage in nonuniform, highly-localized patterns clustered together both in time and space. Consequently, a page that has not been referenced recently is unlikely to be referenced again in the near future.

With the proper choice of a data structure, the LRU algorithm is straightforward and efficient. It requires that the buffers be ordered according to their most recent access in time. This leads us directly to the ubiquitous FIFO (first-in, first-out) queue. When a page is referenced it is placed at the back of the queue. The page at the head of the queue is thus the least recently used, and its buffer is the one the algorithm can overlay. The size of this queue corresponds to the fixed number of existing page buffers, so -- because the queue doesn't shrink and grow -- it's not necessary to incur the overhead of dynamic allocation to manage it. Instead we can use an array.

When an existing page is accessed, its dirty bit is set and the node moves to the back of the queue. Similarly, a page fault grabs the head of the queue, sets its dirty bit, and moves the node to the back. From the standpoint of queue management, then, these operations are identical. A doubly linked list provides the necessary flexibility and, by designating one of the nodes as the list head, there is no distinction between the two cases.

Listing One shows the structure of a queue node and the declaration of the queue as an array of these nodes. The elements pred and succ are the backward and forward pointers, respectively. Note that they are just array indices and not pointer types. This is due to the non-dynamic nature of the queue. The succ of the list head is the oldest (LRU) page, and the pred of the list head is the youngest (MRU) page. The other two elements are the dirty bit (actually a byte here, because there are no other bit fields with which it can share memory) and the identifier of the page associated with the node. In this example the buffer number associated with each queue node is equal to its array index, so no additional data is required to provide page-to-buffer mapping.

Listing Two is the queue initialization routine, which executes once to set up the forward and backward links and clear the dirty bit. It also sets the page ID to an invalid value, thus pre-loading the queue with phantom pages. As a new page request is received it will not match any of the current entries, and because the dirty bit is not set, the memory manager bypasses the copy-back step. The buffer associated with the head of the queue will hold the new page, and its node moves to the back of the queue. This small bit of software sleight of hand avoids special case checking that would otherwise be necessary with every page request.

Listing Three manipulates the pointers to delete the specified node and relink it to the back of the queue. The existence of the list head node allows this to work without case checking. This is not obvious, it is a valuable exercise to prove this by assuming instead a separate variable which points to the head of the queue and developing the equivalent function. You will find that it becomes more complex.

The rest of the code shows what else is needed to put a complete Memory Management System in place. The details of the MMS front end will vary since the definition of a page depends on the application, and the exact nature of the interface between the MMS and the application will, of course, vary also. --T.G.

_IMPLEMENTING THE LRU ALGORITHM_ by Tom Gettys

[SIDEBAR TO _DEMAND PAGED VIRTUAL MEMORY_ BY KENT DAHLGREN]

[LISTING ONE]

<a name="00b7_000e">

        struct queue_node
        {
           WORD succ;            /* index of next younger node        */
           WORD pred;            /* index of next older node          */
           WORD page_id;         /* ID of page associated w/this node */
           BYTE dirty_bit;       /* TRUE if page has been written to  */
        } lru_fifo[BFRS + 1];





<a name="00b7_000f"><a name="00b7_000f">
<a name="00b7_0010">

[LISTING TWO]

<a name="00b7_0010">

        void init_fifo(void)
        {
           WORD i;

           for (i = 0; i <= BFRS; i++)
           {
              lru_fifo[i].succ = i + 1;
              lru_fifo[i].pred = i - 1;
              lru_fifo[i].page_id = BAD_PAGE;
              lru_fifo[i].dirty_bit = FALSE;
           }
           lru_fifo[0].pred = BFRS;
           lru_fifo[BFRS].succ = 0;
        }





<a name="00b7_0011"><a name="00b7_0011">
<a name="00b7_0012">

[LISTING THREE]

<a name="00b7_0012">

        void requeue(WORD node)
        {
           lru_fifo[lru_fifo[node].pred].succ = lru_fifo[node].succ;
           lru_fifo[lru_fifo[node].succ].pred = lru_fifo[node].pred;

           lru_fifo[lru_fifo[LIST_HEAD].pred].succ = node;
           lru_fifo[node].pred = lru_fifo[LIST_HEAD].pred;

           lru_fifo[LIST_HEAD].pred = node;
           lru_fifo[node].succ = LIST_HEAD;
        }

Previous 1 2 3 4

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Design

Demand Paged Virtual Memory

DEMAND PAGED VIRTUAL MEMORY

Putting a mainframe environment on a micro

Kent Dahlgren

How Does It Work?

But What About Performance?

And Then There's Real Time

Virtual Address Spaces

Sizing Up The Players

Table 1: MMU architectures compared

References

Implementing the LRU Algorithm

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Design Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Design

Demand Paged Virtual Memory

DEMAND PAGED VIRTUAL MEMORY

Kent Dahlgren

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Design Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content