High-Performance Programming for the PowerPC

While moving Windows NT applications from an 80x86- to a PowerPC-based platform is straightforward, performance may be sluggish after the port. Our authors examine how byte alignment, inlining, and exception handling affect performance, then show you how to avoid the pitfalls.


October 01, 1995
URL:http://www.drdobbs.com/parallel/high-performance-programming-for-the-pow/184409729

Figure 1


Copyright © 1995, Dr. Dobb's Journal

Figure 1


Copyright © 1995, Dr. Dobb's Journal

SP95: High-Performance Programming for the PowerPC

High-Performance Programming for the PowerPC

Avoid performance pitfalls when coding for Windows NT

Kip McClanahan, Mike Phillip, and Mark VandenBrink

Kip works in Motorola's RISC software group writing low-level PowerPC software and is the author of PowerPC Programming for Intel Programmers (IDG Books, 1995). He can be contacted at [email protected]. Mark has been hacking operating-system kernels for over ten years and is currently on the team responsible for the PowerPC port of Windows NT. He can be contacted at [email protected]. Mike is manager of the compiler and tools development group at Motorola in Austin, Texas. He can be contacted at [email protected].


Squeezing the best performance out of a processor requires both insight and experience. When it comes to the PowerPC microprocessor family, however, programmers are just starting to understand the architecture and each processor's implementation. The PowerPC architecture specification defines both required and optional features for any processor implementation. Each implementation of the PowerPC-architecture specification--the 601, 603, 604, and 620--may have a very different set of features. For example, cache size and type, bus width, power-management capabilities, and number of execution units can vary between each part. However, compliance with the PowerPC architecture specification ensures binary compatibility across each processor implementation.

In this article, we'll examine methods for measuring performance and techniques for improving the speed and efficiency of applications running under Windows NT for PowerPC. In doing so, we'll point out some of the pitfalls associated with Little-endian operating systems (such as Windows NT) and present an application that demonstrates the effect of byte alignment on performance. We'll also look at some optimization techniques that apply more generally to the PowerPC architecture.

Measuring Performance

Perhaps the most difficult step in improving performance is simply getting started. There are several ways to analyze performance for a particular application, but it's almost always necessary to narrow the scope of the analysis to factors that can be controlled by the programmer. Performance is typically affected by:

System hardware issues include the size and speed of memory and disk subsystems, the type of video cards and the amount of dedicated video memory, and the type and speed of the microprocessor itself. For most developers, it is important to characterize the impact of the system hardware and software on performance, but most opportunities to improve performance lie in how the application itself it was built. System-software issues are too numerous to list, but tend to center around the operating-system and networking configuration of the computer.

Most applications are too large to examine in their entirety, but performance-analysis tools such as profilers can often identify those regions of code in which most time is spent. If profiling tools are not available, you can gain insight into any possible performance bottlenecks by thoroughly inspecting the code and answering the following questions:

Improving Performance

Regardless of profiling information, a good place to start improving performance is to turn on compiler optimizations. Most modern compilers offer sophisticated code optimization that can be invoked by the user. Although such options will likely slow down compilation, the resulting application should execute much faster, often with speedups of 200 to 300 percent. Compiler optimization controls are akin to stereo-system controls, however: Just as increasing the volume to its maximum level may not be optimal for each piece of music, simply setting the default optimization flag is unlikely to yield optimal performance for every software application. Many compilers offer specific optimization flags for fine-tuning application performance. While such optimizations may not apply to enough applications to warrant inclusion in the default optimization settings, they can greatly improve the performance of a particular application.

One such option is automatic inlining of subroutines. Although not much of an optimization by itself, subroutine inlining exposes more code to the optimizer in the context in which it will be used. Of course, blindly inlining code is unwise, because replicating the subroutine body increases code size, which can actually decrease performance through loss of code locality in the memory system. However, when used in conjunction with profiling feedback, inlining small, heavily called routines often improves performance significantly for the overall application.

While selectively inlining C or C++ subroutines may improve performance, inline assembly code should be embedded with caution. When assembly code is inlined into a high-level program, the compiler must typically make conservative assumptions about register and memory usage, which can throttle many potential optimizations. But inlining critical PowerPC assembly instructions (such as synchronization primitives or status-register accesses) can improve performance. Some compilers, including those developed by Motorola, provide a set of built-in intrinsic functions that provide efficient access to low-level system instructions without incurring the performance penalty associated with inlining seemingly "random" assembly instructions.

Language and Design Considerations

The most significant factor affecting software performance is the design and implementation of the application itself. While algorithm design is application specific and clearly beyond the scope of this discussion, several general design considerations affect performance for virtually any PowerPC software application.

Stick to the standards. Languages such as C and C++ have well-defined standards intended to ensure the portability of source code among different development environments. Many providers of development tools offer nonstandard language extensions that differentiate their products. While often alluring to the developer, these extensions can easily tie an application to a particular tool set, and, to a lesser extent, a particular target architecture. Many of these features can significantly degrade performance on RISC microprocessors like a PowerPC.

Watch out for misalignment. Examples of such language extensions are the __unaligned keyword and #pragma pack, both of which can affect data alignment. For Little-endian implementations of an operating system such as Windows NT, misaligned accesses can lead to significant performance losses on the PowerPC architecture. Many modern microprocessors, including most PowerPC implementations, are optimized to handle properly aligned memory references, often at the expense of handling relatively infrequent misaligned references. Compilers typically will align data on its "natural" alignment boundary, where the address of the object is an exact multiple of the size of the object in bytes. However, certain programming practices, including the imprudent use of some language extensions, can create a situation where the compiler must access memory in chunks smaller than the natural size of an object. For example, if you use the __unaligned keyword in Microsoft C/C++ to inform a compiler that an object is misaligned, a PowerPC compiler will typically be required to load the corresponding object from memory one byte at a time. For a 32-bit integer object, this requires four memory accesses instead of one, plus three additional rotate instructions.

A similar situation can arise when the compiler is instructed to "pack" the elements of a structure via source-code pragmas. On older architectures, including the Intel 80x86 family, such language assertions do not necessarily affect performance. However, given the relatively high cost of servicing an alignment exception, PowerPC compilers will typically opt to generate conservative, albeit slower, code for known, potentially misaligned accesses. Removing the __unaligned keyword or structure-packing pragmas does not necessarily solve the problem. In fact, it may lead to incorrect code or even worse performance. To avoid these performance pitfalls, it's best to avoid such extensions when designing an application. The alignment example shown in Listing One demonstrates the performance penalties associated with the three common techniques for misalignment resolution.

Alignment exceptions can also be created through mismanagement of pointers in C and C++ programs, or by accessing data through a reference that is not of the same natural alignment as the original object; for example, accessing a series of characters or half-word objects through integer variables. For Windows NT applications, these misalignment exceptions are often hidden from the user through programmer instructions to the operating system. As the alignment example in this article demonstrates, it is clearly preferable to enable exceptions as a means of locating performance losses than to "hide" them from the user.

Structured Exception-Handling Issues

Structured exception handling can also adversely affect performance. While they are an elegant and maintainable means of managing the interaction between the application and the operating system when errors or unexpected events occur, exception handlers can also restrict the compiler's ability to safely optimize code. This is true even for code not directly included in the exception handler itself. Thus, exception handlers should be carefully designed to minimize the performance impact, as follows:

Target-Specific Issues

Target-specific factors can often be utilized to maximize performance. Although fine-tuning for one processor or architecture at the expense of others should be avoided, a few PowerPC-specific issues should be considered to maximize PowerPC performance.

Memory Subsystem

The frequency and speed of memory accesses are almost always key factors in overall application performance. Compared to most PowerPC-processor operations, off-chip memory references are extremely expensive (particularly if the accesses are misaligned). You can often maximize the utilization of on-chip and secondary off-chip caches by carefully managing large data structures such as arrays. Since all PowerPC chip implementations utilize associative cache designs, a slight change in the "stride" of array accesses can significantly affect the effectiveness of a cache. A simple guideline is to avoid array sizes that are an exact multiple of the cache size (typically powers of 2 in the range of 8-64 KB). However, predicting cache behavior through such simple guidelines is precarious, at best, since actual dynamic reference patterns vary between applications. The guidelines' intent is to avoid allocating heavily referenced variables to the same set of cache addresses. Profiling tools can often provide feedback concerning a program's cache-utilization efficiency.

A Real-World Alignment Example

To make the following alignment example more meaningful, we'll tie it to an operating system. Because Windows NT for the PowerPC is a Little-endian operating system, it is subject to the alignment restrictions described earlier. Remember, a multibyte access performed at an address not aligned with the size of the access (known as "natural alignment") will cause an alignment exception. Table 1 shows the natural-alignment boundaries for memory accesses of various sizes.

When a Little-endian PowerPC application performs a memory access that is not aligned on its natural boundary, an alignment exception will result. When alignment issues exist on PowerPC-based systems, they can be resolved in three ways:

Trap and Terminate

Under Windows NT, it is possible (and the default for PowerPC Windows NT) not to have any support for misaligned data. When there is no support for misalignment and your application accesses data on an unnatural boundary, an alignment fault is generated. The Windows NT alignment fault handler is configured not to fix misaligned accesses and will display a pop-up message and terminate your application. This seems like the one situation to avoid, but in fact the operating system is doing you a favor.

The ability to trap an alignment exception and terminate the faulting application is valuable when porting code, particularly when porting Windows NT applications from 80x86 to PowerPC architectures. And the ability to detect alignment exceptions is fundamental to handling misalignment efficiently.

Operating System Fix-ups

The Windows NT kernel can be configured to perform misaligned data fix-ups upon detection of an alignment exception. This means that the alignment-exception handler must break the multibyte memory access that caused the exception into individual byte accesses, which are not constrained by alignment issues. Although this sounds like a terrific service, it is the most inefficient solution. Even if the rest of your application is well constructed, a few OS-based alignment fix-ups can bring performance to its knees. As Table 2 shows, for 5 million misaligned memory references, the difference between OS handling and programmatic handling is nearly 30 seconds! Put another way, having the OS fix-up misaligned memory accesses is 43 times slower than the same number of aligned accesses.

If this solution is so slow, why is it around? Of course, you wouldn't want your released and shipping applications to terminate the first time they have an unexpected alignment fault--compatibility with 80x86 applications would be reduced significantly. OS-based fix-ups are a reasonable first-pass solution for PowerPC applications that have not specifically taken data alignment into consideration. Just as the number of 32-bit applications is slowly increasing in the 80x86 world, so will the number of PowerPC applications that make data alignment a priority.

Processes can enable OS-based alignment fix-ups using the call SetErrorMode(SEM_NOALIGNMENTFAULTEXCEPT). A child process inherits its parent's error mode, so any processes created by your application after enabling this mode will also suffer the performance penalty for misaligned data. Under Windows NT for PowerPC and MIPS processors, OS-based misalignment support is disabled by default. The Alpha version of NT enables OS-based misalignment support by default; Alpha NT applications must turn this feature off if programmatic solutions are used. Setting the SEM_NOALIGNMENTFAULTEXCEPT flag has no effect on x86 processors.

Programmatic Solutions

The first programmatic solution to data misalignment is the __unaligned pointer type qualifier. When the compiler sees a pointer reference (this qualifier works only with pointers) declared using the __unaligned type qualifier, it includes sufficient code to ensure that the memory access does not generate an alignment exception. In particular, it breaks multiple byte accesses into individual byte accesses. A single-alignment constrained, 32-bit load or store instruction is replaced by seven instructions that perform the same operation using only byte accesses. Similarly, a single 16-bit memory reference would be replaced by three instructions. Listing Two shows a single, 32-bit, PowerPC store-word instruction. If the store-word instructions in Listings One and Two were performed using an __unaligned pointer qualifier reference, the instruction would be converted to the seven instructions shown in Listing Three.

Listing Three represents compiler-generated code and, taken out of the context of the original flow of code, may appear suboptimal. Figure 1 depicts the operation of the seven instructions of Listing Three. The first instruction stores the low-order byte (0x78) of r3 into the address contained in r4. The rlwinm instruction is used to rotate the bytes within r3 such that each subsequent store-byte instruction references the proper value. The value contained in r3 is in Big-endian format, and r4 points to Little-endian memory. Therefore, the bytes must be swapped into Little-endian format during the store operation.

Another programmatic solution, the #pragma pack() directive, is particularly appropriate for porting between the various Windows NT platforms. One potentially recurring problem results from data structures and formats that were never designed from a portability perspective. In particular, graphics formats such as BMP and DIB do not address natural-boundary alignment. This is understandable: When these file formats were created for 80x86 software (such as Microsoft Windows), alignment issues were not a big concern--alignment was nice, but misaligned accesses didn't kill performance. With the advent of RISC processors, fixed-length instruction size, and the associated alignment restrictions, data misalignment has become an important issue.

The #pragma pack() directive tells the compiler two things. First, pack structure elements as close together as possible. In particular, avoid using the standard (aligned) structure padding. Second, the compiler knows to generate additional code to support misaligned accesses for elements within the "packed" structure, much like the effect of the __unaligned qualifier. In Listing One, a standard BMP header is packed, and the misaligned access is performed using the 32-bit BMPDataOffset element. This addresses application portability concerns because it allows the programmer to guarantee that the offsets within a native PowerPC structure will exactly match those defined in a native 80x86 structure.

Finally, there may be well-defined times when you know that you're about to perform a misaligned access. Instead of permanently packing a structure or declaring an __unaligned pointer variable to reference the memory, you can use the macros shown in Figure 2 to break the particular access into its byte components. To use the macros, simply bracket each misaligned memory reference with the appropriate macro.

The (Mis)alignment Demonstration Program

Three categories of misalignment resolution are demonstrated in the alignment program (ALIGN.C) shown in Listing One. This program lets you compare the performance impact of misaligned data by timing both aligned and misaligned accesses.

When timing a fixed number of operations in a preemptive, multitasking operating system, such as Windows NT, it is important to minimize noise in your timing measurements. To do so, elevate your thread to the highest priority possible and take multiple data samples. Listing One sets the thread-timing priority to THREAD_PRIORITY_TIME_CRITICAL. This increases the accuracy of the misalignment timing by reducing the number of interrupts that can influence the overall time required to complete the set of operations. In fact, when running ALIGN, the usefulness of your mouse will decrease dramatically.

To obtain the timing values, the GetTimeStamp() routine simply uses QueryPerformanceFrequency() and QueryPerformanceCounter() to sample the Win32 high-frequency counter. The time reported by ALIGN is derived by taking the difference between the time stamp before and after each set of memory-access operations.

ALIGN requires two parameters: an iteration count and one of the parameters in Figure 3. For example, the staggering 28-second measurement was generated using ALIGN 5000000 -2, which performed five million misaligned accesses using the OS to handle the alignment fix-ups. Listing One was used to generate the timing values shown in Table 2, which clearly demonstrates the cost of data misalignment.

Conclusion

While the increased availability of performance analysis tools and continual advancement of compiler optimization techniques will accelerate the tuning process, many key performance issues will remain embedded within the design and implementation of an application. An increased awareness among PowerPC developers of the interactions between the operating system and the microprocessor is critical to avoiding performance losses due to misaligned memory references, poor utilization of structured exception handling, and inefficient application of compiler optimizations.

For Little-endian-system implementations such as Windows NT, the means by which alignment issues are resolved can dominate application performance, as demonstrated in the ALIGN example of Listing One. Most importantly, many of the concepts in this article affect performance not only for PowerPC systems, but for other architectures, as well.

Acknowledgments

Thanks to Ray Essick and the Motorola RISC Software compiler team for a wonderfully mature PowerPC compiler.

Bibliography

McClanahan, K. PowerPC Programming for Intel Programmers. San Mateo, CA: IDG Books, 1995.

Win32 Programmer's Reference, MSDN CD-ROM, July 1995.

Table 1: Natural alignment boundaries.


Alignment Size            Form of 32-bit 
                          Address 
8-bit                     xxxx xxxx
16-bit integers           xxxx xxx0
32-bit integers and
  single-precision FP     xxxx xx00
64-bit integers and
  double-precision FP     xxxx x000
  
Table 2: Timing values generated by Listing One. Sample tests performed on a 100-MHz 604 running Windows NT 3.51 (build 1057), compiled with Motorola's NT compiler and averaged over six runs of the program.

Number of     Naturally        OS Fix-ups      __unaligned   #pragma Pack(1) for 
Iterations    Aligned Access                    Type Qualifier       Sample BMP Structure

500,000         61 ms            2858 ms          86 ms                82 ms
1,000,000       121 ms           5720 ms          171 ms               166 ms
5,000,000       657 ms           28662 ms         852 ms               807 ms

Figure 1: Multi-byte store into Little-endian memory.

Figure 2: Macros to break an access into bytes.

#define rULONG(x)  (ULONG)(                     \
            *(UCHAR *)(&x) |                   \
            (*((UCHAR *)(&x)+1)  8) |         \
            (*((UCHAR *)(&x)+2)  16) |        \
            (*((UCHAR *)(&x)+3)  24) )
#define rUSHORT(x)  (USHORT)(                   \
            *(UCHAR *)(&x) |                   \
            (*((UCHAR *)(&x)+1)  8))
            
Figure 3: Options for the second parameter to the ALIGN program.
-0 Use ONLY aligned accesses.
-1 NO alignment fix ups (causes an exception).
-2 Use OS-based fix ups for misaligned accesses.
-3 Use __UNALIGNED type qualifier.
-4 Use #PRAGMA PACK(1) directive.

Listing One

/*---------------------------------------------------------------------------+
| Windows NT for PowerPC Alignment Demonstration Program                     |
|                                                                            |
| Mark VandenBrink, [email protected]                                   |
| Kip McClanahan,   [email protected]  -or- [email protected]         |
| Mike Phillip,     [email protected]                                 |
|                                                                            |
|                                                                            |
+---------------------------------------------------------------------------*/
#include <stdlib.h>
#include <stdio.h>
#include <stdarg.h>
#include <windows.h>
#include <winioctl.h>
#include <string.h>
#include <ctype.h>
#include <memory.h>
// force compiler fix-ups for data accesses within the
// BMP structure by using #pragma pack(1).
#pragma pack(1)
// Standard Windows3.x BMP file header format
//
typedef struct BMPHeader {
        USHORT  FileType;               // offset 0
        ULONG   FileSize;               // offset 2
        USHORT  reserved1;              // offset 6
        USHORT  reserved2;              // offset 8
        ULONG   BMPDataOffset;          // offset 10
};
struct BMPHeader bmpBuffer;             // declare structure
//
// Print an error message to to the screen and exit.
//
static VOID
Die(char *format, ...)
{
    va_list va;
    va_start(va, format);
    fprintf(stderr, "\n\nALIGN: ");
    vfprintf(stderr, format, va);
    ExitProcess(2);
}
//
// Return a timestamp from the high frequency performance counters (if
// one exists).  Return the stamp in units of number of milliseconds
//
static UINT
GetTimeStamp(VOID)
{
    static DWORD FreqInMs = 0;
    LARGE_INTEGER Time;
    if (!FreqInMs) {
        if (QueryPerformanceFrequency(&Time) == TRUE) {
            if (Time.HighPart) {
                Die("Timer has too high a resolution\n");
            }
            //
            // 100-nanosecond units
            //
            FreqInMs = Time.LowPart / 1000;
        } else {
            Die("Could not get frequency of perfomance counter\n");
        }
    }
    if (QueryPerformanceCounter(&Time) == FALSE) {
        Die("System does not support high-resolution timer\n");
    }
    return Time.LowPart / FreqInMs;
}
//
// Essentially useless function that returns a value to place at 
// IntPointer.  Function used to prevent compiler from optimizing 
// away references to *IntPointer inside a loop.
//
DWORD
GetNextValue(VOID)
{
    static DWORD NextValue = 0;
    
    return NextValue++;
}
main(int argc, char **argv)
{
    CHAR Buffer[1024];
    UINT EndTime;
    UINT Max = 0;
    UINT StartTime;
    UINT i;
    struct BMPHeader *headerPtr;
    int *IntPointer2;
    __unaligned int *IntPointer;
    switch (argc) {
        case 3:
            //
            // Note: setting thread's priority to THREAD_PRIORITY_TIME_CRITICAL
            // can effectively bring a machine to its knees, depending on the 
            // process priority class. 
            //
            SetThreadPriority(GetCurrentThread(), 

                              THREAD_PRIORITY_TIME_CRITICAL
                              );
    
            Max = strtoul(argv[1], 0, 0);
            //
            // The naturally aligned case
            //
            if (argv[2][0] == '-' && argv[2][1] == '0') {
                printf("ONLY aligned references\n");
                IntPointer2 = (int *)(&Buffer[4]);
                printf("Buffer at %x, IntPointer = %x\n", Buffer, IntPointer2);
                StartTime = GetTimeStamp();
                for (i = 0; i < Max; i++) {
                    *IntPointer2 = GetNextValue();
                }
                EndTime = GetTimeStamp();
                break;
            }
            //
            // The no fix-ups, alignment exception causing case
            //
            if (argv[2][0] == '-' && argv[2][1] == '1') {
                printf("NO support for misaligned references\n");
                IntPointer2 = (int *)(&Buffer[3]);
                printf("Buffer at %x, IntPointer = %x\n", Buffer, IntPointer2);
                StartTime = GetTimeStamp();
                for (i = 0; i < Max; i++) {
                    *IntPointer2 = GetNextValue();
                }
                EndTime = GetTimeStamp();
                break;
            }
            //
            // OS-based fix-ups
            //
            if (argv[2][0] == '-' && argv[2][1] == '2') {
                printf("OS support of misaligned references.\n");
                SetErrorMode(SEM_NOALIGNMENTFAULTEXCEPT);
                IntPointer2 = (int *)(&Buffer[3]);
                printf("Buffer at %x, IntPointer = %x\n", Buffer, IntPointer2);
                StartTime = GetTimeStamp();
                for (i = 0; i < Max; i++) {
                    *IntPointer2 = GetNextValue();
                }
                EndTime = GetTimeStamp();
                break;
            }
            if (argv[2][0] == '-' && argv[2][1] == '3') {
                printf("Using __UNALIGNED qualifier.\n");
                IntPointer = (int *)(&Buffer[3]);
                printf("Buffer at %x, IntPointer = %x\n", 
                        Buffer, 
                        IntPointer);
                StartTime = GetTimeStamp();
                for (i = 0; i < Max; i++) {
                    *IntPointer = GetNextValue();
                }
                EndTime = GetTimeStamp();
                break;
            }
            if (argv[2][0] == '-' && argv[2][1] == '4') {
                headerPtr = (struct BMPHeader *)Buffer;
                printf("Using #pragma pack(1) directive\n"); 
             printf("Access offset @%x\n", (ULONG)&(headerPtr->BMPDataOffset));
                StartTime = GetTimeStamp();
                for (i = 0; i < Max; i++) {
                    headerPtr->BMPDataOffset = GetNextValue();
                }
                EndTime = GetTimeStamp();
                
                break;
            }
      //
      // fall through
      //
      default:
       fprintf(stderr, "Usage: ALIGN number-of-iterations [-option]\n");
       fprintf(stderr, "\nwhere option is one of the following:\n");
       fprintf(stderr, "\t-0 Use ONLY aligned accesses.\n");
       fprintf(stderr, "\t-1 NO alignment fix ups (causes an exception).\n");
       fprintf(stderr, "\t-2 Use OS-based fix ups for misaligned accesses.\n");
       fprintf(stderr, "\t-3 Use __UNALIGNED type qualifier.\n");
       fprintf(stderr, "\t-4 Use #PRAGMA PACK(1) directive.\n");
       ExitProcess(0);
    }
    printf("%d milliseconds\n", EndTime - StartTime);
    ExitProcess(0);
}

Listing Two

;
; Typical 32-bit store instruction
;
; Assumes:
; r3 contains word to store at address contained in r4
;
        stw         r3, 0(r4)           ; store word contained in r3
                                        ; to address contained in r4 + 0

Listing Three

;
; The equivalent 32-bit store resulting from use of the 
; __unaligned type qualifier in the pointer declaration
; for IntPointer.
; Assumes:
; r3 contains word to store at address contained in r4
;    For the purposes of this example, assume that 
;    r3 = 0x12345678.
;
        stb         r3, 0(r4)           ; store the lower byte (0x78) 
                                        ; of r3 into address contained
                                        ; in r4 + 0.
        rlwinm      r5, r3,24,8,31      ; extract bits 16-23 into the 
                                        ; low-order position of r5
                ; How the rlwinm instruction works:
                ; Step 1: rotate contents of r3 left by 24 bits
                ;         Result: 0x78123456
                ; Step 2: generate a mask with 1-bits from 
                ;         bit 8 to 31 Result: 0x00ffffff 
                ; Step 3: AND the contents of r3 with mask and
                ;       place the result into r5.
                ;         Result: r5 = 0x00123456
                ; NOTE: the next stb instruction will store 
                ;       0x56 into the address (r4 + 1). 
                ;       See Figure 1.
        stb         r5, 1(r4)       ; store next byte at r4 + 1
        rlwinm      r5, r3,16,16,31 ; extract bits 8-15 into r5
        stb         r5, 2(r4)       ; store next byte at r4 + 2
        rlwinm      r3, r3,8,24,31  ; extract bits 0-7 into r3
        stb         r3, 3(r4)       ; store final byte at r4 + 3
End Listings


Copyright © 1995, Dr. Dobb's Journal

Terms of Service | Privacy Statement | Copyright © 2024 UBM Tech, All rights reserved.