Embedded Systems

Symmetric Multiprocessing for PCs

By John Norwood and Shankar Vaidyanathan, January 01, 1994

Our authors focus on multithreaded application development for single-processor and symmetric-multiprocessor machines under Windows NT. In doing so, they present Fortran interface statements for the Win32 console API and a black-box solution for calling 32-bit DLLs from 16-bit applications under NT.

JAN94: Symmetric Multiprocessing for PCs

John and Shankar work at Microsoft Corp. and can be reached at One Microsoft Way, Redmond, WA 98052.

The traditional definition of supercomputing is generally restricted to Fortran and mainframe computers. But with the advent of the 80486, Pentium, MIPS, and DEC Alpha processors and 32-bit operating systems, such as Windows NT, PC computational power has come to rival that of heavy-metal systems. In this article, we'll describe techniques that provide this computational power, focusing on multithreaded application development for single-processor and symmetric-multiprocessor machines. We'll cover DLLs, shared common blocks, multiprocess/multithreaded programming, and the Win32 API. We'll also provide Fortran interface statements for the Win32 console API and a black-box solution for calling 32-bit DLLs from 16-bit applications (such as Visual Basic) under NT.

To illustrate, we'll implement a simple matrix-multiplication algorithm (Listing One, page 98), then gradually add functionality.

Matrix multiplication is an easy algorithm to multithread since there's no need to handle memory contention when writing to the final output matrix. The driver module (Listing Two, page 98) is a minimal program that calls the computation routine and provides the matrices to be multiplied, their dimensions (assuming conformation), and the number of threads to perform the task. In the compute module (compute.for), we have a common block that stores the matrices A, B, and C and keeps track of the number of spawned threads through MaxThreadCount. The subroutine Initialize initializes the common-block variables from the parameters provided by the driver module--an inefficient but simple approach. The subroutine Compute then spawns the specified number of threads. Each thread has as one of its parameters a number corresponding to the iteration that spawned it.

The essence of the implementation is the thread function MatMult, in which each thread begins on the row corresponding to its thread number. When the thread is finished multiplying that row by all columns of the second matrix, it jumps by the total number of threads to a new row. Thus, the threads "leap-frog" to the end of the matrix. This implementation can, at the end, leave a few final threads operating, but it's simple to implement and avoids memory-contention issues when writing to the final matrix.

SMP Issues and Results

Still, the approach we've just described doesn't avoid memory contentions on read access to the input arrays. This isn't a problem on single-processor machines since only one thread is accessing memory at any given instant. On symmetric-multiprocessor (SMP) machines, however, it becomes a potentially serious issue. SMP machines are coarse-grained, parallel machines that excel in performing separate, discrete tasks, but don't have fine-grained memory-arbitration and messaging facilities like those in vector or other super-scalar implementations. This can result in processors "stalling" on simultaneous access to a memory location that another processor is using. Listing Three (page 98) is a simple first pass at minimizing this problem by staggering the rows and columns each thread accesses. Each thread not only leapfrogs on the rows of the first matrix, but also "chases" the subsequent threads on the columns of the second matrix--thus minimizing the chances of simultaneous memory access in the columns.

We experimented briefly with changing the process and thread priorities, but achieved little measurable benefit. Pushing all threads and processes to the maximum available priority may steal a few more time-slices from other user-level threads, but the NT kernel still maintains its core activities. The net result is that the machine becomes totally unresponsive to user input, while net performance hardly changes. An additional side effect of real-time priority is that on a four-processor machine, using more than four threads is inefficient--the first four threads totally lock out any subsequent threads until they terminate. Thus, the best mix of performance and utility is achieved by eliminating all unnecessary applications and services and allowing the sophisticated NT thread and processor scheduler to deliver its designed functionality.

We ran the example under default priority, allowing a flexible number of threads. We minimized the external interferences by stopping all the extraneous applications and services we could, but we don't claim any particular rigor in the results; they simply illustrate the benefits of using SMP machines with minimal code redesign.

We ran Listing Two with the Listing Three modifications on an NCR 3400 machine with four processors and 64 Mbytes of physical RAM. We experimented with a different number of threads on a 300x300 matrix for A, B, and C and averaged our results over 100 runs. The compiler options used were /Ox and /Ob2; Table 1 shows the results.

On a four-processor machine, a simple task such as matrix multiplication with four threads was about 3.8 times faster than a task with a single thread and 3.7 times faster than a nonthreaded task. As more and more threads are generated, thread overhead and context switching take their toll.

Build Option Considerations

To convert this computational module into a DLL that can be called from different applications (and dynamically loaded), you simply add the dllexport attribute to the Compute subroutine and use the /LD compiler option. This automatically creates an import library for linking calling applications. While there are a number of options for linking NT Fortran executables, /LD is the only supported option for linking DLLs. /LD causes the DLL to be linked with a run-time import library. The actual run-time code resides in MSFRT10.DLL, which must be in the path. This has the following implications for applications that intend to call the DLL: If the calling executable is a console application, you should compile it with the /MD option. The executable can then be linked to the run-time import library, and both the application and the DLL will share the same instance of the run time from the run-time DLL. Unit numbers opened in the executable will be accessible in the DLL and vice versa, and screen I/O using WRITE or PRINT statements will coordinate correctly.

If the calling application is compiled and linked using the /ML (single-threaded static run-time) or /MT (multithreaded static run-time) options, the executable and DLL will use separate instances of the run-time code. Then unit numbers will be local and separate for the executable, and the DLL and screen I/O from the executable might not appear as expected if the DLL also does screen I/O.

A Win32 application can be either a console subsystem executable or a Windows subsystem application. Console applications are always in text mode and can't attain higher-resolution screen output, but screen output is rapid. Windows applications have full access to the user-interface functionality provided by the Win32 API. If a Microsoft Fortran application is compiled and linked to be a console application, the user interface consists of character-mode input and output in a console window using READ, WRITE, or PRINT statements. There's no default for text positioning or mouse input. Microsoft Fortran allows you to generate Windows subsystem applications using the QuickWin implementation and /MW compiler option. Because QuickWin has its own API--similar to that for graphics output under DOS and 16-bit Windows in earlier versions of the compiler--it's very easy to generate a Windows subsystem application.

QuickWin applications must be statically linked (implied by /MW). Furthermore, Windows subsystem applications don't have console windows available by default. Thus, any screen I/O from a Fortran DLL (which is, by definition, console I/O) won't be visible.

Common Blocks Across Processes

All threads of a given process share the same address space and can access the process's global variables; this can vastly simplify communication between threads. Suppose, however, we want to accomplish the same task of matrix multiplication by spawning processes instead of threads, which is a possible scenario on an SMP. We need to allow the same common block to be accessed across all the processes. Spawned processes inherit handles to files, console buffers, named pipes, serial-communication devices, and mailslots; but they don't inherit global variables.

Listing Four (page 98), a modified version of Listing Three, spawns processes instead of threads. The common block is named bridge with the attribute dllexport and is contained in the source file bridge.for along with a DATA statement. Since this module contains only data, the linker can be used with the /EDIT option to rename the .data section and set the new section attributes as read, write, and shared. The common block for the shared data must have at least one data item initialized in a DATA statement or it will not be stored in a section that can be modified. If there's any run-time functionality in this file, renaming the .data section will cause the code to fail. The parent process could wait for all the child processes to signal events, but for simplicity, here the parent process sleeps for ten seconds before it prints out the results. The initialization and driver routines remain the same, and the child process (process.for) identifies its number from the command-line argument passed to it by its parent (compute.for).

Console Input and Output

Many Fortran applications rely on character-mode input and output for speed and simplicity. While the default functionality of READ and WRITE statements is largely sufficient, sometimes more control is required (for text positioning and trapping mouse input, for example). Windows NT provides these text-mode services via the console API. The Win32 API documentation gives detailed descriptions of these functions and an overview of the console API. We've included complete interface files that can be included in Fortran console applications and DLLs for increased text-mode services.

Interfacing console output from Windows subsystem applications is largely unexplored territory. For example, if a Windows application calls a DLL that does console output, that output is normally just sent to the bit bucket. If the DLL calls AllocConsole to create a console window, the output can be displayed.

But there's another level of complexity in this situation. By default in console applications, run-time screen I/O functions like READ, PRINT, and WRITE are associated with the handles stdin and stdout. These handles are, in turn, associated with the console input screen buffer and the output screen buffer. A Windows subsystem application calling AllocConsole creates a console window but it won't automatically associate the run-time screen I/O file handles with that console window. This isn't a concern when using the console APIs, since they directly use the console input and output handles, but READ or WRITE statements to the screen will fail. The solution is to force an association between the run-time standard handles and the console standard handles. This is illustrated in the InitConsole routine (available electronically; see "Availability," page 3). The logical sequence is as follows. First, the console input and output handles are obtained by calling CreateFile with the special filenames of CONIN$ and CONOUT$. These console file handles are converted to C run-time file handles using the _open_osfhandle routine. The _dup2 C run-time function then forces stdin, stdout, and stderr to be identical to the C file handles pointing to the console. The C routines are always available to the Fortran code in this article, since the interface statements are included in the console.fi and console.fd files (available electronically).

The InitConsole routine allows a self-contained DLL capable of doing console I/O using both the console API and run-time screen I/O. The DLL needs to know the type of application (console, Windows subsystem, or other) calling it so it can adjust its functionality accordingly. This information is provided in the final parameter from the calling application. The code displays the progress of the matrix multiplication using different colors for different threads. This requires use of a global critical section to synchronize access to the console-output buffer. Console output is limited in resolution, but it is very rapid and easy to implement.

Mixed-language Considerations

Microsoft Fortran PowerStation 32 for Windows NT is capable of mixing with Microsoft Visual C/C++ for Windows NT. While this process is easier than with prior versions, certain tricks make the process even easier. For example, the default naming and passing convention for a Fortran subroutine is a modified version of stdcall that has the following effect on routine names: They are prepended with an underscore, appear in all capital letters, and are appended with the @ symbol, followed by the number of bytes passed on the stack for the argument list. The passing convention is that arguments are pushed on the stack from right to left. Unlike the cdecl convention, however, the callee, not the caller, cleans up the stack. By default, Fortran passes all arguments by reference, but this can be overridden using the value attribute. Passing by reference means that every argument will usually require four bytes of stack space for its address. The only exception is character variables, which require eight bytes: four for the address, followed by four for an integer passed by value that contains the string length. The hidden string-length parameter is required by the character*(*) indeterminate size type allowed in Fortran. The hidden string parameter following character variables can be suppressed using the stdcall or c attribute on the name of the subroutine or in an interface statement to the Fortran subroutine. The amount of stack space for an argument passed by value is always rounded up to the next multiple of four bytes that can contain the data item.

The default structure packing for both Microsoft C/C++ and Fortran is 8-byte packing. The only allowed metacommands or compiler options for Fortran are for 1-, 2-, and 4-byte packing, so

8-byte packing is only possible as the default. Common blocks are accessible from C as global static structs owned by Fortran. The dllexport attribute allows common blocks to be exported from DLLs, as demonstrated in the previous example illustrating shared memory in a common block.

Microsoft C/C++ and Fortran development environments let you manage projects in their respective languages. The Fortran Visual WorkBench allows debugging mixed-language applications since it has access to both C and Fortran expression evaluators. It is often convenient to use the Fortran Visual WorkBench debugger from the C/C++ Visual WorkBench. Using the Tools menu makes this easy. In the C/C++ Visual WorkBench, go to the Options menu and select Tools. Click the Add button and browse for the Fortran f32vwb.exe file. Enter $(TARGET) in the Arguments text field. When you select the Fortran Visual WorkBench under the Tools menu, it will automatically start debugging the application you're working on in the C/C++ Visual WorkBench.

Linking mixed-language C/C++ applications with Fortran object modules requires that the Fortran library LIBF.LIB (static single thread), LIBFMT.LIB (static multithread), or MSFRT.LIB (DLL run-time import library) precede the equivalent LIBC.LIB, LIBCMT.LIB, or MSVCRT.LIB. The versions of both libraries should come from the Fortran LIB directory. Since the order is important, the libraries should be added from the Options Projects Linker Input text field. You should not add them via the Additional Libraries text field or by including them in a project. To link a C/C++ Win32 Windows application that includes Fortran object files, use all the libraries normally used by C/C++ and add LIBF

.LIB and CONSOLE.LIB. The latter is necessary to provide routines for file I/O.

32-bit DLLs and 16-bit Applications

There are times when it is desirable to call a 32-bit DLL from a 16-bit application. In such cases, the 16-bit application could run either under 16-bit Windows 3.x or under the WOW layer on Windows NT. In the first case, you'd use Win32s and Universal thunk facilities. In the second, you'd turn to Generic thunk functionality. Since this article is mainly about running on NT and threading, we'll examine only the second case.

Peter Golde of Microsoft has graciously shared a 16-bit C DLL that can be used in conjunction with Visual Basic to access 32-bit DLLs under NT. The scope of this article doesn't permit a discussion of the implementation details of this solution, but the DLL is provided with a VB example in Listing Five (page xx) that calls our multithreaded matrix-multiplication DLL. Peter's solution is elegant because it can be used to call any 32-bit DLL under NT without creating a new 16-bit DLL each time. DLLs called from 16-bit applications can't do console I/O (AllocConsole fails), thus requiring the "other" (WIN16$) case in the final parameter that is passed in to the DLL from the driver module, and this will suppress all console I/O from the example code.

Conclusion

The advent of SMP machines at affordable prices promises to bring to the desktop more computational horsepower then ever before. The granularity of the parallelism they offer needs to be considered, but even simple modifications to conventional algorithms combined with multithreading can result in a tremendous boost beyond single-processor expectations.

References

Norwood, John. "Mixed Language Windows Programming." Dr. Dobb's Journal (October 1991).

Vaidyanathan, Shankar. "Multitasking Fortran and Windows NT." Dr. Dobb's Sourcebook of Windows Programming (Fall 1993).

Vaidyanathan, Shankar. "Building Windows NT Applications using Fortran." Proceedings of the Tech*Ed Conference. FR-301, volume 2. (March 1993).

Table 1: Matrix-multiplication results on an SMP.

     Threads     Time in Seconds
     none        21.77
     1           22.21
     2           11.33
     3           7.62
     4           5.91
     5           5.82
     6           5.89
     7           5.82
     8           5.82
     30          5.94
     60          6.02

[LISTING ONE]


C The triple DO loop that performs matrix multiplication
      Do i = 1, A_ROWS
         Do j = 1, B_COLUMNS
             Do k = 1, A_COLUMNS
                 C(i, j)  =  C(i, j) + A(i, k) * B(k, j)
             End Do
         End Do
      End Do

[LISTING TWO]


      include 'mt.fi'
      include 'flib.fi'

**** Driver program to do the Matrix Multiplication. Input matrices are  *****
**** initialized to random values here. Maximum number of threads to be  *****
**** spawned is also identified here.                                    *****
      Program Driver
      include 'flib.fd'
      real*4 ranval
      integer*4 i, j, k, inThreadCount
      integer*4 A_Rows, A_Columns, B_Columns
      real*4 A[Allocatable](:,:), B[Allocatable](:,:), C[Allocatable](:,:)

      A_Rows = 50          ! size of A array
      A_Columns = 100      ! size of B array
      B_Columns = 100      ! size of C array
      inThreadCount = 8    ! number of threads to be spawned

      Allocate (A(A_Rows, A_Columns), B(A_Columns, B_Columns),
     +    C(A_Rows, B_Columns) )
      Do  i = 1, A_Columns
          Do j = 1, A_Rows
             Call Random (ranval)
             A (j, i) = ranval
          End Do
          Do k = 1, B_Columns
             Call Random (ranval)
             B(i, k) = ranval
          End Do
      End Do
      Call Compute (A, B, C, A_Rows, A_Columns, B_Columns, inThreadCount)
      End
***** Initiate transfers data from the arguments into the common block. *****
      Subroutine Initiate(In_A, In_B, In_A_Rows, In_A_Columns,
     +                    In_B_Columns, In_Thread_count)
      real*4 In_A(In_A_Rows, In_A_Columns)
      real*4 In_B(In_A_Columns, In_B_Columns)
      integer*4 In_A_Rows, In_A_Columns, In_B_Columns
      integer*4 In_Thread_count, i, j, k
      include 'common.inc'
      MaxThreadCount = In_Thread_count
      A_Rows = In_A_Rows
      A_Columns = In_A_Columns
      B_Columns = In_B_Columns
      Do  i = 1, A_Columns
         Do j = 1, A_Rows
           A (j, i) = In_A(j, i)
         End Do
         Do k = 1, B_Columns
           B(i, k) = In_B(i, k)
         End Do
      End Do
      End ! Initiate
***** MatMult is where the actual calculation of a row times a column is *****
***** performed. This is the thread procedure.                           *****
      Subroutine MatMult (CurrentThread)
      include 'common.inc'
      integer*4 CurrentThread
      automatic
      integer*4 i, j, k
C The loop variable i ranges from the current thread number to the
C maximum number of rows in A in steps of the maximum number of threads
      Do i = CurrentThread, A_Rows,  MaxThreadCount
         Do j = 1, B_Columns
            Do k = 1, A_Columns
               C(i, j)  =  C(i, j) + A(i, k) * B(k, j)
            End Do
         End Do
      End Do
      End ! MatMult
***** Compute does the actual computation by spawning threads.          *****
      Subroutine Compute
     +                  (In_A, In_B, In_C, In_A_Rows, In_A_Columns,
     +                   In_B_Columns, In_Thread_count)
      real In_A(In_A_Rows, In_A_Columns)
      real In_B(In_A_Columns, In_B_Columns)
      real In_C(In_A_Rows, In_B_Columns)
      integer In_A_Rows, In_A_Columns, In_B_Columns
      integer In_Thread_count
      include 'common.inc'
      external MatMult
      integer*4 ThreadHandle [Allocatable](:), threadId
      integer*4 CurrentThread[Allocatable](:), count
      integer*4 waitResult
      integer*4 i, j
      Call Initiate (In_A, In_B, In_A_Rows, In_A_Columns,
     +      In_B_Columns, In_Thread_count)
      Allocate (ThreadHandle(MaxThreadCount),
     +      CurrentThread(MaxThreadCount) )
      Do count = 1, MaxThreadCount
        CurrentThread(count) = count
        ThreadHandle(count) = CreateThread( 0, 0, MatMult,
     +           CurrentThread(count), 0, threadId)
      End Do
C Can't wait on more than 64 threads
      waitResult = WaitForMultipleObjects(MaxThreadCount,
     +      ThreadHandle, .TRUE.,  WAIT_INFINITE)
C Transfer result from common back into return argument.
      Do i = 1, A_Rows
        Do j = 1, B_Columns
            In_C(i,j) = C(i,j)
            C(i, j) = 0.0
        End Do
      End Do
      Deallocate ( ThreadHandle, CurrentThread )
      End ! Compute

######################################################################
C File Name: common.inc
      include 'mt.fd'      ! Data declarations for Multithreading API
      include 'flib.fd'    ! Data declarations for runtime library
      real*4 A, B, C    ! Input Matrices A & B and Output Matrix C
      integer*4 A_Rows, A_Columns, B_Columns  ! Matrix Dimensions
      integer*4 MaxThreadCount  ! Maximum numner of Threads
      common  MaxThreadCount,        ! common block
     +        A_Rows,                ! Rows in A = Rows in C
     +        A_Columns,             ! Columns in A  = Rows in B
     +        B_Columns,             ! Columns in B = Columns in C
     +        A(1000, 1000),
     +        B(1000, 1000),
     +        C(1000, 1000)          ! Maximum Array size is 1000 X 1000

[LISTING THREE]


C This is variation in the MatMult subroutine Do loops. Loop variable i ranges
C from current thread number to maximum number of rows in A in steps of
C maximum number of threads. Loop variable j ranges across all columns of B,
C but is staggered according to current thread number to minimize memory
C contention on an SMP machine. Loop variable jj translates (maps) value of j
C to fall within permissible range of B, that is from 1 to B_Columns

      Do i = CurrentThread, A_Rows,  MaxThreadCount
         Do j = (CurrentThread-1)*MaxThreadCount,
     +           B_Columns + (CurrentThread-1)*MaxThreadCount - 1
            jj =  1 + mod(j, B_Columns)
            Do k = 1, A_Columns
               C(i, jj)  =  C(i, jj) + A(i, k) * B(k, jj)
            End Do
         End Do
      End Do

[LISTING FOUR]


C File Name: Driver.for
C Include contents of Program Driver from Listing 2 here
C Then modify all occurrences of InThreadCount to InProcCount

######################################################################
C File Name: Compute.for
      include 'mt.fi'
      include 'flib.fi'
C Include contents of Subroutine Initiate from Listing 2 here
C Then modify all occurrences of InThreadCount to InProcCount
C Compute does the actual computation by spawning processes
      Subroutine Compute(In_A, In_B, In_C, In_A_Rows, In_A_Columns,
     +      In_B_Columns, In_Proc_Count)
      real*4 In_A(In_A_Rows, In_A_Columns)
      real*4 In_B(In_A_Columns, In_B_Columns)
      real*4 In_C(In_A_Rows, In_B_Columns)
      integer*4 In_A_Rows, In_A_Columns, In_B_Columns
      integer*4 In_Proc_Count
      include 'mt.fd'
      include 'flib.fd'
      include 'common.inc'
      logical*4 ProcHandle                     ! Process Handle
      integer*4 x, y, count
      character*32 inbuffer [Allocatable] (:)
      record /PROCESS_INFORMATION/ pi          ! Process Information
      record /STARTUPINFO/ si                  ! Startup Information

      si.cb = 56                               ! Size of Startup Info
      si.lpReserved = 0
      si.lpDeskTop = 0
      si.lpTitle = 0
      si.dwFlags = 0
      si.cbReserved2 = 0
      si.lpReserved2 = 0

      Call Initiate (In_A, In_B, In_A_Rows, In_A_Columns, In_B_Columns,
     +     In_Proc_Count)
      Allocate (inbuffer(MaxProcCount) )

      Do count = 1, MaxProcCount
         write(inbuffer(count),"(A7, 1X, I4)") 'process', count
         ProcHandle = CreateProcess( 0, loc(inbuffer(count)),
     +               0, 0, .TRUE. , 0, 0, 0, loc(si), loc(pi))
         print"('+',a,i5)", "Generating Process # " , count
      End Do

      write(*,*)
      write(*,*)
      Call sleepqq(10000)     ! Sleep for 10000 milliseconds
      Do x = 1, A_Rows
        Do y = 1, B_Columns
            In_C(x,y) = C(x,y)
            C(x,y) = 0.0
        End Do
      End Do
      End  ! Compute
######################################################################
C File Name: Process.for -- MatMult is the Process that  multiplies the
C appropriate Row of A with the appropriate column of B
      Program MatMult
      include 'common.inc'
      automatic
      integer*4 CurrentProc, i, j, k, jj
      character*32 buffer
      integer*2 status
C Obtaining the command line arguments
      Call GetArg (1, buffer, status)
      read (buffer(1:status), '(i4)') CurrentProc
      Do i = CurrentProc, A_Rows,  MaxProcCount
         Do j = (CurrentProc-1)*MaxProcCount,
     +           B_Columns + (CurrentProc-1)*MaxProcCount - 1
            jj =  1 + mod(j, B_Columns)
            Do k = 1, A_Columns
               C(i, jj)  =  C(i, jj) + A(i, k) * B(k, jj)
            End Do
         End Do
      End Do
      End
######################################################################
C File Name: Bridge.for -- The common block for shared data must have one data
C item initialized in a DATA statement or it will not be stored in a section
C that can be modified. The LINK /EDIT command is used to rename the .data
C section and set the new sections attributes as read, write, shared. The
C source files should contain only the common declaration and the DATA
C statement. If there is any runtime statements, renaming the .data section
C will call the cause the code to fail.
      Subroutine dllsub[dllexport]
      real*4 A, B, C
      integer*4 A_Rows, A_Columns, B_Columns
      integer*4 MaxProcCount     ! Maximum number of processes
      common /bridge[dllexport]/ MaxProcCount,
     +                           A_Rows,
     +                           A_Columns,
     +                           B_Columns
     +                           A(100, 100),
     +                           B(100, 100),
     +                           C(100, 100)
      data MaxProcCount /0/
      End
######################################################################
C File Name: Common.inc
C Common Block contents
      real*4 A, B, C
      integer*4 A_ROWS, A_COLUMNS, B_COLUMNS
      integer*4 MaxProcCount
      common /bridge[dllimport]/ MaxProcCount,
     +                           A_ROWS,
     +                           A_COLUMNS,
     +                           B_COLUMNS,
     +                           A(1000, 1000),
     +                           B(1000, 1000),
     +                           C(1000, 1000)
######################################################################
# File Name: Makefile

all: bridge.dll process.exe driver.exe
bridge.dll: bridge.obj
  link /edit bridge.obj /section:.data=.bridge,srw
  fl32 /LD bridge.obj
bridge.obj: bridge.for
  fl32 /LD /c bridge.for
process.exe: process.obj bridge.lib
  fl32 /MD process.obj bridge.lib
process.obj: process.for common.inc
  fl32 /MD /c process.for
driver.exe: driver.obj compute.obj bridge.lib
  fl32 /MD driver.obj compute.obj bridge.lib
driver.obj: driver.for common.inc
  fl32 /MD /c driver.for
compute.obj: compute.for common.inc
  fl32 /MD /c compute.for

[LISTING FIVE]


VERSION 2.00
Begin Form Form1
   Caption         =   "Form1"
   ClientHeight    =   6045
   ClientLeft      =   1095
   ClientTop       =   1485
   ClientWidth     =   9180
   Height          =   6450
   Left            =   1035
   LinkTopic       =   "Form1"
   ScaleHeight     =   6045
   ScaleWidth      =   9180
   Top             =   1140
   Width           =   9300
   Begin CommandButton Compute
      Caption         =   "Compute"
      Height          =   375
      Left            =   1200
      TabIndex        =   1
      Top             =   5040
      Width           =   1575
   End
   Begin Grid grdC
      Height          =   4335
      Left            =   1200
      TabIndex        =   0
      Top             =   480
      Width           =   6495
   End
End
' These declarations set up the two core functions to access the CALL32 DLL:
' Declare32 and CALL32. These are the only two functions you need to use to get
' access to any 32-bit DLL. The Option Base is used to start arrays at
' index 1 just as in Fortran
Option Base 1
Declare Function Declare32 Lib "call32.dll" (ByVal Func As String, ByVal Library As String, ByVal Args As String) As Long
Declare Sub Compute Lib "call32.dll" Alias "Call32" (A As Single, B As Single, C As Single, A_ROWS As Long, A_COLUMNS As Long, B_COLUMNS As Long, MaxThreadCount As Long, DO_CONSOLE As Long, ByVal id As Long)
Const A_ROWS% = 30
Const A_COLUMNS% = 200
Const B_COLUMNS% = 30
Const DO_CONSOLE% = 3
Dim A(A_ROWS, A_COLUMNS) As Single
Dim B(A_COLUMNS, B_COLUMNS) As Single
Dim C(A_ROWS, B_COLUMNS) As Single
Dim MaxThreadCount As Long
Dim idCompute As Long
Dim i As Long
Dim j As Long
Sub Compute_Click ()
' This code simply initializes the two input arrays and then calls the
' 32-bit DLL to multiply them.  It then puts the result in the grid.
  MaxThreadCount = 8
  Randomize
  For i = 1 To A_COLUMNS
    For j = 1 To A_ROWS
         A(j, i) = Rnd
    Next j
    For k = 1 To B_COLUMNS
         B(i, k) = Rnd
    Next k
  Next i
  Call Compute(A(1, 1), B(1, 1), C(1, 1), A_ROWS, A_COLUMNS, B_COLUMNS,
                                        MaxThreadCount, DO_CONSOLE, idCompute)
  For i = 1 To A_ROWS
    grdC.Row = i
    For j = 1 To B_COLUMNS
      grdC.Col = j
      grdC.Text = Str$(C(i, j))
    Next j
  Next i
End Sub
Sub Form_Load ()
' This code sets up the call to the CALL32 DLL by first using the Declare32
' function to get an id number. At this point CALL32 creates a function pointer
' to that 32-bit DLL subroutine and all access to the routine will be through
' that function pointer. The code also initializes the row and column numbers
' and sets the size of the grid fields.
  idCompute = Declare32("COMPUTE", "compute", "pppppppp")
  grdC.Rows = A_ROWS + 1
  grdC.Cols = B_COLUMNS + 1
  grdC.Row = 0
  For i = 1 To B_COLUMNS
    grdC.Col = i
    grdC.Text = Str$(i)
    grdC.ColWidth(i) = TextWidth("123.1234567")
  Next i
  grdC.Col = 0
  For i = 1 To A_ROWS
    grdC.Row = i
    grdC.Text = Str$(i)
    grdC.RowHeight(i) = TextHeight("1") + 10
  Next i
End Sub
End Listings

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Embedded Systems

Symmetric Multiprocessing for PCs

SMP Issues and Results

Build Option Considerations

Common Blocks Across Processes

Console Input and Output

Mixed-language Considerations

32-bit DLLs and 16-bit Applications

Conclusion

References

Table 1: Matrix-multiplication results on an SMP.

[LISTING ONE]

[LISTING TWO]

[LISTING THREE]

[LISTING FOUR]

[LISTING FIVE]

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Embedded Systems

Symmetric Multiprocessing for PCs

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content