Channels ▼
RSS

Tools

Hot-Rodding Windows and Linux App Performance with CUDA-Based Plugins


CUDA is maturing to become a natural extension of the emerging CPU/GPU paradigm of high-speed computing to make it, and GPU computing, a candidate for all application development. A recent article in this series tutorial series, Running CUDA Code Natively on x86 Processors, noted recent developments that allow CUDA programs to transparently compile and run on x86 processors. This article focuses on incorporating CUDA into Windows and Linux workflows by exploiting the capabilities of the NVIDIA compiler driver, nvcc, to create native runtime loadable plugins. Source code is provided to create and utilize CUDA plugins and even dynamically compile and link a CUDA source file into a running application (just like the OpenCL). This tutorial also provides a general "click together tools" framework that can stream arbitrary messages (vectors, arrays, and complex nested structures) among heterogeneous CPU-, GPU- and CPU+GPU-based applications running within a single workstation, across a network of machines, or within a cloud computing framework. My production version of this same framework has successfully integrated multiple supercomputers and numerous computation nodes into a single unified workflow.

For simplicity, this tutorial uses the freely downloadable Google protobufs package so readers can easily extend this framework to operate on their own data structures. The Google protobufs package also provides binary interoperability across machines and the ability to incorporate applications written in CUDA C, CUDA C++, standard C, standard C++, Python, Java, R, and many other languages into workflows. The use of a binary format also provides a 20-100x increase in performance over XML.

Readers can manually copy and paste the source code in this article to create working applications. Alternatively, a zip file can be downloaded from my wiki.

The build and test scripts use bash and were purposely kept simple. Linux and Windows Cygwin users should be able to use the scripts in this tutorial to build and run the examples. Native Windows users should be able to easily adapt these bash scripts to run in Microsoft .bat files. Regardless of the scripting language, the Windows nvcc driver uses the Microsoft Visual Studio C compiler to create native Windows .exe executables that do not require Cygwin to run. I use the following .bat file to utilize the Microsoft Visual Studio 2010 compiler in command-line mode under Cygwin:

@echo off
call "C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\vcvarsall.bat" amd64
chdir C:\cygwin\bin
bash --login –i

Creating Dynamically Loadable Runtime CUDA DLLs and Shared Objects

When it encounters calls to libraries, the compiler simply notes the calls. The linker, in a follow-up step, completes the resolution of any unresolved references to create a complete executable that can run on the computer. Linking can be static or dynamic.

Static linking means that all references are resolved when the executable is built. Further, the executable contains the explicit machine code to run all library functions used by the program.

Dynamic load-time linking happens when the executable is loaded into memory. Just like static linking, all symbols in the executable are resolved by linking with one or more .dll (Windows) or .so files (Linux) at program startup. This form of linking provides fixed functionality (think of the C runtime library and other commonly used libraries). A big advantage of load-time linking is that all applications that link a library at load time will benefit from bug fixes and performance improvements just by installing the revised library file at the shared location. No applications need to be recompiled or relinked to use the improved library code. Further, shared libraries keep individual executable sizes small — a cost savings that can be multiplied many times for libraries that are commonly used.

Dynamic runtime linking enables the loading of plugins, which allows generic functionality to be added without recompiling the application. Thus, an application can call a generic function with functionality that depends entirely on whatever plugin happens to be loaded at runtime. In fact, the application can create the plugin (when a compiler is available) by generating a problem-specific source code, which is then compiled and linked into the application. This capability enables developers to create very highly optimized functions for specific problem parameters. I use this trick frequently to create very efficient applications and even to use CUDA C and C++ as a scripting language as I'll show later in this article.

The following is a simple C-language program that calls a generic external function, func(), and prints the value of x created by the generic function. To demonstrate additional generality, this program calls an init() method to perform any initialization, and a fini() method to provide any final processing.

#include <stdio.h>
extern int func(int *);

int main()
{
  int x;
  init();

  func(&x);
  printf("Example of static linking\n");
  printf("Valx=%d\n",x);

  fini();
  return 0;
}

The source code for crossplatform.cc extends this generic behavior and adds the capability to dynamically compile the DLL or .so at runtime, which is then linked to the running executable. The name of the source code is specified on the command line. Due to the flexibility of the CUDA nvcc compiler driver, the source code can be written in C or C++, and use the CUDA Thrust and Runtime APIs. (This code was not adapted to run on Apple CUDA-enabled computers because I do not have access to a Mac.) It is not hard to see how the application can be extended to generate the source code that generates the plugin.

The code walk-through of crossplatform.cc starts with the specification of the include files needed to build crossplatform.cc in the Linux and Windows environments. By default, nvcc defines the preprocessor variable _WIN32 when building under windows. The __declspec() method is used so Windows can find the symbol information for loadable methods.

#ifdef _WIN32
#include <windows.h>
extern __declspec(dllimport) int init();
extern __declspec(dllimport) int func(int*);
extern __declspec(dllimport) int fini();
#else
#include <cstdlib>
#include <sys/types.h>
#include <dlfcn.h>
#endif
#include <string>
#include <iostream>

using namespace std;

Some global handles and pointer to function types are defined.

#ifdef _WIN32
HINSTANCE lib_handle;
#else
void *lib_handle;
#endif

typedef int (*initFini_t)();
typedef int (*func_t)(int*);

The main() method begins by parsing the command-line argument, which contains the filename of the source to be built. The command to build the DLL (Windows) or .so (Linux) is created and performed with a system() call.

int main(int argc, char **argv) 
{
  if(argc < 2) {
    cerr << "Use: sourcefilename" << endl;
    return -1;
  }
  string base_filename(argv[1]);
  base_filename = base_filename.substr(0,base_filename.find_last_of("."));
 
  // build the shared object or dll
#ifdef _WIN32
  string buildCommand("nvcc ");
  buildCommand += string(argv[1]) 
	+ string(" -Xlinker \"/DLL /OUT:") + base_filename + string(".dll \"");
#else
  string buildCommand("nvcc -Xcompiler \"-fPIC -shared\" ");
  buildCommand += string(argv[1])
    + string(" -o ") + base_filename + string(".so ");
#endif
 
  cerr << buildCommand << endl; 
  if(system(buildCommand.c_str())) {
    cerr << "compile command failed!" << endl;
    cerr << "Build command " << buildCommand << endl;
    return -1;
  }
  cerr << buildCommand << endl; 

Assuming no errors occurred during the compilation phase, the next step is to load the library created in the previous step. If there is an error, the program exits.

  // load the library 
  string nameOfLibToLoad("./");
  nameOfLibToLoad += base_filename;
 
#ifdef _WIN32
  nameOfLibToLoad += ".dll";
  cerr << nameOfLibToLoad << endl; 
  lib_handle = LoadLibrary(TEXT(nameOfLibToLoad.c_str()));
  if (!lib_handle) {
     cerr << "Cannot load library: " << TEXT(nameOfLibToLoad.c_str()) << endl;
     return -1;
  }
#else
  nameOfLibToLoad += ".so";
  lib_handle = dlopen(nameOfLibToLoad.c_str(), RTLD_LAZY);
  if (!lib_handle) {
     cerr << "Cannot load library: " << dlerror() << endl;
     return -1;
  }
#endif

Finally, the symbols are loaded and the pointers to the init(),func(), and fini() methods are resolved.

  // load the symbols
  initFini_t dynamicInit= NULL;
  func_t dynamicFunc= NULL;
  initFini_t dynamicFini= NULL;

#ifdef _WIN32
  dynamicInit = (initFini_t) GetProcAddress(lib_handle, "init");
  if (!dynamicInit) {cerr << "sym load error on init" << endl; return -1;}
  dynamicFunc = (func_t) GetProcAddress(lib_handle, "func");
  if (!dynamicFunc) {cerr << "sym load error on func" << endl; return -1;}
  dynamicFini = (initFini_t) GetProcAddress(lib_handle, "fini");
  if (!dynamicInit) {cerr << "sym load error on fini" << endl; return -1;}
#else
  // reset errors
  dlerror();
 
  // load the function pointers
  dynamicFunc= (func_t) dlsym(lib_handle, "func");
  const char* dlsym_error = dlerror();
  if (dlsym_error) { cerr << "sym load: " << dlsym_error << endl; return -1;}
  dynamicInit= (initFini_t) dlsym(lib_handle, "init");
  dlsym_error = dlerror();
  if (dlsym_error) { cerr << "sym load: " << dlsym_error << endl; return -1;}
  dynamicFini= (initFini_t) dlsym(lib_handle, "fini");
  dlsym_error = dlerror();
  if (dlsym_error) { cerr << "sym load: " << dlsym_error << endl; return -1;}
#endif

Each function pointer is checked to make sure that symbol has been resolved. If so, the function is called. As a convenience to the plugin author, any of the calls can be made optional — meaning the method does not need to be included in the compiled source file. All that's required is to modify the logic in the previous step so a failure to resolve a reference does not cause the application to exit.

if( (*dynamicInit)() < 0) return -1;

  int x;
  (*dynamicFunc)(&x);
  cout << "Valx " << x << endl;
 
  if( (*dynamicFini)() < 0) 
    return -1;
 

Finally, the libraries are unloaded and the application exits.

 // unload the library 
 
#ifdef _WIN32
  FreeLibrary(lib_handle);
#else
  dlclose(lib_handle);
#endif
}

Following is the source for a simple C++ plugin, cctest1.cc. This source code is straightforward. Note that __declspec(dllexport) is required by Windows to export the function names.

#include <iostream>
using namespace std;
extern "C" 
#ifdef _WIN32
__declspec(dllexport)
#endif
int init() {
  cerr << "Hello from Init" << endl;
  return(0); 
}

extern "C" 
#ifdef _WIN32
__declspec(dllexport)
#endif
int func(int *i)
{
  cerr << "Hello from Func" << endl;
  *i=100;
  return(1);
}

extern "C" 
#ifdef _WIN32
__declspec(dllexport)
#endif
int fini()
{
  cerr << "Hello from Fini" << endl;
  return(0); 
}

Here is a more-complicated CUDA example that calculates the sum of a sequence of numbers on the GPU using Thrust.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/sequence.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <iostream>
using namespace std;

extern "C" 
#ifdef _WIN32
__declspec(dllexport)
#endif
int init()
{
  cerr << "Hello from Init" << endl;
  return(0); 
}

extern "C" 
#ifdef _WIN32
__declspec(dllexport)
#endif
int func(int *i)
{
  cerr << "Hello from Func" << endl;
  thrust::device_vector<int> ara(1000);
  thrust::sequence(ara.begin(), ara.end() );
  *i = thrust::reduce(ara.begin(), ara.end() , (int) 0, thrust::plus<int>());
  return(1);
}

extern "C" 
#ifdef _WIN32
__declspec(dllexport)
#endif
int fini() 
{
  cerr << "Hello from Fini" << endl;
  return(0); 
}

Linux users will use the following commands to build and run the examples. Either of the plugin examples can be used simply by calling the crossplatform.cc executable with the name of the source file. Without changing or recompiling crossplatform.cc, arbitrary functionally can be added by writing your own plugin.

echo "------ static test cutest2.cu -----"
nvcc prog.c cutest2.cu -o prog
./prog
echo "------ building cross platform-----"
nvcc -o crossplatform crossplatform.cc -ldl
echo "------ dynamic version of cctest1.cc -----"
./crossplatform cctest1.cc
echo "------ dynamic version of cutest2.cu -----"
./crossplatform cutest2.cu

Similarly, Windows users will use the following commands.

echo "------ static test cutest2.cu -----"
nvcc prog.c cutest2.lib -o prog
./prog.exe
echo "------ building cross platform-----"
nvcc -o crossplatform crossplatform.cc
echo "------ dynamic version of cctest1.cc -----"
./crossplatform.exe cctest1.cc
echo "------ dynamic version of cutest2.cu -----"
./crossplatform.exe cutest2.cu

Note that the only real difference is that Linux needs to link to the libdl.a library and the executable names are slightly different.

Following is the output under Linux. Windows users will see similar behavior.

$:~/DDJ023/dynamic_load_link$ sh BUILD.linux 
------ static test cutest2.cu -----
Hello from Init
Hello from Func
Example of static linking
Valx=499500
Hello from Fini
------ building cross platform-----
------ dynamic version of cctest1.cc -----
nvcc -Xcompiler "-fPIC -shared" cctest1.cc -o cctest1.so 
nvcc -Xcompiler "-fPIC -shared" cctest1.cc -o cctest1.so 
Hello from Init
Hello from Func
Valx 100
Hello from Fini
------ dynamic version of cutest2.cu -----
nvcc -Xcompiler "-fPIC -shared" cutest2.cu -o cutest2.so 
nvcc -Xcompiler "-fPIC -shared" cutest2.cu -o cutest2.so 
Hello from Init
Hello from Func
Valx 499500
Hello from Fini

The application crossplatform.cc demonstrates how CUDA can be built and loaded into a running application. This demonstrates that CUDA can build kernels at runtime just like OpenCL can. Further, it opens up the possibility for highly optimized automatic plugin generation based on problem parameters.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
 

Video