CUDA integration with C#

CUDA enabled hardware and .NET 4 (Visual Studio 2010 IDE or C# Express 2010) is needed to successfully run the example code. Visual C++ Express 2008 has been used as a CUDA C editor (2010 version has changed custom build rules feature and cannot work with that provided by CUDA SDK for easy VS integration).

Part 1: Environment and tools configuration for CUDA

CUDA is a general purpose parallel computing architecture introduced by NVIDIA. CUDA programs (kernels) run on GPU instead of CPU for better performance (hundreds of cores that can collectively run thousands of computing threads). It comes with a software environment that allows developers to use C as a high-level programming language. This computation technology is used in mathematics, science, finance, modeling, image processing and so on.

Basic CUDA configuration for developing purpose:
  • Download and install CUDA toolkit for correct OS, use version 3.1 or latest 3.2RC or version 3.0 for device emulation (limited usage but works without CUDA enabled device, not supported after version 3.0)
  • Download and install SDK for the same OS and toolkit version
  • Update display drivers if you encounter problems with SDK installation

Visual C++ Express 2008 (or VS2008) configuration:

1. Syntax coloring

  • Open Tools -> Options from main window, then Text Editor -> File Extension, add .cu and .cuh
    extensions with Microsoft Visual C++ Editor
  • Copy file usertype.dat from [sdk dir]\C\doc\syntax_highlighting\visual_studio_8\ to Program
    Files\Microsoft Visual Studio 9.0\Common7\IDE\ folder
  • Restart Visual Studio

2. New project, 32bit Windows XP version
  • Default SDK location is c:\Documents and Settings\All Users\Application Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\
  • Default toolkit location is C:\CUDA\
  • Create empty Win32 console application and add source file with .cu extension
  • Select newly created project in the Solution Explorer window and then right mouse key ï‚® Custom Build Rules, use Find Existing button to locate Cuda.rules file in [sdk dir] \C\common\ folder, add it and mark on the list of available rule files
  • Select again project -> Properties, select Release Configuration and then from the tree view:

    - Configuration Properties -> Linker -> General -> Additional Library Directories:
    add this line (default folders) C:\CUDA\lib;"C:\Documents and Settings\All Users\Application
    Data\NVIDIA Corporation\NVIDIA GPU Computing SDK\C\common\lib"
  • Configuration Properties -> Linker -> Input -> Additional Dependencies: cudart.lib
  • Use the same settings for Debug Configuration
  • For emulator configurations use Configuration Manager to add new configurations based on Release and Debug ones, name them as EMU-Release and EMU-Debug (or as you wish)
  • Select EMU-Release from Configuration combo-box:

    - Change cudart.lib to cudartemu.lib
    - From tree view select CUDA Build Rule v3.0.14 (or similar) ï‚® General and set Emulation Mode to
  • Do the same for EMU-Debug

3. New project, 64bit Windows 7 version
  • Default SDK location is c:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK\
  • Default toolkit location is C:\CUDA\
  • The rest is basically similar to win32 variant except that:

    - Use Configuration Manager to create new solution configurations AMD64_Release, AMD64_Debug
    and emulation versions if needed (copy settings from Release and Debug configurations)
    - Use Configuration Manager to add new solution platform x64 and copy setting from Win32
    - In Project contexts window choose platform x64 for all AMD64 configurations
    - Linker -> General -> Additional Library Directories: c:\CUDA\lib64; "c:\ProgramData\NVIDIA
    Corporation\NVIDIA GPU Computing SDK\C\common\lib"
    - Linker -> Input - Additional Dependencies: cudart.lib

Important notes about example files:

You must do this basic configuration to be able to open attached example because of custom build rule entry in project solution file. VC++ 2008 will show errors if that rule won't be accessible due to wrong paths. You can also edit file cudalib.vcproj (*.vcproj) and fix path to cuda.rules file.

If this won't help then just create new project, manually add custom build rule as described before in configuration section and then copy source .cu file from the example to that new project.

You need also manually copy file cutil.h from \NVIDIA GPU Computing SDK\C\common\inc\ dir to C:\CUDA\include\ (it is easiest way to fix import paths) and add cutil32.lib (or cutil64.lib for 64bit OS) to linker input additional dependencies -> should be now like this: cudart.lib cutil32.lib (already done in example projects).

CUDA toolkit and SDK version 3.0 had been used to build dll part.

The main project part (written in C#) will work even without this steps because I added already built dll library to \bin\Debug and \bin\Release folders (you have to remember to replace it if you change anything in the dll part – this is not done automatically – at least in my example).

And you need VS 2010 version for C# project part because .NET 4 features has been used.

Part 2: CUDA DLL

The dll part code has been written in VC++ 2008 IDE. Assuming that IDE is configured properly create new Win32 Console Application, switch application type to "DLL" and mark "Empty project" (no need precompiled header and/or dllmain() function). Add new source file and save it with .cu extension. Syntax coloring should work. Remember to add CUDA custom build rule and linker dependencies.

To run CUDA kernel (function executed on the GPU device) we need some sort of wrapper function exposed outside the dll. Kernel setup and invocation will be inside that function. The nice feature is that we can pass kernel execution configuration parameters to it (grid size, block size and shared memory size) instead of set them to constants which gives us possibility to run benchmark on target machine to determine best values (and so on).
// cuda wrapper functionextern "C" int __declspec(dllexport) __stdcall SomeCalculationsCU
      float *a_h,                               // pointer to input array      const unsigned int N,                     // input array size      const unsigned int M,                     // kernel M parameter      const int cuBlockSize = 512,              // kernel block size (max 512)      const int showErrors = 1                  // show CUDA errors in console window      )
      int tmp = PRINT_ERRORS;
      PRINT_ERRORS = showErrors;

      float *a_d;                               // pointer to device array      size_t size = N * sizeof(float);
      int cuerr = 0;                            // no errors      unsigned int timer = 0;
      cudaMalloc((void**)&a_d, size);           // allocate array on device          cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
      int n_blocks = N / cuBlockSize + (N % cuBlockSize == 0 ? 0 : 1);
      cutCreateTimer(&timer);                   // from cutil.h      cutStartTimer(timer);
      some_calculations <<<n_blocks, cuBlockSize>>> (a_d, N, M);  // kernel invocation      cudaThreadSynchronize();                  // by default kernel runs in parallel with CPU
      cuerr = checkCUDAError("cuda kernel");

      cudaMemcpy(a_h, a_d, size, cudaMemcpyDeviceToHost);  
      if(!cuerr) cuerr = checkCUDAError("cuda memcpy");
      sExecutionTime = cutGetTimerValue(timer);
      if(!cuerr) cuerr = checkCUDAError("cuda free");

      PRINT_ERRORS = tmp;
      return cuerr;

The most important in that function is extern "C" int _declspec(dllexport) _stdcall part which makes it visible outside dll. The calling convention (_stdcall here) must be specified because by default C functions are using _cdecl one and .NET platform invoke use CallingConvention.Winapi which is _stdcall. More info about Win32 calling conventions is here Generally important is to use the same convention for caller and callee functions and not use _stdcall with variadic ones.

This function could be also the main() function in console application project so you can modify it and add some print results section if you want or just add main() function like this one:
int main(void)
      float *a_h;
      const unsigned int N = 2000;
      const unsigned int M = 10;
      const int cublocks = 256;

      size_t size = N * sizeof(float);
      a_h = (float*)malloc(size);
      for(unsigned int i = 0; i < N; i++) a_h[i] = (float)i;

      SomeCalculationsCU(a_h, N, M, cublocks, 1);

      printf("exec time = %f ms\n", sExecutionTime);

To run this as an application (not build dll) change Configuration Type to Application (.exe) in project properties -> Configuration Properties -> General section. This also allows you to use CUDA Profiler (located in C:\CUDA\cudaprof\bin\ directory).

File has also example of variable available to access from the outside of dll:
// external variable exampleextern "C" { float __declspec(dllexport) sExecutionTime = -1; }
// variable wrapper functionextern "C" float __declspec(dllexport) __stdcall GetExecutionTime()
      return sExecutionTime;

It is possible to access that kind of variable directly (will be shown in C# part) but this is much easier with wrapper functions approach (could works like get/set accessors).

Finally example kernel function:
// cuda kernel (internal)__global__ void some_calculations(float *a, unsigned int N, unsigned int M)
      unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
      if (idx < N)
            // note1: no need for shared memory here            // note2: global memory access is coalesced            //        (no structs, float only used)
             // do computations M times on each thread            // to extend processor's time            for(unsigned int i = 0; i < M; i++)
                  // some easy arithmetics                              a[idx] = a[idx] * a[idx] * 0.1 - a[idx] - 10;

And CPU version for execution time comparison:
extern "C" void __declspec(dllexport) __stdcall SomeCalculationsCPU
      float *a_h,
      const unsigned int N,
      const unsigned int M
      unsigned int timer = 0;
      for(unsigned int i = 0; i < N; i++)
            for(unsigned int j = 0; j < M; j++)
*(a_h + i) = *(a_h + i) * *(a_h + i) * 0.1 - *(a_h + i) - 10;
      sExecutionTime = cutGetTimerValue(timer);