Simple GPU Path Tracing, Part. 1.1 : Adding a cuda backend to the project

 
Here's the code for this part 



Adding a Cuda backend

I also wanted to add a cuda back end to this project, because I used a lot of openGL in the past, but not much cuda and this was a good opportunity to learn more about this api, and to see how the performances might differ between the two.
 
I want to only have a single path tracing code file that would run on both cuda and openGL backends.
The thing is Cuda kernels are using c++ language, whereas openGL compute shaders are written in glsl.
So we'll have to use some trickery to achieve that, but it will work eventually!

In the CMake project, we add some compile time definitions that define which backend we're using. We can access those variables in the preprocessor and use #if statements with them. That's how we're going to execute specific code for each backend.
set(API_GL 0)
set(API_CU 1)
set(API ${API_CU})

add_definitions(-DAPI_GL=${API_GL})
add_definitions(-DAPI_CU=${API_CU})
add_definitions(-DAPI=${API})

Writing an image in a cuda kernel

We don't have to use a texture object to represent images in cuda, we can just use a simple gpu buffer that stores vec4 type.
I added a simple bufferCu class to do that. Here's how I create it in the CreateGPUObjects function : 
    RenderBuffer = std::make_shared<bufferCu>(Window->Width * Window->Height * 4 * sizeof(float));
It will simply call cudaMalloc() to allocate the memory, and that's pretty much it.

OpenGL interop

Writting into this buffer from a cuda kernel is cool and all, but how do we actually show that buffer onto the screen, given that we're using ImGui that can only display openGL textures ? 
 
That's where we'll use the useful openGL interop that cuda offers : It's possible to allocate some cuda device memory, and bind it to the memory of an opengl texture.

We'll create a cudaTextureMapping struct that will take care of handling all the interop between the cuda buffer and the gl texture. The code is not too interesting so I won't show it here, but go see the link above if you're curious. 
We can then call 
std::shared_ptr<cudaTextureMapping> CreateMapping(std::shared_ptr<textureGL> Tex
                                                 bool Write=false

Once we have that, we can copy the content of our cuda RenderBuffer into the device memory of the mapping, and that will effectively write into the opengl texture memory!
We can then just display that texture on the screen, and that's it.
 
So, in the InitGpuObjects() function, I'll create a RenderTexture that we'll use to create a mapping with cuda device memory : 
    RenderTexture = std::make_shared<textureGL>(Window->Width, Window->Height, 4);
    RenderTextureMapping = CreateMapping(RenderTexture);    
 
Cool, now we have everything in place to start writing the cuda kernel.

The cuda kernel

The tricky part here is that we want to use the same code for the cuda kernel and the opengl compute shader. c++ and glsl are not too dissimilar, so we can use preprocessor macros to make it work. I don't know if it's the best solution, but it works.
 
I'll create a PathTraceCode.cpp file that will contain the actual path tracing code. This code will compile to glsl and to c++ cuda kernel.
 
Here's the content of this file : 

MAIN()
{
    INIT()
   
    ivec2 ImageSize = IMAGE_SIZE(RenderImage);
    int Width = ImageSize.x;
    int Height = ImageSize.y;

    uvec2 GlobalID = GLOBAL_ID();
    if (GlobalID.x < Width && GlobalID.y < Height) {    
        ivec2 ImageSize = IMAGE_SIZE(RenderImage);
        vec2 UV = vec2(GLOBAL_ID()) / vec2(ImageSize);
        imageStore(RenderImage, ivec2(GLOBAL_ID()), vec4(UV, 0, 1));    
    }
}
 
as you can see, there are a few macro functions being called here.
That's where there are some differences between compute shaders and cuda kernels. each macro is defined differently depending on if we compile for cuda or openGL.
 

Cuda macros


For cuda, we define those macros in PathTrace.cu, and include PathTraceCode.cpp in that file.
Here's how it looks : 
#pragma once
#include <glm/glm.hpp>
using namespace glm;

__device__ u32 Width;
__device__ u32 Height;

#define MAIN() \
__global__ void TraceKernel(glm::vec4 *RenderImage, int _Width, int _Height)

#define INIT() \
    Width = _Width; \
    Height = _Height; \

#define IMAGE_SIZE(Img) \
    ivec2(Width, Height)

#define GLOBAL_ID() \
    uvec2(blockIdx.x * blockDim.x + threadIdx.x, blockIdx.y * blockDim.y + threadIdx.y)


__device__ void imageStore(vec4 *Image, ivec2 p, vec4 Colour)
{
    Image[p.y * Width + p.x] = Colour;
}

First thing to note is using namespace glm : Because glm is an almost direct mapping between glsl vector classes and c++, we can just use the same code and it will work in most cases (glm doesn't really support swizzling operators for example.)

then, the MAIN() macro develops to the definition of a cuda kernel with some inputs. In this case, the inputs are the image buffer, and the width and height of the image.

We want to access those inputs everywhere in the code, just like we coud do in a GL compute shader where all the inputs are defined in a global scope.
To do that, for each input to the kernel we declare a variable in the global scope, and assign them at the start of the shader in the INIT() macro function.

Then, GLOBAL_ID() returns the index of the current thread. It's effectively the pixel index.

we also define the imageStore function that exists in glsl but not in kernel code.

OpenGL macros

Here's the content of PathTrace.glsl : 
#version 460
#include Inputs.glsl

#define INIT()

#define MAIN()  void main()

#define GLOBAL_ID() \
    gl_GlobalInvocationID.xy

#define IMAGE_SIZE(Img) \
    imageSize(Img)

#include ../PathTraceCode.cpp
 

It's much simpler than the cuda backend. 
INIT() is empty, MAIN() simply resolves to void main(), etc...

the PathTrace.glsl and PathTrace.cu files will not evolve much in the future. All of the code will mainly be written in the PathTraceCode.cpp file.

Granted that it's a bit convoluted, but it's very helpful to maintain a single code base for the main path tracing algorithm. Writing a path tracer is a very iterative process, and it's just a pain if we have to maintain 2 different files for the 2 backends.


App.cu

We had to change the extension of the App.cpp to App.cu, because we will now be running cuda kernels from this file, and it therefore needs to be compiled as a cuda program.

That's where we will be using our #if API==API_GL and API==API_CU directives. 
So that's the InitGpuObjects() function now : 
#if API==API_GL
    PathTracingShader = std::make_shared<shaderGL>("resources/shaders/PathTrace.glsl");
    RenderTexture = std::make_shared<textureGL>(Window->Width, Window->Height, 4);
#elif API==API_CU
    RenderBuffer = std::make_shared<bufferCu>(Window->Width * Window->Height * 4 * sizeof(float));
    RenderTexture = std::make_shared<textureGL>(Window->Width, Window->Height, 4);
    RenderTextureMapping = CreateMapping(RenderTexture);    
#endif
 
The GL code didn't change. For cuda, we create the RenderBuffer, and then the texture and the mapping for cuda.

And in the Run() function, that's where we run the shaders : 
#if API==API_GL
        PathTracingShader->Use();
        PathTracingShader->SetTexture(0, RenderTexture->TextureID, GL_READ_WRITE);
        PathTracingShader->Dispatch(Window->Width / 16 + 1, Window->Height / 16 +1, 1);
#elif API==API_CU
        dim3 blockSize(16, 16);
        dim3 gridSize((Window->Width / blockSize.x)+1, (Window->Height / blockSize.y) + 1);
        TraceKernel<<<gridSize, blockSize>>>((glm::vec4*)RenderBuffer->Data, Window->Width, Window->Height);
        cudaMemcpyToArray(RenderTextureMapping->CudaTextureArray, 0, 0, RenderBuffer->Data, Window->Width * Window->Height * sizeof(glm::vec4), cudaMemcpyDeviceToDevice);
#endif

The GL code didn't change here either. 
For cuda, we just run the kernel, and copy the renderBuffer content to the RenderTextureMapping array.

Results


 
 
 

Links


Commentaires

Articles les plus consultés