With our CPU ray tracer in our hand, we can handle multiple materials and create some beautiful images. However, the process is slow, and we can optimize many things in our code. In this post, we will optimize our ray tracer with acceleration structures, tiled multithreading and some linearization.

Bounding Volume Hierarchy

In our ray tracer, we need to check every object in the scene to find the closest intersection point. This process is slow, especially when we have many objects, like millions of triangles, in the scene. To speed up this process, we can use a Bounding Volume Hierarchy (BVH) structure. BVH is a tree structure that divides the scene into volumes and stores the objects in these volumes. With this structure, we can check the intersection with the volumes first, and if the ray intersects with the volume, we can check the objects in that volume. This way, we can reduce the number of intersection checks and speed up the process.

First, I thought implementing a basic BVH with nodes that have pointers in them pointing other nodes or objects. However, this approach is not cache-friendly and can be slow. I looked into the linear versions of BVH and seemed simple enough to create it from the scratch. First, I created the node structure to hold the bounding box and the children nodes, and for the leaf nodes we need to have the object positions in the respected structure.

struct BVHNode
{
	vec3 aabb_min, aabb_max;
	uint32_t left_child;
	uint32_t first, count;
	bool is_leaf() const { return count > 0; }
};

At first, I thought having the right child index and a bool to hold the leaf status would be enough. However, I realized that I only need to have the left child index, and then while traversing the tree I can calculate the right child index with the placements I made while building the BVH. And for the leaf nodes, I can use the count variable to hold the number of objects in the leaf node, and if the child count is zero, then it is not a leaf node, because In BVH, only the leaf nodes have triangles in them. This made my node structure 36 bytes, which is a good size for cache-friendly operations.

After creating the BVH structure, I needed to build the BVH. To build the BVH, as a first step, I calculated the centroids of the triangles and to split each node, I used a basic split mode which halves the most extended axis of the bounding box. After the split, I created the left and right child nodes and assigned the triangles to the respected child nodes. I repeated this process until I reached the leaf node.

And voila, I have a BVH structure that can handle the scene with millions of triangles. For testing, I used the Dragon model which has 869928 triangles, and the BVH structure handled it like a charm.

Dragon model with BVH structure rendered in 789ms

Before the BVH structure, the same scene was rendered in 15-20 mins. This is a huge improvement, like 20x faster.

However, while implementing the BVH structure, I realized that there are some black artifacts present in the image. At first, I thought that BVH structure is causing this, however the issue was the intersection code. I implemented the Möller–Trumbore intersection algorithm, I tried to adjust the epsilon values to fix the issue, it did help decrease the artifacts, however, it did not eliminate them. To solve this issue, I jumped back to the barycentric coordinates method, and adjusts the epsilons accordingly, and the artifacts were gone. And to my surprise, the barycentric coordinates method was faster than the Möller–Trumbore algorithm, at least in my implementation.

Dragon model without artifacts in 730ms

Tiled Multithreading

There is an indeed huge speedup with the BVH structure, however, we can still optimize the process. In the previous post, I mentioned that I used a simple multithreading approach. While examining the code, I realized that I spawned a thread for each pixel (I used std::future and async, and I believe behind the scenes It doesn’t use a thread pool). To fix this issue, I implemented a basic thread pool, in which each thread gets the jobs from a job queue until there is no job left in the queue. Which ensures that no thread is waiting for a job after finishing the previous one, if we enqueue jobs fast enough. While doing the pooled multithreading, I also implemented a tiled multithreading approach. In this approach, I divided the image into 32x32 tiles, and enqueued each tile as a job into the pool. While testing the structure I came across a deadlock which prevented the threads getting new jobs as they are not notified when a job is available if we wait all the threads to finish. I solved this issue by adding a second conditional variable. The results were satisfying, for the dragon model I can render the scene in 60ms, which is a huge improvement in addition to the BVH structure. In the end, we decreased the render time from 15-20 mins to 60ms. The job loop for the thread pool is below:

    ThreadPool(size_t threads) {
        for(size_t i = 0; i < threads; ++i) {
            workers.emplace_back([this] {
                while(true) {
                    std::function<void()> task;
                    {
                        std::unique_lock<std::mutex> lock(queue_mutex);
                        task_condition.wait(lock, [this] {
                            return stop || !tasks.empty();
                        });
                        
                        if(stop && tasks.empty()) {
                            return;
                        }
                        
                        task = std::move(tasks.front());
                        tasks.pop();
                    }
                    
                    active_tasks++;
                    task();
                    
                    {
                        std::unique_lock<std::mutex> lock(queue_mutex);
                        active_tasks--;
                    }
                    finished_condition.notify_all();
                }
            });
        }
    }

Instancing and Transformations

If we have a fast ray tracer with a BVH structure, we can push the limits and have multiple of these huge models. And in fact, if we want to have multiple of the same model, we can cut from the memory cost by using instancing. We can just store each base model once in our memory and after that we can just reference the base model for other instances. However, to have meaningful renders, we need to move them around, rotate them, and scale them. To do this, we can implment transformations and apply them to the instances. For the transformations, I implemented a 4x4 matrix structure and converted each transformation into the matrix equivalent.

To implmement instancing to our BVH structure, I created a new structure called BVHInstance to reference the base model’s BVH structure, the transformation matrix and the material information.

struct BVHInstance : public hittable
{
	std::shared_ptr<BVH> bvh;
	std::shared_ptr<material> mat_ptr;

	BVHInstance(std::shared_ptr<BVH> _bvh) : bvh(_bvh) {}
	BVHInstance(std::shared_ptr<BVH> _bvh, mat4 _model) : bvh(_bvh) { model = _model; }

	bool hit(const ray& r, double tMin, double tMax, hitRecord& rec, mat4* model) const override;
};

Hitting Transformed Instances

These instances are nice and can be created easily, but our camera spawns rays in the world space, and we need to transform these rays into the object space to check the intersections. To do this, we can use the inverse of the transformation matrix. We need to transform the ray’s origin and direction, with respect to this inverse matrix. For a small note, we need to add 4th elements to the ray’s origin and direction to make the matrix multiplication work with the 4x4 matrix. This is a common practice in computer graphics, and it is called homogenous coordinates. We add 0 as the 4th element for the direction and 1 for the points, because the directions do not need to be translated, only the points need to be translated, which we calculate our matrices accordingly.

auto invModel = object->model.inverse();
ray newRay = r;
vec4 origin = vec4(newRay.origin(), 1.0f);
vec4 direction = vec4(newRay.direction(), 0.0f);

vec4 newOrigin = invModel * origin;
vec4 newDirection = invModel * direction;

newRay = ray(vec3(newOrigin.x(), newOrigin.y(), newOrigin.z()), 
             vec3(newDirection.x(), newDirection.y(), newDirection.z()));

After this implementation, we can have multiple instances of the same model with different transformations and materials.

9 Dragon instances rendered in 1188ms

Issue with Normals

While implementing the transformations, I realized that the normals are not transformed correctly. At first, I thought that the normal transformation calculations are the suspect, (we transform normal values with respect to the inverse transpose of the model matrix), but this was not the case. In the previous iteration, I calculated the normals in the hit function of the objects, and if the normal does not face the ray direction, I negated it, and made the refraction calculations based on this assumption. I believe, this model has some planar surfaces, and the normal calculations are not correct for these surfaces. To fix this issue, I calculated the normals in the intersection functions of the objects, and I did not negate the normals. And for the refracted materials, I negated the normals if it is a backface.

Faulty normals of the grass

Correct version of the normals rendered in 6679ms

Back to the BVH, is a better split possible?

While testing the BVH structure, some scenes, didn’t render as fast as I expected. While experimenting with some stuff on the code, I noticed that some leaf nodes didn’t split any triangles, and have all the triangles in one child which can have like 240 triangles which is bad. To solve this issue I tried to split the triangles equally, for each axis, I tried different split points and calculated the cost of the split. I used a very basic cost function, which is the count difference between the nodes after we split them. Then I used the minimun cost split point for the split. This approach worked well, and the render times decreased for some meshes. However, for some meshes like the dragon model, the render time increased. I believe this is the result of my cost functions, it can work well for meshes with longer triangles, however bad for meshes with shorter and smaller triangles. I am not exactly sure this is the cause, it is just a guess and needs further investigation.

For the windmill scene, it worked well, and the render time decreased from 1080ms to 484ms.

Windmill scene rendered with a better split in 484ms, 2x faster than the basic split method

Too Many BVHs, Time For a BVH of BVHs Also Known As Top Level Acceleration Structures

In the previous windmill scene, there are 20 meshes to render, and each mesh has its own BVH structure. And our rays won’t intersect with many of them, in this case we don’t have many meshes, but we still can reduce the number of intersection checks. To do this, we can create a top-level BVH structure, which holds the BVH structures of the meshes. We can check the intersection with the top-level BVH first, then with the BVH structures of the meshes. However, there is a catch, we need to transform the bounding boxes of these BVH structures with respect to their model matrices and move them into the word space to correctly build the top level BVH structure. I used the same building method and split function to build the top-level BVH structure, the main difference is that leaf nodes hold the BVH structures of the meshes.

The results for the windmill scene were satisfying, the render time decreased from 484ms to 419ms. However, for the grass scene this approach increased the render time from 6679ms to 7226ms. I believe this is the result of the additional intersections with bounding boxes, which can be unnecessary for some scenes. I believe this approach can be useful for scenes with many meshes, and for scenes with a few meshes, it can be slower than the basic BVH structure.

Windmill scene rendered with a TLAS BVH in 419ms

Timings with Different Approaches

Scene	Half Split BVH Tiled MP	Experimental Split BVH Tiled MP	Half Split TLAS BVH Tiled MP	Experimental Split TLAS BVH Tiled MP	Experimental Split BVH Basic MP
Dragon	84ms	363ms	90ms	375ms	1051ms
Glaring Davids	283ms	748ms	222ms	757ms	1324ms
Lobster	837ms	2211ms	893ms	2274ms	3510ms
Science Tree	145ms	157ms	157ms	173ms	1179ms
Ton Roosendaal	73ms	68ms	84ms	76ms	312ms
Dragon Metal	1225ms	8773ms	1315ms	7327ms	9803ms
Ellipsoids	101ms	103ms	1086ms	112ms	744ms
Marching Dragons	1227ms	3655ms	1230ms	3673ms	5699ms
Metal Glass Plates	632ms	747ms	639ms	688ms	1163ms
Simple Transform	46ms	44ms	45ms	47ms	741ms
Spheres	87ms	90ms	106ms	104ms	726ms
Two Berserkers	364ms	357ms	210ms	130ms	993ms
Windmill	1158ms	483ms	961ms	444ms	1034ms
Grass Desert	6930ms	7886ms	7330ms	8512ms	10097ms