Blog

A Platform-Independent Thread Pool Using C++14

Introduction

One of the major benefits provided by the new generation of graphics APIs is much better support for multithreaded command list generation and submission.  It’s not uncommon for computers nowadays to contain 2, 4, 8, or even 16 core processors.  The goal of the solution in this post is to ensure we can use the power our CPU provides, not just for generating graphics command lists, but for any task that can be easily parallelized.

At it’s simplest, a thread pool is a collection of threads that run continuously waiting to take on a task to complete.  If there’s no task available, they yield or sleep for some amount of time, wake back up, and check again.  When a task is available, one of the waiting threads claims it, runs it, and returns to the waiting state.

The reason we would want to use a thread pool instead of creating new threads over and over for each task we want to run on a separate thread is to save on the time it would otherwise take to construct a thread, submit work to it, and deconstruct it when it’s done running.  With a small collection of threads continuously running and waiting on tasks, we’re only left with the middle step – work submission.

Implementation

The thread pool presented here is based off the implementation provided in [1].  It has been updated to include variadic arguments for added flexibility.

A Thread-Safe Queue

Before we build the pool itself, we need a means of submitting work in a thread-safe manner.  Jobs should be picked up in the same order they are submitted to the pool, which means a queue is a good candidate.  Jobs are pushed to the back of the queue, and popped from the front.

/**
 * The ThreadSafeQueue class.
 * Provides a wrapper around a basic queue to provide thread safety.
 */
#pragma once

#ifndef THREADSAFEQUEUE_HPP
#define THREADSAFEQUEUE_HPP

#include <atomic>
#include <condition_variable>
#include <mutex>
#include <queue>
#include <utility>

namespace MyNamespace
{
	template <typename T>
	class ThreadSafeQueue
	{
	public:
		/**
		 * Destructor.
		 */
		~ThreadSafeQueue(void)
		{
			invalidate();
		}

		/**
		 * Attempt to get the first value in the queue.
		 * Returns true if a value was successfully written to the out parameter, false otherwise.
		 */
		bool tryPop(T& out)
		{
			std::lock_guard<std::mutex> lock{m_mutex};
			if(m_queue.empty() || !m_valid)
			{
				return false;
			}
			out = std::move(m_queue.front());
			m_queue.pop();
			return true;
		}

		/**
		 * Get the first value in the queue.
		 * Will block until a value is available unless clear is called or the instance is destructed.
		 * Returns true if a value was successfully written to the out parameter, false otherwise.
		 */
		bool waitPop(T& out)
		{
			std::unique_lock<std::mutex> lock{m_mutex};
			m_condition.wait(lock, [this]()
			{
				return !m_queue.empty() || !m_valid;
			});
			/*
			 * Using the condition in the predicate ensures that spurious wakeups with a valid
			 * but empty queue will not proceed, so only need to check for validity before proceeding.
			 */
			if(!m_valid)
			{
				return false;
			}
			out = std::move(m_queue.front());
			m_queue.pop();
			return true;
		}

		/**
		 * Push a new value onto the queue.
		 */
		void push(T value)
		{
			std::lock_guard<std::mutex> lock{m_mutex};
			m_queue.push(std::move(value));
			m_condition.notify_one();
		}

		/**
		 * Check whether or not the queue is empty.
		 */
		bool empty(void) const
		{
			std::lock_guard<std::mutex> lock{m_mutex};
			return m_queue.empty();
		}

		/**
		 * Clear all items from the queue.
		 */
		void clear(void)
		{
			std::lock_guard<std::mutex> lock{m_mutex};
			while(!m_queue.empty())
			{
				m_queue.pop();
			}
			m_condition.notify_all();
		}

		/**
		 * Invalidate the queue.
		 * Used to ensure no conditions are being waited on in waitPop when
		 * a thread or the application is trying to exit.
		 * The queue is invalid after calling this method and it is an error
		 * to continue using a queue after this method has been called.
		 */
		void invalidate(void)
		{
			std::lock_guard<std::mutex> lock{m_mutex};
			m_valid = false;
			m_condition.notify_all();
		}

		/**
		 * Returns whether or not this queue is valid.
		 */
		bool isValid(void) const
		{
			std::lock_guard<std::mutex> lock{m_mutex};
			return m_valid;
		}

	private:
		std::atomic_bool m_valid{true};
		mutable std::mutex m_mutex;
		std::queue<T> m_queue;
		std::condition_variable m_condition;
	};
}

#endif

Most of this is pretty standard fare for designing a thread-safe class.  We lock a mutex anytime we need to read or write data and provide a simplified interface over a std::queue where writes are checked for validity before being performed.  This is why tryPop and waitPop return bools for success and write to the provide parameter in successful cases.

Any time push is called with a new task, it calls notify_one() on the condition variable which will wake one thread blocked on the condition.  The mutex is locked, the predicate is checked, and if all conditions are met (the queue is not empty and the queue is still valid), a task is popped and returned from the queue.

Because this queue provides a blocking method, waitPop, that depends on a condition variable being set to continue, it also needs a way to signal to anything waiting on the condition in the case that the queue needs to be deconstructed while there are threads still blocked on the condition.  This is accomplished through the invalidate() method that first sets the m_valid member to false and then calls notify_all() on the condition variable.  This will wake up every thread blocked on the condition and waitPop will return with a value of false, indicating to the call site that no work is being returned.

Another nicety the condition variable gives us is protection from spurious wakeups [3].  If a spurious wakeup does occur and the entire predicate isn’t met, the thread goes back to waiting.

The Thread Pool

The implementation of the thread pool is shown below.

/**
 * The ThreadPool class.
 * Keeps a set of threads constantly waiting to execute incoming jobs.
 */
#pragma once

#ifndef THREADPOOL_HPP
#define THREADPOOL_HPP

#include "ThreadSafeQueue.hpp"

#include <algorithm>
#include <atomic>
#include <cstdint>
#include <functional>
#include <future>
#include <memory>
#include <thread>
#include <type_traits>
#include <utility>
#include <vector>

namespace MyNamespace
{
	class ThreadPool
	{
	private:
		class IThreadTask
		{
		public:
			IThreadTask(void) = default;
			virtual ~IThreadTask(void) = default;
			IThreadTask(const IThreadTask& rhs) = delete;
			IThreadTask& operator=(const IThreadTask& rhs) = delete;
			IThreadTask(IThreadTask&& other) = default;
			IThreadTask& operator=(IThreadTask&& other) = default;

			/**
			 * Run the task.
			 */
			virtual void execute() = 0;
		};

		template <typename Func>
		class ThreadTask: public IThreadTask
		{
		public:
			ThreadTask(Func&& func)
				:m_func{std::move(func)}
			{
			}

			~ThreadTask(void) override = default;
			ThreadTask(const ThreadTask& rhs) = delete;
			ThreadTask& operator=(const ThreadTask& rhs) = delete;
			ThreadTask(ThreadTask&& other) = default;
			ThreadTask& operator=(ThreadTask&& other) = default;

			/**
			 * Run the task.
			 */
			void execute() override
			{
				m_func();
			}

		private:
			Func m_func;
		};

	public:
		/**
		 * A wrapper around a std::future that adds the behavior of futures returned from std::async.
		 * Specifically, this object will block and wait for execution to finish before going out of scope.
		 */
		template <typename T>
		class TaskFuture
		{
		public:
			TaskFuture(std::future<T>&& future)
				:m_future{std::move(future)}
			{
			}

			TaskFuture(const TaskFuture& rhs) = delete;
			TaskFuture& operator=(const TaskFuture& rhs) = delete;
			TaskFuture(TaskFuture&& other) = default;
			TaskFuture& operator=(TaskFuture&& other) = default;
			~TaskFuture(void)
			{
				if(m_future.valid())
				{
					m_future.get();
				}
			}

			auto get(void)
			{
				return m_future.get();
			}


		private:
			std::future<T> m_future;
		};

	public:
		/**
		 * Constructor.
		 */
		ThreadPool(void)
			:ThreadPool{std::max(std::thread::hardware_concurrency(), 2u) - 1u}
		{
			/*
			 * Always create at least one thread.  If hardware_concurrency() returns 0,
			 * subtracting one would turn it to UINT_MAX, so get the maximum of
			 * hardware_concurrency() and 2 before subtracting 1.
			 */
		}

		/**
		 * Constructor.
		 */
		explicit ThreadPool(const std::uint32_t numThreads)
			:m_done{false},
			m_workQueue{},
			m_threads{}
		{
			try
			{
				for(std::uint32_t i = 0u; i < numThreads; ++i)
				{
					m_threads.emplace_back(&ThreadPool::worker, this);
				}
			}
			catch(...)
			{
				destroy();
				throw;
			}
		}

		/**
		 * Non-copyable.
		 */
		ThreadPool(const ThreadPool& rhs) = delete;

		/**
		 * Non-assignable.
		 */
		ThreadPool& operator=(const ThreadPool& rhs) = delete;

		/**
		 * Destructor.
		 */
		~ThreadPool(void)
		{
			destroy();
		}

		/**
		 * Submit a job to be run by the thread pool.
		 */
		template <typename Func, typename... Args>
		auto submit(Func&& func, Args&&... args)
		{
			auto boundTask = std::bind(std::forward<Func>(func), std::forward<Args>(args)...);
			using ResultType = std::result_of_t<decltype(boundTask)()>;
			using PackagedTask = std::packaged_task<ResultType()>;
			using TaskType = ThreadTask<PackagedTask>;
			
			PackagedTask task{std::move(boundTask)};
			TaskFuture<ResultType> result{task.get_future()};
			m_workQueue.push(std::make_unique<TaskType>(std::move(task)));
			return result;
		}

	private:
		/**
		 * Constantly running function each thread uses to acquire work items from the queue.
		 */
		void worker(void)
		{
			while(!m_done)
			{
				std::unique_ptr<IThreadTask> pTask{nullptr};
				if(m_workQueue.waitPop(pTask))
				{
					pTask->execute();
				}
			}
		}

		/**
		 * Invalidates the queue and joins all running threads.
		 */
		void destroy(void)
		{
			m_done = true;
			m_workQueue.invalidate();
			for(auto& thread : m_threads)
			{
				if(thread.joinable())
				{
					thread.join();
				}
			}
		}

	private:
		std::atomic_bool m_done;
		ThreadSafeQueue<std::unique_ptr<IThreadTask>> m_workQueue;
		std::vector<std::thread> m_threads;
	};

	namespace DefaultThreadPool
	{
		/**
		 * Submit a job to the default thread pool.
		 */
		template <typename Func, typename... Args>
		inline auto submitJob(Func&& func, Args&&... args)
		{
			return getThreadPool().submit(std::forward<Func>(func), std::forward<Args>(args)...);
		}

		/**
		 * Get the default thread pool for the application.
		 * This pool is created with std::thread::hardware_concurrency() - 1 threads.
		 */
		inline ThreadPool& getThreadPool(void)
		{
			static ThreadPool defaultPool;
			return defaultPool;
		}
	}
}

#endif

There are a few pieces to touch on here.  First, we have an IThreadTask interface that defines an execute() pure virtual function.  The reason for this inteface is simply so we can maintain a collection of them in one container type (the ThreadSafeQueue<T>).  ThreadTask<T> implements IThreadTask and takes a callable type T for its template parameter.

When constructing the thread pool, we attempt to read the number of hardware threads available to the system by using std::thread::hardware_concurrency().  We always ensure the pool is started with at least one thread running, and ideally started with hardware_concurrency – 1 threads running.  The reason for the minus one will be discussed later.  For each thread available, we construct a std::thread object that runs the private member function worker().

The worker function’s only job is to endlessly check the queue to see if there is work to be done and execute the task if there is.  Since we’ve taken care to design the queue in a thread-safe manner, we don’t need to do any additional synchronization here.  The thread will enter the loop, get to waitPop, and either pop and execute a queued task, or wait on a task to become available via the submit function.  If waitPop returns true, we know pTask has been written to and can immediately execute it.  If it returns false, it most likely means that the queue has been invalidated.

The submit function is the public facing interface of the thread pool.  It starts by creating a few handy type definitions that make the actual implementation easier to follow.  First, the provided function and its arguments are bound to a callable object with no parameters using std::bind.  We need this for our ThreadTask<T> class to be able to call execute on its functor without having to know the arguments that came with the original function.  We then create a std::packaged_task with the bound task and extract the std::future from it before pushing it onto the queue.  Here again, we do not need to do any additional synchronization due to the thread-safe implementation of the queue.  You’ll notice the std::future returned from the std::packaged_task is wrapped in a class called TaskFuture<T>.  This was a design decision because of the way I intend to use the pool in my specific application.  I wanted the futures to mimic the way std::async futures work, specifically that they will block until their work is complete when they are going out of scope and being destructed.  std::packaged_task futures don’t do this out of the box, so we give them a simple wrapper to emulate the behavior [2].  Like std::future, TaskFuture is movable-only, so the synchronization does not have to occur in the same method as the call site as long as it’s passed along from the method.

You will see where the queue’s invalidate method is called in the thread pool’s destroy() method, which is called from the destructor or if an exception is thrown while creating the threads in the constructor, before joining the threads, and after setting the thread pool’s done marker to true.  The order is important to ensure that the threads know to exit their worker functions instead of re-attempting to obtain more work from the invalidated queue.  Due to the way the predicate is set up on the queue’s condition variable, it is not an error to re-enter waitPop on an invalidated queue since it will just return false, but it is a waste of time.

An optional nicety I decided to throw in is the DefaultThreadPool namespace.  This creates a thread pool with the maximum number of threads as discussed previously and is accessible from anywhere in the application that includes the thread pool header.  I prefer using this as opposed to having each subsystem owning its own thread pool, but there’s nothing wrong with creating thread pool instances through the constructors, either.

Submitting Work to the Thread Pool

With the above in place.  Submitting work is as simple as including the thread pools header file and calling its submit function with a callable object and optionally arguments to be provided to it.

auto taskFuture = DefaultThreadPool::submitJob([]()
{
	lengthyProcess();
});

auto taskFuture2 = DefaultThreadPool::submitJob([](int a, float b)
{
	lengthyProcessWithArguments(a, b);
}, 5, 10.0f);

If submitting a reference for an argument, it is important to remember to wrap it with std::ref or std::cref.

MyObject obj;
auto taskFuture = DefaultThreadPool::submitJob([](const MyObject& object)
{
	lengthyProcessThatNeedsToReadObject(object);
}, std::cref(obj));

Does It Work?

To ensure the thread pool and backing queue work not only in ideal cases, but also in the case where work is being submitted faster than the threads can take it on, we can write a little program that submits a bunch of jobs that sleep for a while and then synchronizes on them.  My machine reports eight as the result of std::thread::hardware_concurrency(), so I create a thread pool with seven threads.  The task I’m running is just to sleep whatever thread is executing for one second and finish.  I’ll submit twenty-one of these jobs to the pool.  We know that this would take about twenty-one seconds if executed serially, and since we’re running a thread pool with seven threads, we know that if everything is working well the jobs should all complete in about three seconds.

Timer saturationTimer;
const auto startTime = saturationTimer.tick();
std::vector<ThreadPool::TaskFuture<void>> v;
for(std::uint32_t i = 0u; i < 21u; ++i)
{
	v.push_back(DefaultThreadPool::submitJob([]()
	{
		std::this_thread::sleep_for(std::chrono::seconds(1));
	}));
}
for(auto& item: v)
{
	item.get();
}
const auto dt = saturationTimer.tick() - startTime;

Running the above code on my machine, the result is just about what would be expected, averaging around 3.005 seconds over a dozen runs.

About the Number of Pooled Threads

Earlier I mentioned that I start the thread pool with std::thread::hardware_concurrency() – 1 threads.  The reason for this is simple.  The thread that’s calling the thread pool is a perfectly valid thread to do work on while you’re waiting for the results of submitted tasks to become available.  Despite the example from the Does It Work? section, submitting a bunch of jobs and then just waiting on them to complete is hardly optimal, so it makes sense to have the thread pool executing up to NumThreads – 1 jobs and the main thread doing whatever work it can accomplish in the meantime.  Splitting the workload up evenly across all available threads is usually the best approach with a task-based setup like this.

Conclusion

This post has discussed what a thread pool is, why they’re useful, and how to get started implementing one.  There are very likely ways to make the provided thread pool more performant by specializing it more to avoid memory allocations on job submissions, but for my use cases I typically ensure the jobs being submitted are large enough that they make up for the time lost to allocating and deallocating memory with the time gained by running them in parallel with other large tasks.  Your mileage may vary, but at the very least you should have a solid start to customizing a thread pool to fit your exact needs.

Thank You

A big thank you to the members of /r/cpp who helped with code review and provided excellent feedback!

References

[1] William, Anthony.  C++ Concurrency in Action:  Practical Multithreading.  ISBN:  9781933988771

[2] http://scottmeyers.blogspot.com/2013/03/stdfutures-from-stdasync-arent-special.html

[3] http://en.cppreference.com/w/cpp/thread/condition_variable/wait

Screen Space Glossy Reflections Demo Available

It’s Been a While

Earlier this year I wrote a post discussing an implementation of real-time screen space glossy reflections.  The post has received a lot of positive feedback, and I’ve had some very interesting conversations with various individuals since it went up discussing theory, details, shortcomings, and everything in between.  The response has been great, and I appreciate the community’s interest.  One request I’ve received a few times is for a working demo that users could play with to get a better feel for the effect in action.  I had originally hoped to finish updating the engine to support DirectX 12 before releasing anything, and while it’s probably about 90% done, there are still some areas that need work and my time lately has been limited.

Thankfully, it’s the year 2015 (for a little while longer) and we have this magical thing called source control.  I’ve decided to use a tag I created right before the DX 12 update began, and have modified it to provide a small demo for anyone that’s been waiting on it.  The goods news is that it’s entirely DirectX 11-based, so the hardware support will be much broader than that of a DX 12 solution.  The downside is that I’ve been able to make a few improvements, especially around blending missed ray hits with fallback solutions that won’t be present in the demo provided.  I should get a chance to release a demo with the new features once things settle down a bit and all will be right with the world.

Demo Controls

Once the scene loads, anyone familiar with first person applications should feel more or less at home with the basics.  A, W, S, and D control movement, with W and S moving the camera forwards and backwards, and A and D strafing the camera left and right.  The mouse controls where the camera looks.  The user is not glued to the ground, so will move forward in whatever direction the camera is facing.  J and K control the floor roughness value, with J making the floor smoother, and K making it rougher.  A uniform roughness texture is applied over the entire floor, but in a real-world application an artist-authored texture would be used to make the results much more convincing.  Q and E are used to change the time of day.

The Esc key is used to quit the application.  To restart the scene without exiting, use the left, right, or up arrow keys.

Some Ugliness

The fallback environment maps are setup exactly as they were in the original post.  Specifically, this means that a large area of the scene only has the global, undetailed environment map to fall back to.  This is quite noticeable in the beginning area of the scene underneath the characters.  If you move directly forward from the starting point of the scene, you’ll pass through a few walls and end up in an enclosed hallway-type structure.  This area does have localized environment maps to fallback to on ray hits and the results are cleaner.  As stated in the first section, more work has been done to improve blending that is not present in the demo.

Besides the shortcomings of the screen space approach discussed in the original blog post, the stack of boxes in the scene still use the engine’s old physics and collision system.  In the latest version, all of this has been updated to use the Bullet Physics implementation, but if you choose to knock the stack down (clicking the left mouse button throws a ball), be aware that you’re likely to see quite a bit of oddness.  That being said – go for it, it’s always fun to knock things over!

Also, ambient light is handled by sampling from environment maps placed throughout the scene.  To ensure maintaining these doesn’t become a bottleneck, only one is ever updated per frame, and they’re only updated when the lighting changes.  Namely, this means that as the time of day changes the environment maps will get rebuilt.  If the time of day changes slowly enough, as it would in a real-world application, these updates would be mostly unnoticeable.  However, since the user can control the time of day the overall lighting situation can change faster than the environment maps can keep up.  If the user holds down one of the keys to change the time of day, they’ll see stale lighting data being applied to most parts of the scene.  Once the key is released, the environment map renderer will catch up and the lighting will become coherent again.

The Demo

Below is a link to download the demo.  Feel free and encouraged to continue commenting, asking questions, and offering constructive criticism.

Download the demo here.

Screen Space Glossy Reflections

Introduction

Reflections are an important effect present in any routine attempting to approximate global illumination.  They give the user important spatial information about an object’s location, as well as provide an important visual indicator of the surface properties of certain materials.

For several years now, engineers and researchers in real-time graphics have worked towards improving reflections in their applications.  Simple implementations like cube maps used as reflection probes have been around for decades, while much newer techniques build upon their predecessors, such as parallax-corrected environment maps [4].

More recently, screen space ray tracing has become a widely used supplement to previously established methods of applying reflections to scenes.  The idea is simple enough – a view ray is reflected about a surface normal, then the resultant vector is stepped along until it has intersected the depth buffer.  With that location discovered, the light buffer is sampled and the value is added  to the final lighting result.  The below image shows a comparison of a scene looking at a stack of boxes without and with screen space ray tracing enabled.

sslr_off_on_comparison

In practice, there are more than a few pitfalls to this approach that need special care and addressing to avoid.  The most obvious short-coming of this and any other screen space effect is the limited information available.  If a ray doesn’t hit something before leaving the screen bounds, it will not return a result, even if its would-be collider is just barely off-screen.

This effect also tends to have a lot of trouble with rays facing back towards the viewer.  When given thought, it makes a lot of sense that this would present an issue.  For one, if the ray reflects directly back at the viewer, it will never intersect the depth buffer, thus basically degenerating into the case of rays traveling off-screen that’s already been discussed.  The other issue is similar, but maybe not as obvious.  If a ray is traveling back in the general direction of the viewer and it does intersect the depth buffer, it’s likely to do so on a face of an object that’s faced away from the viewer.  This means that even if an intersection is reported, an incorrect result will be sampled from the light buffer at that position.  This can lead to ugly artifacts such as streaks across surfaces.  The figure below shows a top-down view of a ray being cast from the viewer, hitting a mirrored surface, and finally making contact with the back of a box.  Since from the viewer’s perspective the back of the box is not currently on-screen, erroneous results will be returned if that result is used.

reflect_back_of_box

There are ways to mitigate many of these artifacts, including fallback methods and fading that will be addressed later on.

Glossy Ray Traced Reflections

One further challenge with the generic approach described above is that if the result is used directly, only perfectly mirror-like reflections can be generated.  In the real world, most surfaces do not reflect light perfectly, but instead scatter, absorb, and reflect it in varying proportions due to microfacets [9].  To account for this, the technique needs to not only consider where the ray intersects the depth buffer, but also the roughness of the reflecting surface and the distance the ray has traveled.  The following image shows a comparison of mirror-like and glossy reflections.  Notice on the right half of the image how the further the ray has to travel to make contact, the blurrier it becomes.

sslr_mirror_glossy_comparison

The rest of this post will re-touch on some of these issues as it discusses and provides a full implementation of ray tracing in screen space and creating glossy reflections via cone tracing.

Setting Up

The effects described in this post are implemented using DirectX 11 and HLSL.  That’s not at all to say those are mandatory for following along.  In fact, the ray tracing shader used below is a translation of one written in GLSL, which would use OpenGL as its graphics API.

This implementation was designed as part of a deferred shading pipeline.  The effect runs after geometry buffer generation and lighting has completed.  The ray tracing step needs access to the depth buffer and a buffer containing the normals of the geometry in view.  The blurring step needs access to the light buffer.  The cone tracing step needs access to all of the aforementioned buffers, including the resultant ray traced buffer and blurred light buffer, as well as a buffer containing the specular values for materials in view.  It is also beneficial to include a fallback buffer containing indirect specular contributions derived from methods such as parallax-corrected cube maps.  These will each be addressed as they are used in the implementation.

Therefore, the final list of buffers needed before starting the effect becomes:

  • Depth buffer – the implementation uses a non-linear depth buffer due to its ready availability after the geometry buffer is generated.  McGuire’s initial implementation [1] uses a linear depth buffer and may be more efficient.
  • Normal buffer – the geometry buffer used in this implementation stores all values in view space.  If the implementer stores their values in world space, they will need to be cognizant of the differences and prepared to apply appropriate transforms when necessary.
  • Light buffer – this is a buffer containing all lighting to be applied to the scene.  The exact values stored in this buffer will be refined further during implementation discussion.
  • Specular buffer – stored linearly as Fresnel reflectance at normal incidence (F(0°)) [5].  Some engines, such as Unreal Engine 4, have different workflows where this value may be hard-coded for dialectrics to a value of around 0.04 and stored in base color for metals.  The engine in use for this project is custom and stores the value directly.
  • Roughness buffer – this engine stores the roughness value in the w-component of the specular buffer, and is thus readily available when the previous buffer is bound.
  • Fallback indirect specular buffer – this buffer contains specular lighting values calculated before the ray tracing step using less precise techniques such as parallax-corrected cube maps and environment probes to help alleviate jarring discontinuities between ray hits and misses.

The depth buffer used in this implementation has 32 bits for depth.  All buffers containing lighting data contain 16 bit per channel floating point buffers.

Also needed for this effect is a constant buffer containing values specific to the effect.  In the initial GLSL implementation these were passed as uniforms, but in HLSL we set up a constant buffer like so:

/**
 * The SSLRConstantBuffer.
 * Defines constants used to implement SSLR cone traced screen-space reflections.
 */
#ifndef CBSSLR_HLSLI
#define CBSSLR_HLSLI

cbuffer cbSSLR : register(b0)
{
 float2 cb_depthBufferSize; // dimensions of the z-buffer
 float cb_zThickness; // thickness to ascribe to each pixel in the depth buffer
 float cb_nearPlaneZ; // the camera's near z plane

 float cb_stride; // Step in horizontal or vertical pixels between samples. This is a float
 // because integer math is slow on GPUs, but should be set to an integer >= 1.
 float cb_maxSteps; // Maximum number of iterations. Higher gives better images but may be slow.
 float cb_maxDistance; // Maximum camera-space distance to trace before returning a miss.
 float cb_strideZCutoff; // More distant pixels are smaller in screen space. This value tells at what point to
 // start relaxing the stride to give higher quality reflections for objects far from
 // the camera.

 float cb_numMips; // the number of mip levels in the convolved color buffer
 float cb_fadeStart; // determines where to start screen edge fading of effect
 float cb_fadeEnd; // determines where to end screen edge fading of effect
 float cb_sslr_padding0; // padding for alignment
};

#endif

This constant buffer is contained in it’s own .hlsli file and included in the various steps where needed.  Most of the values map directly to uniform values in the GLSL implementation, and a few others will be discussed as they become pertinent.

Ray Tracing in Screen Space

The ray tracing portion of this technique is directly derived from Morgan McGuire and Mike Mara’s open source implementation of using the Digital Differential Analyzer (DDA) line algorithm to evenly distribute ray traced samples in screen space [1].  Their method handles perspective-correct interpolation of a 3D ray projected to screen space, and helps avoid over- and under-sampling issues present in traditional ray marches.  This helps more evenly distribute the limited number of samples that can be afforded per frame across the ray instead of skipping large portions at the start of the ray and bunching up samples towards the end.

McGuire and Mara’s initial implementation was presented in GLSL and assumed negative one (-1) to be the far plane Z value.  Below, the implementation has been converted to HLSL and uses postive one for the far plane.  The initial implementation also uses a linear depth buffer, though their accompanying paper provides source code for running the effect with a non-linear depth buffer.  The provided implementation assumes non-linear depth, and reconstructs linear Z values as they are sampled from the depth buffer using the methods described in [6].

// By Morgan McGuire and Michael Mara at Williams College 2014
// Released as open source under the BSD 2-Clause License
// http://opensource.org/licenses/BSD-2-Clause
//
// Copyright (c) 2014, Morgan McGuire and Michael Mara
// All rights reserved.
//
// From McGuire and Mara, Efficient GPU Screen-Space Ray Tracing,
// Journal of Computer Graphics Techniques, 2014
//
// This software is open source under the "BSD 2-clause license":
//
// Redistribution and use in source and binary forms, with or
// without modification, are permitted provided that the following
// conditions are met:
//
// 1. Redistributions of source code must retain the above
// copyright notice, this list of conditions and the following
// disclaimer.
//
// 2. Redistributions in binary form must reproduce the above
// copyright notice, this list of conditions and the following
// disclaimer in the documentation and/or other materials provided
// with the distribution.
//
// THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
// CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES,
// INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
// MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
// DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
// CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
// SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
// LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
// USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
// AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
// LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
// IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF
// THE POSSIBILITY OF SUCH DAMAGE.
/**
 * The ray tracing step of the SSLR implementation.
 * Modified version of the work stated above.
 */
#include "SSLRConstantBuffer.hlsli"
#include "../../ConstantBuffers/PerFrame.hlsli"
#include "../../Utils/DepthUtils.hlsli"

Texture2D depthBuffer : register(t0);
Texture2D normalBuffer: register(t1);

struct VertexOut
{
 float4 posH : SV_POSITION;
 float3 viewRay : VIEWRAY;
 float2 tex : TEXCOORD;
};

float distanceSquared(float2 a, float2 b)
{
 a -= b;
 return dot(a, a);
}

bool intersectsDepthBuffer(float z, float minZ, float maxZ)
{
 /*
 * Based on how far away from the camera the depth is,
 * adding a bit of extra thickness can help improve some
 * artifacts. Driving this value up too high can cause
 * artifacts of its own.
 */
 float depthScale = min(1.0f, z * cb_strideZCutoff);
 z += cb_zThickness + lerp(0.0f, 2.0f, depthScale);
 return (maxZ >= z) && (minZ - cb_zThickness <= z);
}

void swap(inout float a, inout float b)
{
 float t = a;
 a = b;
 b = t;
}

float linearDepthTexelFetch(int2 hitPixel)
{
 // Load returns 0 for any value accessed out of bounds
 return linearizeDepth(depthBuffer.Load(int3(hitPixel, 0)).r);
}

// Returns true if the ray hit something
bool traceScreenSpaceRay(
 // Camera-space ray origin, which must be within the view volume
 float3 csOrig,
 // Unit length camera-space ray direction
 float3 csDir,
 // Number between 0 and 1 for how far to bump the ray in stride units
 // to conceal banding artifacts. Not needed if stride == 1.
 float jitter,
 // Pixel coordinates of the first intersection with the scene
 out float2 hitPixel,
 // Camera space location of the ray hit
 out float3 hitPoint)
{
 // Clip to the near plane
 float rayLength = ((csOrig.z + csDir.z * cb_maxDistance) < cb_nearPlaneZ) ?
 (cb_nearPlaneZ - csOrig.z) / csDir.z : cb_maxDistance;
 float3 csEndPoint = csOrig + csDir * rayLength;

 // Project into homogeneous clip space
 float4 H0 = mul(float4(csOrig, 1.0f), viewToTextureSpaceMatrix);
 H0.xy *= cb_depthBufferSize;
 float4 H1 = mul(float4(csEndPoint, 1.0f), viewToTextureSpaceMatrix);
 H1.xy *= cb_depthBufferSize;
 float k0 = 1.0f / H0.w;
 float k1 = 1.0f / H1.w;

 // The interpolated homogeneous version of the camera-space points
 float3 Q0 = csOrig * k0;
 float3 Q1 = csEndPoint * k1;

 // Screen-space endpoints
 float2 P0 = H0.xy * k0;
 float2 P1 = H1.xy * k1;

 // If the line is degenerate, make it cover at least one pixel
 // to avoid handling zero-pixel extent as a special case later
 P1 += (distanceSquared(P0, P1) < 0.0001f) ? float2(0.01f, 0.01f) : 0.0f;
 float2 delta = P1 - P0;

 // Permute so that the primary iteration is in x to collapse
 // all quadrant-specific DDA cases later
 bool permute = false;
 if(abs(delta.x) < abs(delta.y))
 {
 // This is a more-vertical line
 permute = true;
 delta = delta.yx;
 P0 = P0.yx;
 P1 = P1.yx;
 }

 float stepDir = sign(delta.x);
 float invdx = stepDir / delta.x;

 // Track the derivatives of Q and k
 float3 dQ = (Q1 - Q0) * invdx;
 float dk = (k1 - k0) * invdx;
 float2 dP = float2(stepDir, delta.y * invdx);

 // Scale derivatives by the desired pixel stride and then
 // offset the starting values by the jitter fraction
 float strideScale = 1.0f - min(1.0f, csOrig.z * cb_strideZCutoff);
 float stride = 1.0f + strideScale * cb_stride;
 dP *= stride;
 dQ *= stride;
 dk *= stride;

 P0 += dP * jitter;
 Q0 += dQ * jitter;
 k0 += dk * jitter;

 // Slide P from P0 to P1, (now-homogeneous) Q from Q0 to Q1, k from k0 to k1
 float4 PQk = float4(P0, Q0.z, k0);
 float4 dPQk = float4(dP, dQ.z, dk);
 float3 Q = Q0; 

 // Adjust end condition for iteration direction
 float end = P1.x * stepDir;

 float stepCount = 0.0f;
 float prevZMaxEstimate = csOrig.z;
 float rayZMin = prevZMaxEstimate;
 float rayZMax = prevZMaxEstimate;
 float sceneZMax = rayZMax + 100.0f;
 for(;
 ((PQk.x * stepDir) <= end) && (stepCount < cb_maxSteps) &&
 !intersectsDepthBuffer(sceneZMax, rayZMin, rayZMax) &&
 (sceneZMax != 0.0f);
 ++stepCount)
 {
 rayZMin = prevZMaxEstimate;
 rayZMax = (dPQk.z * 0.5f + PQk.z) / (dPQk.w * 0.5f + PQk.w);
 prevZMaxEstimate = rayZMax;
 if(rayZMin > rayZMax)
 {
 swap(rayZMin, rayZMax);
 }

 hitPixel = permute ? PQk.yx : PQk.xy;
 // You may need hitPixel.y = depthBufferSize.y - hitPixel.y; here if your vertical axis
 // is different than ours in screen space
 sceneZMax = linearDepthTexelFetch(depthBuffer, int2(hitPixel));

 PQk += dPQk;
 }

 // Advance Q based on the number of steps
 Q.xy += dQ.xy * stepCount;
 hitPoint = Q * (1.0f / PQk.w);
 return intersectsDepthBuffer(sceneZMax, rayZMin, rayZMax);
}

float4 main(VertexOut pIn) : SV_TARGET
{
 int3 loadIndices = int3(pIn.posH.xy, 0);
 float3 normalVS = normalBuffer.Load(loadIndices).xyz;
 if(!any(normalVS))
 {
 return 0.0f;
 }

 float depth = depthBuffer.Load(loadIndices).r;
 float3 rayOriginVS = pIn.viewRay * linearizeDepth(depth);

 /*
 * Since position is reconstructed in view space, just normalize it to get the
 * vector from the eye to the position and then reflect that around the normal to
 * get the ray direction to trace.
 */
 float3 toPositionVS = normalize(rayOriginVS);
 float3 rayDirectionVS = normalize(reflect(toPositionVS, normalVS));

 // output rDotV to the alpha channel for use in determining how much to fade the ray
 float rDotV = dot(rayDirectionVS, toPositionVS);

 // out parameters
 float2 hitPixel = float2(0.0f, 0.0f);
 float3 hitPoint = float3(0.0f, 0.0f, 0.0f);

 float jitter = cb_stride > 1.0f ? float(int(pIn.posH.x + pIn.posH.y) & 1) * 0.5f : 0.0f;

 // perform ray tracing - true if hit found, false otherwise
 bool intersection = traceScreenSpaceRay(rayOriginVS, rayDirectionVS, jitter, hitPixel, hitPoint);

 depth = depthBuffer.Load(int3(hitPixel, 0)).r;

 // move hit pixel from pixel position to UVs
 hitPixel *= float2(texelWidth, texelHeight);
 if(hitPixel.x > 1.0f || hitPixel.x < 0.0f || hitPixel.y > 1.0f || hitPixel.y < 0.0f)
 {
 intersection = false;
 }

 return float4(hitPixel, depth, rDotV) * (intersection ? 1.0f : 0.0f);
}

The DepthUtils.hlsli header contains the linearizeDepth function that’s used to convert a perspective-z depth into a linear value.  The PerFrame.hlsli header contains several values that are set at the start of a frame and remain constant throughout.  Of particular interest are texelWidth and texelHeight, which contain the texel size for the client (1 / dimension).  We use these value to convert pixel positions from the trace result into UV coordinates for easy lookup in subsequent steps.

An idea borrowed from Ben Hopkins (@kode80), who also open sourced his implementation of ray tracing based on McGuire’s initial work, is to use cutoff value for the stride based on Z distance [2].  The idea is that since as the distance grows further from the viewer and perspective projection makes objects smaller in screen space, the stride can be shortened and still likely find its contact point.  This helps distant locations create higher quality reflections than if they were to use a large stride similar to closer locations.  In the above implementation, this idea was extended into adding additional thickness to objects as their distance from the viewer increased.  This resulted in less artifacts at shallow angles where the rayZMin and rayZMax values would grow such that the sampled sceneZMax would fail and be rejected by small margins.

Another interesting idea from Hopkins’ implementation was to store the values and the step derivatives in float4 types.  The goal of this is to encourage the HLSL compiler to take advantage of SIMD operations since they are used in identical operations all at the same time.  In practice, the output from the Visual Studio 2013 Graphics Debugger showed the bytecode was nearly identical between the McGuire implementation and Hopkins’ implementation, but it was left in for being a cool idea.

The image below shows the results of the ray tracing step.  The buffer values include the UV coordinates of the ray hit in the x and y components, the depth in the z component, and the dot product of the view ray and the reflection ray in the w component.  The value stored in the w-component is used in the cone tracing step to fade rays facing towards the camera.  Black pixels mark areas where no intersection occurred.

sslr_ray_traced_buffer

Blurring the Light Buffer

The next step to obtaining glossy reflections is to blur the light buffer.  Specifically, the light buffer is copied to the top-most mip level of a texture supporting a full mip chain, and from there the result is blurred into its lower mip levels.  A separable 1-dimensional Gaussian blur is used.  The below implementation uses a 7-tap kernel, but the implementer should experiment to get a value that seems appropriate for their particular needs.  First the blur is applied vertically to a temporary buffer, then the blur is applied horizontally to the next level down in the mip chain.  The following code listing shows a simple blur shader.  Notice that to use the contents, there would need to exist two additional shaders, one defining each of the pre-processor directives specifying directionality and including the file below.

/**
 * The Convolution shader body.
 * Requires either CONVOLVE_VERTICAL or CONVOLVE_HORIZONTAL
 * to be defined.
 */
#ifndef CONVOLUTIONPS_HLSLI
#define CONVOLUTIONPS_HLSLI

#include "SSLRConstantBuffer.hlsli"

struct VertexOut
{
 float4 posH : SV_POSITION;
 float2 tex : TEXCOORD;
};

Texture2D colorBuffer : register(t1);

#if CONVOLVE_HORIZONTAL
static const int2 offsets[7] = {{-3, 0}, {-2, 0}, {-1, 0}, {0, 0}, {1, 0}, {2, 0}, {3, 0}};
#elif CONVOLVE_VERTICAL
static const int2 offsets[7] = {{0, -3}, {0, -2}, {0, -1}, {0, 0}, {0, 1}, {0, 2}, {0, 3}};
#endif
static const float weights[7] = {0.001f, 0.028f, 0.233f, 0.474f, 0.233f, 0.028f, 0.001f};

float4 main(VertexOut pIn): SV_Target0
{
 float2 uvs = pIn.tex * cb_depthBufferSize; // make sure to send in the SRV's dimensions for cb_depthBufferSize
 // sample level zero since only one mip level is available with the bound SRV
 int3 loadPos = int3(uvs, 0);

 float4 color = float4(0.0f, 0.0f, 0.0f, 1.0f);
 [unroll]
 for(uint i = 0u; i < 7u; ++i)
 {
 color += colorBuffer.Load(loadPos, offsets[i]) * weights[i];
 }
 return float4(color.rgb, 1.0f);
}

#endif

During the blur passes the constant buffer values storing depth buffer size for the rest of the effect are re-purposed for recovering the load positions for fetches from the bound texture.  At the end of all blur passes these values should be reset to the correct dimensions before proceeding.

Cone Tracing

At this point in the effect, the ray traced buffer is complete and the full mip chain of the light buffer has been generated.  The idea in this section comes from the Yasin Uludag’s article in GPU Pro 5 [3].

It was mentioned earlier in the post that for glossy reflections to be represented, both the surface roughness and the distance traveled from the reflecting point to its point of contact needed to be accounted for.  Whereas a perfect mirror would cast a straight line outwards from the origin point, a rougher surface would cast a cone shape.  The figure below shows a representation of this phenomenon (albeit a bit crudely).

sslr_ray_vs_cone_comparison

With these observation made, it can further be distinguished that in screen space a cone (3-dimensional) projects into an isosceles triangle (2-dimensional).  Knowing the location of the starting point and the ray’s end point tells us how far in screen space the ray has traveled.  With the roughness value available for the current surface through sampling the appropriate texture, everything that’s needed to move forward is on-hand.

The steps for cone tracing are as follows.

  1. The adjacent length of the isosceles triangle is found by finding the magnitude of the vector from the origin position to the ray hit position.
  2. The sampled roughness is converted into a specular power.
  3. The specular power is then used to calculate the cone angle (theta) for the isosceles triangle.
  4. The opposite length of the the triangle is found by dividing the cone angle in half and finding the opposite side of a right triangle using basic trigonometry, specifically that tan(theta) = oppositeLength/adjacentLength, which is equivalently represented as oppositeLength = tan(theta) * adjacentLength.
  5. The result is then doubled to recover the full length.
  6. The radius of a circle inscribed in the triangle is found using the formula found at [7] for isosceles triangles.  This is used to determine the sample position and the mip level from which to sample.
  7. The color is sampled and weighted based on surface roughness.
  8. Steps 2-7 are repeated several times until the resulting alpha reaches 1, or the loop hits its iteration limit.  During each iteration, the triangle’s adjacent length is shortened by the previously calculated radius, then each value is recomputed for the new triangle.

Step 7 in particular differs from Uludag’s implementation where he builds out an entire visibility buffer that is used to help diminish contributions from sampled pixels that should not be included as part of the result.  For most cases, the results tend to be good enough with this simplified approach, and the cost saved from not creating the visibility buffer and the hierarchical z-buffer from Uludag’s article can be re-assigned to further refinements or other effects.

The formula for finding the incircle of an isosceles triangle is displayed below.  In the formula, a represents the opposite length of the triangle and h represents the adjacent length.  The following image was obtained from [7].

Once the cone traced color is found, it’s modulated by the calculated Fresnel term using the values from the specular buffer, a normalized vector pointing from the surface location back towards the viewer, and the surface normal.  Finally, several fading steps are applied to help diminish the pronouncement of areas where the ray tracing step failed to find an intersection.  The results of this step are added back to the original light buffer and the process is complete.

The below shader code demonstrates this process.

#include "SSLRConstantBuffer.hlsli"
#include "../../LightingModel/PBL/LightUtils.hlsli"
#include "../../ConstantBuffers/PerFrame.hlsli"
#include "../../Utils/DepthUtils.hlsli"
#include "../../ShaderConstants.hlsli"

struct VertexOut
{
 float4 posH : SV_POSITION;
 float3 viewRay : VIEWRAY;
 float2 tex : TEXCOORD;
};

SamplerState sampTrilinearClamp : register(s1);

Texture2D depthBuffer : register(t0); // scene depth buffer used in ray tracing step
Texture2D colorBuffer : register(t1); // convolved color buffer - all mip levels
Texture2D rayTracingBuffer : register(t2); // ray-tracing buffer
Texture2D normalBuffer : register(t3); // normal buffer - from g-buffer
Texture2D specularBuffer : register(t4); // specular buffer - from g-buffer (rgb = ior, a = roughness)
Texture2D indirectSpecularBuffer : register(t5); // indirect specular light buffer used for fallback

///////////////////////////////////////////////////////////////////////////////////////
// Cone tracing methods
///////////////////////////////////////////////////////////////////////////////////////

float specularPowerToConeAngle(float specularPower)
{
 // based on phong distribution model
 if(specularPower >= exp2(CNST_MAX_SPECULAR_EXP))
 {
 return 0.0f;
 }
 const float xi = 0.244f;
 float exponent = 1.0f / (specularPower + 1.0f);
 return acos(pow(xi, exponent));
}

float isoscelesTriangleOpposite(float adjacentLength, float coneTheta)
{
 // simple trig and algebra - soh, cah, toa - tan(theta) = opp/adj, opp = tan(theta) * adj, then multiply * 2.0f for isosceles triangle base
 return 2.0f * tan(coneTheta) * adjacentLength;
}

float isoscelesTriangleInRadius(float a, float h)
{
 float a2 = a * a;
 float fh2 = 4.0f * h * h;
 return (a * (sqrt(a2 + fh2) - a)) / (4.0f * h);
}

float4 coneSampleWeightedColor(float2 samplePos, float mipChannel, float gloss)
{
 float3 sampleColor = colorBuffer.SampleLevel(sampTrilinearClamp, samplePos, mipChannel).rgb;
 return float4(sampleColor * gloss, gloss);
}

float isoscelesTriangleNextAdjacent(float adjacentLength, float incircleRadius)
{
 // subtract the diameter of the incircle to get the adjacent side of the next level on the cone
 return adjacentLength - (incircleRadius * 2.0f);
}

///////////////////////////////////////////////////////////////////////////////////////

float4 main(VertexOut pIn) : SV_TARGET
{
 int3 loadIndices = int3(pIn.posH.xy, 0);
 // get screen-space ray intersection point
 float4 raySS = rayTracingBuffer.Load(loadIndices).xyzw;
 float3 fallbackColor = indirectSpecularBuffer.Load(loadIndices).rgb;
 if(raySS.w <= 0.0f) // either means no hit or the ray faces back towards the camera
 {
 // no data for this point - a fallback like localized environment maps should be used
 return float4(fallbackColor, 1.0f);
 }
 float depth = depthBuffer.Load(loadIndices).r;
 float3 positionSS = float3(pIn.tex, depth);
 float linearDepth = linearizeDepth(depth);
 float3 positionVS = pIn.viewRay * linearDepth;
 // since calculations are in view-space, we can just normalize the position to point at it
 float3 toPositionVS = normalize(positionVS);
 float3 normalVS = normalBuffer.Load(loadIndices).rgb;

 // get specular power from roughness
 float4 specularAll = specularBuffer.Load(loadIndices);
 float gloss = 1.0f - specularAll.a;
 float specularPower = roughnessToSpecularPower(specularAll.a);

 // convert to cone angle (maximum extent of the specular lobe aperture)
 // only want half the full cone angle since we're slicing the isosceles triangle in half to get a right triangle
 float coneTheta = specularPowerToConeAngle(specularPower) * 0.5f;

 // P1 = positionSS, P2 = raySS, adjacent length = ||P2 - P1||
 float2 deltaP = raySS.xy - positionSS.xy;
 float adjacentLength = length(deltaP);
 float2 adjacentUnit = normalize(deltaP);

 float4 totalColor = float4(0.0f, 0.0f, 0.0f, 0.0f);
 float remainingAlpha = 1.0f;
 float maxMipLevel = (float)cb_numMips - 1.0f;
 float glossMult = gloss;
 // cone-tracing using an isosceles triangle to approximate a cone in screen space
 for(int i = 0; i < 14; ++i)
 {
 // intersection length is the adjacent side, get the opposite side using trig
 float oppositeLength = isoscelesTriangleOpposite(adjacentLength, coneTheta);

 // calculate in-radius of the isosceles triangle
 float incircleSize = isoscelesTriangleInRadius(oppositeLength, adjacentLength);

 // get the sample position in screen space
 float2 samplePos = positionSS.xy + adjacentUnit * (adjacentLength - incircleSize);

 // convert the in-radius into screen size then check what power N to raise 2 to reach it - that power N becomes mip level to sample from
 float mipChannel = clamp(log2(incircleSize * max(cb_depthBufferSize.x, cb_depthBufferSize.y)), 0.0f, maxMipLevel);

 /*
 * Read color and accumulate it using trilinear filtering and weight it.
 * Uses pre-convolved image (color buffer) and glossiness to weigh color contributions.
 * Visibility is accumulated in the alpha channel. Break if visibility is 100% or greater (>= 1.0f).
 */
 float4 newColor = coneSampleWeightedColor(samplePos, mipChannel, glossMult);

 remainingAlpha -= newColor.a;
 if(remainingAlpha < 0.0f)
 {
 newColor.rgb *= (1.0f - abs(remainingAlpha));
 }
 totalColor += newColor;

 if(totalColor.a >= 1.0f)
 {
 break;
 }

 adjacentLength = isoscelesTriangleNextAdjacent(adjacentLength, incircleSize);
 glossMult *= gloss;
 }

 float3 toEye = -toPositionVS;
 float3 specular = calculateFresnelTerm(specularAll.rgb, abs(dot(normalVS, toEye))) * CNST_1DIVPI;

 // fade rays close to screen edge
 float2 boundary = abs(raySS.xy - float2(0.5f, 0.5f)) * 2.0f;
 const float fadeDiffRcp = 1.0f / (cb_fadeEnd - cb_fadeStart);
 float fadeOnBorder = 1.0f - saturate((boundary.x - cb_fadeStart) * fadeDiffRcp);
 fadeOnBorder *= 1.0f - saturate((boundary.y - cb_fadeStart) * fadeDiffRcp);
 fadeOnBorder = smoothstep(0.0f, 1.0f, fadeOnBorder);
 float3 rayHitPositionVS = viewSpacePositionFromDepth(raySS.xy, raySS.z);
 float fadeOnDistance = 1.0f - saturate(distance(rayHitPositionVS, positionVS) / cb_maxDistance);
 // ray tracing steps stores rdotv in w component - always > 0 due to check at start of this method
 float fadeOnPerpendicular = saturate(lerp(0.0f, 1.0f, saturate(raySS.w * 4.0f)));
 float fadeOnRoughness = saturate(lerp(0.0f, 1.0f, gloss * 4.0f));
 float totalFade = fadeOnBorder * fadeOnDistance * fadeOnPerpendicular * fadeOnRoughness * (1.0f - saturate(remainingAlpha));

 return float4(lerp(fallbackColor, totalColor.rgb * specular, totalFade), 1.0f);
}

The following image roughly illustrates the process.  From top to bottom, the floor of the image starts off perfectly mirror-like and gradually becomes rougher.  The red lines indicate the cones.  The circles inscribed in them show how the radii are used for mip selection (i.e., the larger the circle, the further down the mip chain), and the center of each circle is where the sample would be taken.  Notice that for a perfectly mirror-like surface, the cone diminishes to a straight line.

sslr_cone_width_comparison

Bringing It All Together

It’s mentioned earlier that a fallback technique is useful for any screen space reflection technique.  This implementation uses parallax-corrected cube maps based on Lagarde’s post [4]. These also include a fallback to generic, non-corrected cube maps as a last resort.  These values are all computed before the screen space reflections technique starts and are accessed above in the cone tracing step through the “indirectSpecularBuffer” resource.  While fallback methods won’t be as exact as ray traced results, properly set-up cube maps can certainly help alleviate jarring artifacts.  The image below shows a comparison of two sections of the same scene.  The left half of the image does not have good cube map placement and the missed reflection data is quite noticeable under the sphere.  The right half includes blended parallax-corrected cube maps and introduces a much less severe penalty for missed rays.

sslr_fallback_bad_good_comparison

Another artifact of inadequate fallback techniques can also be seen in the left image above.  As the traced ray nears closer to the edge of the screen, it starts to become faded.  The code for this is towards the bottom of the cone tracing shader.  Without a decent fallback technique in place, the differences between the center of the screen and the edges can be quite drastic.  The right half of the image shows such fading only to a very minor degree, most noticeably on the left edge of the picture.

Due to the numerous issues mentioned towards the start of the post, rays facing back towards the viewer are disallowed entirely.  This is an implementation choice and by no means a requirement.  Implementers should experiment with their own scenes and determine whether backwards-traversing rays provide acceptable results for use cases specific to the application.  In the implementation above, ray results start to fade as they become perpendicular such as to not cause a sharp cutoff at any one point.

A final nicety that was added to this implementation is that the indirect specular buffer is actually a part of the light buffer during the initial convolution and is subtracted back out before applying the cone tracing pass.  What this allows for is metals to be reflected more appropriately in the cone traced step.  In the image below, the left half does not take these steps into consideration and the metal’s reflection is black.  The specular highlight shows up in the reflection since it is contributed from direct lighting, the sun in this case, but none of the indirect light is included.  In the right half of the image, these effects are enabled and the sky is observable in the reflected sphere.

sslr_metal_reflections_comparison

The U-shape on the bottom of each sphere is due to not having good fallback techniques in this area of the scene, and can be alleviated as discussed previously.

Areas of Improvement

The biggest area of needed improvement with this technique in its current state is the need for a better blur technique.  The current separable Gaussian blur, while fast, can lead to reflections being blurred onto parts of the scene where they don’t belong.  A feature-aware blur similar to a bilateral blur is likely a better candidate in this space and will be followed up on in a separate post once a better method is determined.  Specifically, the blur will likely need to account for large depth discrepancies and reject samples that do not fall within a specified threshold.  It should be noted that battling these type of artifacts is a potential strength of Uludag’s proposed visibility buffer.

The blur can also be sped up while still obtaining the same results by using the approach found at [10].  This is slated as future work for the current effect, and will likely be included in the same follow up post that revisits a better overall blurring solution.

While testing storage for the blurred results, a Texture2DArray was also tried out.  While this means of storage improved the overall perceived smoothness of the blur over varying roughness values, the memory requirements and increased time to run the blur several times over the full textures were simply not worth the small improvements.  The mip-chained texture provides decent results and blends adequately with trilinear sampling.  While testing values for various kernel sizes and sigmas, the calculator at [11] was extremely helpful for quick iteration.

One further improvement that can be made to the blurred result using the current implementation is to sample several points within the inscribed circle instead of just the center and blend all the results together.  The trade-off for sampling multiple points in this fashion is between performance and quality.  This technique is demonstrated in [8] on page 3 of the conversation.

Another area of improvement for this technique would be to update the reflection model to better match the lighting model used in the rest of the engine’s rendering pipeline.  As mentioned previously, the above implementation for the cone-tracing step is based off Uludag’s explanation provided in [3].  In its current state, the effect uses an approximation of the Phong model, while the rest of the pipeline uses GGX for its specular distribution term.  Uludag does offer suggestions in his article on how to adapt to other reflection models, and this will likely be the topic of a future post once implemented.

Furthermore, using more efficiently packed buffers for lighting data could prove to be a performance improvement for this technique.  As mentioned above, all buffers containing lighting data are 64-bit floating point buffers with 16 bits of precision in each channel.  Future experimentation with a more efficient 32-bit floating point buffer such as DirectX’s DXGI_FORMAT_R11G11B10_FLOAT should be considered.

Results

This section contains images generated using the techniques described above.  Each image is comprised of a few smaller images showing increasing roughness in the floor material.

The first image shows the effect working on a large scale in an area of the scene spanning over 100 meters.

sslr_result_1

The second image shows the effect working in a more localized setting at ground level, similar to how a user would perceive the world in a first-person game or application.  The area uses parallax-corrected cube maps as a fallback technique, and missed ray intersections, such as those that would likely occur around concave objects (the soldier in this case), are very well-blended.

sslr_result_2

The third image again shows the effect in a localized setting.  The later time of day creates a steeper contrast between shadowed and un-shadowed areas causing the effect to be more pronounced and better showing how a rougher surface will blur and even start to pull the reflection vertically.

sslr_result_3

The fourth image again uses a steeper lighting contrast to help demonstrate how the effect applies as the floor material changes from very smooth to very rough.

sslr_result_4

The following videos show the effect running in a real-time interactive application.  For best viewing, it is recommended to either run the videos in full-screen with high-definition enabled, or visit their respective YouTube pages by following these links:  Video 1  Video 2.

Conclusion

This post has presented a full implementation of a solution for glossy screen space reflections.  While the abundance of programmer art and MS Paint images may not be quite as fantastical as those rendered using a proper studio’s asset collection, the contributions of the effect to the final result should be clear.  Even with a basic reflection model, the technique serves to add more realism to a scene and provides a means for believable real-time reflections for rough surfaces.

Acknowledgements

I first came into contact with Bruce Wilkie about a year ago when he posted a topic on gamedev.net.  We were both working on implementing Yasin Uludag’s article from GPU Pro 5 [3].  We spoke a few times on the subject, and it became abundantly clear that he was much more knowledgeable on the matter than me.  He was critical in helping me understand and figure out Uludag’s use of the hierarchical Z-buffer for ray tracing and work the kinks out of my initial attempts at implementing it [8].  Bruce was kind enough to offer that we keep in touch and that I could ask him questions around issues I might have while implementing different features in my engine, which I work on as a hobby in my spare time.  I’ve certainly taken advantage of that offer over the course of the year, and he’s offered various ranges of advice on almost everything graphics-related that’s been posted to this blog to date.  He has shown a great deal of patience in helping clarify certain concepts to me, and has a knack for explaining how to arrive at a solution without simply giving the answer away – an extremely valuable teaching technique.  He also brought the idea of the more efficient blur using [10] to my attention as a solid alternative to the standard approach used above, as well as offered a few more suggestions for improvement over the first draft of this post.

Thank you, Bruce.

I would also like thank Morgan McGuire (@morgan3d) and Mike Mara for open-sourcing and generously licensing their DDA-based ray tracing code.  A thank you goes to Ben Hopkins (@kode80) for doing the same with his implementation.

References

[1] Morgan McGuire and Mike Mara.  http://casual-effects.blogspot.com/2014/08/screen-space-ray-tracing.html

[2] Ben Hopkins.  http://www.kode80.com/blog/2015/03/11/screen-space-reflections-in-unity-5/

[3] Yasin Uludag.  GPU Pro 5.  Hi-Z Screen-Space Cone-Traced Reflections.

[4] Sébastien Lagarde.  https://seblagarde.wordpress.com/2012/09/29/image-based-lighting-approaches-and-parallax-corrected-cubemap/

[5] Sébastien Lagarde.  https://seblagarde.wordpress.com/2011/08/17/feeding-a-physical-based-lighting-mode/

[6] Matt Pettineo.  https://mynameismjp.wordpress.com/2010/09/05/position-from-depth-3/

[7] Weisstein, Eric W. “Inradius.” From MathWorld–A Wolfram Web Resource. http://mathworld.wolfram.com/Inradius.html

[8] https://www.gamedev.net/topic/658702-help-with-gpu-pro-5-hi-z-screen-space-reflections/

[9] https://en.wikipedia.org/wiki/Specular_highlight

[10] http://rastergrid.com/blog/2010/09/efficient-gaussian-blur-with-linear-sampling/

[11] http://dev.theomader.com/gaussian-kernel-calculator/

Dealing with Shadow Map Artifacts

In a previous post on stack stabilization, the linked video showed a few major issue with shadow mapping.  These issues have plagued the technique since it’s inception, and while there are many methods that assist in alleviating them, it’s still very difficult to completely get rid of them.  Here we’ll review some common artifacts and discuss potential ways to squash them.

Perspective Aliasing

These types of artifacts are perhaps the simplest to alleviate.  Stair-like artifacts outlining the projected shadows are generally caused by the resolution of the shadow map being too low.  Compare the halves in the image below.  The top half shows a scene using a shadow map resolution of 256×256, while the bottom shows the same scene using a resolution of 2048×2048.

0_sm_resolution_comparison

Unfortunately, increasing the resolution will only get us so far.  Even at high resolutions, if the viewer is close enough to the receiving surface, tiny stair-like artifacts will still be noticeable along the edges of projected shadows.  The solution to this is to use a technique called percentage closer filtering (PCF).  Instead of sampling at one location, this algorithm samples several points around the initial location, weighs the results that are shadowed versus non-shadowed, and creates soft edges for the result.  The image below shows an up-close view of a shadow map with 2048×2048 resolution without and then with PCF enabled.

1_pcf_comparison

There are several different sampling patterns that can be used for the PCF algorithm.  Currently, I’m using a simple box filter around the center location.  Other sampling patterns, such as a rotated Poisson disc, are also popular and produce varying results.

Shadow Acne

Another common artifact found in shadow mapping is shadow acne, or erroneous self-shadowing.  This generally occurs when the texel depth in light space and the texel depth in view space are so close that floating point errors incorrectly cause the depth test to fail.  The image below shows an example of these artifacts present (top) and addressed (bottom).

2_shadowacne_comparison

There are a few ways to address this issue.  It’s so prevalent, that most graphics APIs provide a means to instantiate a rasterizer state that includes both a depth bias and a slope-scaled depth bias.  Essentially, during shadow map creation, these values are used in combination to offset the current value by a certain amount and push it out of the range where floating point inaccuracies would cause inaccurate comparisons.  One must be careful when setting these bias values.  Too high of a value can cause the next issue to be discussed, peter panning, while too low of a value will still let acne artifacts creep back into the final image.

Peter Panning

It’s frustrating when introducing a fix for one thing breaks something else.  That’s exactly what we can potentially end up with when we use depth biases for shadow maps.  Peter Panning is caused by offsetting the depth values in light space too much.  The result is that the shadow becomes detached from the object casting it.  The image below displays this phenomenon.  In both halves of the image, the blocks are resting on the ground, but in the top half the depth bias is so large that it pushes the shadow away from the caster, causing them to appear as though they could be floating.  The bottom half uses a more appropriate depth bias and the shadow appears properly attached.

3_peter_panning_comparison

Bangarang!

Working in the Shader

Using hardware depth biasing in the rasterizer is nice in that it’s fast and easy enough to set up and get working.  Sometimes, however, we have different needs for our shadow maps and want to delay these type of correction steps until further in the pipeline.  Though I’ve since reverted to a more basic approach, when first implementing transmittance through thin materials I switched my shadow map vertex shaders to output linear values to make the implementation a bit more straightforward.  If I used the rasterizer state offsets as described above, I would have to somehow track and undo those offsets before I could use the values effectively in my transmittance calculations, or else have major artifacts from depth discrepancies.  Fortunately, there are several excellent resources that describe alternative methods for getting rid of shadow artifacts (see references), and with a combination of ideas borrowed from all of them, I’ve been able to get a fairly decent implementation working.  Below is some example code in HLSL.

Storing linear values to the shadow map:

// client code
Matrix4x4f linearProjectionMtx = createPerspectiveFOVLHMatrix4x4f(fovy, aspect, nearPlane, farPlane);
linearProjectionMtx.rc33 /= farPlane;
linearProjectionMtx.rc34 /= farPlane;

// shadow map vertex shader
float4 main(VertexIn vIn) : SV_POSITION
{
 // transform to homogeneous clip space
 float4 posH = mul(float4(vIn.posL, 1.0f), worldViewProjectionMatrix);
 // store linear depth to shadow map - there is no change to the value stored for orthographic projections since w == 1
 posH.z *= posH.w;
 return posH;
}

Using a scaled normal offset in the light shader before transforming a point in world space by the shadow transform matrix.  I use a deferred shading pipeline and store data in the G-Buffer in view space, hence having to transform the new position by the inverse of the camera view matrix first:

#if DIRECTIONALLIGHT
 float3 toLightV = normalize(-light.direction);
#else
 float3 toLightV = normalize(light.position - position);
#endif
 float cosAngle = saturate(1.0f - dot(toLightV, normal));
 float3 scaledNormalOffset = normal * (cb_normalOffset * cosAngle * smTexelDimensions);
 float4 shadowPosW = mul(float4(position + scaledNormalOffset, 1.0f), inverseViewMatrix);

Once the point has been transformed by the shadow matrix, finish projecting it and apply a depth offset:

// complete projection by doing division by w
shadowPosH.xyz /= shadowPosH.w;
shadowPosH.z -= cb_depthBias * smTexelDimensions;
float depth = shadowPosH.z; // depth to use for PCF comparison

And that’s it.  The values for depth bias and normal offset have to be adjusted per light and depend on various factors, such as the light range, the shadow projection matrix, and to some extent the resolution of the shadow map, but when properly set the results can be quite nice and artifacts are almost entirely mitigated.

References

http://www.dissidentlogic.com/old/images/NormalOffsetShadows/GDC_Poster_NormalOffset.png

http://c0de517e.blogspot.co.at/2011/05/shadowmap-bias-notes.html

http://www.digitalrune.com/Support/Blog/tabid/719/EntryId/218/Shadow-Acne.aspx

https://msdn.microsoft.com/en-us/library/windows/desktop/ee416324%28v=vs.85%29.aspx

https://www.mvps.org/directx/articles/linear_z/linearz.htm

Stacks on Stacks

A long while back, I realized my scenes would be better served and more interesting if there was a more dynamic component to them.  Outside of the very basics, implementing  a proper physics engine with accurate collision detection and response was quite foreign to me.  Therefore, I picked up Ian Millington’s book Game Physics Engine Development and got to work.  I enjoyed the author’s approachable writing style and well-explained information on both particle and rigid body dynamics.  Within about a week or so, I was able to integrate a fairly robust adaptation of the engine presented in the book into my own engine’s architecture.

While the information presented on physical body simulation is quite good, the book’s main shortcoming is in collision detection and resolution.  In fairness, the author calls this out and tries to realistically set the reader’s expectations, but there’s a lot left to be desired when two boxes can’t reliably be stacked on top of one another due to non-converging solutions for contact generation and impulse resolutions.  Regardless, this is the approach that had lived in my engine for well over a year and still remains in the code base, although I consider it to be deprecated for anything beyond very simple simulations.

After a lot of research and a short back and forth email exchange with Randy Gaul, I tried my hand at implementing a more complex collision detection routine.  The new routine generated an entire contact manifold, as opposed to the old one, which only ever recorded one contact between two objects for any given point in time.  The contact manifold contained up to 4 points per collision pair.  This data, combined with a few other tricks I picked up here and there, finally allowed a small stack of boxes to sit on top of each other without shaking and falling over.

Eventually, I decided I wanted an overall more robust solution for both physics simulation and collision detection and resolution, so I spent a weekend integrating the Bullet Physics library into my engine.  Bullet’s API has proven to be reasonably straightforward, and I was able to get a stable stack of boxes set up in a very short amount of time.

The video below shows the dramatic difference in the old collision resolution method, and the newly implemented engine backed by Bullet.

With the old setup, I would place objects in the world with a sleep state and a tiny amount of space between each to give the appearance of a stack, but as soon as I interacted with anything in the stack, all bets were off.  With the new implementation, I can safely let objects fall into place and rest on top of each other at the start of the simulation without worrying too much about the whole thing going haywire.

(Regarding the ugly shadow artifacts in the video, those will be addressed in a follow-up post specific to the topic.)

References

http://www.randygaul.net/

http://allenchou.net/

https://code.google.com/p/box2d/downloads/list

https://github.com/bulletphysics/bullet3/releases

Bachelor Thesis Acknowledgment

I recently received an acknowledgement in Lukas Hermanns’ bachelor’s thesis entitled Screen Space Cone Tracing for Glossy Reflections, which I thought was really cool of him.  He’s produced some great results, and I’m happy to have lent a hand in the excellent work he’s done.

The full thesis can be found here:  http://publica.fraunhofer.de/documents/N-336466.html

Separable Subsurface Scattering

I’ve recently implemented Screen-Space Separable Subsurface Scattering into my rendering engine.  This implementation is based off the incredible work that’s been done over the past several years, and documented here and here.  I’m quite pleased with the results I’m getting from the effect and so am posting a few screenshots of it in action.

The first screenshot shows the effect in daylight.  Hopefully it’s quite obvious which head in the picture has the new technique applied and which is being lit with the engine’s standard lighting model.

SSSonSSSoff

The second screenshot shows another part of the overall effect, which is the transmittance of light through very thin slabs of materials, such as ears.

TransmittanceAndSSS1

The next screenshot better shows both subsurface scattering and transmittance working together.  In particular, notice how the light behaves along the ridge of the nose.

TransmittanceAndSSS2

Finally, I cobbled together a quick setup showing how this technique could be used to create a nice effect for candles.  In truth, I cheated a little in that I have not yet incorporated a wax kernel for the subsurface scattering technique, so instead I aimed a bright spotlight straight down at a cylinder using the same skin kernel as used in the above screenshots.  Even with such a simple (and quite lazy) setup, the result is still decent looking, and it bodes well for creating a proper candle in an actual scene using a correct SSS kernel.

SSSCandle

Update:  I didn’t like that I had left the post at “think about how nice a candle could look”, so I went ahead and brought in a more wax-like kernel.  The setup is basically the same in that it’s just a cylinder with a light shining down on it, but now it definitely exhibits light interactions much more like an actual candle.

SSSWaxCandle

References

http://www.iryoku.com/separable-sss-released

http://cg.tuwien.ac.at/~zsolnai/gfx/separable-subsurface-scattering-with-activision-blizzard/