The official place where I do most of the work for this is over on GitLab Though I do field questions on the GitHub mirror To contact me, you can always @ me on Twitter as @DefPriPub (Or my personal website)
Now with an Android, iOS, and macOS (Apple Silicon) GUI:
(It also runs on Windows, Mac (Intel), Linux,, but you need to build that for yourself in the meantime.)
Contents
- What Is This?
- Why?
- Some Blog Posts
- Building and Basic Usage
- How To Use
- What's Different?
- The Changes That Made An Impact
- Other Experiments
- Looking Back and Thinking Forward
- Thanks
This repo is an implementation of Peter Shirley's Ray Tracing mini-books, done in modern C++ (17), with a different project layout and all sorts of performance boosts. The CMake build script allows the user to toggle on/off most of the changes made from the reference implementation; as to better see the effects of these changes.
Compared to the book's code, mine is able to render the final scene in about 1/4 of the time. Without radically changing the architecture of the renderer and using (near) vanilla C++.
There are other fun features such as multi-core rendering, configurable render parameters (e.g. samples per pixel, max ray depth), and progress bars; it's the little things that count. It should be able to replicate all of the scenes from all three books.
What I like about how this book is structured is that it shows you how to make incremental changes to
the ray tracer. With these step-by-step changes in book 3 though, it was getting very difficult to
have my code reproduce every step, due to many structural changes in critical sections. Originally the
code for book 3 was off in its own separate branch (called book3
), but it has now been mereged into
the mainline branch (that is master
). The PDFVariant
optimization has also come along with it
(and is now a CMake configuration variable instead of a second separate branch). Note that getting book 3's
code to work alongside that of books 1 & 2 did required adjusting some of the architecture a bit; but not
my too much.
I've decided to license this implementation under the Apache License (2.0)
. The full text is in the
file LICENSE.txt
. The only exception to this is src/third_party
, as those are externally provided
projects. Check those files for their licenses. Note that the books (and the book code) is under
CC0 1.0 Universal
. If you end up using any of this, a shout out would be appreciated (and telling me
too).
I first went through the books back in mid 2016 when they were new. That time I did it though as an exercise to learn Nim It's kind of cool to see how much these books have blown up in popularity. Ray Tracing seems to be quite in vogue recently, and I was looking for a project to learn modern C++ and better project architecture. This was also an experiment to see how far I could push a CPU render and optimize my code while keeping things simple, portable, and reproducible.
Care to read a little more about this project? Check out these:
- Making the first few revisions of the project
- Using Python to do automated testing & performance measuring
- Adding a Qt/QML UI that works seamlessly on Mobile & Desktop
- Localizing the GUI app into Japanese (日本語)and German (Deutsch)
- This one is more about localizing Qt apps & software localization in general; not so much ray tracing
- Measuring the impact of the final keyword
- Finding out when noexcept helps performance (and when it doesn't)
I initially developed this on an Ubuntu 18.04 machine using GCC 10.x; but now its moved to newer versions of Ubuntu (22.04 LTS) and GCC (11 & 12). It runs on Windows 10/11 via MSYS2 (also GCC). It also compiles via clang 11+ without a hitch. macOS & xcode/clang have also been testing and verified. Something I could use help with is getting a build working with MSVC on Windows.
Don't want to use the command line interface? Want to see how this performs on your phone, tablet, or chromebook?
Take a look at the qt_ui/
subfolder. Be sure the check the README.rst
over there for some more instructions.
- A C++17 compiler. I'd recommend GCC 11/12 (or higher)
- CMake 3.x. Using a CMake GUI application (such as
ccmake
) will make your life easier if you want to toggle on/off changes from the reference implementation
- In the root of this project, make a build directory and go into it:
mkdir build && cd build/
- Set your desired compiler (e.g.
export CC=gcc-12
,export CXX=g++-12
) - Run CMake w/ build type set to be a release:
cmake ../ -DCMAKE_BUILD_TYPE=Release
- Build it:
make
If you ever get lost, doing a simple ./PSRayTracing --help
should give you a list of all the options
available, as well as their defaults. Here are some of the more important ones:
--scene <scene_id_string>
, scene from book(s) to render (default isbook2::final_scene
).-n <integer>
, number of samples per pixel (default is25
)-j <integer>
, how many cores/threads to render with (default is1
)-s <integer>x<integer>
, resolution of the render (default is960x540
)-o <filename>.png
, file to save the render to (default isrender.png
). Note that this will always overwrite any existing file.
If you want to see what scenes are available to render, supply --list-scenes
as an argument and it should
show you all that are implemented. They should be in order as they were presented in the book.
So for example, If you do ./PSRayTracing
by itself, it will render Book 2's final scene, with 25 samples
per pixel, on a single core, saving it to render.png
with a resolution of 540p. Though if you do
./PSRayTracing -n 250 -j 4 -s 1920x1080 --scene book1::normal_sphere
, you'll get that sphere using it's
surface normal (to shade it), with 250 samples/pixel, rendering on four cores at 1080p; also saving to
render.png
.
You should be good now to start rendering.
If you want to see what scenes, supply the flag --list-scenes
to the executable. The output form this is
different depending upong what branch you're currently on.
Oh, quite a bit. This isn't an exhaustive list though. (I forgot to document some stuff along the way, my bad ¯\_( ツ )_/¯)
- Naming changes; e.g. I did things like
BVHNode
instead ofbvh_node
. These are just preferences of mine - Classes like
material
have become strict abstract base classes,IMaterial
for instance; take a look atsrc/Interfaces/
to see more - Instead of rendering to PPM files (via standard output),
stb_image_write
is used to write directly to PNGs; A file format that you're not ashamed to take back home and introduce to Mom. - The
random_*
functions have put into their ownRandomGenerator
object; this helps perf. and multi-threading (more further down). - More files/structs/classes and compartmentalization of functionality. Imma neat freak
- A lot more
const
andconst ref
parameters, these help perf. and create cleaner, more predictable code - More/less inline functions
- Sometimes inlining makes sense and can boost perf well (e.g.
Vec3
). For virtual functions (.e.gIMaterial
,IHittable
, etc), it does jack
- Sometimes inlining makes sense and can boost perf well (e.g.
- Using
Vec3
as a Point- Yeah yeah, I can hear one of my CG professors from college saying “A point is not a vector, a vector
is not a point. You can't add or multiply two points.” But I wanted to keep things easy here.
Vec3
, 3D points, and colors are used quite interchangeably in this code
- Yeah yeah, I can hear one of my CG professors from college saying “A point is not a vector, a vector
is not a point. You can't add or multiply two points.” But I wanted to keep things easy here.
Vec3
is also not backed by an array of three elements. We have a hardx
, a hardy
, and a hardz
. Saved me typing parenthesis- The
Box
type also requires you to pass in an RNG to it's constructor. More on why in the perf. section - Avoiding defining our own Deconstructors and copy/move/assignment constructors.
- Modern compilers are really nice. Sometimes they do this for you. It's less code we have to write, which also means less bugs
ImageTexture
's method of object creation is different; it's constructor has a different signature. And you are instead recommed to use a "static constructor" method instead. One which loads an image from a filepath, the other which loads an image from a memory buffer.- This was done to make Android & iOS port easier to develop. There is only a single image used in the
entire program (
assets/earthmap.jpg
). So embedding this makes life much easier instead of having to worry about how those platforms look for runtime assets. And if you want to add your own image textures into a scene (directly from a file), I've left the older loading function in, commented out (seeScenes/Book2.cpp
).
- This was done to make Android & iOS port easier to develop. There is only a single image used in the
entire program (
- The ray tracer has been split into two parts, a static library (where the rendering code lives) and a
"CLI Runner". This was done so I could make it easier to build the Qt UI. This is an architectural
change that was introduced after the
r7
release. - To get all three book's to be able to render with the same branch, I needed to add an enumeration called
RenderMethod
. This is passed into some of theIMaterial
methods. Books 1 & 2 have the same exact method of perfoming renders, but book 3 is different in its "rendering equation" (e.g. the use of ``PDF``s). It doesn't clutter up the code/architecture too much and after testing, it didn't hamper performance whatsover.
Not too much. The final scene for book 2 replicates the perlin noise texture of the book's cover, not what's
actually in the book code. A SurfaceNormal
material was also added so one of the first generated images can
be included for completeness.
This is the juicy good stuff that you're here for.
Using CMAKE_BUILD_TYPE=Release
, rendering the final scene (with same samples/pixel, resolution, single
threaded of course), my implementation would compute the result in about 1/4 the time it took the book's code.
This implementation supports multiple cores/threads, so it can render the final image even faster.
I do want to note that the doubling the amount of core/threads rendering doesn't give me that ideal 50%
reduction in render time. E.g. 1 core = 120 sec
, 2 cores = 72 sec
, 4 cores = 43 sec
, etc..).
This was my first time implementing a thread pool in C++. If someone knows how to improve on my multi-threading
code, please send a PR my way. I think there is some pointer chasing going on that is hampering perf. too.
My code is structured differently, so it's very likely there were perf. benefits from that as well. If you
look at the CMake build file (src/CMakeLists.txt
), you should see that there are a lot of added ON/OFF
options, that all begin with WITH_*
(e.g. WITH_BOOK_PERLIN
, WITH_BOOK_SQRT
, etc). These are
changes that differ from the book code. I made them toggleable on/off so you can better see the effects they
have on render time (some even have effects on the final image). Using a CMake GUI application (like ccmake
)
it is pretty easy to do all this toggling.
Some of them created massive perf. boosts for me (e.g. 12%, 7%, etc.). While some others were very tiny, it
was hard to measure (e.g. 1%, 0.5%); which could be within the margin of error. So I will say that not all
of these are fully conclusive, but I would like others to take a look as well and report to me if my method
or the book's method was better on your system. For instance, I'm on an Intel(R) Core(TM) i5-7300U CPU @
2.60GHz
. (I'd like to see what one of those magical AMD Ryzen Threadripper does. :] )
I'm not going to go into gory details of the code, as it can get a bit long. If you' interested in seeing
what I did for a certain option, search through the code (C++) looking for #ifdef
statements that have
a corresponding USE_*
to them. For example, if you wanted to look to see what I wrote for
WITH_BOOK_AABB_HIT
, search for USE_BOOK_AABB_HIT
.
There were many parts of the code and can be rewritten and moved around with having the same computed result,
but computed in a more efficient manner. I'm finding it a little hard to correctly explain how this works,
so I think it would be best to go in and look at the code differences. In some places, I actually had the
result of an if
body computed right before that if
, hoping that the compiler would reorder instructions
and pack computations together via auto-vectorization. This is what I did in a lot of *::hit()
functions.
Branching (i.e. if
statements for you non-assembly people) can be a real performance killer. Only do
branching if you need to for the correctness of an algorithm, or to save time on an expensive computation
(e.g. checking a cache).
I think the best change to show off for this is my implementation of the AABB::hit()
function. It went
from a bunch of if
s, swap()
s, and value comparisons computed sequentially, to a very parallelizable
batch of min()
s and max()
es.
I would like to make a note about branch prediction. I'm not sure if I was able to successfully exploit it
in my implementation (see my ray_color()
function). If you have the time, reading this Stack Overflow
post
It provides some great insight into what's going on at the CPU level to make your code faster.
The people who write compilers and create CPUs are the smartest in the world. The best we can do is write our programs to utilize their genius.
Trig functions are necessary for almost anything math. Though, they can also get a little expensive to
compute. But in some cases, such as graphics, we can get away with doing a faster approximation of the
functions. In our case, we use the functions sin()
, cos()
, asin()
, and atan2()
.
Sine, cosine, and arcsine use a taylor series approximation. It's fairly easy enough to implement.
atan2()
as a bit more tricky and required bitwidling magic. My method was taken from
this page. I'd really recommend reading through it if
you want to know the details of how it worked.
Keep in mind that since these are approximations, they're going to differ from the ground truth. Here's a series of images that explain it better.
If it's hard for you to see the difference between the first and third renders, load them up in an image viewer and toggle between them really fast.
I used to have a math professor scoff at approximations. I mean, they are technically incorrect. But in our
program we express Pi as 3.1415926535
; that's an approximation, not the actual value of Pi. If we're doing
that, any approximation is fair game to use as long as the viewer has no idea it's different.
In the books' code, the Box
object is actually made up of six components. Two XYRect
, two XZRect
,
and two YZRect
. Using a HittableList
to store them all, and then loop through it for the hit()
detection. While this is pretty simple, it can be done better.
Take for example this stress test. It is a 5x5x5 matrix of glass cubes in a Cornell Box. Using the book's
method on my machine, it took around 8 minutes and 28 seconds (528 sec); 600x600 with 100 samples per pixel
(single threaded). Its scene id is fun::cornell_glass_boxes
Now, in a prior chapter, we actually made a BVHNode
object. Having Box
actually use that to store the
rectangles (and perform the hit()
) was much faster. The only complication from this is now construction of
a Box
object requires an RNG, but it's really a small price to pay.
With the use of the BVHNode
, this dropped down to 6 minutes and 5 seconds (365 sec). That's already a speedup
of ~30%.
I've removed this implementation ofBox
from the current revision because the method mentioned below is more performant. If you wish to see this one though, checkout commit taggedr1
.
To push this even further, it's better to give the Box object its own hit()
implementation, rather than relying
on that of the *Rect
children. To do this we simply take the code for each individual *Rect::hit()
function,
place it in Box::hit()
, and then rewrite it to take better advantage of SIMD instructions. This way the ray-side
hit intersection is being computed in parallel. And since we don't need to store a list of pointers to more objects,
this also helps us trim down on memory usage.
This reduced the render time of the "Cornell Glass Boxes" down to 5 minutes and 34 seconds (334 sec). That's an extra ~10% upon the BVHNode method, but ~40% on the book's method!
When doing random number generation, you're not limited to what's provided out of the box in C++. As a replacement for the Mersenne twister engine from the standard, PCG provides a drop in RNG that is better performing.
On top of that, the book's RNG solution was to use essentially one source for generation; a bunch of functions
prefixed with random_*
. This created issues when I was trying to multi-thread the ray tracer (initially
with OpenMP).
Black speckles were showing up on the multi-core enabled renders. In a Twitter thread someone suggested that I make sure the RNG is thread safe, which it wasn't. My implementation creates a branch new RNG object per scanline (each seeded from a master RNG). It fixed this issue and improved performance.
The list of objects to render (or tree, whatever term you prefer), if you notice, is actually a collection
of std::shared_ptr<T>
types. In some cases, using a shared pointer can make a lot of sense. For example
if two different objects have the same texture or material. If you update the material on one of the objects,
you'll see the change on the other. A common complaint of shared pointers is that they are slow (e.g. because
of reference counting). This can even further cause issues when you throw multi-core access into the mix.
Luckily, the rendering process is read-only when it comes to the scene, so we don't need to worry about any of
those multi-threaded reading/writing/access issues you hear so much about.
I started with the hypothesis: "Copying the scene (to render) to each thread/core would improve the render
performance." I created the IDeepCopyable
interface, and then implemented it on every class that could be
rendered (e.g. Spheres, Textures, Boxes, etc.). It requires that you add a function called copy()
, which
must return a deep copy of the object and of its child objects. Then when setuping up the render threads, in
each one, copy()
is called on the root of the scene.
So for single core rendering, there's no improvement. But when rendering with multiple cores, I saw signifcant
improvements. Sometimes in the range of it being 20-30% faster! I also want to note that different scenes did
not have all the same benefits. To be honest, I'm not fully sure why this is the case. My assumption is because
multiple threads are not having to fight over a single shared pointer tree. I did ask Reddit's /r/cpp_questions/
why it could be faster. If you would like to read the thread it's right over here .
Or if you may know why this is more performant, please tell me so I can put that information here.
A better thing to do in my opinion is not to rely on shared pointers unless you really need to.
By default the "deep copy per thread" feature is on, but it can be toggled off at runtime by supplying the
--no-copy-per-thread
flag to the executable.
Going off from the above section, it would be good to eliminate (or reduce) the use of std::shared_ptr<T>
in
our code.
Going off on a little aside here, part of my inspiration to revisit this book series was watching Tyler Morgan-Wall
build his Rayrender/Rayshader project (a path tracer for R).
Like this one, it's also based off of the Peter Shirley books. In a recent tweet
he announced that he was able improve render speeds by about 20%. After poking through his commit history,
there is a change in the HitRecord
structure. mat_ptr
was changed from being a
std::shared_ptr<IMaterial>
over to a plain old raw pointer (IMaterial *
).
Making that change to this ray tracing project also led to a good speed boost. When testing this out on an older Gen 7 i5 CPU, I got a render speed improvement of about 30%! On a Gen 10 i7 CPU, the speed boost was closer to 10%; not as grand, but still quite significant.
Since HitRecord
is used a lot, including the mat_ptr
field, it makes a lot of sense here to remove the shared
pointer usage. We have no intention of modifying the material used, only to know what it is. And during the render
process, none of the objects or materials will change. This is a perfect place to use a raw pointer.
If you want to try toggling this on/off, this is controlled by the WITH_BOOK_MAT_PTR
flag at CMake configuration
time.
While working on Book 3, I couldn't help but notice that during the rendering process, we were allocating dynamic
memory and creating shared pointers when it came to using classes like CosinePDF
, HittablePDF
, and
MixturePDF
(all subclasses of IPDF
). These classes weren't being used in any extraordinary complex ways.
For instance CosinePDF
is only being used in IMaterial
objects. HittablePDF
is only being used with
the light objects for a scene. And MixturePDF
is only instantiated in the ray_color()
function.
Leveraging std::variant<T>
, type we can still pass around these various IPDF
sublcasses in a flexable manor,
but ensure that they stay allocated on the stack (thus no dynamic memory or reference counting). This was shoved
into an aliased type called PDFVariant
. We still need to work with pointers to IPDF
(namely for
MixturePDF
), but these are much faster raw pointers (and the memory actually lives on the stack).
Not everything is a success. I had some theories that I wanted to test that turned out to fail.
If you look through the code, you'll find there's next to no mention of float
or double
directly. It
has this type called rreal
. That's actually an alias to one of those two; it defaults to double
.
I was wondering if using less precision would be more performant (since it doesn't have to use as many bytes
in memory). Turns out that wasn't the case for most of the development; it was exactly the same.
Though later on (I can't remember where/when), float
started to perform worse than double
. I haven't
figured out what the cause of it was. It's truly a bit perplexing to me. If anyone might know why, I'm all
ears.
As an aside, I used to usereal
as the data type, but I later found out thatstd::complex<T>
has a member function calledreal()
. Luckily humans were smart enough to invent refactoring tools.
Chasing after that famous fast inverse square root approximation (of Quake 3 frame), I did some research in square root approximation. This article was an interesting read. I tried the Babylonian method without much success. I did learn quite a bit about the history of computing square roots. For me, in the end it turned out those methods were slower and more incorrect.
std::sqrt()
is king.
Update Aug 5th, 2024: After doing an experiment with the final
keyword and writing a blog post on it,
someone reached out and asked me if I had done the same but with noexcept
. Now a more comprehensive
performance test of the noexcept
keyword has been done. The results are mixed and you can read it right here .
Orginally I left noexcept
on, but now I've decided to turn it off.
I remember hearing about how marking my functions with the noexcept
could make my code run faster. So to test this out (and test toggling it on & off), I added a macro called
NOEXCEPT
, which will either expand to the noexcept
keyword, or will be a null string. As a benchmark
I did single core renders with the sample-per-pixels set to 250.
To my surprise, I found out that there was no signficant change in render time with noexcept
on the functions
or not. Each one took about 230 seconds in total, with a difference of about ~0.5 seconds (which can be
attributed to error). My code is very much exception sparse, so I don't think it would have helped that much
anyways. Other code bases may benefit from this, but this one definately did not.
Despite having no real performance benefits (at least for me), I still think adding noexcept
is a good
C++ practice; especially for APIs. It informs other developers of your design intentions. If you write code
that could crash someone else's, it's best to tell others that it could happen. C++ documetation tools such
as Doxygen will pick up those noexcept
keywords and mark it in the generated docs.
Update Aug 24th, 2024: After a further investigation with the noexcept
keyword I actually found this alternative
implementation to be slightly slower that what book series originally offered. I've now switched back to that as the
default option. My _MorePerformant
variant did run faster on an older system (and compiler), but that is no longer
the case. I'm keeping the code available though.
The idea here is that I thought the BVHNode
object was a little inefficient when it came to memory usage. It required
that you create have two IHittable
objects as children (which could also be BVHNodes
). Instead, the BVH tree could
be a list AABBs, that also contained indices to child AABBs. But maybe some of those indices actually pointed to objects
that could be hit and produce colour. It practically became "pointers but with a lot more steps involved". I don't want
to go into the gory details of how it works. If you want to see, look at the code for the class BVHNode_MorePerformant
.
This one had a much more minor speedup. On some newer hardware, I only saw about a 1-2% performance boost. When it came an older machine, it was more in the range of 5-9% (which is more on the significant side). This one didn't seem to be as significant as other changes. Not to mention it was hard to reproduce results that saw a constant performance boost accross different hardware. It is on by default though; it can be toggled on/off via a CMake configuration variable.
The tree construction and hit algorithms are the same as the book's BVH node (depth first). It's very likely that alterative construction and hit algorithms could produce more performant results.
Overall this was a fun project.
I'd love to visit some of these ideas, as they could bring better perf and add all around fun features, but I want to get onto other projects. Someday...
- Being able to pause and restart renders. Should be simple, but I'd want to do it
- Adding in a scripting language to define scenes (instead of hard coding them in).
- I think ChaiScript would be a good candidate for this. Fancy animations would also be more possible then!
- There should be some ways to lay out the memory and objects differently to gain more perf.
std::shared_ptr<>
isn't a zero cost abstraction. Reducing the amount of pointers (and dynamically allocated memory) can really boost performance. Though, I think a more radically different structure is required for this renderer - CPUs and AVX instructions are fun and all, but let's not kid ourselves, GPUs are the alpha dogs in this realm. If I knew CUDA, OpenCL, or Vulkan better this renderer could very likely be in a real time state.
- Techniques such as adaptive sampling would be a boon. But I wanted to keep this repo strictly to topics that were mentioned in the book
- Adding in some more fun features like metaballs or an “0ldskool” plasma effect. Let's be real here, it isn't a true CG application unless you support these. LAN party like it's '96.
I will be visiting Ray Tracing again sometime in the future.
We're all working off of the works of others, in some way or another. Let me highlight those that had a bit more of an impact on this project.
- Peter Shirley, he wrote this book series initially
- Matt Godbolt for his compiler explorer tool. It has been invaluable when trying to play code golf for generated assembly and seeing if things get vectorized. It's a must use for anyone who doing performance tweaking in C++
- Nic Taylor for his
atan2()
approximation - Roman Wiche (a.k.a. Bromanz) for his Ray-AABB intersection article (and code)
- Tyler Morgan-Wall for working on Rayrender/Rayshader which provided me with inspiration to start this project. As well providing another hint on how to boost render speed
- The folks over on Reddit's C++ community answering my questions (/r/cpp and /r/cpp_questions)
- Those who work on the Boost, PCG Random, cxxopts (jarro2783), and stb libraries