-
Notifications
You must be signed in to change notification settings - Fork 165
The rocky road to Vulkan
One of our long-running projects is to add Vulkan support to our engine. In this document, I want to sketch a map of how to get there, because the terrain is not easy to navigate and there are a lot of things that need to be changed in FSO and how it interacts with GPUs.
Despite the render backend being one of the most fundamentally changed areas of the code compared to the original source code release, in many ways our engine is still operating like it's 1999. Back in the day, rendering was done mostly in what we now call "immediate mode": when the engine hits a point where it decides it wants to render something, it issues a command, waits on the gpu to execute it, then carries on with whatever it was doing at the time. Moving away from that mode and into using various forms of buffer objects is one of the fundamental performance improvements that happened over time; this put gpu resource management more firmly in the hand of the GPU driver on the assumption that the driver knows best how to get performance out of the hardware.
The remnants of this design can be seen in our hardware abstraction layer. There is a strict separation between the code that actually talks to the GPU and the rest of the engine, expressed as a bundle of function prototypes in 2d.h
:
an excerpt of 2d.h
This creates a handy interface that, in theory at least, anyone wishing to add another render backend can implement and then be ready to go.
Theory, however, disagrees with reality when it comes to this interface and Vulkan. The problem is that this interface mixes high- and low-level abstractions freely: there are functions that are reasonably abstract (like gf_print_screen
) and functions that are very granular (like gf_sync_fence
) that make it hard to adapt this interface to something that isn't following the same paradigms as OpenGL 3/Direct3D prior to 12.
This high/low mix also has a very unwelcome side effect: in order to completely render a frame, the engine has to reach the point where it can call gr_flip
to switch the front and back buffers. This results in the GPU idling for long parts of the frame while the CPU grinds through the gameplay logic, or (in the case of batched effects) having to wait while the data the GPU needs is being prepared and sent over the PCIe bus. The end result of this is that our throughput (the amount of commands we issue per frame) is far below what the GPU can actually render.
a screenshot of nvidia nsight, showing throughput statistics for a single full frame on an RTX 3080. Note the barely filled gauges showing the utilization of various parts of the GPU
This brings us, at long last, to Vulkan.
When Khronos designed Vulkan, one design choice they made was that Vulkan should be optimized for easy threading. As CPUs gained more and more threads, hitching render performance directly to single-thread performance became increasingly unfeasible; the solution to this was to design the Vulkan API around the concept of pipelines (static objects that defined how a given set of commands should be processed), command buffers (data structures recording lists of actual vulkan commands) and descriptor sets (data structures describing, for example, all the information needed to render a given piece of geometry). These could be prepared in parallel by worker threads, then dispatched to the GPU for processing while the CPU gets back to work working on the next frame and preparing the next set of command buffers.
A lot of the performance improvements that Vulkan can provide come from this radical restructuring of how frames are defined and rendered. It is, of course, possible to write vulkan like one would opengl, but to do so is stupid - we want vulkan for performance gains and features not (yet) supported by OpenGL, after all.
In essence, in order to gain the most benefit from Vulkan, we need to adapt the engine to operate in a way compatible with it, meaning that the render backend needs to be rearchitected from the ground up in terms of creating command buffers and throwing them over the fence to keep the GPU fed as much as possible.
FSO's high-level render pipeline can be expressed in the following graph:
All of these steps are, at least theoretically, present and processed in every frame, regardless of game state. They are always present in this exact order: moving left to right, pixels written in an earlier step can be overwritten by a later one. This gives us a handy guide on how to structure draw commands issued by the engine's gameplay side; each of these pipeline steps can be seen as a single command buffer that can be filled, dispatched and processed. By restructuring our hardware abstraction layer to accommodate this new paradigm, we can likely already get some benefits on the OpenGL side, while at the same time actually simplifying the abstraction layer itself and making the render backend easier to use by developers not familiar with graphics programming, while simultaneously making it easier to write new implementations of the render interface due to the reduced amount of state tracking needed.
As with any large scale project, it is good to establish some core design principles early. So here's a set of them for this project.
As much as is feasible, data structures used and discussed here should be POD objects, with member data being immutable. If the underlying data changes, these objects should be recreated rather than changed.
In order to achieve high throughput in the render thread, all data processed in it should be laid out in a cache-friendly manner. The principles of struct-of-arrays design should apply throughout, and the use of polymorphism, virtual functions, and other OOP functionality should be reduced.
One of our perennial problems as a team is that the number of people able to work with and reason about the graphics subsystem is always much lower than we might want it to be. This creates a bottleneck that can hold up feature requests for very long periods. The new renderer should thus be designed in a manner that allows people not intimately familiar with graphics programming to make meaningful changes by breaking it down into components that can be easily composed into new functionality.
To achieve our goal of making the new graphics API composable, it is perhaps useful to borrow the concept of layers from (insert your favourite photo editing software here). Each layer has defined inputs and outputs, and configurable rules for how to compose its content into the final frame. As an example, a generic 2D layer only needs two functions, drawImage
and drawString
. Layering (as used in mainhalls) can be achieved either through implicit (i.e. by preserving the temporal order of drawImage
calls) or explicit ordering (via a function parameter) or a combination of the two. In the standard render pipeline as described above, this layer forms the base that every other layer is then composited over through alpha blending.
The same layer can be repurposed to render the HUD: the difference would just be a matter of where in the final composite pipeline the layer is placed.
Expressed as class diagrams, this would look roughly like this:
This diagram is obviously incomplete and intended as a rough sketch of the interface that will be exposed to the rest of the code.
Pipelines are the next step up from Layers. Each pipeline is composed of multiple layers and produces a single output at its end. That output will generally be a RenderTarget
- meaning, a texture or collection of textures. All pipelines feed ultimately into a single master pipeline that is responsible for updating the screen.
All pipelines and all ressources associated with them are under the purview of a single RenderManager class. This class is responsible for setting up a graphics context and offers functionality to upload textures, models, and whatever else is required to render a frame.
This includes managing the pipelines themselves: The way I envision this system, Pipelines are registered with the RenderManager, which will then create an execution graph to resolve render-to-texture functionality for example. When the engine begins a new frame, it notifies the RenderManager, which will initialize all registered pipelines. The engine then fills up the pipelines with whatever it wants to render this frame, then notifies RenderManager that the current frame is finished. RenderMan will then in turn expose the accumulated frame data to the render thread, which will grab said data when it's done processing the current frame.
This is a place for notes and unorganized thoughts so that The E can fill them out later
- each pipeline step consists of a series of elements
- elements are specific to each pipeline step, but are self-contained units that include all state necessary to render them (i.e. for the 2D pipeline a bitmap, screen coordinates, transparency setting)
- to process a pipeline step, the renderer processes all elements in a set order
- pipeline processing may discard elements if they are found to be invisible
- all render commands must include the pipeline step they are associated with
- command buffers are always present in three states: rendering, ready, preparing. gr_flip sets the preparing buffer to ready, discards the existing ready buffer, and starts a new preparing set
- pipeline steps must be able to render their output to a render buffer that can be used by other steps in the pipeline. A one-frame lag is acceptable.