Graphic Rants

Variable sized work

2026-03-15T14:11:00.000-05:00

What about small TessFactors? This commonly happens in the form of a base patch with DiceFactor << MaxDiceFactor or during splitting where SplitFactor << MaxSplitFactor. For example if the max factor is 8 then a patch could generate anywhere from 1 to 64 triangles or subpatches. Statically assigning threads to a fixed size means half of them will be idle on average assuming all sizes are equally probable. Alternatively multiple output queues with different fixed sizes can be used. Multiple queues is complicated though and there will be waste rounding up.

Really what is desired is the ability to enqueue variable sized work at a constant cost to the producer that gets perfectly packed into waves by the consumer. This is a frustratingly common problem in my experience and there is no good general purpose parallel programming primitive that I’m aware of that addresses it.

This turns out to be an extremely similar problem to rasterization where triangles are large or pixel shaders are reasonably expensive. 1 triangle expands to a variable number of pixels. That pixel work must be distributed across many threads and packed into waves. There’s more things that are rasterization specific to efficiently match up to the ROP but from a high level this is the same general problem.

In fact it is so similar that the HW rasterizer could be abused for this purpose. A generator task runs in VS and produces a variable amount of child work items, which are the pixels. The PS work can then access data from the generator task using interpolators. Using rect lists where the rects are 2xN allows the finest granularity of work expansion. There is weirdness though because of this abuse. HW rasterizers can have static assignment of pixel tiles to units, serialize on overlapping in a tile, and other things that make sense for pixels but not for arbitrary work. I did make this experiment though I jokingly called rasterizer inception since I was running a SW rasterizer inside a HW rasterizer. It was not competitive against the design I will explain.

Minimize data movement

The other problem with small TessFactors is the cost of queuing itself. This gets to an often misunderstood aspect of Nanite. Why is Nanite’s software rasterization faster than hardware? I always have to preface this topic with the fact that I am not a hardware engineer nor do I have access to the details of any GPU’s rasterizer block.

Many think the primary reason is because of pixel quads, or that it isn’t more efficient, it just uses more power, or more transistors on the task at once. Those are true but I don’t think that’s the core of the issue.

I believe it is more about data movement. Assuming the hardware is roughly similar to tile based software rasterizers like cudaraster or cuRE. Triangles get setup, that means calculating depth and attribute planes, edge equations, and bounding boxes. Those will be stored somewhere so the following stages can access them. Already this is a significant amount of data. Then the triangle gets binned, meaning added to tile lists. This might be hierarchical. That means multiple passes of reading an index from a list and a mask, then reading the triangle data to compute a new mask, then write out the index and masks again. Doing prefix sums of the masks to repack into dense waves of work. All of this is optimized and balanced for typical workloads that are pixel bound, the amount of triangles that can be setup in 1 clock, the size of caches and queues.

What the Nanite software rasterizer does instead of binning, it just writes the pixels. No need for any of the state from triangle setup to leave registers or to move at all. The amount of pixel work is small enough that expanding, consolidating, distributing, and repacking it is more expensive than just doing it.

Local work distribution

The same data movement concern is present with tessellation. As already explained there is more generated work than the simple rasterization case such that load balancing of some kind is required, but there is also a lot of state that makes distributing that work expensive. So instead of either writing it all out to memory or rederiving it, a local approach is used. Redistribute the work only across the wave, keep the state in registers, and read it using WaveReadLaneAt if it isn’t scalar.

A wave starts first by deciding how much work it will produce. Each lane writes the number of items it is producing to the work queue. Then the wave switches to consuming work from the queue, a wave worth of work in each iteration until the queue is empty. A consuming lane gets the index of the producing lane and an index of which item from that producing lane it is consuming. Data for the work itself is read from the producing lane using WaveReadLaneAt or is scalar and was shared by all producing lanes.

Threads are thus dynamically assigned to the work instead of statically. They are only empty and idling in the last iteration of the loop, the last wave worth of packed work. This isn’t without drawbacks. It can only distribute work amongst lanes of the same wave, no further. This means state doesn’t need to be put on the queue itself, it can be read directly from the producer by the consumer, but it also means it is a very limited form of load balancing. Care must be taken to not to produce too much work or the advantages will be outweighed by the rest of the machine idling waiting for 1 wave to retire.

groupshared uint WorkBatch[ THREADGROUP_SIZE ];

template< typename FTask >
void DistributeWork( FTask Task, uint GroupIndex, uint NumWorkItems )
{
    const uint LaneCount    = WaveGetLaneCount();
    const uint LaneIndex    = GroupIndex &  ( LaneCount - 1 );
    const uint QueueOffset  = GroupIndex & ~( LaneCount - 1 );

    uint FirstWorkItem  = WavePrefixSum( NumWorkItems );
    uint TotalWorkItems = WaveReadLaneAt( FirstWorkItem + NumWorkItems, LaneCount - 1 );
    
    uint SourceData = ( FirstWorkItem << 8 ) | LaneIndex;

    // Pull work from queue
    for( uint BatchFirstItem = 0; BatchFirstItem < TotalWorkItems; BatchFirstItem += LaneCount )
    {
        uint ItemIndex = BatchFirstItem + LaneIndex;

        WorkBatch[ GroupIndex ] = 0xFFFFFFFFu;
        GroupMemoryBarrier();
        
        if( NumWorkItems > 0u )
        {
            // Mark the first work item present in the batch for each source.
            int FirstItemLane = int( FirstWorkItem - BatchFirstItem );
            if( FirstItemLane < ( int )LaneCount && FirstItemLane + ( int )NumWorkItems - 1 >= 0 )
                WorkBatch[ QueueOffset + max( FirstItemLane, 0 ) ] = SourceData;
        }

        GroupMemoryBarrier();

        uint BatchValue     = WorkBatch[ GroupIndex ];
        uint BatchMask      = WaveActiveBallot( BatchValue != 0xFFFFFFFFu ).x;
        // Highest Index <= LaneIndex with a bit set
        uint BatchLane      = firstbithigh( BatchMask & ~( 0xFFFFFFFEu << LaneIndex ) );
        // This is where we wrote SourceData relevant to this item.
        uint SourceValue    = WaveReadLaneAt( BatchValue, BatchLane );
        uint SourceLane     = SourceValue & 0xFFu;
        uint LocalItemIndex = ItemIndex - ( SourceValue >> 8 );
        bool bActive        = ItemIndex < TotalWorkItems;

        Task.RunChild( bActive, SourceLane, LocalItemIndex );
    }
}

For brevity this code assumes wave32. Thanks to Rune for optimizing the original code to not need compaction when NumWorkItems == 0.

Step through

I'll walk through how this works with example VGPR state. To make it easier to follow, let's pretend the wave size is 8. I replaced the packed SourceData values with letters. It's more important to track where those get scattered to than caring about what the values are themselves.

	0	1	2	3	4	5	6	7
NumWorkItems	12	2	5	1	0	6	10	14
FirstWorkItem	0	12	14	19	20	20	26	36
SourceData	A	B	C	D	E	F	G	H

Then looking at one iteration of the loop where BatchFirstItem == 16. Only some work sources have a nonzero amount of work that fits in this batch's window. Specifically lanes 2, 3, 5 corresponding to SourceData C, D, F. I’ll color code any values associated with these sources.

	0	1	2	3	4	5	6	7
SourceData	A	B	C	D	E	F	G	H
FirstItemLane	-16	-14	-2	3	4	4	10	20

These get scattered to max( FirstItemLane, 0 ) in groupshared and then pulled back to VGPR as BatchValue.

	0	1	2	3	4	5	6	7
BatchValue	C			D	F
BatchLane	0	0	0	3	4	4	4	4
SourceData	C	C	C	D	F	F	F	F
SourceLane	2	2	2	3	5	5	5	5
LocalItemIndex	2	3	4	0	0	1	2	3

Each iteration of the loop slides the wave sized batch window through what is essentially a contiguous list of SourceLane’s and LocalItemIndex’s. If you laid all the batches out head to tail that's what they'd look like. You can see that C’s items started in the last batch and F’s items will be completed in the next batch.

Applied

This local work distribution is a generally useful parallel programming primitive. In the tessellation use case the producer is a patch with TessFactors. It knows from the Tessellation Table how many triangles that will be produced (NumWorkItems) and enqueues that work. No other patch state needs to be written to a queue. A consumer reads the source patch state directly from registers. That state includes which Tessellation Pattern to use and combines it with the item’s index to read from the Tessellation Table to produce a tessellated triangle output.

This tool is used for immediate dicing and rasterization in ClusterRasterize when TessFactors <= MaxDiceFactor for the base patch. The work that is distributed is the diced triangles. The same is done for splitting. SplitFactors were already calculated so it would be a shame to do that again redundantly in PatchSplit. Instead 1 step of splitting is done immediately in ClusterRasterize. The writing of each subpatch to the split queue is distributed across the wave. The third place this is used is in PatchSplit. Again there are SplitFactors calculated for a patch but the actual work of splitting is variable in size and may be significant. Calculating the barycentrics for the child subpatches and writing them to the split queue is distributed across the wave.

Nanite + Reyes

2026-02-06T19:43:00.000-06:00

Let's get into the details of how this integrates with Nanite. We aren’t replacing anything Nanite already does. Nothing is changed about the Nanite data structure or how it is built. We are only adding new stages to the rendering pipeline. Instead of directly rasterizing a cluster’s triangles, those triangles are considered to be patches with a displacement function. As such, depending on their size they may need additional tessellation before rasterizing.

That means there can be a mix of both simplification from base Nanite and amplification from tessellation simultaneously on the same mesh. Small patches collapsed away by Nanite’s LOD selection and large patches tessellated. The base LOD heuristic is still valid. It takes into account both positional error and normal error. If both are low then many patches can be represented by a few with little loss whether they are displaced or not. It isn’t simplifying the displacement function in this case, just the base geometry that is being displaced.

The error calculated is for the original surface area which could change significantly due to displacement but that is an example like many others where the displacement signal must be known to compute the correct error. Because the displacement isn’t known, its impacts are ignored.

Pipeline

This is a high level diagram of the stages in the base Nanite rendering pipeline:

These are the new pipeline stages appended to the end to support tessellation:

As you can see, the new pipeline architecture is very similar to what was there before. Not everything needs tessellation though. Currently only materials with displacement mapping enable tessellation. In the future we may also support higher order surfaces. Non-tessellated materials still end where they did before.

The programmable blocks are actually multiple passes. There was one such block before but now there are two: ClusterRasterize and PatchRasterize. There are binning setup passes that I’ve omitted from those blocks but more importantly there is a Dispatch and Draw call for each material using a programmable rasterization feature. We can ignore the previously supported programmable features for the moment. What is important to understand is that displacement mapping is a programmable feature. The displacement function is shader logic expressed by a node graph and authored by an artist. Any shader that needs to evaluate the displacement function must be programmable meaning there needs to be a material shader permutation compiled for it and a DispatchIndirect to launch it.

Also important to realize what is not a programmable stage. Artist programmable shaders should in general be minimized to reduce the number of unique compiled shaders and dispatches but in this case there are also scheduling reasons to avoid them. Not being programmable obviously means not having access to any of the programmable logic which can be difficult. I’ll get into all of this more of this later.

ClusterRasterize

Time to start digging into these stages. The ClusterRasterize compute shader used to do exactly what its name implies, take a visible cluster and rasterize it using our software rasterizer.

Rough sketch of old ClusterRasterize:

Scalar

Load cluster data

Thread per vertex

Load position
Transform position
Store in groupshared

Thread per triangle

Load indexes
Load transformed positions from groupshared
Rasterize triangle
- If pixel inside triangle then atomic write to VisBuffer

In the triangle stage it now needs to check if a patch needs tessellation first. To do that it calculates its TessFactors. If all are <=1 it doesn’t need tessellation. Even still the vertices need to be displaced, hence why this is still a programmable stage. If all TessFactors <= MaxDiceFactor it can be added to the dice queue, otherwise it’s added to the split queue.

We’ll return later to ClusterRasterize as it is far more complex than that but this is enough for now to stand up a functional pipeline.

PatchSplit

Each thread reads 1 subpatch from the split queue and operates on it. It will bound and cull. If visible, determine TessFactors and split it.

If TessFactors <= MaxDiceFactor it is queued for dicing and PatchRasterize. Otherwise it is split into child subpatches according to the SplitFactors and the Tessellation Table. The Tessellation Table’s barycentrics for each child are now relative to the subpatch so they need to be made absolute, ie relative to the base patch. Those child subpatches are finally added back to the split queue for another round of PatchSplit. I’ll skip over how to load balance the work of a variable number of children for now.

Subpatch format

I’ve been referring to reading and writing subpatches but how are they stored? Ideally the data to express a subpatch is as small as possible since recursive splitting means a lot of reading and writing of these subpatches from the split work queue. Unlike regularized topology we can’t just store an index representing which region of the original patch this subpatch covers. The Tessellation Table generates irregular topology by design. While it might be tempting to store an index into the Tessellation Table, that would only work for 1 level of recursion. Each level is relative to the last and the stack would need to be unwound to derive the actual coordinates. Instead a subpatch stores barycentric coordinates for all 3 corners.

struct Subpatch
{
    uint32 VisibleClusterIndex : 25;
    uint32 TriIndex : 7;
    uint32 BarycentricsUV[3]; // 16:16
};

Culling

Visibility culling for subpatches works the same as clusters; their bounds are tested against the frustum, HZB for occlusion, and VSM page mask in the case of shadows. Unlike clusters, tight bounds around displaced patches aren’t known until they are displaced, classic chicken or egg problem. Since the displacement function is arbitrary shader logic there is limited ability to be clever, outside of involved interval arithmetic analysis of shader code.

Instead we ask the user to specify the max range of the displacement function. This is a key source of user error and I’ve tried my best to reframe the problem as user defined mapping of a [0,1] displacement function to encourage full use of the specified range but this continues to be a commonly tripped over pitfall by artists, overly bloating bounds and destroying culling and performance.

But, even if we wanted to evaluate displacement in this shader we couldn’t. PatchSplit is a global shader and not specialized per material. I’ll explain why in a moment. This one shader is responsible for splitting patches from all materials so all logic contributing to patch splitting must be fixed function.

This user provided displacement range is in the same space as the displacement function, ie it is scalar. To determine screen space bounds for testing I use the technique from [Niessner and Loop]. We have since removed the normalizing of displacement vectors after interpolation so the spherical capped cone logic is unnecessary but it has yet to be adjusted. A simple union of prism corners should be slightly tighter.

Recursive splitting

Recursive splitting is effectively recursive tree expansion of an implicit tree. This task looks just like Nanite’s hierarchical cluster culling so the same tool to solve work distribution comes to mind, a persistent threads shader with a global MPMC lockless work queue. That is exactly what I started with and was what shipped in the first version.

Multipass for recursive

As previously mentioned this is not D3D spec compliant and not guaranteed to make forward progress on all GPU architectures. It does work well on many though. On PC, where we can’t be certain, we no longer use it for cluster culling. Dealing with the support burden wasn’t worth it for the perf improvement. For fixed platforms like consoles we still use it.

Likewise, instead of persistent threads, PatchSplit now is multipass on all platforms not just PC. Remember the key advantage of persistent threads was avoiding sync points with drain and spin up where the GPU isn’t filled. With PatchSplit we can async overlap with non-tessellated Nanite rasterization work through smart scheduling and not worry about idling.

Also worth noting that there is a known and small limit to the number of passes required. Subpatches write 16b barycentrics. The MaxSplitFactor can only divide 65534 so many times before written child subpatches will be the same as their parents. With the persistent threads approach this actually caused problems since in rare cases recursion would go past this depth and become an infinite loop. Detecting that requires a bunch of checks that a fixed recursion depth of multipass doesn’t need.

One could argue that those cases showing up in practice suggests more than 16b is needed for subpatch barycentrics but I’d argue that 1 base triangle being tessellated to a resolution of MaxDiceFactor * 64k is surely plenty. The encountered crashes only happened in unreasonable cases with cameras crashing through surfaces.

Either way, whether persistent threads or multipass, it becomes clear now why it is important this is a global shader and thus all logic needs to be fixed function, decoupled from the programmable material displacement. If there were per material dispatches of PatchSplit a persistent threads style would need to spin up enough threads to fill the machine each time, often only to immediately retire since many wouldn’t have much work. Impossible to know how much work a single starting patch might turn into due to recursion. No overlap could be used between dispatches if they used the same buffer for the queue. Multipass has a simpler reason: overhead of lots of dispatches.

Being a global shader means any data about the materials that needs to be accessed, such as the displacement range needs to be in global buffers. It also means no user programmable normals, no programmable dicing rate.

PatchRasterize

Unlike the split queue, the dice queue doesn’t contain subpatch structs since every patch in this queue already was on the split queue or is a base patch. Instead we can index them and reduce memory and traffic. What has also been done already in a prior stage is calculating the TessFactors for dicing. To avoid repeating that they are also written to the dice queue, or more accurately the Tessellation Table index that pattern corresponds to is.

Subpatches after splitting are mostly close to uniform topology and full size. By that I mean most have TessFactors near MaxDiceFactor. This is because splitting already dealt with most of the irregularity, assuming that patches reaching here were the result of splitting. That isn’t true from what I’ve explained so far but will be true by the end.

Given this, PatchRasterize can follow the same pattern as base Nanite ClusterRasterize. It's fine to statically assign threads to diced vertices and triangles and not worry about empty threads since there will be few of those.

Rough sketch of PatchRasterize:

Scalar

Load cluster data
Load patch data
Load patch corner data
Transform corners

Thread per diced vertex

Load vertex barycentrics from Tessellation Table
Lerp the corners using barycentrics
Evaluate displacement function
Transform position
Store in groupshared

Thread per diced triangle

Load indexes from Tessellation Table
Load transformed positions from groupshared
Rasterize triangle

Too much scalar

While there is a decent chunk of scalar work in with base ClusterRasterize in the form of per cluster work, there is much more with PatchRasterize since there is also per patch work. In most cases scalar work gets hidden by vector work. If there is too much of it, too front loaded, it can start to matter. Worse yet when that group uniform work is float math. RDNA3 and earlier don’t have any scalar float ops so these ops go to the vector unit with only 1 lane utilized. Basically while the virtual “domain shader” (DS) work is vectorized the “vertex shader” (VS) work isn’t.

Having all this scalar work is not well utilizing what is primarily a vector processor. To try and vectorize this while still keeping the results in registers Rune Stubbe made an optimization where both the 3x work from patch corners as well as multiple patches are spread across the threads. This means 1 wave works on multiple patches, purely to try and vectorize per patch work. First each thread works on 1 patch corner (VS). Then the shader switches to 1 patch at a time, looping over all patches this wave covers. A patch loads its data from the first phase using WaveReadLaneAt.

Don’t normalize

Another optimization Rune made was noticing that if he removed normalizing the vertex normal after interpolation, that all the rest of the transform math is linear. That means that the base patch positions and base patch normals can be transformed to clip space and shared for the whole patch. This is basically like moving work from DS to VS. The visual difference and more importantly the choice to normalize in the first place is conventional but arbitrary.

Software rasterization only

Unlike clusters, patches don’t have a hardware rasterization path. The most obvious reason is that Reyes generates micropolys by design. Even if the dicing rate is larger than 1 pixel it will still be far smaller than the threshold for switching to HW rasterization. This is convenient in that it means we don’t need an additional shader permutation for HW. We also don’t need to issue the draws for them, most of which would be empty. This presents some issues though.

Near plane clipping

The first is easy to address. Before to avoid having to deal with near plane clipping in SW we sent any clusters that intersected with the near plane down the HW path. We still don’t want to deal with it but at least this time the triangles are small. So instead of clipping I cull triangles that cross the near plane. This looks nearly indistinguishable from clipping since the triangles are so small. Initially I culled subpatches but that was too coarse and depending on their bounds which may be considerably bigger than the subpatches themselves leading to unexpected culling of subpatches nowhere near the camera.

No MaxEdgeLength test

While tessellation tries to achieve triangles that are the size of the dicing rate it only does so before displacement. Typically the difference is small but it isn’t guaranteed. Sharp discontinuities in the displacement function are the worst case. These will cause a diced triangle to stretch the distance from min to max displacement. That can be longer than is good for the SW rasterizer to handle. Instead of allowing unbounded rasterization cost there is a clamp to the screen rect a triangle covers in the rasterizer (set to max 64 pixels). This means this worse case will visually appear as the surface tearing apart under too much stretching as the camera gets close. This may force us to add HW rasterization in the future.

Holes due to SW rasterized triangles longer than 64 pixels

DS derivatives

The last issue isn’t because of SW raster or even is particular to Nanite Tessellation. There are no automatic derivatives in domain shaders. Texture samples in the displacement function require UV derivatives for mipmapping to work. Mipmapping needs to work for band limiting the signal and reducing aliasing but more importantly it is needed for cache coherency. Using mip0 can be a serious performance loss.

Traditional Reyes renderers dice into rectangular grids. Finite differences along grid UV directions are simple and can be used for shading but the Tessellation Table’s irregular meshing doesn’t provide that simple neighbor lookup. Perhaps analytic gradients could be used instead like we do for deferred materials? It still needs to be relative to a sampling rate though. For that we can use the chain rule:

\begin{equation} \frac{dUV}{dTessFactors} = \frac{dUV}{dXYZ} \frac{dXYZ}{dTessFactors} \end{equation} Pardon my lack of mathematical rigor here in terms of dimensionality. Anisotropic texture filtering isn’t important, only isotropic trilinear filtering is needed, so for simplicity assume this is projected in the direction that maximizes this derivative and the equation above can be treated as scalar.

The second term that includes TessFactors is inconvenient. It’s only defined at the edges. The interior could somehow lerp it but the corners would still remain undefined. Thankfully $\frac{dXYZ}{dTessFactors}$ is effectively $\frac{1}{DicingRate}$. That was how the TessFactors were computed in the first place. That is a perfectly smooth function in XYZ space, independent from the surface. So it is defined, continuous, and thus will always match between patches.

The problem of continuity is actually with the first term. Taken directly this is a piecewise constant function. It’s basically the tangent basis of each face before orthonormalization. Like TessFactors, edges could be made to match by only using data of the edge. Edges could exclusively use the UV difference along the edge, ignoring any component of the gradient orthogonal to the edge direction. Unlike TessFactors this doesn’t entirely solve it due to the corners. Corners won’t have just 1 neighbor. They will have an arbitrary number of them depending on the valence of the vertex. Following a similar trick would mean using only vertex data which doesn’t make sense for a rate of change since a point is zero size.

Solving this requires some form of continuous function over the mesh through preprocessing. Like how the tangent basis is typically calculated, this could be averaged and stored per vertex but at a memory cost. Worth noting we support multiple UV channels. From a workflow perspective it would mean extra heavyweight data would need to be optionally built for meshes to allow displacement on them.

UV density

Instead of averaging per vertex Jamie Hayes implemented averaging this UV density over the whole mesh (for each UV channel). This is the same thing that most engines will do for informing mip based texture streaming. This won’t work well for large differences in texture density across a single mesh but that is considered undesirable for anything visible anyways. This also doesn’t account for any other attribute that could be used in the shader for procedural texturing. For other things we can’t provide a reasonable derivative for will use zero. These are rare and performance loss is acceptable.

How to tessellate

2026-02-01T20:29:00.002-06:00

There are many different approaches for determining whether and by how much a patch needs to be tessellated. It is important that we don’t create cracks between neighboring patches or subpatches. Many Reyes implementations choose tessellation patterns which do not match and must stitch them together after they are generated. We need something that can be determined purely from the subpatch itself with no communication needed between subpatches such that they can run completely in parallel.

TessFactors

The approach Moreton and D3D’s hardware tessellation stage take is a simple solution to this problem and the one I use. So long as the only data about a patch that affects the placement of vertices on an edge is data about the edge itself, those vertices will match between different patches. Each edge computes a tessellation factor, ie the number of segments they wish to be subdivided into. I will refer to these 3 edge factors as TessFactors. D3D also has an inner tessellation factor but in practice that is derived from the edge factors and mostly is just an artifact of the tessellation pattern D3D uses.

That doesn’t explain how these TessFactors are calculated. There are various approaches to that as well. Calculating the length of the undisplaced edge in screen space doesn’t work since that might be zero but displacement causes the patch to face towards the camera. Other approaches suggest the opposite, only densely tessellate at silhouettes and reduce tessellation where the displacement direction faces the camera and displacing would only change depth. Depth matters for object intersections and shadows though.

Diagsplit samples the displacement function at a few points to estimate its screen space length. Not only is that too expensive, it requires every shader that needs to calculate TessFactors to have access to the displacement function, ie be programmable, which isn’t practical. Also ruled out are artist provided density hints expressed by the material graph for the same reason.

The approach I take is simple, common, and similar to what Nanite already uses to project object space error to screen space. TessFactor for an edge is based solely on its world space undisplaced length which is projected to pixels as if the edge was perpendicular to the view vector. This length in pixels is divided by a global DiceRate setting to convert it to TessFactor. Often DiceRate >1 pixel can save cost with little visual difference. The UE default is 2 pixels.

Uniform density dicing

We have these TessFactors for a patch we want to dice. They represent roughly the length of each edge of the patch. We want to dice that triangular patch into uniformly sized triangles. What is typically called a uniform tessellation is one where a shape is tessellated into many smaller identical shapes similar to the original. The angles are congruent. This is topologically uniform. That is not helpful. What we need is uniform density.

Uniform topology

Uniform topology snapped to edge TessFactors

Uniform density

I’ll define an optimal uniform density tessellation as one with the minimum number of triangles possible where all edges are <= to a chosen length. The length of the longest edge dictates the worst case sampling rate of the displacement function and thus the signal resolution. Achieving this target edge length with any more triangles wouldn’t be optimal. It isn’t important to achieve a perfectly optimal tessellation but it is useful to understand the target and its properties to understand how to approach it.

Remeshing

Thankfully generating meshes like these, isotropic meshes made of roughly equal length edges forming equilateral triangles with vertices close to valence 6, is a common operation called remeshing and there are many published approaches to do so. Perhaps the most popular is [Botsch and Kobbelt 2004].

The algorithm goes as follows:

For N iterations

For all edges
- If edge is too long then split it
- If edge is too short then collapse it
- If edge could be shorter if it was flipped then flip it
For all vertices
- Move position to the average of its neighbors

After many iterations the result will approach an isotropic mesh matching the desired properties. There are more considerations when this is meant to express a particular surface but we are only concerned with remeshing a flat triangular patch. For this case the only consideration needed is to constrain boundary vertices to the patch edge they started on.

Tessellation Table

This remeshing process is far too expensive to do in real-time though. Thankfully since we are working with just the patches themselves and there is a limit on the maximum TessFactor for dicing, every permutation of TessFactors can be precomputed and placed in a lookup table which I will call the Tessellation Table. The TessFactors index into the Tessellation Table. I will call an entry in this table a Tessellation Pattern.

What exactly is stored in the Tessellation Table? For each pattern there is a vertex and index buffer as if it were a little mesh. Instead of positions, the vertex buffer stores barycentric coordinates in the patch. When rendering, the patch can be replaced with this little mesh. The barycentrics are used to interpolate the patch corners. Because each pattern is of variable size there is also an offset table translating the TessFactor index into the buffer offsets for the VB and IB. The barycentrics are stored as 2 16bit coordinates with the 3rd coordinate implied. The indexes are 10bit so all 3 corners of a triangle can be packed into 1 dword.

Tessellation Table redundancy

There is a lot of redundancy in this indexing though. The Tessellation Pattern for (3,4,2), (3,2,4), (4,3,2) for example are all the same. They are just rotated or mirrored versions of the same pattern. Instead a unique index into the table is defined as an ordering of the TessFactors from largest to smallest.

This reduces the number of patterns stored from $N^3$ to $\binom{N+2}{3}$ or $\frac{N(N+1)(N+2)}{6}$, where $N$ is the max TessFactor the Tessellation Table covers. For N=16 this is the difference between 4096 and 816 or <20% the patterns needing to be stored. Size reduction also reduces cache pollution.

To correctly alias patterns the reordering must also be undone when the pattern is used. First, if the winding flips it needs to be reversed so backface culling is preserved. Second, the barycentrics stored in the table need to be unswizzled so they correctly index the corners of the patch. Alternatively, the patch corners themselves can be swizzled which is often cheaper since it happens at lower frequency.

Tessellation Pattern building

The remeshing algorithm previously explained was written for meshes with Cartesian coordinates but Tessellation Patterns only store barycentric coordinates. Another way to describe this issue is to say the original algorithm assumes extrinsic geometry but Tessellation Patterns are intrinsic geometry. An extrinsic triangle is defined by the position of its corners. It is embedded in a space. An intrinsic triangle is defined by its edge lengths. It doesn’t have any specific position or orientation. A Tessellation Pattern is intrinsic. A pattern exists for each combination of TessFactors. TessFactors are treated as patch edge lengths so the goal for a pattern is to tessellate the patch into triangles with roughly unit length edges.

Intrinsic isn’t a problem for the relaxation step. The average of Cartesian and barycentric coordinates will result in equivalent positions since the math is linear. What extrinsic appears to be needed for is the edge length calculations. Thankfully that is not the case. A good number of geometry calculations can be done with only barycentrics and edge lengths.

For a pair of barycentric points P and Q there is a vector between them $\mathbf{PQ} = Q - P$. Unlike normalized barycentric points where $u+v+w=1$, for normalized barycentric vectors $u+v+w=0$, since $1-1=0$. The squared length of a barycentric vector $\mathbf{PQ}(u,v,w)$ in a triangle with edge lengths $(a,b,c)$ is:
\begin{equation} \lVert \mathbf{PQ} \rVert^2 = -a^2 v w -b^2 w u -c^2 u v \end{equation} Thankfully this also implicitly handles a nonobvious issue. While treating TessFactors as edge lengths is sensible and expresses what we are optimizing for, they aren’t exactly lengths. There are rare cases where TessFactors interpreted as edge lengths express a non-Euclidean triangle, specifically $a>b+c$. If extrinsic geometry was required to calculate edge lengths it wouldn’t be possible for these patterns. Deriving the extrinsic coordinates would result in divide by zeros and negative sqrts. Working as intrinsic instead, using the formula above, no special handling is needed. While the distances lose geometric meaning in these cases it degrades gracefully

Barycentric quantization

The last thing to take care of is to make sure boundary vertices bitwise match along an edge with all other patterns with the same TessFactor. If they don’t there will be cracks. That can be left for when the barycentrics are quantized to 16bit. It is important though that this quantization is symmetric.

Consider 2 adjacent triangles rotated 180 degrees from one another. Their shared edge will also share the same corners except swapped. Their coordinates lerp in opposite directions. If this edge is where $w=0$, then for vertices along the edge to match their counterpart $(u,v,w)=(1-v,1-u,w)$. This implies that quantization must be symmetric about 0.5. Whatever direction x rounds needs to be the opposite of what happens to 1-x. If 0.25 rounds down, 0.75 needs to round up, or vice versa.

The most obvious way to quantize to 16b fixed point would be to represent the range [0.0, 1.0] as [0, 65535]. With that all float values can be made to match a reversed counterpart at the boundaries except one: 0.5, the midpoint. This is a point that will be used by any pattern that has an even TessFactor. 0.5 can’t round in the opposite direction as 0.5. It needs to be stored exactly. The easiest fix is to use an even max value, so map 1.0 to 65534.

If each coordinate is quantized separately they might not still sum to 1. To fix this for interior vertices any coordinate could be chosen to be rederived, so we can say it's always w=1-u-v. But boundary vertices need to be symmetric. To do that I quantize the median barycentric and rederive the max barycentric given they are normalized (the min is always zero on the boundary).

Uniform density splitting

The Tessellation Table can be used for splitting as well. Typically binary splitting is used. There are advantages to using a wider branching factor though. Wider means a shallower tree, less recursion, and thus less traffic to and from a work queue. Perhaps more importantly though it has the potential to generate more uniformly shaped subpatches for the same reason we were interested in a uniformly dense final tessellation. This can reduce the number of subpatches and make each subpatch more likely to be uniform in dimensions and closer to having max DiceFactors, which we'll see matters later.

Diagsplit longest edge splitting vs Uniform splitting

SplitFactors

How to best take advantage of this flexibility? Simplifying this question down to 1D and just looking at a single edge. If an edge has a TessFactor of 32, which is larger than the max TessFactor of 16, what SplitFactor should be used in this step? 32 is a multiple of 16 but it isn’t a power of 16 so there is a choice. Clamping to 16 means it will split into 16 subedges which each will then have a DiceFactor of 2. For reasons I will get into later it is important to have DiceFactors be as large as possible, or in other words do as much of the tessellation in the dicing phase as possible. So the other option in this example is SplitFactor of 2 and DiceFactor of 16.

The same question can be asked at every step of recursive splitting. Is it better to do the smaller factors early? Does the ordering matter besides for the final dicing step I mentioned already? Predicting too far ahead won’t work well since the desired TessFactor in an early step may not be what is chosen by the end because the projection on screen refines with smaller edges. How about the aspect ratio? Would it be better to address aspect ratio early and generate uniform sized subpatches as soon as possible? Unfortunately that isn't possible while still conforming to the limit of TessFactors being determined purely from that edge’s data.

In my tests the best calculation for turning desired TessFactor into the SplitFactor is the following:

SplitFactor = min( TessFactor / MaxDiceFactor, MaxSplitFactor )

This tries to emit subpatches from splitting with maximized DiceFactors but nothing else. Other choices were slower.

Results

Uniform dicing using the Tessellation Table results in 69% the diced triangles compared to D3D style uniform topology. Uniform splitting using the Tessellation Table results in 68% of the split patches compared to binary splitting. More uniformly sized triangles also benefit the rasterizer.

I believe this Tessellation Table approach could have wide applicability due to its more optimal density. The first such use has already been out for a while. The Tessellation Table UE builds has already been used with permission outside of UE in https://github.com/nvpro-samples/vk_tessellated_clusters.

Possible approaches for tessellation

2026-02-01T20:28:00.000-06:00

Requirements

All the original goals of Nanite should apply to our approach here as well. One of them implicitly covers the rest: authoring decisions should have no impact on performance. While not 100% realized, Nanite mostly behaves that way so that principle should be maintained if at all possible. Applied here would mean a run-time tessellated and displaced mesh should render just as fast as an offline tessellated and displaced Nanite mesh. So the performance characteristics of Nanite, and thus its goals to the extent they were realized, should ideally be shared by Nanite Tessellation. This is not strictly a requirement because, as we'll see by the end, it was not achieved. But it very much was a goal and the closer we can get to this ideal the better.

Tracing

Tessellation is actually not a requirement, it’s an implementation detail. Displacement is the artist facing feature. There are a variety of tracing based approaches to scalar displacement.

Shell mapping and its descendants find the ray enter and exit point through the prism formed by the base triangle extruded along its vertex normals, then ray march the displacement function in tangent space. Displaced Micro Maps are similar except the ray march is replaced with a quadtree traversal that acts as an acceleration structure for the barycentric space. Neither have any acceleration in the displacement direction. This means if the displacement range is large compared to the size of base triangles there will be performance problems. There will be many tall skinny prisms with high overdraw and no way to quickly skip empty space. The expectation with these techniques is that the prisms will be shallow. In the case of DMMs this comes from their creation offline. That does not apply here since the base mesh will be an arbitrary Nanite mesh that could be dense and the displacement function is a shader with a range specified by an artist.

Thonat et al. traverse a quadtree acceleration structure that bounds the displacement direction as well. Unfortunately it assumes that displacement is a single texture, not a shader, and that minmax mipmaps (the acceleration structure) can be generated for that texture. If the shader results were cached in texture space this might be viable. The origins of the approach to bound displacement with interval or affine arithmetic were actually for the purpose of tracing shaders, so it is possible. Unfortunately, the cost, even in the simple single texture case, puts this outside our needs.

That leaves us with rasterization and thus some form of tessellation.

Work within the current framework

Tessellation and displacement mapping were in mind with the original design of Nanite. The idea being triangle clusters could be synthesized instead of streamed from disk when finer detail levels were requested. Once we have the core Nanite based on offline simplification working well and shipped we can add on with tessellation being the first obvious extension.

This is a really elegant approach to the problem. It means the same framework can be used to solve that problem as well. Tessellation and displacement mapping in addition to other potential forms of geometry synthesizing, maybe marching cubes or subd surfaces, could be implemented and at basically no additional run-time cost. IO and transcoding would be traded for generation. The per frame cull and rasterization would be identical regardless of the source the triangles came from. All the work of generating that geometry is cached and reused across many frames. Going this route makes it conceivable that the goals from the Nanite dream could be achieved instead of this just adding cost.

As the design of Nanite was being realized, little by little it was becoming clear this idea wasn’t realistic. Every step of the Nanite build process was a bit more complicated than expected with extra details, constraints, or edge cases that weren’t obvious until we seriously implemented them and battle tested it all. Synthesizing tessellated clusters changes the process from simplification to amplification. Basically this amounts to running the Nanite build process in reverse. Now that the details of that process were better understood, doing it in reverse, in a random access sort of way where a complete level worth of clusters isn’t all available at once, in constrained memory, in the same budget that transcoding is currently costing, is not remotely straightforward and potentially not even possible.

I have ideas on how the process could be modified by maybe replacing graph cuts with something more spatial or precomputing portions but not all of the task. There are some important flaws though that were discovered that are inherent to the idea.

Both simplification and amplification

Tessellation to a uniform sampling resolution suitable for displacement mapping is not simply a matter of adding additional levels, like generating a level -1 past level 0, (0 being the original source triangles). Triangles from the source mesh larger than 1 pixel may not be flat anymore and need to be tessellated but at the same time tiny triangles may be far smaller than a pixel. Levels can’t simply be divided into exclusively simplification or amplification, both are needed simultaneously.

The base Nanite structure accounts for only the error against the base mesh, not the error against the displaced mesh. A simple way to solve that is the stored levels are tessellated at build time to a desired resolution matching that level’s error. Then runtime amplification picks up from there by generating more such levels. But that means the Nanite mesh needs to be built differently to support displacement mapping and at a significant cost. Far more triangles and vertices will need to be stored and rendered than otherwise needed compared to the normal Nanite mesh.

Adaptive tessellation

Another major flaw with this idea is that base Nanite adapts the triangle density to the content. This happens naturally with quadric based mesh simplification. Flatter areas can use fewer larger triangles to hit the same error. To achieve the same rendering performance and run-time memory overhead from generated triangle clusters the triangles would need to be just as efficiently placed. This simply isn’t possible. Even getting half way there with good content adaptive tessellation is incredibly challenging. So the reality is there is no chance generated triangle clusters will be the same cost to store and render as offline simplified ones, even if the cost to generate them is free. Far more of them will be needed to hit the same error due to less efficient use of triangles to approximate the surface.

Not only is it very difficult to adaptively place triangles to efficiently represent the underlying signal, unlike core Nanite, the signal is not known up front. Displacement comes from a user defined shader. It must be sampled. This presents a problem. The error, or the difference between the limit surface and the tessellated one, can’t be known exactly. The best we can do is consider the error to be the sampling rate. This is reasonable if the signal is band limited. Hopefully it is due to mipmaps but given it is user defined there is no guarantee of that.

True micropoly

With this new measurement for error it is clear that what we will be rendering are true micropolys. To hit 1 pixel error all triangles must be <=1 pixel wide. Many people mistakenly think this is what Nanite was already doing. Our software rasterizer is designed to be efficient for micropolys but that is not the LOD target. Nanite targets a LOD with 1 pixel of error not 1 pixel triangles. Some triangles need to be pixel sized to be within that error but most aren’t. Losing content adaptive thus means far more triangles for the same content so there is no chance this doesn’t cost more than an offline tessellated Nanite mesh.

That assumes all else being equal, as in the original Nanite rendering pipeline is still the fastest way to render this. Maybe not. Nanite’s LOD decisions are very coarse. It works on decently large cluster group granularity with conservative bounding volumes and jumps in power of 2 increments. Ignoring the spatial granularity and conservativeness, pow2 alone means on average the triangle count is at least 33% greater than the ideal. Could a different design make up for the increase in triangle count from losing content adaptive by hitting closer to the optimal number of uniformly sized triangles? Doing so means stepping outside the existing framework and dynamically tessellating patches every frame.

Reyes

The Reyes rendering architecture was the first to support displacement mapping and is designed around efficiently tessellating surfaces into micropolys. Therefore it is an obvious reference point for this problem.

In Reyes a primitive goes through the following pipeline:

Bound
Split
Dice
Shade
Rasterize

The bounding box for a primitive is computed. If it is off screen, cull it. If it is too large the primitive is split, usually in two, and the sub primitives are sent back to Bound. This will continue recursively until a primitive is small enough to dice. Dicing converts the primitive into a uniform grid of micropolys. The vertices of that grid are shaded and finally the micropolys are rasterized.

Why have both split and dice? Recursive splitting allows visibility to be retested at a more uniform granularity. It also allows surfaces that cover a large depth range to tessellate at a varying density, better matching the view. This recursive splitting is actually very similar to Nanite’s cluster hierarchy traversal, both in the approach and the reasons for it. Why not only split? There are efficiencies that are important at the leaf level that motivate dicing being special.

Reyes has been used heavily in offline renderers and film production for decades. Even after the move away from Reyes to path tracing, some production path tracers (Manuka, PRMan, etc) still run the majority of this pipeline. Instead of rasterizing the micropolys they trace rays against them.

Real-time Reyes

Because of its success in film, real-time Reyes adapted for GPUs has long been a target with numerous research papers (Patney and Owens, RenderAnts, DiagSplit, FracSplit, Sattlecker and Steinberger, etc) dedicated to possible approaches. As far as I know Nanite Tessellation in Unreal Engine 5.4 is the first shipping real-time Reyes implementation and Fortnite is the first it has shipped in a game (although only used on the ground).

While Nanite Tessellation retains every aspect of the high-level Reyes algorithm, there are many differences in the details, more than just in how it integrates with the base Nanite algorithm. Starting with the most basic: the primitives in our case are triangular patches. They start as triangles from a triangle mesh which are further split into triangular subpatches. This continues recursively until the subpatches are small enough to dice. Dicing uniformly tessellates the patch into microtris.

Shade only evaluates the displacement function at the diced triangle vertices. All other shading is done at pixel frequency in screen space. This is simply more efficient due to the amount of overshade that object space shading incurs. True preshading is a relic of the past. Modern production path tracers either shade on hit for everything not displacement as well or at most evaluate material shaders into BxDF lobes similar to a GBuffer and then shade with that on hit.

Nanite Tessellation

2026-02-01T20:27:00.007-06:00

Nanite Tessellation, aka Nanite Dynamic Tessellation, aka Nanite Dynamic Displacement was the next major feature I worked on after Nanite itself. Initial prototypes started back in 2020, only months after showing Nanite publicly for the first time. UE5 still hadn’t even been released to customers yet. After a couple years of shipping Nanite in things and then a couple of years more of development, Nanite Tessellation finally shipped in UE 5.4 in 2024. I started this write up shortly after but it has taken far longer to complete than I ever expected. Needless to say, publishing all this has been a long time coming. It will come in a series of posts starting with this one.

List of posts:

Intro
Possible approaches
How to tessellate
Nanite + Reyes
Variable sized work
Vertex deduplication / Post transform cache (coming soon)
VisBuffer / Deferred materials
Wrapping up

What is Nanite Tessellation?

A system for dynamically tessellating meshes and displacing them. The displacement comes from a shader graph, authored in UE’s material editor. This tessellation is in addition to what Nanite already provides.

Tessellation demoed at GDC in Marvel 1943: Rise of Hydra

Patches

Diced triangles

Final pixels

Why?

Why is geometry amplification needed when we have Nanite? I’ve argued in the past that amplification approaches to the virtualized geometry problem were not good enough, so why would I start working on it now? Have I changed my mind? No. My argument was that amplification approaches are not a general purpose solution to the virtualized geometry problem. They do not solve all cases. They can’t change the genus of a surface. A simplification approach would always be needed. If you have that it could also solve amplification in a way. A mesh could always be synthesized, subdivided, tessellated, and/or displaced offline. Then simplification can reduce it down. This ignores the data storage implications but it does show that it is the more general purpose solution to that problem.

So we have that now. It’s called Nanite and is pretty cool. It was the right thing to work on first. But just because this solution is general purpose enough that it can be used for these other cases does not mean it is ideal for them. Storing full topology of an irregular mesh covers all cases but is expensive. Storing every position on the surface as a full 3d point with a complete set of attributes is expensive. We do our best to compress that data but nothing beats not having that data at all.

Compression

Scalar displacement fields, whether they are artist authored maps or captured through projection of a detailed surface to a simpler one, are much less data. 1 value compared to 5+. Compression of regular 2D data, ie images, in relation to human perception is vastly more researched and well understood.

Comparison of disk size for high poly vs low poly with normal and displacement maps

Even better data compression than that are procedural texturing approaches. What do I mean by procedural texturing? I don’t just mean mathematical functions like Perlin noise. I might be pushing the definition a bit but even simple texture tiling in a way is a form of procedural texturing. But certainly once shaders are involved where multiple textures are mixed and modified we are in the realm of procedural. The simplest form of this is detail texturing. A much higher frequency signal can be represented than stored explicitly. Viewed statically like this, the compression ratio can be far higher than is achievable through any other means.

Authoring

But beyond data compression, procedural content generation can be an incredibly effective time saver for an artist. It also can be reusable, dynamic, and animatable. By reusable I mean that a base material type, like snow, can be authored once and applied to many surfaces. By dynamic I mean the same asset can accumulate snow over time by changing the shader parameters, all the way to full animation like a moving lava field flow.

Displacement maps are extremely common in film and the primary reason for their use is not data compression. A good bit of it is tooling and while I could say that should be improved and is someone else’s responsibility to keep up with Nanite’s capabilities, the fact of the matter is I can’t snap my fingers and change all the DCCs. Even if every application were all optimized to better work with high poly meshes there is always something inherently simpler with 2d textures, and displacement maps are no different.

Displacement’s use in film also helps animation. For a character the base cage can be rigged and deformed. The deformed cage can then be smoothly subdivided and displaced to get the final detail. This simplifies the rigger’s and animator’s concerns and separates them to an extent from the sculptor who might carve out individual dragon scales.

The last use case is specific to games. Scalability is an important consideration for Fortnite as well as other games that still need to support lower end platforms that aren’t powerful enough to run the Nanite pipeline. We can easily generate low poly fallback meshes through the same mesh simplification algorithm that Nanite uses, but what is fine for the distance isn’t necessarily good enough for up close. The art of low poly modelling is often a matter of abstraction of shape and artists are much more picky about the results. They will also move detail between domains, from mesh to texture, that requires involvement of other assets that is difficult or impossible to automatically do reliably. For these reasons, when a large scalability range is required, like with Fortnite, our art teams have been more comfortable authoring for low or mid in the scalability range and amplifying up rather than authoring for high and simplifying down.

Normal map filtering using vMF (part 3)

2018-05-12T14:07:00.000-05:00

$$ \newcommand{\vv}{\mathbf{v}} \newcommand{\rv}{\mathbf{r}} \newcommand{\muv}{\boldsymbol\mu} \newcommand{\omegav}{\boldsymbol\omega} \newcommand{\mudotv}{\muv\cdot\vv} $$ What can we use this for? One example of a place where distributions are summed up is in normal map and roughness filtering. Normal and roughness maps are textures describing the distribution of microfacet normals. The normal of the normal map is the mean of the distribution and the roughness describes the width of the distribution. We can fit our chosen NDF with a vMF by finding a mapping from roughness to sharpness $\lambda$.

This mapping for Beckmann is given by [1] as: \begin{equation} \lambda \approx \frac{2}{\alpha^2} \label{eq:roughness_to_lambda} \end{equation} and following my previous post about specular models we can use the $\alpha$ from any of those distributions in this equation for a reasonable approximation.

Once you have vMFs we can sum or filter them in $\rv$ form. Then we can turn it back to normal and roughness by inverting the function: \begin{equation} \alpha \approx \sqrt{\frac{2}{\lambda}} \label{eq:lambda_to_roughness} \end{equation} We must be careful with floating point precision and divide by zero though. Instead of calculating $\lambda$ we can instead calculate its reciprocal which avoids multiple places where a divide by nearly zero can happen.

// Convert normal and roughness to r form

float InvLambda = 0.5 * Alpha*Alpha;

float exp2L = exp( -2.0 / InvLambda );

float CothLambda = InvLambda > 0.1 ? (1 + exp2L) / (1 - exp2L) : 1;

float3 r = ( CothLambda - InvLambda ) * Normal;

// Filter in r form

// Convert back to normal and roughness

float r2 = clamp( dot(r,r), 1e-8, 1 );

InvLambda = rsqrt( r2 ) * ( 1 - r2 ) / ( 3 - r2 );

Alpha = sqrt( 2 * InvLambda );

Normal = normalize(r);

How does this compare to the common approaches? The first to do something like this was Toksvig [2] which follows similar logic with vector length corresponding with gloss and uses properties of Gaussian distributions but not SGs exactly. LEAN mapping [3] is based on Gaussians as well but planar distributions, not spherical. The approach I just explained should in theory work just as well with object space normals.

Even though it was part of the original approach the common way to use "Toksvig" filtering (including UE4's implementation) is to find the normal variance and increase the roughness by it. There is no influence from the roughness on the normals when doing that and there should be. The correct way will affect how the normals are filtered. A smooth normal should have more weight in the filter than a rough normal.

vMF has been used for this task before in [5] and later [6]. There is a major difference from our approach in that Frequency domain normal map filtering relies on convolving instead of averaging. It finds the vMF for the distribution of normals over the filter region. It then convolves the normal and roughness by that vMF. But what is a convolution?

Convolution

Graphics programmers know of convolutions like blurring. It sort of spreads data out right? What does it mean mathematically though? A convolution of one function by another creates a new function that is equivalent to the integral of the function being convolved multiplied by a the convolving function translated to that point.

Think of a blur kernel with weights per tap. That kernel center is translated to the pixel in the image that we write the blur result to. Each tap of the kernel is a value from the blur function. We multiply those kernel values by the image that is being convolved. All of those samples are then added together. Now usually a blur doesn't have infinite support or every pixel of the image would need to be sampled but the only reason that doesn't need to happen is because the convolving function, ie the blur kernel, is zero past a certain distance from the center of the function. Otherwise the integral needs to cover the entire domain. In the 1D case that means from negative to positive infinity. In the case of a sphere that means over the entire surface of the sphere.

This symbolically looks like this for 1D: \begin{equation} (f * g) (x) = \int_{-\infty}^\infty f(t) g(x-t)\,dt \end{equation} We now have the definition but why would we want to convolve a function besides image blurring? A convolution of one function by another creates a new function that when evaluated is equal to if the both functions were multiplied together and integrated at that translated point. Think of this like precalculating the multiplication and integration of those functions for any translated point. The integral of the product is done ahead of time and now we can evaluate it for any translation.

This is exactly the use case for preconvolving environment maps by the reflected GGX distribution. GGX is the convolving function, the environment map is the function being convolved, the reflection vector direction used to sample the preconvolved environment map is the "translation". SGs are very simple to multiply and integrate as we have already seen so precomputing it often doesn't save much. Convolving does have its uses though so let's see how to do it.

Convolving SGs

The convolution of two SGs is not closed in the SG basis, meaning it does not result in exactly a SG. Fortunately it can be well approximated by one. [7] gave a simple approximation that is fairly accurate so long as the lobe sharpnesses aren't very low: \begin{equation} \begin{aligned} \left(G_1 * G_2\right) \left( \vv \right) &= \int_{S^2} G_1(\omegav) G_2\left( \omegav; \vv, \lambda_2, a_2 \right) \,d\omegav \\ &\approx G \left( \vv; \muv_1, \frac{\lambda_1 \lambda_2}{\lambda_1 + \lambda_2}, 2\pi\frac{a_1 a_2}{\lambda_1 + \lambda_2} \right) \end{aligned} \label{eq:convolve_sg} \end{equation} The first line of the equation above may shed more light on how we can use this if it isn't clear already. This is identical to the inner product but with $\muv_2$ replaced with a free parameter.

For the case of normal map filtering we don't care about amplitude. We want a normalized SG. That means for this case the only part that matters is the convolved $\lambda'$: \begin{equation} \lambda' = \frac{\lambda_1 \lambda_2}{\lambda_1 + \lambda_2} \end{equation} We can ignore the rest of eq \eqref{eq:convolve_sg} for the moment. If we replace $\lambda$ everywhere with $\alpha$ using eq \eqref{eq:lambda_to_roughness} we get a nice simple equation: \begin{equation} \alpha' = \sqrt{ \alpha_1^2 + \alpha_2^2 } \end{equation} Leave $\lambda$ in for one of them and we get: \begin{equation} \alpha' = \sqrt{ \alpha^2 + \frac{2}{\lambda} } \label{eq:alpha_prime} \end{equation} which looks just like what [6] used except for the 2 factor. I believe this is a mistake in their course notes. In equation (37) of their notes they have it as 1/2 and in the code sample it is 1. I think the confusion comes from the Frequency Domain Normal Map Filtering paper working with Torrance Sparrow and not Cook Torrance, and $\sigma \neq \alpha$. Overall it means less roughness from normal variance. In my tests using eq \eqref{eq:alpha_prime} that we just derived looks closer to Toksvig results. Otherwise the range is off and less rough. MJP uses the same 2/a^2 for SG sharpness in his blog post so we don't disagree there.

As a gut check if $\alpha=0$ and all final roughness comes from normal variation then $\alpha'=\sqrt{\frac{2}{\lambda}}$ which is what we established in eq \eqref{eq:lambda_to_roughness}. If there is no normal variation then this equation explodes but if you calculate InvLambda like I did in the code snippet the second term becomes zero and $\alpha'=\alpha$ which is what we want.

Next up, converting from SH to SG (coming soon).

References

[1] Wang et al. 2007, "All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance"
[2] Toksvig 2004, "Mipmapping Normal Maps"
[3] Olano et al. 2010, "LEAN Mapping"
[4] Hill 2011, "Specular Showdown in the Wild West"
[5] Han et al. 2007, "Frequency Domain Normal Map Filtering"
[6] Neubelt et al. 2013, "Crafting a Next-Gen Material Pipeline for The Order: 1886"
[7] Iwasaki 2012, "Interactive Bi-scale Editing of Highly Glossy Materials"

von Mises-Fisher (part 2)

2018-05-12T12:44:00.000-05:00

$$ \newcommand{\vv}{\mathbf{v}} \newcommand{\rv}{\mathbf{r}} \newcommand{\muv}{\boldsymbol\mu} \newcommand{\mudotv}{\muv\cdot\vv} $$ A normalized SG has the same equation as the probability distribution function for a von Mises-Fisher (vMF) distribution on the 3 dimensional sphere. This affords us a few more tools and applications to work with. A vMF distribution can be defined for any dimension. I'll focus on 3D here because it is the most widely usable for computer graphics and simplifies discussion. Because a vMF does not have a free amplitude parameter it is written as: \begin{equation} \begin{aligned} V(\vv;\muv,\lambda) = \frac{\lambda}{ 2\pi \left( 1 - e^{-2 \lambda} \right) } e^{\lambda(\mudotv - 1)} \end{aligned} \label{eq:vmf} \end{equation} The more common form you will likely see in literature is this: \begin{equation} \begin{aligned} V(\vv;\muv,\lambda) = \frac{\lambda}{ 4\pi \sinh(\lambda) } e^{\lambda(\mudotv)} \end{aligned} \label{eq:vmf_sinh} \end{equation} which is equivalent due to the identity \begin{equation} \begin{aligned} \sinh(x) = \frac{ 1 - e^{-2x} }{ 2e^{-x} } \end{aligned} \label{eq:sinh_identity} \end{equation} The form in eq \eqref{eq:vmf} is more numerically stable so should be used in practice as explained by [2].

Compare the equation for a vMF to the equation for a SG and it is easy to see that: \begin{equation} \begin{aligned} V(\vv;\muv,\lambda) = G\left( \vv; \muv, \lambda, \frac{\lambda}{ 2\pi \left( 1 - e^{-2 \lambda} \right) } \right) \end{aligned} \label{eq:vmf_to_sg} \end{equation} That means a vMF is equivalent to a normalized SG and by moving terms from one side to the other we can show that a SG is equivalent to a scaled vMF. \begin{equation} \begin{aligned} G\left( \vv; \muv, \lambda, a \right) = \frac{2\pi a}{\lambda} \left( 1 - e^{-2 \lambda} \right) V(\vv;\muv,\lambda) \end{aligned} \label{eq:sg_to_vmf} \end{equation}

Fitting a vMF distribution to data

Fitting a vMF distribution to directions or points on a sphere is a very similar process as fitting a normal distribution to points on a line. In the case of a normal distribution, one calculates the mean and variance of the data set and then chooses a normal distribution with the same mean and variance as the best fit to the data.

For the vMF distribution the mean direction and spherical variance are used. Calculating these properties for a set of directions is simple. \begin{equation} \begin{aligned} \rv = \frac{1}{n}\sum_{i=1}^{n} \textbf{x}_i \end{aligned} \label{eq:r_average} \end{equation} where $\textbf{x}_1, \textbf{x}_2, ..., \textbf{x}_n$ are a set of unit vectors.

Often values are associated with these directions. So instead taking a simple average we can take a weighted average. \begin{equation} \begin{aligned} \rv = \frac{\sum_{i=1}^{n} \textbf{x}_i w_i}{\sum_{i=1}^{n} w_i} \end{aligned} \label{eq:r_weighted_average} \end{equation} We have the two properties, the mean direction $\muv = \frac{\rv}{\|\rv\|}$ and the spherical variance $\sigma^2 = 1 - \|\rv\|$. To fit a vMF distribution to the data we need to know what these properties are for the vMF distribution. Since the vMF distribution is convex, circularly symmetric about its axis, and is max in the direction of $\muv$, it is fairly obvious that the mean direction will be $\muv$ so I won't derive that here.

The spherical variance $\sigma^2$ on the other hand is a bit more involved. Because we already know the direction of $\rv$ is $\muv$ we can simplify this calculation to the integral of the projection of the function onto $\muv$. \begin{equation} \begin{split} \|\rv\| &= \int_{S^2} V(\vv;\muv,\lambda) (\mudotv) d\vv \\ &= \frac{\lambda}{ 4\pi \sinh(\lambda) } \int_{S^2} e^{\lambda(\mudotv)} (\mudotv) d\vv \\ \end{split} \end{equation} Because the integral over the sphere is rotation-invariant we will replace $\muv$ with the x-axis. \begin{equation} \begin{split} &= \frac{\lambda}{ 4\pi \sinh(\lambda) } \int_{0}^{2 \pi} \int_{0}^{\pi} e^{\lambda\cos\theta} \cos\theta\sin\theta d\theta d\phi \\ &= \frac{\lambda}{ 4\pi \sinh(\lambda) } 2 \pi \int_{0}^{\pi} e^{\lambda\cos\theta} \cos\theta\sin\theta d\theta \\ \end{split} \end{equation} Substituting $t=-\cos\theta$ and $dt=\sin\theta d\theta$ \begin{equation} \begin{split} &= \frac{\lambda}{ 2 \sinh(\lambda) } \int_{-1}^{1} -t e^{-\lambda t} dt \\ &= \frac{\lambda}{ 2 \sinh(\lambda) } \left( \frac{ 2 \lambda \cosh(\lambda) - 2 \sinh(\lambda) }{ \lambda^2 } \right) \\ &= \frac{\cosh(\lambda)}{ \sinh(\lambda) } - \frac{\sinh(\lambda)}{ \lambda \sinh(\lambda) } \\ \end{split} \end{equation} Arriving in its final form \begin{equation} \|\rv\| = \coth(\lambda)-\frac{1}{\lambda} \label{eq:r_length} \end{equation} Although simple in form, this function unfortunately isn't invertible. [1] provides an approximation which is close enough for our purposes. \begin{equation} \begin{aligned} \lambda &= \|\rv\| \frac{ 3 - \|\rv\|^2}{1 - \|\rv\|^2} \end{aligned} \end{equation} Now that we have a way to calculate the mean and spherical variance for a data set and we know the corresponding vMF mean and spherical variance, we can fit a vMF to the data set.

Using eq \eqref{eq:r_weighted_average} to calculate $\rv$, the vMF fit to that data is \begin{equation} V\left( \vv; \frac{\rv}{\|\rv\|},\|\rv\| \frac{ 3 - \|\rv\|^2}{1 - \|\rv\|^2} \right) \label{eq:r_to_vmf} \end{equation}
Going the other direction from $V(\vv;\muv,\lambda)$ form to $\rv$ form using eq \eqref{eq:r_length} is this: \begin{equation} \rv = \left( \coth(\lambda)-\frac{1}{\lambda} \right) \muv \label{eq:vmf_to_r} \end{equation}

Addition of SGs

We now have a way to convert to and from $\rv$ form. $\rv$ is linearly filterable as shown in how it was originally defined in eq \eqref{eq:r_weighted_average}. This means if our vMF functions are representing a spherical distribution of something then a weighted sum of those distributions can be approximately fit by another vMF. In other words we can approximate the resulting distribution by converting to $\rv$ form, filtering, and then converting back to traditional $V(\vv;\muv,\lambda)$ form.

By using the weighted average eq \eqref{eq:r_weighted_average} we can apply this concept to non normalized SGs too. This allows us to not just filter (ie sum with a total weight of 1) but add as well. A non-normalized SG as shown in eq \eqref{eq:sg_to_vmf} is a scaled vMF. We can use this scale as the weight when summing and use the total weight as the final scale for the summed SG.

This is the $\rv$ form for $G(\vv;\muv,\lambda, a)$. It includes an additional weight value you can think of like the energy this SG is adding to the sum: \begin{equation} \begin{aligned} \rv_i &= \left( \coth(\lambda_i)-\frac{1}{\lambda_i} \right) \muv_i \\ w_i &= \frac{2\pi a_i}{\lambda_i} \left( 1 - e^{-2 \lambda_i} \right) \\ \end{aligned} \end{equation} This weight is of course used in the weighted sum \begin{equation} \begin{aligned} \rv &= \frac{\sum_{i=1}^{n} \rv_i w_i}{\sum_{i=1}^{n} w_i} \\ w &= \sum_{i=1}^{n} w_i \\ \end{aligned} \end{equation} Using eq \eqref{eq:r_to_vmf} and eq \eqref{eq:vmf_to_sg} we can convert back to a scaled vMF and finally to a SG in $G(\vv;\muv,\lambda, a)$ form: \begin{equation} \begin{aligned} G\left( \vv; \muv, \lambda, a \right) &= w V(\vv;\muv,\lambda) \\ &= G\left( \vv; \muv, \lambda, w \frac{\lambda}{ 2\pi \left( 1 - e^{-2 \lambda} \right) } \right) \end{aligned} \end{equation} While addition and filtering are approximate they can be useful. The accuracy of the result is very dependent on the angle between the $\mu$ vectors or lobe axii. Adding sharp lobes pointed in different directions will result in a single wide lobe.

Next, what we can use this for:
Normal map filtering using vMF

References

[1] Banerjee et al. 2005, "Clustering on the Unit Hypersphere using von Mises-Fisher Distributions"
[2] Jakob 2012, Numerically stable sampling of the von Mises Fisher distribution on S2 (and other tricks)"

Spherical Gaussians (part 1)

2018-05-12T12:42:00.000-05:00

$$ \newcommand{\vv}{\mathbf{v}} \newcommand{\rv}{\mathbf{r}} \newcommand{\muv}{\boldsymbol\mu} \newcommand{\mudotv}{\muv\cdot\vv} $$ A Spherical Gaussian (SG) is a function of unit vector $\vv$ and is defined as \begin{equation} G(\vv;\muv,\lambda, a) = a e^{\lambda(\mudotv - 1)} \end{equation} where unit vector $\muv$, scalar $\lambda$, and scalar $a$ represent the lobe axis, lobe sharpness, and lobe amplitude of the SG, respectively.

The formula can be read as evaluating a SG in the direction of $\vv$ where the SG has the parameters of $\muv,\lambda, a$. An abbreviated notation $G(\vv)$ can be used instead when the parameters can be assumed. Often the more verbose notation is used to assign values to the parameters.

SGs have a number of nice properties including simple equations for a number of common operations.

Product of Two SGs

The product of two SG's can be represented exactly as another SG. This product is sometimes referred to as the vector product. This formula was first properly given in [1] (it was shown earlier but in an non-normalized form).

Let $\lambda_m = \lambda_1 + \lambda_2$ and let $\muv_m = \frac{\lambda_1\muv_1 + \lambda_2\muv_2}{\lambda_1 + \lambda_2}$, then \begin{equation} \begin{split} G_1(\vv)G_2(\vv) = G\left(\vv; \frac{\muv_m}{\|\muv_m\|}, \lambda_m\|\muv_m\|, a_1 a_2 e^{\lambda_m\left(\|\muv_m\| - 1\right)}\right) \end{split} \label{eq:sg_product} \end{equation}

Raising to a power

Given that the product of two SGs is another SG it shouldn't be much of a surprise that a SG raised to a power can be expressed exactly as another SG: \begin{equation} \begin{aligned} G(\vv)^n &= G(\vv; \muv,n\lambda, a^n) \end{aligned} \label{eq:sg_power} \end{equation}

Integration Over The Sphere

The integral of a SG over the sphere has a closed form solution.

[2] showed that the integral was: \begin{equation} \int_{S^2}G(\vv) d\vv = 2 \pi \frac{a}{\lambda} \left( 1 - e^{-2\lambda} \right) \label{eq:sg_integral} \end{equation}

Inner product

The inner product is defined as the integral over the sphere of the product of two SGs. We can already find the product of two SGs and integrate over a sphere. Putting those together we have: \begin{equation} \int_{S^2}G_1(\vv) G_2(\vv) d\vv = \frac{4 \pi a_0 a_1}{e^{\lambda_m}} \frac{ \sinh\left(\|\muv_m\| \right) }{ \|\muv_m\| } \label{eq:sg_inner_product_sinh} \end{equation} This equation has numerical precision issues when evaluated with floating point arithmetic. An alternative form which is more stable is the following: \begin{equation} \int_{S^2}G_1(\textbf{v}) G_2(\textbf{v}) d\textbf{v} = 2 \pi a_0 a_1 \frac{ e^{ \|\boldsymbol\mu_m\| - \lambda_m } - e^{ -\|\boldsymbol\mu_m\| - \lambda_m } }{ \|\boldsymbol\mu_m\| } \label{eq:sg_inner_product_exp} \end{equation}

Normalization

Although there are other definitions for normalization I use the term to mean having an integral over the sphere equal to 1. Normalizing a SG is a simple matter of dividing it by its integral over the sphere. \begin{equation} \begin{aligned} \frac{ G(\vv) }{ \int_{S^2}G(\vv) d\vv } = G\left( \vv; \muv, \lambda, \frac{\lambda}{ 2\pi \left( 1 - e^{-2 \lambda} \right) } \right) \end{aligned} \label{eq:sg_normalized} \end{equation} Notice that the original $a$ parameter canceled out. Instead lobe amplitude is derived purely from the lobe sharpness $\lambda$.

These are all the common operations that have closed form solutions. So far nothing new here but hopefully it is helpful to have all these equations in a centralized place for reference. I didn't include derivations for any of these formulas. If readers think that would be useful to see maybe those could be added at a later date.

Special thanks to David Neubelt. Although this has been heavily modified from what we previously had I'm sure his touch is still present.

Now on to some less well covered concepts.
von Mises-Fisher (part 2)

References

[1] Wang et al. 2007, "All-Frequency Rendering of Dynamic, Spatially-Varying Reflectance"
[2] Tsai et al. 2006, "All-Frequency Precomputed Radiance Transfer using Spherical Radial Basis Functions and Clustered Tensor Approximation"

Spherical Gaussian series

2018-05-12T12:41:00.000-05:00

Intro and backstory

About 4 years ago now I ran into Spherical Gaussian (SG) math in a few different publications in a row, enough that it triggered the pattern detection in my brain. All were using SGs to approximate specular lobes. I remember feeling very similar 10 years prior when Spherical Harmonics were starting to become all the rage. Back with SH I noticed it fairly early, primarily from Tom Forsyth's slides on the topic and took the time to dig into the math and make sure I had this new useful thing in my toolbox. Doing so has proven to be well worth the time. I decided that I should do the same again and learn SGs and related math, in particular to build up a toolbox of operations I can do with them. Maybe it would prove as useful as SH has.

In the years since I'd say it has certainly been worth the effort. I don't think I can say it has proven as useful as SH has been to computer graphics but it was still worth learning. I intended to write up what I had found and share the toolbox of equations compiled in a centralized place at the very least. Unfortunately laziness and procrastination got in the way.

Also scope creep. About 3 years ago I mentioned to David Neubelt that I intended to write this up and he too had done a lot of work with SGs at Ready at Dawn so we decided we'd collaborate and write a joint paper and submit it to JCGT. The intention was to make something similar to Stupid Spherical Harmonics (SH) Tricks but for SG. That scope and seriousness is much larger than a simple single author blog post. We worked on derivations to all the formulas, wanted to have quality solutions to cube map fitting, multiple use cases proven in production, and a ton of other things to make it an exhaustive, professional, and ultimately great paper. This was much more than I ever intended as a simple blog post. I still think that paper we had in mind would be great to exist but the actual end result is it bloated the expectation of what either of us had the bandwidth or maybe the attention span to complete and after a couple of months of work the unfinished paper regretfully stagnated.

A year went by and eventually MJP of Ready at Dawn did in fact do a blog series write up on Spherical Gaussians. It is excellent and I suggest you read it before continuing if you haven't already. There are still some things I intended to cover that he did not as well as things he did that I never have done nor planned to cover so I think these should compliment each other well.

It is a royal shame I have not posted this in the 3+ years since I intended to. I had even promised folks publicly a write up was coming and then didn't deliver. I haven't posted anything on this blog since then in fact. I hope to do better and hopefully finally getting this out will unclog the pipes.

Without further ado,

Part 1 - Spherical Gaussians
Part 2 - von Mises-Fisher
Part 3 - Normal map filtering using vMF
Part 4 - Coming soon...

UE4 available to all

2014-03-29T19:50:00.000-05:00

The big news from Epic at GDC this year was that Unreal Engine 4 is now available to everyone for only $19/month + 5% royalties.

There's been a lot said already about why this is cool, opinions, comparisons to competitors and so forth. From a business model point of view I feel it has been covered better than I could possibly say. If you are interested check out Mark Rein's post or hear it from the man himself Tim Sweeney for very compelling reasons why this is a good idea for us to provide and for developers to subscribe.

I wanted to give my own personal perspective as an engineer. What Epic just did is absolutely revolutionary and I am privileged to be part of it. I am super excited! Let me explain why.

If you've followed this blog for a while you may remember I gave Epic huge props for releasing UDK back in 2009. That was great for artists and designers, not that great for programmers. This is the next step down that path and its a doozy.

Now there have been naysayers that compare the numbers 19/month + 5%, is that better or worse, is it new or been done before, and so on. They are missing the point. It isn't the price, it isn't even the features. The revolution is this: for a 1/3 the cost of a new video game you can get access to the complete source code for our cutting edge game engine. The entire engine source code, every bit of it, the exact same we use in house and the same provided to private licensees, is available to you for only $19 (except for XB1 and PS4 code we aren't legally able to give you due to NDAs). This is the same tech that will be powering many of the top AAA games developed this generation. As Mark put it "Right now 7 of the top 21 (!) all-time highest rated Xbox 360 games (by Metacritic score) were powered by Unreal Engine 3" and you damn well better bet we are going to do the same or better this time around with UE4.

Now you may say John Carmack and id have open sourced their engines many times. That is true and I commend them for it. John has had a lasting impact on how many coders, myself included, write software due to these efforts. Unfortunately, the strict licensing terms have resulted in basically no commercial products using the id engines from that open source license. John has even lamented this recently "It is self-serving at this point, but looking back, I wish I could have released all the Id code under a more permissive license. GPL never did really do anything for game code, and I do wonder whether it was a fundamental cultural incompatibility. GPL was probably the best that could have flown politically for the code releases -- posterity without copy-paste into competition." And my own personal twist with this situation is after Zenimax bought id, their engines were no longer commercially licensable meaning anything built on top of that code base had to be scraped. GPL is simply not an option for commercial games.

There are other examples of open sourced game engines or cheaply priced engine source code but they all have something in common which is they were never up to date. The id engines were only open sourced years after the game that used them shipped. Id had already moved onto their next technology. For example the engine that powered Doom 3 wasn't opened sourced until 6 years after the game shipped. By that point Rage had even shipped. The reason for this is obvious. The point where releasing the code is non-threatening to execs is the point where a competitor can't use the code to their advantage. This means every single engine code base available to indie devs or students was either not commercially developed or purposely not competitive in its capabilities.

With the UE4 subscription, you can today get complete source that is totally up to date and will continue to be. This isn't any old engine, its the same cutting edge technology that will be powering many of the top AAA games of this generation. And as soon as we add new features you'll get them. In fact you'll even see the work in progress before they're done. How crazy is that?

Look, its easy to brush this off as the guy who works for Epic telling me I should give him money. That isn't why I'm writing this. What I am most excited about by far is in what I can share and give back the the gamedev community. That is the reason I have this blog, its the reason I have presented at conferences and will again in the future.

If you are a teacher, check out the license terms, they are ridiculously nice for schools. If you are a student, tell your school about it or grab a personal subscription.

For all those wanting to get into the game industry, the same response you will hear from everyone is make something. No one lands a game job because of their great grades or fancy degrees. They get it by making something cool. I wrote my own engine. That's how I landed my first job. Although I learned a ton if I were in that place today I would modify UE4. There is literally nothing else more applicable to a engine programming position than proving you can do the work. Hell, make something cool enough we'll hire you!

I can't wait to see what people make with this. It's a great time to be a programmer!

Tone mapping

2013-12-15T15:30:00.000-06:00

When working with HDR values, two troublesome situations often arise.

The first happens when one tries to encode an HDR color using an encoding that has a limited range, for instance RGBM. Values outside the range still need to be handled gracefully, ie not clipped.

The second happens when an HDR signal is under sampled. One very bright sample can completely dominate the result. In path tracing these are commonly called fireflies.

In both cases the obvious solution is to reduce the range. This sounds exactly like tonemapping so break out those tone mapping operators, right? Well yes and no. Common tone mapping operators work on color channels individually. This has the downside of desaturating the colors which can look really bad if later operations attenuate the values, for instance reflections, glare, or DOF.

Instead I use a function that modifies only the luminance of the color. The simplest of which is this:

$$ T(color) = \frac{color}{ 1 + \frac{luma}{range} } $$
Where $T$ is the tone mapping function, $color$ is the color to be tone mapped, $luma$ is the luminance of $color$, and $range$ is the range that I wish to tone map into. If the encoding must fit RGB individually in range then $luma$ is the max RGB component.

Inverting this operation is just as easy. $$ T_{inverse}(color) = \frac{color}{ 1 - \frac{luma}{range} } $$
This operation, when used to reduce fireflies, can also be thought of as a weighting function for each sample: $$ weight = \frac{1}{ 1 + luma } $$
For a weighted average, sum all samples and divide by the summed weights. The result will be the same as if the samples were tone mapped using $T$ with $range$ of 1, averaged, then inverse tone mapped using $T_{inverse}$.

If a more expensive function is acceptable then keeping more of the color range linear is best. To do this use the functions below where 0 to $a$ is linear and $a$ to $b$ is tone mapped. $$ T(color) = \left\{ \begin{array}{l l} color & \quad \text{if $luma \leq a$}\\ \frac{color}{luma} \left( \frac{ a^2 - b*luma }{ 2a - b - luma } \right) & \quad \text{if $luma \gt a$} \end{array} \right. $$ $$ T_{inverse}(color) = \left\{ \begin{array}{l l} color & \quad \text{if $luma \leq a$}\\ \frac{color}{luma} \left( \frac{ a^2 - ( 2a - b )luma }{ b - luma } \right) & \quad \text{if $luma \gt a$} \end{array} \right. $$
These are same as the first two functions if $a=0$ and $b=range$.

I have used these methods for lightmap encoding, environment map encoding, fixed point bloom, screen space reflections, path tracing, and more.

Specular BRDF Reference

2013-08-03T21:16:00.000-05:00

$$ \newcommand{\nv}{\mathbf{n}} \newcommand{\lv}{\mathbf{l}} \newcommand{\vv}{\mathbf{v}} \newcommand{\hv}{\mathbf{h}} \newcommand{\mv}{\mathbf{m}} \newcommand{\rv}{\mathbf{r}} \newcommand{\ndotl}{\nv\cdot\lv} \newcommand{\ndotv}{\nv\cdot\vv} \newcommand{\ndoth}{\nv\cdot\hv} \newcommand{\ndotm}{\nv\cdot\mv} \newcommand{\vdoth}{\vv\cdot\hv} $$ While I worked on our new shading model for UE4 I tried many different options for our specular BRDF. Specifically, I tried many different terms for to Cook-Torrance microfacet specular BRDF: $$ f(\lv, \vv) = \frac{D(\hv) F(\vv, \hv) G(\lv, \vv, \hv)}{4(\ndotl)(\ndotv)} $$ Directly comparing different terms requires being able to swap them while still using the same input parameters. I thought it might be a useful reference to put these all in one place using the same symbols and same inputs. I will use the same form as Naty [1], so please look there for background and theory. I'd like to keep this as a living reference so if you have useful additions or suggestions let me know.

First let me define alpha that will be used for all following equations using UE4's roughness: $$ \alpha = roughness^2 $$

Normal Distribution Function (NDF)

The NDF, also known as the specular distribution, describes the distribution of microfacets for the surface. It is normalized [12] such that: $$ \int_\Omega D(\mv) (\ndotm) d\omega_i = 1 $$ It is interesting to notice all models have $\frac{1}{\pi \alpha^2}$ for the normalization factor in the isotropic case.

Blinn-Phong [2]: $$ D_{Blinn}(\mv) = \frac{1}{ \pi \alpha^2 } (\ndotm)^{ \left( \frac{2}{ \alpha^2 } - 2 \right) } $$ This is not the common form but follows when $power = \frac{2}{ \alpha^2 } - 2$.

Beckmann [3]: $$ D_{Beckmann}(\mv) = \frac{1}{ \pi \alpha^2 (\ndotm)^4 } \exp{ \left( \frac{(\ndotm)^2 - 1}{\alpha^2 (\ndotm)^2} \right) } $$

GGX (Trowbridge-Reitz) [4]: $$ D_{GGX}(\mv) = \frac{\alpha^2}{\pi((\ndotm)^2 (\alpha^2 - 1) + 1)^2} $$

GGX Anisotropic [5]: $$ D_{GGXaniso}(\mv) = \frac{1}{\pi \alpha_x \alpha_y} \frac{1}{ \left( \frac{(\mathbf{x} \cdot \mv)^2}{\alpha_x^2} + \frac{(\mathbf{y} \cdot \mv)^2}{\alpha_y^2} + (\ndotm)^2 \right)^2 } $$

Geometric Shadowing

The geometric shadowing term describes the shadowing from the microfacets. This means ideally it should depend on roughness and the microfacet distribution.

Implicit [1]: $$ G_{Implicit}(\lv,\vv,\hv) = (\ndotl)(\ndotv) $$

Neumann [6]: $$ G_{Neumann}(\lv,\vv,\hv) = \frac{ (\ndotl)(\ndotv) }{ \mathrm{max}( \ndotl, \ndotv ) } $$

Cook-Torrance [11]: $$ G_{Cook-Torrance}(\lv,\vv,\hv) = \mathrm{min}\left( 1, \frac{ 2(\ndoth)(\ndotv) }{\vdoth}, \frac{ 2(\ndoth)(\ndotl) }{\vdoth} \right) $$

Kelemen [7]: $$ G_{Kelemen}(\lv,\vv,\hv) = \frac{ (\ndotl)(\ndotv) }{ (\vdoth)^2 } $$

Smith

The following geometric shadowing models use Smith's method[8] for their respective NDF. Smith breaks $G$ into two components: light and view, and uses the same equation for both: $$ G(\lv, \vv, \hv) = G_{1}(\lv) G_{1}(\vv) $$ I will define $G_1$ below for each model and skip duplicating the above equation.

Beckmann [4]: $$ c = \frac{\ndotv}{ \alpha \sqrt{1 - (\ndotv)^2} } $$ $$ G_{Beckmann}(\vv) = \left\{ \begin{array}{l l} \frac{ 3.535 c + 2.181 c^2 }{ 1 + 2.276 c + 2.577 c^2 } & \quad \text{if $c < 1.6$}\\ 1 & \quad \text{if $c \geq 1.6$} \end{array} \right. $$

Blinn-Phong:
The Smith integral has no closed form solution for Blinn-Phong. Walter [4] suggests using the same equation as Beckmann.

GGX [4]: $$ G_{GGX}(\vv) = \frac{ 2 (\ndotv) }{ (\ndotv) + \sqrt{ \alpha^2 + (1 - \alpha^2)(\ndotv)^2 } } $$ This is not the common form but is a simple refactor by multiplying by $\frac{\ndotv}{\ndotv}$.

Schlick-Beckmann:
Schlick [9] approximated the Smith equation for Beckmann. Naty [1] warns that Schlick approximated the wrong version of Smith, so be sure to compare to the Smith version before using. $$ k = \alpha \sqrt{ \frac{2}{\pi} } $$ $$ G_{Schlick}(\vv) = \frac{\ndotv}{(\ndotv)(1 - k) + k } $$

Schlick-GGX:
For UE4, I used the Schlick approximation and matched it to the GGX Smith formulation by remapping $k$ [10]: $$ k = \frac{\alpha}{2} $$

Fresnel

The Fresnel function describes the amount of light that reflects from a mirror surface given its index of refraction. Instead of using IOR we instead use the parameter or $F_0$ which is the reflectance at normal incidence.

None: $$ F_{None}(\mathbf{v}, \mathbf{h}) = F_0 $$

Schlick [9]: $$ F_{Schlick}(\mathbf{v}, \mathbf{h}) = F_0 + (1 - F_0) ( 1 - (\vdoth) )^5 $$

Cook-Torrance [11]: $$ \eta = \frac{ 1 + \sqrt{F_0} }{ 1 - \sqrt{F_0} } $$ $$ c = \vdoth $$ $$ g = \sqrt{ \eta^2 + c^2 - 1 } $$ $$ F_{Cook-Torrance}(\mathbf{v}, \mathbf{h}) = \frac{1}{2} \left( \frac{g - c}{g + c} \right)^2 \left( 1 + \left( \frac{ (g + c)c - 1 }{ (g - c)c+ 1 } \right)^2 \right) $$

Optimize

Be sure to optimize the BRDF shader code as a whole. I choose these forms of the equations to either match the literature or to demonstrate some property. They are not in the optimal form to compute in a pixel shader. For example, grouping Smith GGX with the BRDF denominator we have this: $$ \frac{ G_{GGX}(\lv) G_{GGX}(\vv) }{4(\ndotl)(\ndotv)} $$ In optimized HLSL it looks like this:

float a2 = a*a;
float G_V = NoV + sqrt( (NoV - NoV * a2) * NoV + a2 );
float G_L = NoL + sqrt( (NoL - NoL * a2) * NoL + a2 );
return rcp( G_V * G_L );

If you are using this on an older non-scalar GPU you could vectorize it as well.

References

[1] Hoffman 2013, "Background: Physics and Math of Shading"
[2] Blinn 1977, "Models of light reflection for computer synthesized pictures"
[3] Beckmann 1963, "The scattering of electromagnetic waves from rough surfaces"
[4] Walter et al. 2007, "Microfacet models for refraction through rough surfaces"
[5] Burley 2012, "Physically-Based Shading at Disney"
[6] Neumann et al. 1999, "Compact metallic reflectance models"
[7] Kelemen 2001, "A microfacet based coupled specular-matte brdf model with importance sampling"
[8] Smith 1967, "Geometrical shadowing of a random rough surface"
[9] Schlick 1994, "An Inexpensive BRDF Model for Physically-Based Rendering"
[10] Karis 2013, "Real Shading in Unreal Engine 4"
[11] Cook and Torrance 1982, "A Reflectance Model for Computer Graphics"
[12] Reed 2013, "How Is the NDF Really Defined?"

Epic, SIGGRAPH, etc

2013-07-28T15:52:00.000-05:00

I'm resurrecting this blog from the dead. I'm sorry it's been neglected for a year but I've been busy. If you follow me on twitter (@BrianKaris) then this probably isn't news, but for those that don't here's an update:

A year ago I left Human Head and accepted a position on the rendering team at Epic Games. Since then we made the UE4 Infiltrator demo. I've worked on temporal AA, reflections, shading, materials, and other misc cool stuff for UE4 and games being developed here at Epic. I'm surrounded by a bunch of really smart, talented people, with whom it has been a pleasure to work.

Just this last week I presented in the SIGGRAPH 2013 course: Physically Based Shading in Theory and Practice. If you saw my talk and are interested in the subject but haven't looked at the course notes I highly suggest you follow that link and check them out as well as the other presenter's materials. Like previous years, the talks are only a taste of the content that the course notes cover in detail.

Now with that out of the way, hopefully I can start making some good posts again.

Sparse shadows through tracing

2012-05-14T00:04:00.000-05:00

The system I described last time allowed specular highlights to reach large distances but only requires calculating them on the tiles where they will show up. This is great but it means now we must calculate shadows for these very large distances. Growing the shadow maps to include geometry at a much greater distance is hugely wasteful. Fortunately there is a solution.

Before I get to that though I want to talk about a concept I think is going to be very important for next gen renderers and that is having more than one representation for scene geometry. Matt Swoboda talked about this in his GDC presentation this year [1] and I am in complete agreement with him. We will need geometry in similar formats as we've had in the past for efficient rasterization (vertex buffers, index buffers, displacement maps). This will be used whenever the rays are coherent simply because HW rasterization is much faster than any other algorithm currently for coherent rays. Examples of use are primary rays and shadow rays in the form of shadow maps.

Incoherent rays will be very important for next gen renderers but we need a different representation to efficiently trace rays. Any that support tracing cones will likely be more useful than ones which can only trace rays. Possible representations are signed distance fields [2][1][9], SVOs [3], surfel trees [4], and billboard cloud trees [5][9]. I'll also include screen space representations although these don't store the whole scene. These include mip map chains of min/max depth maps [6], variance depth maps [7] and adaptive transparency screen buffers [8]. Examples of use for these trace friendly data structures are indirect diffuse (radiosity), indirect specular (reflections) and sparse shadowing of direct specular. The last one is what helps with our current issue.

The Samaritan demo[9] from Epic had a very similar issue that they solved in the same way I am suggesting. They had many point lights which generated specular highlights at any distance. To shadow them they did a cone trace in the direction of the reflection vector against a signed distance field that was stored in a volume texture. This was already being done for other reflections so using that data to shadow the point lights doesn’t come at much cost. The signed distance field data structure could be swapped with any of the others I listed. What is important is that the shadowing is calculated with a cone trace.

What I propose as the solution to our problem is to use traditional shadow maps only within the diffuse radius. Do a cone trace down the reflection vector. The cone trace will return a visibility function that any specular outside the range of a shadow map can cheaply use to shadow.

Actually, having shadowing data independent from the lights means it can be used for culling as well. The max unoccluded ray distance can be accumulated per tile which puts a cap on the culling cone for light sources. I anticipate this form of occlusion culling will actually be a very significant optimization.

This shadowing piece of the puzzle means the changes I suggested in my last post, in theory, come at a fairly low cost assuming you already do cone tracing for indirect specular. That may seem like a large assumption but to demonstrate how practical cone tracing is, a very simple, approximate form of cone tracing can be done purely against the depth buffer. This is what I do with screen space reflections on current gen hardware. I don’t do cone tracing exactly but instead reduce the trace distance with low glossiness and fade out the samples at the end of the trace. This acts like occlusion coverage fades by the radius of the cone at the point of impact which is a visually acceptable approximation. In other words the crudest form of cone tracing can already be done in current gen. It is fairly straightforward to extend this to true cone tracing on faster hardware using one of the screen space methods I listed. Replacing screen space with global is much more complex but doable.

The result is hopefully point light specularity “just works”. The problem is then shifted to determining which lights in the world to attempt to draw. Considering we have >10000 in one map in Prey 2 this may not be easy :). Honestly I haven’t thought about how to solve this yet.

I, like everyone else who has talked about tiled light culling, am leaving out an important part which is how to efficiently meld shadow maps and tiled culling for the diffuse portion. I will be covering ideas on how to handle that next time.

Finally, I want to reach out to all that have read these posts that if you have an idea on how the cone based culling can be adapted to a blinn distribution please let me know.

[1] http://directtovideo.wordpress.com/2012/03/15/get-my-slides-from-gdc2012/
[2] http://iquilezles.org/www/material/nvscene2008/rwwtt.pdf
[3] http://maverick.inria.fr/Publications/2011/CNSGE11b/GIVoxels-pg2011-authors.pdf
[4] http://www.mpi-inf.mpg.de/~ritschel/Papers/Microrendering.pdf
[5] http://graphics.cs.yale.edu/julie/pubs/bc03.pdf
[6] http://www.drobot.org/pub/M_Drobot_Programming_Quadtree%20Displacement%20Mapping.pdf
[7] http://www.punkuser.net/vsm/vsm_paper.pdf
[8] http://software.intel.com/en-us/articles/adaptive-transparency/
[9] http://www.nvidia.com/content/PDF/GDC2011/GDC2011EpicNVIDIAComposite.pdf

Tiled Light Culling

2012-04-29T20:29:00.000-05:00

First off I'm sorry that I haven't updated this blog in so long. Much of what I have wanted to talk about on this blog, but couldn't, was going to be covered in my GDC talk but that was cancelled due to forces outside my control. If you follow me on twitter (@BrianKaris) you probably heard all about it. My comments were picked up by the press and quoted in every story about Prey 2 since. That was not my intention but oh, well. So, I will go back to what I was doing which is to talk here about things I am not directly working on.

Tiled lighting

There has been a lot of talk and excitement recently concerning tiled deferred [1][2] and tiled forward [3] rendering.

I’d like to talk about an idea I’ve had on how to do tile culled lighting a little differently.

The core behind either tiled forward or tiled deferred is to cull lights per tile. In other words for each tile, calculate which of the lights on screen affect it. The base level of culling is done by calculating a min and max depth for the tile and using this to construct a frustum. This frustum is intersected with a sphere from the light to determine which lights hit solid geometry in that tile. More complex culling can be done in addition to this such as back faced culling using a normal cone.

This very basic level of culling, sphere vs frustum, only works with the addition of an artificial construct which is the radius of the light. Physically correct light falloff is inverse squared.

Light falloff

Small tangent I've been meaning to talk about for a while. To calculate the correct falloff from a sphere or disk light you should use these two equations [4]:

Falloff:
$$Sphere = \frac{r^2}{d^2}$$
$$Disk = \frac{r^2}{r^2+d^2}$$

If you are dealing with light values in lumens you can replace the r^2 factor with 1. For a sphere light this gives you 1/d^2 which is what you expected. The reason I bring this up is I found it very helpful in understanding why the radiance appears to approach infinity when the distance to the light approaches zero. Put a light bulb on the ground and this obviously isn’t true. The truth from the above equation is the falloff approaches 1 when the distance to the sphere approaches zero. This gets hidden when the units change from lux to lumens and the surface area gets factored out. The moral of the story is don’t allow surfaces to penetrate the shape of a light because the math will not be correct anymore.

Culling inverse squared falloff

Back to tiled culling. Inverse squared falloff means there is no distance in which the light contributes zero illumination. This is very inconvenient for a game world filled with lights. Two possibilities, first is to subtract a constant term from the falloff but max with 0. The second is windowing the falloff with something like (1-d^2/a^2)^2. The first loses energy over the entire influence of the light. The second loses energy only away from the source. I should note the tolerance should be proportional to the lights intensity. For simplicity I will use the following for this post:
$$Falloff = max( 0, \frac{1}{d^2}-tolerance)$$

The distance cutoff can be thought of as an error tolerance per light. Unfortunately glossy specular doesn’t work well in this framework at all. The intensity of a glossy, energy conserving specular highlight, even for a dielectric, will be WAY higher than the lambert diffuse. This spoils that idea of the distance falloff working as an error tolerance for both diffuse and specular because they are at completely different scales. In other words, for glossy specular, the distance will have to be very large for even a moderate tolerance, compared to diffuse.

This points to there being two different tolerances, one for diffuse the other for specular. If these both just affect the radius of influence we might as well just set the radius of both as the maximum because diffuse doesn’t take anything more to calculate than specular. Fortunately, maximum intensity of the specular inversely scales with the size of the highlight. This of course is the entire point of energy conservation but energy conservation helps us in culling. The higher the gloss, the larger the radius of influence the tighter the cone of influencing normals.

If it isn’t clear what I mean, think of a chrome ball. With a mirror finish, a light source, even as dim as a candle, is visible at really large distances. The important area on the ball is very small, just the size of the candle flame’s reflection. The less glossy the ball, the less distance the light source is visible but the more area on the ball the specular highlight covers.

Before we can cull using this information we need specular to go to zero past a tolerance just like distance falloff. The easiest is to subtract the tolerance from the specular distribution and max it with zero. For simplicity I will use phong for this post:
$$Phong = max( 0, \frac{n+2}{2}dot(L,R)^n-tolerance)$$

Specular cone culling

This nicely maps to a cone of L vectors per pixel that will give a non-zero specular highlight.

Cone axis:
$$R = 2 N dot( N, V ) - V$$

Cone angle:
$$Angle = acos \left( \sqrt[n]{\frac{2 tolerance}{n+2}} \right)$$

Just like how a normal cone can be generated for the means of back face culling, these specular cones can be unioned for the tile and used to cull. We can now cull specular on a per tile basis which is what is exciting about tiled light culling.

I should mention the two culling factors need to actually be combined for specular. The sphere for falloff culling needs to expand based on gloss. The (n+2)/2 should be rolled into the distance falloff which leaves angle as just acos(tolerance^(1/n)). I’ve leave these details as an exercise for the reader. Now, to be clear I'm not advocating having diffuse and specular light lists. I'm suggesting culling the light if diffuse is below tolerance AND spec is below tolerance.

This leaves us with a scheme much like biased importance sampling. I haven’t tried this so I can’t comment on how practical it is but it has the potential to produce much more lively reflective surfaces due to having more specular highlights for minimal increase in cost. It also is nice to know your image is off by a known error tolerance from ground truth (per light in respect to shading).

The way I handle this light falloff business for current gen in P2 is by having all lighting beyond the artist set bounds of the deferred light get precalculated. For diffuse falloff I take what was truncated from the deferred light and add it to the lightmap (and SH probes). For specular I add it to the environment map. This means I can maintain the inverse squared light falloff and not lose any energy. I just split it into runtime and precalculated portions. Probably most important, light sources that are distant still show up in glossy reflections. This new culling idea may get that without the slop that comes from baking it into fixed representations.

I intended to also talk about how to add shadows but this is getting long. I'll save it for the next post.

References:
[1] http://visual-computing.intel-research.net/art/publications/deferred_rendering/
[2] http://www.slideshare.net/DICEStudio/spubased-deferred-shading-in-battlefield-3-for-playstation-3
[3] http://aras-p.info/blog/2012/03/27/tiled-forward-shading-links/
[4] http://www.iquilezles.org/www/articles/sphereao/sphereao.htm

New Prey 2 screenshot

2011-08-10T00:16:00.000-05:00

It has been a long time since I updated this blog with substantial content but I wanted to point out that Bethesda just released this new screenshot of Prey 2. It's a great shot but it's also a fantastic demonstration of some new graphical features I added to our latest build of the game.

First, there's the depth of field in the background which is HDR circular bokeh DOF.

Secondly, in the puddles on the ground, you will see screen space reflections. They aren't planer reflections, they work on every surface and run on every platform. SSR really adds a ton of dimension and accuracy to our wet, metal filled, alien noir city. I can't talk yet about how it works unfortunately.

So, check it out and tell me what you think. Hopefully in not too long I can start talking about how some of the tech works but for now you just get a glimpse.

-Brian

Virtualized volume textures

2011-01-30T17:42:00.000-06:00

First off it's been a very long time since I made a post. Sorry about that. I've found it difficult to come up with subjects to discuss that I both know enough about and am allowed to publicly talk about. For many, personal hobby projects can be the source of subjects to write about but all the at home pet project stuff I do, I do in the HH codebase and check in if it is a success. Personally I find this more rewarding than the alternative because it can go into a commercial product that hopefully will be seen by millions as well as it can get artist love which is very hard to get with hobby projects. The two biggest downsides are that I no longer own work I do in my free time and I can't easily talk about it. Now on to a technique that fits the bill as I have no particular commercial use for at the moment.

Irradiance volumes using volume textures is a technique that has been getting some use lately. Check out the following for some places I've noticed it.
Split/Second
Cryengine 3 (Cascaded Light Propagation Volumes)
FEAR 2
Rust

Probably the biggest downside to volumes vs more traditional lightmaps is the resolution. Volume textures take up quite a bit of memory so they need to be fairly low resolution. Unfortunately much of this data is covering empty space. It's convenient for empty space to have some data coverage so that the same solution can be used for dynamic objects but the same resolution is certainly not needed. How would you store a different resolution of volume data on world geometry than in empty space?

The most straight forward solution to me is an indirection texture. Interestingly what this turns into is virtualizing the volume texture just like you would a 2d texture. That indirection volume texture acts like the page table to your physical texture. Each page table texel translates the XYZ coordinates into a page or brick in the physical volume texture and subsequently the UVW coordinates in the physical volume texture. If you need a refresher on virtual texturing check out Sean Barrett's and id's presentations on the topic. All the same limitations apply in 3d as they do in 2d. Pages will need borders to have proper filtering. The smaller the page size the more waste due to borders and the larger the page table gets.

Another way of thinking of this is as a sparse voxel octree. Instead of the page table being managed like a quadtree it would work like an octree. Typically this data structure is thought of only for ray casting but there's nothing inherent about it that requires that. SVO's have also only been stored as trees, requiring traversal to get to the leaves. So long as you have bricks and aren't working with the granularity of individual voxels the traversal can be changed into a single texture look up just like the quadtree traversal in a 2d virtual texture.

Thinking about it as a SVO is helpful because volume data usually has more sparseness than 2d textures. In this case we don't really care about bricks where there is no geometry. If you use a screen read back to determine which pages to load this will happen automatically. Better yet you don't need to even store this data on disk. Better than that you don't need to generate that data in the first place during the offline baking process. Don't worry about dynamic objects. There is still data covering the empty space, it is just at the resolution of one page covering the world. If you need it at a higher resolution than that you can force a minimum depth to the octree.

In the end it still likely can't compete with the resolution a lightmap can get you if high res is what you are looking for. If lower res is the target it probably will be quite a bit more memory efficient because of not having any waste due to 2d parameterization. As for what a good page size would be I'm not sure as I haven't implemented any of this. If I did I probably couldn't talk about it yet. If anyone does implement this I'd love to hear about it.

RGBD

2010-01-03T15:20:00.002-06:00

This entire post turned out to be hogwash. I'm wiping it out to prevent the spread of misinformation. If you are interested in why it's all nonsense see my comment below. Thanks to Sean Barrett for pointing it out. The result of all of this has been positive ignoring the part of me looking extremely foolish. RGBM is more useful than I originally claimed because a larger scale factor can be used or if a fairly small range is required the gamma correction is likely not needed.

UDK

2009-11-06T00:11:00.002-06:00

Epic released UDK today. It is the next step in their plan to completely dominate the game engine licensing market place. They've been doing a pretty good job of that so far. To all the other companies trying to compete I'd suggest following in their footsteps. The reason I believe they are so successful is because they do everything they can to get the engine out there and in peoples hands. You can't be scared of people looking at or stealing your stuff. If devs can't look at it they certainly won't plunk down major money to buy it. Epic has been powering their freight train of engine licensing with visibility of their product. With UDK now they have a brand new group of students and indie developers who are going to be familiar with unreal engine as well as make the barrier to evaluate the engine for commercial purposes practically nothing. You get to see all the tools and a major portion of documentation without even having to talk to someone at Epic. You can see engine updates whenever they release a new build. If you want to know more they will send you more than enough stuff to make up your mind in the form of an evaluation version. Compare that to most of the other engine providers. Many will never let you see any code and require all sorts of paper work to see private viewings of just an engine demonstration. As expected they are also not getting very many licensees. Stop being so paranoid and let people see your stuff.

On to the tech stuff. I haven't spent a ton of time looking at it yet but I've noticed some new things already. Of coarse there's the obvious stuff like Lightmass that they've talked all about. There's docs on all that which is great. I'm going to talk about what they aren't talking about. First they changed the way they encode their lightmaps. Previously it was 3 DXT1's that stored the 3 colors of incoming light in the 3 HL2 basis directions. They still use the same concept now but instead only have 2 DXT1's. The first stores the incoming light intensity in each of the 3 directions as RGB. The second is the average color of all 3 directions. The loss in quality may be small as the error is mainly bumps picking up different colors of light. The memory is 2/3's of what it was so it seems like a smart optimization.

What impressed me though was their signed distance field baked shadows. Previously they stored baked shadow masks as G8 or 8 bit luminance format textures. This was used simply as a mask to the light. Now they are computing a signed distance field like this paper. It's still stored in a G8 format texture so there's no difference in storage. The big difference comes from the sharpness and quality from a relatively low res texture. The same smooth lines that are useful for vector graphics for Valve's use also work well for shadows. After I read this paper when it came out I thought the exact same thing. This is perfect for baked shadow textures! I implemented it then but I was trying to have another value in the texture to specify how far the occluder was so I could control the width of the soft edge. The signed distance field could be in the green channel of a DXT1 and the softness could be in red. If that wasn't good enough softness in green and distance field in alpha of a DXT5. The prototype was scraped after it didn't really give me what I needed and was being replaced with a different shadow system anyways. I am really impressed with Epic's results though. For a fairly large map they get good quality sun shadows from only 4 1024x1024 G8 textures. That's only 4 mb of shadow data. I bet it would still work well from DXT1's too with the data in the green channel. That would bring it down to 2 mb. The only down side is it is unable to represent soft shadows unless someone can get my encoding softness as an extra value idea to work which would be cool. Apparently I should have given that experiment more consideration.

Uncharted 2

2009-10-01T00:27:00.007-05:00

We are hiring!
First up, Human Head Studios, where I work, is hiring for basically every department. We are currently staffing up and are looking for talented people. I can't tell you what we are working on but I can say it's awesome and we are doing some very interesting and innovative things on many fronts including technology. The tech department where I am, is developing our own cutting edge tech that I intend to be competitive with all the games I talk about on this blog. I really wish I could be more specific but alas, I cannot. Not yet at least. I'm trying to whet your appetite without getting in trouble if you haven't noticed.

Of most interest to the common visitor of this blog, we are looking for tech programmers. Of most interest to me we are looking for a graphics guy that I will work with on aforementioned secret, awesome, graphics tech. If you understand all of what I talk about here and live and breathe graphics you may be the perfect fit. For submission details see here. Mention my name and you might filter up in the pile. Being a reader of my blog has to count for something right?

Uncharted 2
On to the main event. The Uncharted 2 multiplayer demo came out yesterday and I suggest you check it out. The first game has been a benchmark for graphics on the ps3 and I'm sure when number two comes out it will set the bar again. Although all you can see at the moment are some of the mp maps it is very pretty and much can be discerned from dissecting it.

There are many options in the demo that offer an ability to study things. You can start games with just yourself. There is a machinima mode that I haven't fully figured out yet but it allows you to fool with more than in a normal game. You can also replay a play through using the cinema mode. That allows you to pause, single step forward, fly around and even change post processing and lighting a bit.

Right out of the gate I noticed their nice DOF blur. It's one of the best I've seen. It looks like there is both a small blur and a large blur that are lerped between to simulate a continuous blur radius. It looks better than gaussian but I could be imagining things. It is definitely done in HDR as bright things dominate in the blur.

Speaking of blur they now have object motion blur. In the first game there was motion blur when the camera swung around quickly but it only happened in the distance. It was at low resolution and pretty blurry. I didn't really care for it as any time I moved the camera quickly to look at something it blurred and my eyes lost focus on what I was trying to look at. This type of motion blur is still there but doesn't seem to bother me as much. It's likely less blurry than before but I can't say for sure. It is done in HDR though. In addition to the camera motion blur objects have there own geometry that draws to blur them. This is the same thing Tomb Raider Underworld did for character motion blur. U2 uses it for objects as well as characters like the bus that drives by in one of the videos. The blur trails behind and can look weird in certain situations when paused but while playing the game these oddities aren't noticeable and it looks pretty nice.

Glare or bloom seemed to be map specific. In the snow map the sky was bright enough to bloom everywhere. In other maps light sources that where bright enough to go almost white but were obviously saturated when blurred didn't bloom at all. The control to change the sun intensity could be jacked up to the point of being completely blown out but with no hint of bloom. On that note, they are correctly tone mapping and not clamping off bright colors, something so many games are getting wrong. Please, people. Use a tone curve that doesn't just clamp off bright colors, as in not linear. Bloom all you want but without a nice response curve your brights will look terrible.

In regards to lighting the only dynamic shadow in their maps was the sun. I don't know whether that is the same in single player or whether it's mp only. There were other dynamic light sources but they didn't cast dynamic shadows. Strangely they did have precomputed shadows that looked like it was the light being baked into a lightmap. I'm not sure what was happening here as the machinima playground map showed evidence of the the artifact from storing precomputed lighting at the verts which is what they did in U1. It's possible that some objects had lightmaps and others were vertex lit. Another possibility is the lightmap looking shadows were not in the baked lighting but were more like light masks. It's hard to say from what I saw. There were also other lights that seemed to be completely baked and not influence characters.

I don't remember from U1 but in this demo the ambient character lighting doesn't seem to change much or be influenced by local features. There seemed to be a warm light direction always there in the night time city map that should be a very cool ambient practically everywhere. I'm guessing the ambient lighting is artist set up and isn't based on sampling the environment.

They are still using a lot of vertex color blending of textures. I think this is done in the shader where it lerps between 2 sets of textures based on a vertex value and a heightmap texture. The heightmap corresponds with one or both of the texture sets and the vertex value is like an alpha test. This hides the typical smooth vertex colored blending and matches the material that is fading. Think of brick and mortar to get the idea. Brick is one material, concrete the other. The heightmap is based off of the brick so as it transitions the mortar starts to dominate and cover the brick, masking the gradient of the vertex values until it's all concrete. In U1 it was very much like an alphatest because it was a hard transition. For U2 these transitions can be smooth. The vertex value is likely a separate vertex stream so that different variations can be used with a single model instance. A material set up in this fashion can be used in a variety of places and with a simple bit of vertex coloring the geometry seems to be uniquely textured.

Here's just a list of some other miscellaneous things I noticed:

SSAO darkens ambient term. Nice addition, helps ground the characters and lessens the pressure on detailed baked lighting. It's much more subtly done than Gears which tended to look very constrasty.

Shadowmap sampling pattern changed to be a bit smoother. Not a big change. There is major shadow acne on characters at times which is not cool.

Particle systems with higher fill expense fade out when you get close to them. There were some nice clouds of dust in a few places around the maps that are gone when you get up to them.

Last shadowmap cascade (3rd?) fades out to no shadow in the distance. This is mostly not noticeable but it can result in the inside of buildings when outdoors turning bright in the distance.

They still have low res translucent drawing although I don't think it's frame rate dependent anymore. I saw low and normal res translucent things at the same time. The odd thing was the low res wasn't bilinearly filtered but nearest. I have no idea why they would do that.

The snow particles stretch with camera movement to fake motion blur. Cool trick.

I can't wait to play the full game. I loved the first and this is looking to be even better. As I said before this will be the new bar so you should definitely check it out for your self.

RGBM color encoding

2009-04-28T01:38:00.012-05:00

LogLUV
There has been some talk about using LogLUV encoding to store HDR colors in 32 bits by packing it all into a RGBA_8888 target. I won't go into the details as a better explanation than I could give is here. This encoding has the benefit over standard floating point buffers of reducing ROP bandwidth and storage space at the expense of some shader instructions for both encoding and decoding. It has been used in shipped games. Heavenly Sword used it to achieve 4xaa with HDR on a PS3. Uncharted used it for the 2xaa output from their material pass. The code for encoding and decoding is as follows (copied from previous link):

// M matrix, for encoding
const static float3x3 M = float3x3(
  0.2209, 0.3390, 0.4184,
  0.1138, 0.6780, 0.7319,
  0.0102, 0.1130, 0.2969);

// Inverse M matrix, for decoding
const static float3x3 InverseM = float3x3(
  6.0014, -2.7008, -1.7996,
  -1.3320,  3.1029, -5.7721,
  0.3008, -1.0882,  5.6268);

float4 LogLuvEncode(in float3 vRGB)  {
  float4 vResult;
  float3 Xp_Y_XYZp = mul(vRGB, M);
  Xp_Y_XYZp = max(Xp_Y_XYZp, float3(1e-6, 1e-6, 1e-6));
  vResult.xy = Xp_Y_XYZp.xy / Xp_Y_XYZp.z;
  float Le = 2 * log2(Xp_Y_XYZp.y) + 127;
  vResult.w = frac(Le);
  vResult.z = (Le - (floor(vResult.w*255.0f))/255.0f)/255.0f;
  return vResult;
}

float3 LogLuvDecode(in float4 vLogLuv) {
  float Le = vLogLuv.z * 255 + vLogLuv.w;
  float3 Xp_Y_XYZp;
  Xp_Y_XYZp.y = exp2((Le - 127) / 2);
  Xp_Y_XYZp.z = Xp_Y_XYZp.y / vLogLuv.y;
  Xp_Y_XYZp.x = vLogLuv.x * Xp_Y_XYZp.z;
  float3 vRGB = mul(Xp_Y_XYZp, InverseM);
  return max(vRGB, 0);
}

RGBM
There is a different encoding which I prefer. I don't know whether anyone else is using this but I imagine someone is. The idea is to use RGB and a multiplier in alpha. This is often used as a HDR format for textures. I prefer it over RGBE for HDR textures. Here's some info related even though 2 DXT5's is a bit overkill for most applications. Here's some slides from Lost Planet on storing lightmaps as DXT5's. Both LogLUV and these texture encodings are about storing the luminance information separately with a higher precision. This is a standard color compression thing which becomes even more powerful when dealing with HDR data. What at first doesn't make sense is if RGBM is stored in a RGBA_8888 there is no increase in precision by placing luminance in the alpha over having it stored with RGB. The thing is luminance isn't only in alpha. What is essentially stored in alpha is a range value. The remainder of the luminance is stored with the chrominance in rgb. The code is really simple to do this encoding:

float4 RGBMEncode( float3 color ) {
  float4 rgbm;
  color *= 1.0 / 6.0;
  rgbm.a = saturate( max( max( color.r, color.g ), max( color.b, 1e-6 ) ) );
  rgbm.a = ceil( rgbm.a * 255.0 ) / 255.0;
  rgbm.rgb = color / rgbm.a;
  return rgbm;
}

float3 RGBMDecode( float4 rgbm ) {
  return 6.0 * rgbm.rgb * rgbm.a;
}

I should also note that it is best to convert the colors from linear to gamma space before encoding. If you plan to use them again in linear a simple additional sqrt and square will work fine for encoding and decoding respectively. The constant 6 gives a range in linear space of 51.5. Sure it's no 1.84e19 of LogLUV but honestly did you really need that? 51.5 should be plenty so long as exposure has already been factored in. This constant can be changed to fit your tastes. Those 3 max's can be replaced with a max4 on the 360 if the compiler is smart enough. I haven't looked to see if it does this. Also the epsilon value to prevent dividing by zero I haven't found necessary in practice. The hardware must output black in the event of denormals which is the same as handling it correctly. I haven't tried it on a large range of hardware so beware if you remove it.

There are some major advantages of RGBM over LogLUV. First off is the cost of encoding and decoding. There is no need for matrix multiplies or logs and exp. Especially of note is how cheap the decoding is. It behaves very well in filtering so you can still use the 4 samples in 1, bilinear trick for downsizing. This isn't technically correct but the difference is negligible.

As far as quality I can't see any banding even in dark stress test cases on a fancy monitor after I've turned all the lights off. It also unsurprisingly handles very bright and saturated colors with the same level of quality. I found no discernible differences in my testing versus LogLUV. I don't have any sort of data on what amount of error it has or whether it covers whatever color space. What I can tell you is that it handles my HDR needs perfectly.

Storing your colors encoded means you cannot do any blending into the buffer. This rules out multi-pass additive lighting and transparency. You will have to use another buffer for transparent things such as particles. This is also a good time to try a downsized buffer since you need a separate one anyways. Now a transparency buffer can store additive, alpha blended and multiply type transparency but only grayscale multiplies since they are going into the alpha channel. Multiply decals can be very useful in adding surface variation while still having tiling textures underneath. These often use color to tint the underlying surface and need to be at full res.

Now for the cool part. Because what is stored in RGB is basically still a color, you can apply multiply blending straight into a buffer stored as RGBM. Multiplying will never increase the range required to store the colors so this is a non destructive operation. In practice I have seen no perceivable precision problems crop up due to this. It is also mathematically correct so there are no worries as to whether it will get weird results.

Killzone 2

2009-02-25T18:24:00.000-06:00

I got a bogus DMCA notice on this post. Google took it down and now I'm putting it back up. I just finished Killzone 2 and it really is graphically impressive. If you are reading this blog then you are interested in graphics which means you owe it to yourself to play this game. The other levels in the game I think are actually more impressive than the one in the demo. The level in the demo was pretty geometrically simple. Lots of boxy bsp brush looking shapes. The later levels are a lot more complex. In particular the sand level was very pretty.

Level Construction
There didn't seem to be much high poly mesh rendered to a normal map looking stuff. Most everything was made from texture tiles and heightmap generated normalmaps. Most of the textures are fairly desaturated to the point of being likely grayscale with most of the color coming from the lighting and post processing. This is something we did quite a bit in Prey and is something we are trying to change. You may notice the post changing when you walk through some door ways. The most likely candidates are doors from inside to outside.

FX
Their biggest triumph I think is in the fx and atmospherics. There is a ridiculous number of particles. The explosions are some of the best I've seen in a game. There is a lot of dust from bullet impacts, foot falls, wind, explosions. There's smoke coming from explosions, world fires, rocket trails. Each bullet impact also causes a spray of trailed sparks that collide with the world and bounce. Particles are not the only thing contributing. There are also a lot of tricks with sheets and moving textures. For the dust blowing in the wind effect there is a distinct shell above the ground with a scrolling texture plus lots of particles. The common trick with sheets is fading them out when they get edge on and when you get close to them. Add soft z clipping and a flat sheet can look very volumetric. There is also a lot of light shafts done with sheets. One of these situations you can see in the demo. All of this results in a huge amount of overdraw. It has already been pointed out that they are using downsized drawing. This looks to be 1/4 the screen dimensions (1/16 the fill). This is bilinearly upsampled to apply it to the screen opposed to using the msaa trick and drawing straight in. Having the filtering makes it look quite a bit better. It looks like it averages about 10% of the GPU frame time. That would mean they didn't need to sacrifice much to get these kind of effects.

Shadows
All the shadows are dynamic shadow maps. Sunlight is cascaded shadow maps with each level at 512x512. Omni lights use cube shadow maps. They are drawing the back faces to the shadow map to reduce aliasing. Some of the shadow maps can be pretty low resolution. This isn't as bad as Dead Space because they have really nice filtering. This is likely because the rsx has shadow map bilinear pcf for free. I can't tell exactly what the sample pattern is but it looks to alternate. They have stated there is up to 12 samples per pixel. There is a really large number of lights casting dynamic shadows at a time. Even muzzle flashes cast shadows. Lightning flashes cast shadows. At a distance the shadows fade out but the distance is pretty far. To be expected their triangle counts were evenly split between screen rendering and shadow map rendering at about 200k-400k. They should be able to get away with a lot more than that amount of tris.

Lighting
I think this is the first game to really milk deferred lighting for what its worth. There are a ton of lights. The good guys have like 3 small lights on each one of them. That doesn't include muzzle flashes. The bad guys are defined by the red glowing eyes. These have a small light attached to them so the glowing eyes actually light the guys and anything they are close to. In the factory level you can see 230 lights on screen at once. I'm curious if all of these are drawn or if a good fraction is faded out. If there aren't any faded that is insane. 200 draw calls just in lights and that doesn't count stencil drawing that can happen before. Their draw counts seem to always be below 1000 so this is not likely the case.

Post processing
A fair amount of their screen post processing is done on SPU's. As far as I know this is a first. The DOF has a variable blur size. This is most easily visible when going back and forth to the menu. There is motion blur on everything but the blur distance is clamped to very small.

Misc
Environment maps are used on many surfaces. They are mostly crisp to show sharp reflections. I didn't see any situation where they were locational. They are instead tied to the material.

Another neat effect was the water from the little streams. This wasn't actually clipping with the ground or another piece of geometry at all. It is merely in the ground material and it masked to where it should be. The plane moves up and down by changing what range of a heightmap to mask to.

Their profiler says they are spending up to 30% of an SPU on scene portals. I assumed this meant area / portal visibility. In the demo this made sense. After playing it all it no longer makes sense. There are many areas in the game that are just not portalable. I'm not sure what that could mean anymore. They could use it as a component of visibility and the other component is not on the SPUs. In that case I am curious what they used for visibility.

The texture memory amount stayed constant. This must mean that they are not doing any texture streaming.

They have the player character cast shadows but you can not see his model. I found this to be kind of strange especially when you can see the shadows at the feet z fighting with the ground but no feet that would have conveniently hid the problem. It's expensive to get the camera in the head thing to work really well so I understand why they didn't wish to do it but personally I would have gone with both or nothing concerning the players shadow. BTW, why is the player like a foot and a half shorter than everyone else?

For more killzone info:
Deferred lighting
Profiling numbers

It isn't quite to the level of the original prerendered footage but honestly who expected it to be? It is a damn good effort from the folks at Guerrilla. I look forward to their presentation at GDC next week. This is the first year since I've been doing this professionally that I am not going to GDC. I'll have to try and get what I can from the powerpoints and audio recordings. You are all posting your slides right? Wink, wink.

Virtual Geometry Images

2009-01-10T20:07:00.003-06:00

Geometry images are one of those ideas so simple you ask yourself "Why didn't I think of this?" I'll admit it isn't the topic of much discussion concerning the "more geometry" problem for the next generation. They work great for compression but they don't inherently solve any of the other problems. Multi-chart geometry images have a complicated zipping procedure that is also invalid if a part is rendered at a different resolution.

A year ago when I was researching a solution to "more geometry" on DirectX 9 level hardware I came across this paper that was in line with the direction I was thinking. The idea is an extension to virtual textures by having another layer with the textures that is a geometry image. For every texture page that is brought in there is a geometry image page with it. By decomposing the scene into a seemless texture atlas you are also doing a Reyes like split operation. The splitting is a preprocess and the dice is real time. The paper also explains an elegant seem solution.

My plan on how to get this running really fast was to use instancing. With virtual textures every page is the same size. This simplifies many things. The way detail is controlled is similar to a quad tree. The same size pages just cover less of the surface and there are more of them. If we mirror this with geometry images every time we wish to use this patch of geometry it will be a fixed size grid of quads. This works perfectly with instancing if the actual position data is fetched from a texture like geometry images imply. The geometry you are instancing then is grid of quads with the vertex data being only texture coordinates from 0 to 1. The per instance data is passed in with a stream and the appropriate frequency divider. This passes data such as patch world space position, patch texture position and scale, edge tessellation amount, etc.

If patch tessellation is tied to the texture resolution this provides the benefit that no page table needs to be maintained for the textures. This does mean that there may be a high amount of tessellation in a flat area merely because texture resolution was required. Textures and geometry can be at a different resolution but still be tied such as the texture is 2x the size as the geometry image. This doesn't affect the system really.

If the performance is there to have the two at the same resolution a new trick becomes available. Vertex density will match pixel density so all pixel work can be pushed to the vertex shader. This gets around the quad problem with tiny triangles. If you aren't familiar with this, all pixel processing on modern GPU's gets grouped into 2x2 quads. Unused pixels in the quad get processed anyways and thrown out. This means if you have many pixel size triangles your pixel performance will approach 1/4 the speed. If the processing is done in the vertex shader instead this problem goes away. At this point the pipeline is looking similar to Reyes.

If this is not a possibility for performance reasons, and it's likely not, the geometry patches and the texture can be untied. This allows the geometry to tessellate in detailed areas and not in flat areas. The texture page table will need to come back though which is unfortunate.

Geometry images were first designed for compression so disk space should be a pretty easy problem. One issue though is edge pixels. Between each page the edge pixels need to be exact otherwise there will be cracks. This can be handled by losslessly compressing just the edge and using normal lossy image compression for the interiors. As the patches mip down they will be using shared data from disk so this shouldn't be an issue. It should be stored uncompressed in memory thought or the crack problem will return.

Unfortunately vertex texture fetch performance, at least on current console hardware, is very slow. There is a high amount of latency. Triangles are not processed in parallel either. With DirectX 11 tessellators it sounds like they will be processed in parallel. I do not know whether vertex texture fetch will be up to the speed of a pixel texture fetch. I would sure hope so. I need to read specs for both the API and this new hardware before I can postulate on how exactly this scheme can be done with tessellators instead of instanced patches but I think it will work nicely. I also have to give the disclaimer that I have not implemented this. The performance and details of the implementation are not yet known because I haven't done it.

To compare this scheme with the others it has some advantages. Given that it is still triangle rasterization dynamic objects are not a problem. To make this work with animated meshes it will probably need bone indexes and weights stored in a texture along with the position. This can be contained to an animation only geometry pool. It doesn't have the advantage subd meshes have that you can animate just the control points. This advantage may not work that well anyways because you need a fine grained cage to get good animation control which increases patch number, draw count, and tessellation of the lowest detail LOD (the cage itself).

It's ability to LOD is better than subd meshes but not as good as voxels. The reason for this is the charts a model has to be split up into are usually quite a bit bigger than the patches of a subd mesh. This really depends on how intricate the mesh is though. It scales the same subd meshes do but just with a different multiplier. Things like terrain will work very well. Things like foliage work terribly.

Tools side, anything can be converted into this format. Writing the tool unfortunately looks very complicated. This primarily lies with the texture parametrization required to build the seemless texture atlas. After UV's are calculated the rest should be pretty straight forward.

I do like this format better than subd meshes with displacement maps but it's still not ideal. Tiny triangles start to lose the benefits of rasterization. There will be overdraw and triangles missing the center of pixels. More important I think is that it doesn't handle all geometry well, so it doesn't give the advantage of telling your artists they can make any model and place it however they want and it will have no impact on the performance of the game. Once they start making trees or fences you might as well go back to how they used to work because this scheme will run even slower than the old way. The same can be said for subd meshes btw.

To sum it up I think it's a pretty cool system but it's not perfect and doesn't solve all the problems.

More Geometry

2009-01-08T22:40:00.013-06:00

There has been a lot of talk lately about the next generation of hardware and how we are going to render things. The primary topic seems to be "How are we going to render a lot more geometry than we can now?" There are two approaches that are getting a lot of attention. The first is subdivision surfaces with displacement maps and the second is voxel ray casting.

Here are some others that are getting a bit of attention.

Ray casting against triangles
Intel's ray tracer
Cuda ray tracer

Point splatting
Far Voxels
QSplat
Atom

Progressive meshes
Progressive Buffers
View dependent progressive mesh on the GPU

Progressive Buffers is one of my favorite papers. It's one that I keep coming back to time and time again.

Otoy
interview

Who knows exactly what is going on in Otoy. It almost seems like they are being deliberately confusing. It's for a game engine, it's for movie CG, it's lightstage (which I thought was a non-commercial product), it's a server side renderer, it's a web 3d viewer / streaming video.

What I have gathered it generates an unstructured point cloud on the gpu and creates a point hierarchy. It uses this for ray tracing not just eye rays but shadows and reflections. The reflections are massively cached. It's not clear how. I can't figure out how this is working with full animations like they have in their videos. Either that would require regenerating the points, which makes ray casting into it kind of pointless, or it has to deal with holes. Whatever they are doing the results are very impressive.

Subdivision surfaces with displacement maps
This has a lot of powerful people behind it. Both nVidia and AMD are behind it along with DirectX 11 API support through the hull shader, tessellator, and domain shader. I'm not a big fan of this. First off, it's only really useful in data amplification not data reduction. For example our studio and many others are now using the subd cage for the in game models for our characters. That means the the lowest tessellation level the subd surface can get to, the subd cage, is the same poly count as our current near LOD character meshes. It makes subdivision surfaces not useful at all in LODing moderate distances. This can be reduced some but likely not by enough to solve the problem. It looks really complicated to implement and rife with problems. It requires artists input to create models that work well with it. The data imported from the modelling package can be specific to that package. It's hard to beat the generality of a triangle soup.

The plus side is there is no issue with multiple moving meshes or deforming. They are fully animatable. To allow good animation control the tessellation of the cage may need to be higher. In theory every piece of geometry in your current engine could be replaced with subd models with displacement maps and the rest of your pipeline would work exactly the same.

Check out some work in this direction from Michael Bunnell of Fantasy Lab:
GPU Gems 2 chapter
Fantasy Lab

Voxel ray casting
This has been made popular by John Carmack who has described this as his planned approach for idTech6. Considering he's always at the head of the pack this should give it pretty strong backing. John refers to his implementation as a sparse voxel octree (SVO). The idea is to extend his current virtual texturing mip blocks to 3D with a "mip" hierarchy of voxels that will be stored as an octree. The way this is even remotely reasonable is that you only need to store the important data, no data in empty space. This is very different from most scientific and medical applications that require the whole data block. This structure is great for compression. Geometry compression now turns into image compression which is a well studied problem and effective. LODing works from the whole screen bing a single voxel to subpixel detail. To render it every screen pixel casts a ray into the octree until it hits a voxel the size of a pixel or smaller. This means that both rendering is reduced by LODing and memory is reduced. If you don't need the voxel data you don't need it in memory.

I like this approach because it gives one elegant way of handling textures, geometry, streaming, compression and LODing all in one system. There are no demands on the input data. Anything can be converted into voxels. Due to a streaming structure very similar to virtual texturing it allows unique geometry and unique texturing. This means there are no restrictions on the artists. There is little way an artist can impact performance or memory with assets they create. That puts the art and visuals in the artists hands and the technical decisions in the engineers hands.

There are some problems. Ray tracing has always been slow. Ray casting is a search where rasterization is binning. Binning is almost always faster than searching. In a highly parallel environment the synchronization required in binning may tip the scales in searching's favor. As triangles shrink the number of multiple triangles in one bin or bin misses also hurts the speed advantage. It has now been demonstrated to be fast enough to render on current hardware at 60fps. This should be enough proof to let this concern slide a bit. It does mean it will only be able to be rendered once. It's unlikely there will be power left to render a shadow map or ray trace to the light for that matter. My guess is John's intent is to have the lighting fully baked into the texture like how idTech5 works currently. This also cuts down on the required memory as only one color is required per voxel.

Memory is another possible problem. Jon Olick's demo of one character using a SVO required ~1gb of video memory which was not enough to completely hide paging. His plans to decrease this size was entropy encoding which means each child's data is based on its parents data. As far as I'm aware this is only going to work if you use a kd-tree restart traversal which is slower than the other alternatives. Otherwise he would need to evaluate the voxel data for the whole stack once he wishes to draw the pixel.

The most important problem is it doesn't work with dynamic meshes. The scheme I believe John Carmack is planning on using is a static world with baked lighting with all dynamic objects and characters using traditional triangle meshes. I expect this to work out well. The performance of this type of situation has been shown to be there so it's not too risky to pursue this direction. There is something about it that bugs me. You release the constraints of the environmental artists but leave the other artists with the same problems. If you handle texturing inherently with voxels does that mean he needs to keep around his virtual texturing for everything else? Treating the world and the dynamic objects in it separately has been required in the past with lightmaps and vertex lighting. To bring back this Hanna Barbara effect in a even more complicated way leaves a bad taste in my mouth. I'm really looking for a uniform solution for geometry and textures.

For more detailed information see Jon Olick's Siggraph presentation. He is a former employee of id. Also check out the interview that started this all off.

The brick based voxel implementations seem like a better solution to me than having the tree uniform. This means the leaves are a different type than the nodes. They consist of a fixed size brick of voxels. Being in a brick format has many advantages. It allows free filtering and hardware volume texture compression through DXT formats.

Check out these for brick approaches:
Gigavoxels
GPU ray casting of voxels

Other blogs

2008-11-20T00:56:00.003-06:00

I just stumbled on this blog the other day that I haven't seen linked in the graphics blog circle so I thought I'd make a point of mentioning it. Chris Evans, a technical artist from Crytek, has a blog. He has just left Crytek to go to ILM and I hope he keeps up the blog as it's full of art and technical topics. His main site also has some good stuff like cryTools, Crytek's suite of max scripts.

I didn't get a chance to play Resistance 2 yet but for a good rendering break down of the game check out Timothy Farrar's post. Timothy is a Senior Systems Programmer I work with at Human Head so you can trust him ;). From a part of the game I did see I think the object shadows work similarly to the first cascade of cascaded shadow maps as in there is one shadow map that is fit to the view frustum within a short range and fades out past that range. I haven't seen the rest of the game to know whether shadows can come from more than one direction that would make this not work.