Friday, February 6, 2026

Nanite + Reyes

Let's get into the details of how this integrates with Nanite. We aren’t replacing anything Nanite already does. Nothing is changed about the Nanite data structure or how it is built. We are only adding new stages to the rendering pipeline. Instead of directly rasterizing a cluster’s triangles, those triangles are considered to be patches with a displacement function. As such, depending on their size they may need additional tessellation before rasterizing.

That means there can be a mix of both simplification from base Nanite and amplification from tessellation simultaneously on the same mesh. Small patches collapsed away by Nanite’s LOD selection and large patches tessellated. The base LOD heuristic is still valid. It takes into account both positional error and normal error. If both are low then many patches can be represented by a few with little loss whether they are displaced or not. It isn’t simplifying the displacement function in this case, just the base geometry that is being displaced.

The error calculated is for the original surface area which could change significantly due to displacement but that is an example like many others where the displacement signal must be known to compute the correct error. Because the displacement isn’t known, its impacts are ignored.

Pipeline


This is a high level diagram of the stages in the base Nanite rendering pipeline:
Cluster Instance cull Visible instance Occluded instance Cluster cull Visible cluster Occluded cluster Occluded node HW rasterize SW rasterize Visible node Instance cull Visible instance Cluster cull Visible cluster HW rasterize SW rasterize Visible node

These are the new pipeline stages appended to the end to support tessellation:
Patch SW rasterize SW rasterize Cluster Patch split Occluded patch HW rasterize SW rasterize HW rasterize SW rasterize Split patch Visible patch Split patch Patch split Split patch Visible patch Split patch
As you can see, the new pipeline architecture is very similar to what was there before. Not everything needs tessellation though. Currently only materials with displacement mapping enable tessellation. In the future we may also support higher order surfaces. Non-tessellated materials still end where they did before.

The programmable blocks are actually multiple passes. There was one such block before but now there are two: ClusterRasterize and PatchRasterize. There are binning setup passes that I’ve omitted from those blocks but more importantly there is a Dispatch and Draw call for each material using a programmable rasterization feature. We can ignore the previously supported programmable features for the moment. What is important to understand is that displacement mapping is a programmable feature. The displacement function is shader logic expressed by a node graph and authored by an artist. Any shader that needs to evaluate the displacement function must be programmable meaning there needs to be a material shader permutation compiled for it and a DispatchIndirect to launch it.

Also important to realize what is not a programmable stage. Artist programmable shaders should in general be minimized to reduce the number of unique compiled shaders and dispatches but in this case there are also scheduling reasons to avoid them. Not being programmable obviously means not having access to any of the programmable logic which can be difficult. I’ll get into all of this more of this later.



ClusterRasterize


Time to start digging into these stages. The ClusterRasterize compute shader used to do exactly what its name implies, take a visible cluster and rasterize it using our software rasterizer.

Rough sketch of old ClusterRasterize:
Scalar
  • Load cluster data
Thread per vertex
  • Load position
  • Transform position
  • Store in groupshared
Thread per triangle
  • Load indexes
  • Load transformed positions from groupshared
  • Rasterize triangle
    • If pixel inside triangle then atomic write to VisBuffer

In the triangle stage it now needs to check if a patch needs tessellation first. To do that it calculates its TessFactors. If all are <=1 it doesn’t need tessellation. Even still the vertices need to be displaced, hence why this is still a programmable stage. If all TessFactors <= MaxDiceFactor it can be added to the dice queue, otherwise it’s added to the split queue.

We’ll return later to ClusterRasterize as it is far more complex than that but this is enough for now to stand up a functional pipeline.



PatchSplit


Each thread reads 1 subpatch from the split queue and operates on it. It will bound and cull. If visible, determine TessFactors and split it.

If TessFactors <= MaxDiceFactor it is queued for dicing and PatchRasterize. Otherwise it is split into child subpatches according to the SplitFactors and the Tessellation Table. The Tessellation Table’s barycentrics for each child are now relative to the subpatch so they need to be made absolute, ie relative to the base patch. Those child subpatches are finally added back to the split queue for another round of PatchSplit. I’ll skip over how to load balance the work of a variable number of children for now.

Subpatch format

I’ve been referring to reading and writing subpatches but how are they stored? Ideally the data to express a subpatch is as small as possible since recursive splitting means a lot of reading and writing of these subpatches from the split work queue. Unlike regularized topology we can’t just store an index representing which region of the original patch this subpatch covers. The Tessellation Table generates irregular topology by design. While it might be tempting to store an index into the Tessellation Table, that would only work for 1 level of recursion. Each level is relative to the last and the stack would need to be unwound to derive the actual coordinates. Instead a subpatch stores barycentric coordinates for all 3 corners.
struct Subpatch
{
    uint32 VisibleClusterIndex : 25;
    uint32 TriIndex : 7;
    uint32 BarycentricsUV[3]; // 16:16
};

Culling

Visibility culling for subpatches works the same as clusters; their bounds are tested against the frustum, HZB for occlusion, and VSM page mask in the case of shadows. Unlike clusters, tight bounds around displaced patches aren’t known until they are displaced, classic chicken or egg problem. Since the displacement function is arbitrary shader logic there is limited ability to be clever, outside of involved interval arithmetic analysis of shader code.

Instead we ask the user to specify the max range of the displacement function. This is a key source of user error and I’ve tried my best to reframe the problem as user defined mapping of a [0,1] displacement function to encourage full use of the specified range but this continues to be a commonly tripped over pitfall by artists, overly bloating bounds and destroying culling and performance.

But, even if we wanted to evaluate displacement in this shader we couldn’t. PatchSplit is a global shader and not specialized per material. I’ll explain why in a moment. This one shader is responsible for splitting patches from all materials so all logic contributing to patch splitting must be fixed function.

This user provided displacement range is in the same space as the displacement function, ie it is scalar. To determine screen space bounds for testing I use the technique from [Niessner and Loop]. We have since removed the normalizing of displacement vectors after interpolation so the spherical capped cone logic is unnecessary but it has yet to be adjusted. A simple union of prism corners should be slightly tighter.

Recursive splitting

Recursive splitting is effectively recursive tree expansion of an implicit tree. This task looks just like Nanite’s hierarchical cluster culling so the same tool to solve work distribution comes to mind, a persistent threads shader with a global MPMC lockless work queue. That is exactly what I started with and was what shipped in the first version.

Multipass for recursive

As previously mentioned this is not D3D spec compliant and not guaranteed to make forward progress on all GPU architectures. It does work well on many though. On PC, where we can’t be certain, we no longer use it for cluster culling. Dealing with the support burden wasn’t worth it for the perf improvement. For fixed platforms like consoles we still use it.

Likewise, instead of persistent threads, PatchSplit now is multipass on all platforms not just PC. Remember the key advantage of persistent threads was avoiding sync points with drain and spin up where the GPU isn’t filled. With PatchSplit we can async overlap with non-tessellated Nanite rasterization work through smart scheduling and not worry about idling.

Also worth noting that there is a known and small limit to the number of passes required. Subpatches write 16b barycentrics. The MaxSplitFactor can only divide 65534 so many times before written child subpatches will be the same as their parents. With the persistent threads approach this actually caused problems since in rare cases recursion would go past this depth and become an infinite loop. Detecting that requires a bunch of checks that a fixed recursion depth of multipass doesn’t need.

One could argue that those cases showing up in practice suggests more than 16b is needed for subpatch barycentrics but I’d argue that 1 base triangle being tessellated to a resolution of MaxDiceFactor * 64k is surely plenty. The encountered crashes only happened in unreasonable cases with cameras crashing through surfaces.

Either way, whether persistent threads or multipass, it becomes clear now why it is important this is a global shader and thus all logic needs to be fixed function, decoupled from the programmable material displacement. If there were per material dispatches of PatchSplit a persistent threads style would need to spin up enough threads to fill the machine each time, often only to immediately retire since many wouldn’t have much work. Impossible to know how much work a single starting patch might turn into due to recursion. No overlap could be used between dispatches if they used the same buffer for the queue. Multipass has a simpler reason: overhead of lots of dispatches.

Being a global shader means any data about the materials that needs to be accessed, such as the displacement range needs to be in global buffers. It also means no user programmable normals, no programmable dicing rate.



PatchRasterize


Unlike the split queue, the dice queue doesn’t contain subpatch structs since every patch in this queue already was on the split queue or is a base patch. Instead we can index them and reduce memory and traffic. What has also been done already in a prior stage is calculating the TessFactors for dicing. To avoid repeating that they are also written to the dice queue, or more accurately the Tessellation Table index that pattern corresponds to is.

Subpatches after splitting are mostly close to uniform topology and full size. By that I mean most have TessFactors near MaxDiceFactor. This is because splitting already dealt with most of the irregularity, assuming that patches reaching here were the result of splitting. That isn’t true from what I’ve explained so far but will be true by the end.

Given this, PatchRasterize can follow the same pattern as base Nanite ClusterRasterize. It's fine to statically assign threads to diced vertices and triangles and not worry about empty threads since there will be few of those.

Rough sketch of PatchRasterize:
Scalar
  • Load cluster data
  • Load patch data
  • Load patch corner data
  • Transform corners
Thread per diced vertex
  • Load vertex barycentrics from Tessellation Table
  • Lerp the corners using barycentrics
  • Evaluate displacement function
  • Transform position
  • Store in groupshared
Thread per diced triangle
  • Load indexes from Tessellation Table
  • Load transformed positions from groupshared
  • Rasterize triangle

Too much scalar

While there is a decent chunk of scalar work in with base ClusterRasterize in the form of per cluster work, there is much more with PatchRasterize since there is also per patch work. In most cases scalar work gets hidden by vector work. If there is too much of it, too front loaded, it can start to matter. Worse yet when that group uniform work is float math. RDNA3 and earlier don’t have any scalar float ops so these ops go to the vector unit with only 1 lane utilized. Basically while the virtual “domain shader” (DS) work is vectorized the “vertex shader” (VS) work isn’t.

Having all this scalar work is not well utilizing what is primarily a vector processor. To try and vectorize this while still keeping the results in registers Rune Stubbe made an optimization where both the 3x work from patch corners as well as multiple patches are spread across the threads. This means 1 wave works on multiple patches, purely to try and vectorize per patch work. First each thread works on 1 patch corner (VS). Then the shader switches to 1 patch at a time, looping over all patches this wave covers. A patch loads its data from the first phase using WaveReadLaneAt.

Don’t normalize

Another optimization Rune made was noticing that if he removed normalizing the vertex normal after interpolation, that all the rest of the transform math is linear. That means that the base patch positions and base patch normals can be transformed to clip space and shared for the whole patch. This is basically like moving work from DS to VS. The visual difference and more importantly the choice to normalize in the first place is conventional but arbitrary.

Software rasterization only

Unlike clusters, patches don’t have a hardware rasterization path. The most obvious reason is that Reyes generates micropolys by design. Even if the dicing rate is larger than 1 pixel it will still be far smaller than the threshold for switching to HW rasterization. This is convenient in that it means we don’t need an additional shader permutation for HW. We also don’t need to issue the draws for them, most of which would be empty. This presents some issues though.

Near plane clipping

The first is easy to address. Before to avoid having to deal with near plane clipping in SW we sent any clusters that intersected with the near plane down the HW path. We still don’t want to deal with it but at least this time the triangles are small. So instead of clipping I cull triangles that cross the near plane. This looks nearly indistinguishable from clipping since the triangles are so small. Initially I culled subpatches but that was too coarse and depending on their bounds which may be considerably bigger than the subpatches themselves leading to unexpected culling of subpatches nowhere near the camera.

No MaxEdgeLength test

While tessellation tries to achieve triangles that are the size of the dicing rate it only does so before displacement. Typically the difference is small but it isn’t guaranteed. Sharp discontinuities in the displacement function are the worst case. These will cause a diced triangle to stretch the distance from min to max displacement. That can be longer than is good for the SW rasterizer to handle. Instead of allowing unbounded rasterization cost there is a clamp to the screen rect a triangle covers in the rasterizer (set to max 64 pixels). This means this worse case will visually appear as the surface tearing apart under too much stretching as the camera gets close. This may force us to add HW rasterization in the future.

Holes due to SW rasterized triangles longer than 64 pixels


DS derivatives

The last issue isn’t because of SW raster or even is particular to Nanite Tessellation. There are no automatic derivatives in domain shaders. Texture samples in the displacement function require UV derivatives for mipmapping to work. Mipmapping needs to work for band limiting the signal and reducing aliasing but more importantly it is needed for cache coherency. Using mip0 can be a serious performance loss.

Traditional Reyes renderers dice into rectangular grids. Finite differences along grid UV directions are simple and can be used for shading but the Tessellation Table’s irregular meshing doesn’t provide that simple neighbor lookup. Perhaps analytic gradients could be used instead like we do for deferred materials? It still needs to be relative to a sampling rate though. For that we can use the chain rule:

\begin{equation} \frac{dUV}{dTessFactors} = \frac{dUV}{dXYZ} \frac{dXYZ}{dTessFactors} \end{equation} Pardon my lack of mathematical rigor here in terms of dimensionality. Anisotropic texture filtering isn’t important, only isotropic trilinear filtering is needed, so for simplicity assume this is projected in the direction that maximizes this derivative and the equation above can be treated as scalar.

The second term that includes TessFactors is inconvenient. It’s only defined at the edges. The interior could somehow lerp it but the corners would still remain undefined. Thankfully $\frac{dXYZ}{dTessFactors}$ is effectively $\frac{1}{DicingRate}$. That was how the TessFactors were computed in the first place. That is a perfectly smooth function in XYZ space, independent from the surface. So it is defined, continuous, and thus will always match between patches.

The problem of continuity is actually with the first term. Taken directly this is a piecewise constant function. It’s basically the tangent basis of each face before orthonormalization. Like TessFactors, edges could be made to match by only using data of the edge. Edges could exclusively use the UV difference along the edge, ignoring any component of the gradient orthogonal to the edge direction. Unlike TessFactors this doesn’t entirely solve it due to the corners. Corners won’t have just 1 neighbor. They will have an arbitrary number of them depending on the valence of the vertex. Following a similar trick would mean using only vertex data which doesn’t make sense for a rate of change since a point is zero size.

Solving this requires some form of continuous function over the mesh through preprocessing. Like how the tangent basis is typically calculated, this could be averaged and stored per vertex but at a memory cost. Worth noting we support multiple UV channels. From a workflow perspective it would mean extra heavyweight data would need to be optionally built for meshes to allow displacement on them.

UV density

Instead of averaging per vertex Jamie Hayes implemented averaging this UV density over the whole mesh (for each UV channel). This is the same thing that most engines will do for informing mip based texture streaming. This won’t work well for large differences in texture density across a single mesh but that is considered undesirable for anything visible anyways. This also doesn’t account for any other attribute that could be used in the shader for procedural texturing. For other things we can’t provide a reasonable derivative for will use zero. These are rare and performance loss is acceptable.

Sunday, February 1, 2026

How to tessellate

There are many different approaches for determining whether and by how much a patch needs to be tessellated. It is important that we don’t create cracks between neighboring patches or subpatches. Many Reyes implementations choose tessellation patterns which do not match and must stitch them together after they are generated. We need something that can be determined purely from the subpatch itself with no communication needed between subpatches such that they can run completely in parallel.

TessFactors

The approach Moreton and D3D’s hardware tessellation stage take is a simple solution to this problem and the one I use. So long as the only data about a patch that affects the placement of vertices on an edge is data about the edge itself, those vertices will match between different patches. Each edge computes a tessellation factor, ie the number of segments they wish to be subdivided into. I will refer to these 3 edge factors as TessFactors. D3D also has an inner tessellation factor but in practice that is derived from the edge factors and mostly is just an artifact of the tessellation pattern D3D uses.

That doesn’t explain how these TessFactors are calculated. There are various approaches to that as well. Calculating the length of the undisplaced edge in screen space doesn’t work since that might be zero but displacement causes the patch to face towards the camera. Other approaches suggest the opposite, only densely tessellate at silhouettes and reduce tessellation where the displacement direction faces the camera and displacing would only change depth. Depth matters for object intersections and shadows though.

Diagsplit samples the displacement function at a few points to estimate its screen space length. Not only is that too expensive, it requires every shader that needs to calculate TessFactors to have access to the displacement function, ie be programmable, which isn’t practical. Also ruled out are artist provided density hints expressed by the material graph for the same reason.

The approach I take is simple, common, and similar to what Nanite already uses to project object space error to screen space. TessFactor for an edge is based solely on its world space undisplaced length which is projected to pixels as if the edge was perpendicular to the view vector. This length in pixels is divided by a global DiceRate setting to convert it to TessFactor. Often DiceRate >1 pixel can save cost with little visual difference. The UE default is 2 pixels.



Uniform density dicing


We have these TessFactors for a patch we want to dice. They represent roughly the length of each edge of the patch. We want to dice that triangular patch into uniformly sized triangles. What is typically called a uniform tessellation is one where a shape is tessellated into many smaller identical shapes similar to the original. The angles are congruent. This is topologically uniform. That is not helpful. What we need is uniform density.
11 11 11
Uniform topology
11 9 6
Uniform topology snapped to edge TessFactors
11 9 6
Uniform density

I’ll define an optimal uniform density tessellation as one with the minimum number of triangles possible where all edges are <= to a chosen length. The length of the longest edge dictates the worst case sampling rate of the displacement function and thus the signal resolution. Achieving this target edge length with any more triangles wouldn’t be optimal. It isn’t important to achieve a perfectly optimal tessellation but it is useful to understand the target and its properties to understand how to approach it.

Remeshing

Thankfully generating meshes like these, isotropic meshes made of roughly equal length edges forming equilateral triangles with vertices close to valence 6, is a common operation called remeshing and there are many published approaches to do so. Perhaps the most popular is [Botsch and Kobbelt 2004].

The algorithm goes as follows:

For N iterations
  • For all edges
    • If edge is too long then split it
    • If edge is too short then collapse it
    • If edge could be shorter if it was flipped then flip it
  • For all vertices
    • Move position to the average of its neighbors

After many iterations the result will approach an isotropic mesh matching the desired properties. There are more considerations when this is meant to express a particular surface but we are only concerned with remeshing a flat triangular patch. For this case the only consideration needed is to constrain boundary vertices to the patch edge they started on.

Tessellation Table

This remeshing process is far too expensive to do in real-time though. Thankfully since we are working with just the patches themselves and there is a limit on the maximum TessFactor for dicing, every permutation of TessFactors can be precomputed and placed in a lookup table which I will call the Tessellation Table. The TessFactors index into the Tessellation Table. I will call an entry in this table a Tessellation Pattern.
7 4 4 7 6 3 11 9 6 14 14 11
What exactly is stored in the Tessellation Table? For each pattern there is a vertex and index buffer as if it were a little mesh. Instead of positions, the vertex buffer stores barycentric coordinates in the patch. When rendering, the patch can be replaced with this little mesh. The barycentrics are used to interpolate the patch corners. Because each pattern is of variable size there is also an offset table translating the TessFactor index into the buffer offsets for the VB and IB. The barycentrics are stored as 2 16bit coordinates with the 3rd coordinate implied. The indexes are 10bit so all 3 corners of a triangle can be packed into 1 dword.

Tessellation Table redundancy

There is a lot of redundancy in this indexing though. The Tessellation Pattern for (3,4,2), (3,2,4), (4,3,2) for example are all the same. They are just rotated or mirrored versions of the same pattern. Instead a unique index into the table is defined as an ordering of the TessFactors from largest to smallest.

This reduces the number of patterns stored from $N^3$ to $\binom{N+2}{3}$ or $\frac{N(N+1)(N+2)}{6}$, where $N$ is the max TessFactor the Tessellation Table covers. For N=16 this is the difference between 4096 and 816 or <20% the patterns needing to be stored. Size reduction also reduces cache pollution.

To correctly alias patterns the reordering must also be undone when the pattern is used. First, if the winding flips it needs to be reversed so backface culling is preserved. Second, the barycentrics stored in the table need to be unswizzled so they correctly index the corners of the patch. Alternatively, the patch corners themselves can be swizzled which is often cheaper since it happens at lower frequency.

Tessellation Pattern building

The remeshing algorithm previously explained was written for meshes with Cartesian coordinates but Tessellation Patterns only store barycentric coordinates. Another way to describe this issue is to say the original algorithm assumes extrinsic geometry but Tessellation Patterns are intrinsic geometry. An extrinsic triangle is defined by the position of its corners. It is embedded in a space. An intrinsic triangle is defined by its edge lengths. It doesn’t have any specific position or orientation. A Tessellation Pattern is intrinsic. A pattern exists for each combination of TessFactors. TessFactors are treated as patch edge lengths so the goal for a pattern is to tessellate the patch into triangles with roughly unit length edges.

Intrinsic isn’t a problem for the relaxation step. The average of Cartesian and barycentric coordinates will result in equivalent positions since the math is linear. What extrinsic appears to be needed for is the edge length calculations. Thankfully that is not the case. A good number of geometry calculations can be done with only barycentrics and edge lengths.

For a pair of barycentric points P and Q there is a vector between them $\mathbf{PQ} = Q - P$. Unlike normalized barycentric points where $u+v+w=1$, for normalized barycentric vectors $u+v+w=0$, since $1-1=0$. The squared length of a barycentric vector $\mathbf{PQ}(u,v,w)$ in a triangle with edge lengths $(a,b,c)$ is:
\begin{equation} \lVert \mathbf{PQ} \rVert^2 = -a^2 v w -b^2 w u -c^2 u v \end{equation} Thankfully this also implicitly handles a nonobvious issue. While treating TessFactors as edge lengths is sensible and expresses what we are optimizing for, they aren’t exactly lengths. There are rare cases where TessFactors interpreted as edge lengths express a non-Euclidean triangle, specifically $a>b+c$. If extrinsic geometry was required to calculate edge lengths it wouldn’t be possible for these patterns. Deriving the extrinsic coordinates would result in divide by zeros and negative sqrts. Working as intrinsic instead, using the formula above, no special handling is needed. While the distances lose geometric meaning in these cases it degrades gracefully

Barycentric quantization

The last thing to take care of is to make sure boundary vertices bitwise match along an edge with all other patterns with the same TessFactor. If they don’t there will be cracks. That can be left for when the barycentrics are quantized to 16bit. It is important though that this quantization is symmetric.
<< 1 0 >> 0 1
Consider 2 adjacent triangles rotated 180 degrees from one another. Their shared edge will also share the same corners except swapped. Their coordinates lerp in opposite directions. If this edge is where $w=0$, then for vertices along the edge to match their counterpart $(u,v,w)=(1-v,1-u,w)$. This implies that quantization must be symmetric about 0.5. Whatever direction x rounds needs to be the opposite of what happens to 1-x. If 0.25 rounds down, 0.75 needs to round up, or vice versa.

The most obvious way to quantize to 16b fixed point would be to represent the range [0.0, 1.0] as [0, 65535]. With that all float values can be made to match a reversed counterpart at the boundaries except one: 0.5, the midpoint. This is a point that will be used by any pattern that has an even TessFactor. 0.5 can’t round in the opposite direction as 0.5. It needs to be stored exactly. The easiest fix is to use an even max value, so map 1.0 to 65534.

If each coordinate is quantized separately they might not still sum to 1. To fix this for interior vertices any coordinate could be chosen to be rederived, so we can say it's always w=1-u-v. But boundary vertices need to be symmetric. To do that I quantize the median barycentric and rederive the max barycentric given they are normalized (the min is always zero on the boundary).



Uniform density splitting


The Tessellation Table can be used for splitting as well. Typically binary splitting is used. There are advantages to using a wider branching factor though. Wider means a shallower tree, less recursion, and thus less traffic to and from a work queue. Perhaps more importantly though it has the potential to generate more uniformly shaped subpatches for the same reason we were interested in a uniformly dense final tessellation. This can reduce the number of subpatches and make each subpatch more likely to be uniform in dimensions and closer to having max DiceFactors, which we'll see matters later.
7 6 3 7 6 3 11 9 6 11 9 6

Diagsplit longest edge splitting vs Uniform splitting

SplitFactors

How to best take advantage of this flexibility? Simplifying this question down to 1D and just looking at a single edge. If an edge has a TessFactor of 32, which is larger than the max TessFactor of 16, what SplitFactor should be used in this step? 32 is a multiple of 16 but it isn’t a power of 16 so there is a choice. Clamping to 16 means it will split into 16 subedges which each will then have a DiceFactor of 2. For reasons I will get into later it is important to have DiceFactors be as large as possible, or in other words do as much of the tessellation in the dicing phase as possible. So the other option in this example is SplitFactor of 2 and DiceFactor of 16.

The same question can be asked at every step of recursive splitting. Is it better to do the smaller factors early? Does the ordering matter besides for the final dicing step I mentioned already? Predicting too far ahead won’t work well since the desired TessFactor in an early step may not be what is chosen by the end because the projection on screen refines with smaller edges. How about the aspect ratio? Would it be better to address aspect ratio early and generate uniform sized subpatches as soon as possible? Unfortunately that isn't possible while still conforming to the limit of TessFactors being determined purely from that edge’s data.

In my tests the best calculation for turning desired TessFactor into the SplitFactor is the following:

SplitFactor = min( TessFactor / MaxDiceFactor, MaxSplitFactor )

This tries to emit subpatches from splitting with maximized DiceFactors but nothing else. Other choices were slower.



Results

Uniform dicing using the Tessellation Table results in 69% the diced triangles compared to D3D style uniform topology. Uniform splitting using the Tessellation Table results in 68% of the split patches compared to binary splitting. More uniformly sized triangles also benefit the rasterizer.

I believe this Tessellation Table approach could have wide applicability due to its more optimal density. The first such use has already been out for a while. The Tessellation Table UE builds has already been used with permission outside of UE in https://github.com/nvpro-samples/vk_tessellated_clusters.