Realtime CPU octree splatting

How I wrote a realtime(ish) software octree renderer in C#
 

Introduction

Why would anyone bother with CPU rendering, nowadays?

  • Often, it’s for the sake of challenge in a constrained environment [A1, A2, A3], or simply as a learning experience [A4, A5].
  • In some cases, people just don’t need the extreme speed of modern GPUs for their projects, and would rather use a slower but predictable / tweakable / simple renderer that works anywhere [B1, B2].
  • CPUs have their own strengths [C1, C2], and on modern multi-core processors software rendering can even reach competitive performance [C3, C4].

For me, the primary motivation was curiosity.

Back when the first “Unlimited Detail” video was making rounds, I got really intrigued how they managed to pull that off. To my knowledge, CPUs were simply too slow for anything of that complexity… yet here was a clear demonstration that such a thing was possible.

At the time, many other people were also stumped as to what their algorithm was. Which is kinda weird, because an idea so old and simple (and really obvious, in hindsight), by all rights, shouldn’t have been quite so obscure.

Still, no tutorials or articles on software rendering I’ve seen back then have mentioned that method… and even though I later stumbled upon various demos by other people [D1, D2, D3, D4] that demonstrated comparable results, there was either no source code at all, the code was too esoteric/convoluted, or simply not very efficient.

So, lacking a well-established solution to realtime pointcloud/voxel rendering on CPU, I set out to do my own experiments. A lot of the things I tried didn’t work — but, eventually, I got to the point where my single-threaded renderer could run in 640x480 at realtime-ish framerates. Granted, I’m writing it in C# / Unity, so there’s plenty of room for more low-level optimizations… but even in the theoretically optimal case, it would still be miles behind any GPU-based engine.

That said — if anyone is interested, here’s what I learned or figured out along the way :-)

The basic algorithm

I’ll use Python-like pseudocode for simplicity:

def Render(node):
boundingBox = CalculateScreenSpaceBoundaries(node)

if IsOutOfView(boundingBox): return

projectedSize = max(boundingBox.SizeX, boundingBox.SizeY)

if node.IsLeaf or (projectedSize <= 1):
Draw(boundingBox, node)
return

if IsFullyOccluded(boundingBox): return

for octant in CalculateFrontToBackOrder(node):
if node.HasChild(octant):
Render(node.GetChild(octant))

This gives the O(Npixels) average-case performance scaling, virtually independent of model complexity. Technically, raycasting also scales this way, but splatting avoids a lot of work (and cache misses) by processing each node only once.

Of course, theoretical scaling doesn’t automatically translate into good speed in practice. If you implement this basic algorithm naively, you’ll see realtime framerates only at very small resolutions. So, how do we improve it?

Well, before getting into specifics, we first need to address a couple of important architectural choices.

Ways to split a cuboid

Slicing: the corners of subnodes either coincide with the parent’s corners, or lie at the centroids of the parent’s edges / faces / all vertices. This sort of “slices” each face along its bimedians, hence the samurai :)

Tiling: each subnode is a twice-smaller copy of the parent shape, translated by some offset. If the shape is a perfect parallelepiped, the subnodes will tile it exactly.

They both have their pros and cons. Tiling requires much less calculations (and due to self-similarity, a lot of things can be precalculated), but only works for parallelepiped shapes and requires special care to guarantee the absence of gaps. Slicing is considerably more expensive, but gives correct and gapless results under any amount of deformation.

Since we aim for realtime, we can’t afford the much lower efficiency of the more general method. Of course, this limits us to rendering only parallel projections of cubic volumes and their affine transformations, but hey — at least you can do top-down, side-scrolling and isometric games with that :-)

Octree storage

All nodes are stored in a contiguous array, and each node is a struct consisting of:

  • Address — the index of this node’s first child (subnode) in the array
  • Mask — the bitmask of this node’s non-empty octants (subnodes)
  • Data — the data associated with this node (for example, color)

For this to work, all children of the same parent must be stored in adjacent locations — but, in the general case, this block of adjacent locations may reside anywhere in the array (which can result in a lot of cache misses during the traversal).

Since it’s likely that our renderer will visit many neighboring nodes of the same octree depth (specifically, the one at which nodes are pixel-sized), we can improve data locality by sorting the nodes in their breadth-first traversal order.

There are also a couple of ways we can improve memory usage, if the octree doesn’t need to be modified:

  • Instead of always storing all 8 children of a node (even if some are empty), we can store only the non-empty ones, at the cost of potentially doing some extra lookups during the traversal.
  • We can split the node array into chunks, and store chunks in compressed form. In combination with a virtual addressing / memory paging system, we can unpack on-demand and keep in memory only the chunks we actually need to access.

To be fair, there exists a more memory-efficient approach (at least when it comes to geometry itself), based on directed acyclic graphs [E1, E2]. Its core idea is that octrees typically contain a lot of identical subtrees that can be merged together, which saves quite a bit of memory (although it also reduces data locality). I haven’t tried it yet, but I’m aware of at least one person who has a working implementation.

For simplicity, however, here I’ll be using the basic storage format outlined above.

Nuts and bolts

// nodeX, nodeY: screen coordinates of the node's center
// nodeZ: min Z coordinate of the node's screen-space bounding box
// address: index of this node in the node array
// level: node's depth in the (sub-)octree
void Render(int nodeX, int nodeY, int nodeZ, uint address, int level) {
// Calculate screen-space bounding rectangle
var boundingRect = CalculateScreenSpaceBoundaries(nodeX, nodeY, level);

// Find the part of the rectangle visible within the viewport
var visibleRect = boundingRect.Intersection(Viewport);

// If the rectangle is invisible, stop processing this node
if ((visibleRect.SizeX < 0) | (visibleRect.SizeY < 0)) return;

// Read our node array at the specified node address
var node = Octree[address];

// Find the maximal projected size in pixels
var projectedSize = Math.Max(boundingRect.SizeX, boundingRect.SizeY);

// If this is a leaf node (indicated by an empty children mask)
// or it fits within 1 pixel, draw and be done with it
if ((node.Mask == 0) | (projectedSize < 1)) {
Draw(visibleRect, nodeZ, node);
return;
}

// If the rectangle is fully occluded, stop processing this node
if (IsFullyOccluded(visibleRect, nodeZ)) return;

for (int i = 0; i < 8; i++) {
// queue is a pre-calculated front-to-back sequence of octants,
// packed into the bits of a 32-bit integer
uint octant = (queue >> (i*4)) & 7;

// Skip this octant if the node has no corresponding child
if ((node.Mask & (1 << (int)octant)) == 0) continue;

// deltas contains pre-calculated offset vectors for each subnode
var delta = deltas[octant];

// Calculate subnode position and address
int childX = nodeX + (delta.X >> level);
int childY = nodeY + (delta.Y >> level);
int childZ = nodeZ + (delta.Z >> level);
var childAddress = node.Address + octant;

// Process the subnode
Render(childX, childY, childZ, childAddress, level+1);
}
}

Draw(...) tests the nodeZ value against the depth of each pixel within the visible rectangle, and overwrites the pixel's depth and color if nodeZ is smaller. IsFullyOccluded(...) is very similar, but simply returns false as soon as the depth test passes.

// Pixels is a row-major 2D array of pixels,
// where each row's length is BufferStride
void Draw(Range2D rect, int nodeZ, OctreeNode node) {
for (int y = rect.MinY; y <= rect.MaxY; y++) {
int index = rect.MinX + (y * BufferStride);
for (int x = rect.MinX; x <= rect.MaxX; x++, index++) {
ref var pixel = ref Pixels[index];
if (nodeZ < pixel.Depth) {
pixel.Depth = nodeZ;
pixel.Color24 = node.Data;
}
}
}
}
bool IsFullyOccluded(Range2D rect, int nodeZ) {
for (int y = rect.MinY; y <= rect.MaxY; y++) {
int index = rect.MinX + (y * BufferStride);
for (int x = rect.MinX; x <= rect.MaxX; x++, index++) {
ref var pixel = ref Pixels[index];
if (nodeZ < pixel.Depth) return false;
}
}
return true;
}

In floating-point arithmetic, CalculateScreenSpaceBoundaries(...) would be equivalent to the formula max, min = floor(center ± (extent / pow(2, level) - 0.5)), where extent is the root node's half-size. However, since here we use fixed-point arithmetic, the actual implementation looks like this:

// SubpixelBits is the number of bits in the fractional part
// of the coordinate's fixed-point representation.
// Correspondingly, SubpixelHalf = (1 << SubpixelBits) / 2
 
Range2D CalculateScreenSpaceBoundaries(int nodeX, int nodeY, int level) {
var nodeExtentX = (extentX >> level) - SubpixelHalf;
var nodeExtentY = (extentY >> level) - SubpixelHalf;

return new Range2D {
MinX = (nodeX - nodeExtentX) >> SubpixelBits,
MinY = (nodeY - nodeExtentY) >> SubpixelBits,
MaxX = (nodeX + nodeExtentX) >> SubpixelBits,
MaxY = (nodeY + nodeExtentY) >> SubpixelBits,
};
}

Subtracting half-pixels from each edge is necessary for correct pixel coverage in some cases. In the image below, both the original and the shrinked rectangles overlap the same pixel centers, but floor()-ing the original rectangle’s corners will give wrong results.

And, for completeness’ sake, the Range2D implementation is this:

public struct Range2D {
public int MinX, MinY, MaxX, MaxY;

public int SizeX => MaxX - MinX;
public int SizeY => MaxY - MinY;

public Range2D Intersection(Range2D other) {
return new Range2D {
MinX = (MinX > other.MinX ? MinX : other.MinX),
MinY = (MinY > other.MinY ? MinY : other.MinY),
MaxX = (MaxX < other.MaxX ? MaxX : other.MaxX),
MaxY = (MaxY < other.MaxY ? MaxY : other.MaxY),
};
}
}

Well, that’s basically all the logic in the Render(...) method. However, it depends on a few pre-calculated variables that deserve their own explanation.

The full(er) picture

public class OctreeRenderer {
// Viewport & renderbuffer info
public Range2D Viewport;
public int BufferStride;
public PixelData[] Pixels;

// Model info
public Matrix4x4 Matrix;
public OctreeNode[] Octree;
public uint RootAddress;

public void Render() {
if (!Setup()) return;

Render(startX, startY, startZ, RootAddress, 0);
}

...

One thing that should be noted here is that Matrix is expected to represent an object's coordinate system in renderbuffer space, which ranges from (0, 0, 0) to (Width, Height, Depth). E.g., for a 640x480 buffer with 16-bit depth, this would be (640, 480, 65536).

The non-public part looks like this:

...

private const int SubpixelBits = 16;
private const int SubpixelSize = 1 << SubpixelBits;
private const int SubpixelHalf = SubpixelSize >> 1;

private struct Delta {
public int X, Y, Z;
}

// Subnode offset vectors
private Delta[] deltas = new Delta[8];

// Front-to-back sequence of octants
private uint queue;

// Root node info
private int extentX, extentY, extentZ;
private int startX, startY, startZ;

// The int matrix
private int XX, XY, XZ;
private int YX, YY, YZ;
private int ZX, ZY, ZZ;
private int TX, TY, TZ;

// Render(...) and related methods
...

// Setup() and related methods
...
}

And the Setup() itself:

bool Setup() {
int maxLevel = CalculateMaxLevel();

if (maxLevel < 0) return false;

CalculateIntMatrix(maxLevel);

CalculateRootInfo();

if (startZ < 0) return false;

CalculateDeltas();

CalculateQueue();

return true;
}

CalculateMaxLevel() estimates the depth of octree at which adjacent nodes are guaranteed to be no more than 1 pixel apart.

int CalculateMaxLevel() {
float absXX = Math.Abs(Matrix.M11);
float absXY = Math.Abs(Matrix.M12);
float absYX = Math.Abs(Matrix.M21);
float absYY = Math.Abs(Matrix.M22);
float absZX = Math.Abs(Matrix.M31);
float absZY = Math.Abs(Matrix.M32);

float maxGap = 0;
maxGap = Math.Max(maxGap, absXX + absYX + absZX);
maxGap = Math.Max(maxGap, absXY + absYY + absZY);

// The integer part of our fixed-point representation
// contains (32 - SubpixelBits) bits. But to avoid
// overflows and crashes, we can left-shift at most by
int maxShift = 28 - SubpixelBits;

for (int maxLevel = 0; maxLevel <= maxShift; maxLevel++) {
if (maxGap < (1 << maxLevel)) return maxLevel;
}

return -1; // too big; can't render
}

CalculateIntMatrix(...) calculates the fixed-point representation of the root node's matrix. To guarantee the absence of gaps, all nodes must lie exactly at the vertices of an integer lattice, with the lattice vectors equal to half-axes of a (sub)pixel-sized node.

void CalculateIntMatrix(int maxLevel) {
// Scale x and y components of axes to 1-pixel precision
float levelScale = (float)Math.Pow(2, SubpixelBits - maxLevel);

// Calculate fixed-point representations
XX = (int)(Matrix.M11 * levelScale);
XY = (int)(Matrix.M12 * levelScale);
XZ = (int)(Matrix.M13);
YX = (int)(Matrix.M21 * levelScale);
YY = (int)(Matrix.M22 * levelScale);
YZ = (int)(Matrix.M23);
ZX = (int)(Matrix.M31 * levelScale);
ZY = (int)(Matrix.M32 * levelScale);
ZZ = (int)(Matrix.M33);
TX = (int)(Matrix.M41 * SubpixelSize);
TY = (int)(Matrix.M42 * SubpixelSize);
TZ = (int)(Matrix.M43);

// Halve all components of the axes
XX >>= 1; XY >>= 1; XZ >>= 1;
YX >>= 1; YY >>= 1; YZ >>= 1;
ZX >>= 1; ZY >>= 1; ZZ >>= 1;

// Scale x and y components of axes back to root node level
XX <<= maxLevel; XY <<= maxLevel;
YX <<= maxLevel; YY <<= maxLevel;
ZX <<= maxLevel; ZY <<= maxLevel;
}

CalculateRootInfo() calculates the extent (half-size) and position of the root node. To slightly simplify the rendering logic, Z position is counted from the bounding box minimum instead of the center.

void CalculateRootInfo() {
extentX = (Math.Abs(XX) + Math.Abs(YX) + Math.Abs(ZX)) << 1;
extentY = (Math.Abs(XY) + Math.Abs(YY) + Math.Abs(ZY)) << 1;
extentZ = (Math.Abs(XZ) + Math.Abs(YZ) + Math.Abs(ZZ)) << 1;

startX = TX;
startY = TY;
startZ = TZ - extentZ;
}

CalculateDeltas() calculates the octant offset vectors (from the node's position to the position of each of its subnodes).

void CalculateDeltas() {
int offsetZ = extentZ >> 1;
int octant = 0;
for (int z = -1; z <= 1; z += 2) {
for (int y = -1; y <= 1; y += 2) {
for (int x = -1; x <= 1; x += 2) {
deltas[octant].X = (XX * x + YX * y + ZX * z);
deltas[octant].Y = (XY * x + YY * y + ZY * z);
deltas[octant].Z = (XZ * x + YZ * y + ZZ * z) + offsetZ;
octant++;
}
}
}
}

And, finally, the CalculateQueue(). It uses the helper class OctantOrder, which stores look-up tables of pre-calculated sequences of octants (and some other related things).

void CalculateQueue() {
int axisOrder = OctantOrder.Order(in Matrix); // 6 values
int startingOctant = OctantOrder.Octant(in Matrix); // 3 bits
int nodeMask = 255; // 8 bits

int lookupIndex = axisOrder;
lookupIndex = (lookupIndex << 3) | startingOctant;
lookupIndex = (lookupIndex << 8) | nodeMask;

queue = OctantOrder.SparseQueues[lookupIndex].Octants;
}

The octant sequences in the look-up table are pre-calculated for each combination of axis order (XYZ, XZY, YXZ, YZX, ZXY, ZYX), starting octant (0 .. 7) and node mask.

Axis order defines how the octant axes should be looped over:

# XYZ                 # XZY                 # YXZ                 ...
for x in (0, 1): for x in (0, 1): for y in (0, 1):
for y in (0, 1): for z in (0, 1): for x in (0, 1):
for z in (0, 1): for y in (0, 1): for z in (0, 1):
... ... ...
 
# Regardless of the axis loop order, octant index is calculated as
octant = (x | (y << 1) | (z << 2)) ^ startingOctant

Axis order can be determined from the magnitudes of Z-components of the matrix axes:

if abs(matrix.Xz) > abs(matrix.Yz):
if abs(matrix.Yz) > abs(matrix.Zz):
axisOrder = ZYX
elif abs(matrix.Xz) > abs(matrix.Zz):
axisOrder = YZX
else:
axisOrder = YXZ
else:
if abs(matrix.Xz) > abs(matrix.Zz):
axisOrder = ZXY
elif abs(matrix.Yz) > abs(matrix.Zz):
axisOrder = XZY
else:
axisOrder = XYZ

Starting octant is the first octant in the front-to-back sequence, formally the one with the lowest Z coordinate. However, an easier way to calculate it is by checking how the object’s YZ/XZ/XY planes are oriented relative to the view (essentially, checking the sign of the Z-components of the corresponding planes’ normals):

x = int(matrix.Yy * matrix.Zx <= matrix.Yx * matrix.Zy) # YZ plane
y = int(matrix.Zy * matrix.Xx <= matrix.Zx * matrix.Xy) # XZ plane
z = int(matrix.Xy * matrix.Yx <= matrix.Xx * matrix.Yy) # XY plane
startingOctant = x | (y << 1) | (z << 2)

Node mask is the bitmask of a node’s non-empty octants. Since the naive Render(...) implementation checks subnode existence explicitly, a queue for all 8 children (nodeMask = 255) is used.

Benchmarking

I’m far from being an expert in benchmarking / profiling / optimization (and I have no idea how to properly do rigorous performance tests for such data-dependent algorithms), so here I’m simply going to measure the typical time-per-frame for a particular scene at fixed viewing conditions.

My testing environment:

  • CPU: AMD Ryzen 7 3700X, 3.6 GHz, 8 cores (16 logical processors)
  • L1 data cache: 8 x 32 KB (8-way, 64-byte line size)
  • L1 instruction cache: 8 x 32 KB (8-way, 64-byte line size)
  • L2 cache: 512 KB (8-way, 64-byte line size)
  • L3 cache: 16 MB (16-way, 64-byte line size)
  • RAM: 32 GB, 2.4 GHz
  • OS: Windows 10 64-bit

As of this writing, I have two demo projects that I’ll be testing in the following configurations:

  • OpenTK 4.6.4 under .NET 5: NET5 (Release)
  • Unity 2020.3.11: Mono (Release), IL2CPP (Master)

Theoretically, I could also test in Debug mode, as well as for WebGL and Android platforms, but it’s a lot of extra effort — and we’re interested in the “best fighting chance” performance anyway.

The testing scene consists of a small grid of slightly overlapping octree models, viewed at an empirically chosen angle and zoom (at which I observed the worst performance) and rendered at the 640x480 resolution.

Keep in mind, though, that “worst performance” here refers to the initial implementation, and after some optimizations this exact angle / zoom may no longer indicate the worst-case performance in this scene. However, I’ll keep those settings throughout all tests for consistency.

The octree itself is a voxelized version of the “medieval kind of seaport” model by tokabilitor. It’s pretty heavy, so I didn’t commit it to the repository directly. But if you wish to try the demo yourself, you can download the octree model here (the instructions are inside the archive).

So! Without further ado, here are the first results (in milliseconds per frame):

NET5: 83, Mono: 143, IL2CPP: 74

As expected, pretty slow. Out of curiosity, let’s also check the effect of different octree sorting methods:

  • Breadth-first: NET5: 83, Mono: 143, IL2CPP: 74
  • Depth-first: NET5: 110, Mono: 169, IL2CPP: 100
  • Random: NET5: 121, Mono: 182, IL2CPP: 114

In this particular case, depth-first is about 25 ms slower, and random is about 40 ms slower. In practice, these differences can be quite situation-dependent, but this demonstrates the general principle that breadth-first sorting gives the best overall worst-case.

General optimizations

#1: Call it quits

Adding [MethodImpl(MethodImplOptions.AggressiveInlining)] to the methods used in Render(...) gives us this:

NET5: 83 → 67, Mono: 143 → 132, IL2CPP: 74 → 73

Basically no effect on IL2CPP, but an observable improvement for NET5/Mono. Can we do better?

I manually inlined everything in Render(...), except for the recursion:

NET5: 83 → 67, Mono: 143 → 118, IL2CPP: 74 → 70

Same result for NET5, but better results for Mono and IL2CPP. Seems to be worth it :-)

#2: Let’s get unsafe-rous

First, I tried passing the pointers as extra arguments to the Render(...) method:

NET5: 67 → 68, Mono: 118 → 120, IL2CPP: 70 → 75

Hmm, this made things worse. Passing extra arguments around appears to be kind of expensive.

For an alternative approach, I moved the Render(...) method to an unsafe struct:

NET5: 67 → 64, Mono: 118 → 116, IL2CPP: 70 → 71

That’s closer to what I expected :-)

Next, instead of copying array elements into local variables, let’s use references to those elements. In principle, C# 7.0 allows to do this without resorting to pointers (and I even used ref vars in the drawing/culling loops), but oh well :-)

NET5: 64 → 63, Mono: 116 → 106, IL2CPP: 71 → 65

This definitely helped.

(Out of curiosity, I also tried to rewrite the pixel-processing loops using pointers instead of the (x, y) coordinates, but it only complicated the code without any performance benefit.)

#3: Stacking up

NET5: 63 → 62, Mono: 106 → 98, IL2CPP: 65 → 68

Not a consistent improvement (IL2CPP got slightly worse), but this approach will help us in the long run.

Speaking of stacks: now that we use pointers, we can also try to put all internal renderer state on the program execution stack. Here’s what stackalloc'ing the node stack and deltas does:

NET5: 62 → 63, Mono: 98 → 97, IL2CPP: 68 → 65

Well, at least IL2CPP is back to its previous result.

#4: Shooting the stars

NET5: 63 → 62, Mono: 97 → 98, IL2CPP: 65 → 66

Though this didn’t make the matters worse either (all changes are within the margin of error). So, eh. Let’s keep it for now :-)

#5: Parsimony

NET5: 62 → 55, Mono: 98 → 74, IL2CPP: 66 → 55

= = = = = = = = = =

Well, that’s about all for the general-purpose optimizations (i.e., which don’t introduce any changes at the conceptual level). Our renderer has gotten quite a bit faster, but it still remains the same naive algorithm at heart. To reach even better performance, it will have to get smarter ;-)

Algorithmic tricks

#1: Pixels separately

NET5: 55 → 50, Mono: 74 → 67, IL2CPP: 55 → 50

#2: No checks for the masked

NET5: 50 → 45, Mono: 67 → 62, IL2CPP: 50 → 45

#3: Saving the progress

NET5: 45 → 48, Mono: 62 → 62, IL2CPP: 45 → 46

On its own, this doesn’t provide any improvement. However, storing the bounding rectangle along with other node information allows us to skip the renderbuffer rows that have already been found to be occluded:

NET5: 48 → 43, Mono: 62 → 58, IL2CPP: 46 → 42

#4: Room for stencil

In fact, we don’t actually need a separate stencil buffer, since (by construction) all depth values in our renderer are ≥ 0 (0 at near plane, max depth value at far plane), so the sign bit can be used for stencil data. And in this case our depth test will also automatically work as a stencil test:

# Since node.Z is always >= 0, this test will never pass
# if the sign (stencil) bit in pixel.Depth is set
if node.Z < pixel.Depth:

So, when writing depth values, we just bitwise OR the depth value with a sign bit. The only tricky part here is that we need to clear the stencil bits before we render another object. The easiest way would be to just clear the whole bounding rectangle, but this is quite inefficient. A better way (as far as my experiments indicated) seems to be storing the indices of each overwritten pixel, and then clearing the stencil bits of only those pixels.

NET5: 43 → 43, Mono: 58 → 56, IL2CPP: 42 → 42

Well… stencil test didn’t really help in this particular case, but since these things are highly data-dependent, it may still come in handy in other situations.

Side note: out of curiosity, I also experimented with an explicit stencil buffer, since it allows to potentially test multiple pixels at the same time. However, in practice it created more overhead than any gains from multi-pixel testing could compensate for. Either way, according to my experiments, even completely skipping the known-to-be-occluded / out-of-view nodes and not doing occlusion test for the known-to-be-visible nodes has a rather small impact… So it’s no surprise that “multi-pixel” stencil test optimization couldn’t provide us with any noticeable benefit.

#5: XY marks the spot

Well, yes! We can pre-calculate a small 2D map of bitmasks (or, in my case, two 1D maps for the X and Y axes), where each bit indicates an intersection of the map cell with the corresponding octant. Then, when reaching a node with pixel-sized subnodes (i.e., a node that’s 2x2 pixels in size), we just do this:

for y in range(yMin, yMax):
mapY = ... # convert y from screen to map coordinates
for x in range(xMin, xMax):
mapX = ... # convert x from screen to map coordinates

# This gets us all octants that possibly intersect this pixel
mask = maskMap[mapX, mapY] & node.Mask
if mask == 0: continue

# Get the first octant in the front-to-back sequence
octant = ForwardQueues[mask] & 7

... # Do the depth test, draw the octant, etc.

And yes, it is as awesome as it sounds:

NET5: 43 → 33, Mono: 56 → 49, IL2CPP: 42 → 34

= = = = = = = = = =

Whew! This was quite a ride (at least for me). With all our optimizations applied, we went from kinda-interactive to near-realtime, gaining 2x-3x speed boost compared to the initial implementation.

It’s the end of the road for the renderer itself, unfortunately, and (unless I missed some additional optimization tricks) it can’t be further improved within the current C#/Unity constraints.

However, it’s still possible to squeeze some extra FPS out of our program ;-) But for that, we’ll have to resort to…

Cheating

#1: Cutting corners

To illustrate the problem, let’s consider a toy 1D case. A 4-pixel node divides into two 2-pixel nodes. A 3-pixel node also divides into two nodes (1-pixel and 2-pixel, correspondingly) — but a lot more 3-pixel nodes can fit on the same screen, compared to the 4-pixel ones.

Well, what if we use octant map for 3x3-pixel nodes as well?

NET5: 33 → 16, Mono: 49 → 27, IL2CPP: 34 → 17

Pretty compelling! And most of the time, it looks OK too. But beware of occasional unsightly artifacts:

As a compromise, we can try to use a "higher fidelity" map at 3x3 size (e.g., a map containing 64-bit masks, tested against the combined masks of the node's children). This reduces the above-mentioned artifacts, though obviously with a smaller performance gain:

NET5: 33 → 23, Mono: 49 → 37, IL2CPP: 34 → 21

#2: Job quarter done

In my implementation, the scene is rendered with an appropriate jitter offset to a half-resolution renderbuffer, from which then pixel data is copied to the corresponding positions in the full-resolution color buffer. Also, since we need to sample octrees at resolution higher than render target’s, I added a special case for using octant map at 1x1-pixel nodes.

Here’s the outcome:

NET5: 33 → 14, Mono: 49 → 19, IL2CPP: 34 → 14

That’s even faster than the 3x3 map trick :) But how does it look?

Perfect when everything is still, but kinda noisy in motion. In games, this is typically mitigated via motion blur, but calculating motion vectors on CPU would be prohibitive. However, this technique already works kind of like a poor man’s motion blur (and its noisy nature isn’t particularly glaring at high framerates), so one could almost say it’s a feature ;-)

#3: Brute force

Splitting the screen into horizontal slices (one for each rendering thread) resulted in this:

I guess if you have some processors to spare, even higher resolutions could be viable :D

= = = = = = = = = =

Well, at this point we’ve basically did all we could in our quest for speed. That journey is now over… so if you were only looking for performance optimization tricks, I guess there’s not much else left for you to read here.

However, we are still not quite done with our renderer, feature-wise. In the spirit of curiosity, let’s see (or at least ponder) what other things can be bolted on top ;-)

Bells and whistles

Level cap

Splatting the atoms

Just in case, I also implemented an option to draw nodes as points (or, rather, at fixed size in pixels), regardless of their actual projected size. Perhaps it’s also useful in some situations.

Cubical precision

I also considered an alternative of rasterizing a voxel’s hexagonal projection, but then it would have flat depth — and that would make object intersections look weird.

Going round

Trying to draw actual spheres with proper depth (or, for that matter, any other non-cubical volume) isn’t particularly useful in our case, however, since they won’t “3D intersect” other spherical nodes of the same octree anyway due to the stencil test.

Crossfading data

Normal circumstances

  1. Calculate normals from the depth buffer. This might look okay in some cases, but is prone to artifacts and will be useless for nodes larger than a pixel.
  2. Calculate normals during the octree splatting. This means an extra matrix multiplication for each written pixel, but requires access only to the local data.
  3. Calculate normals after the octree splatting. This avoids matrix multiplications for pixels that get overwritten, but requires storing object ID and the raw normal data (or, alternatively, the node address) in the renderbuffer.

I don’t currently have any octree datasets with normals, so I don’t really know which of the last two options would be better.

Besides, lighting isn’t that much useful without shadows, and rendering the scene from yet another view (for a dynamic shadowmap) will definitely tank FPS, at least for a single-threaded renderer. In principle, a shadowmap could be made using meshes and GPU… but then it kinda defeats the point of a CPU voxel renderer? Alternatively, we could precompute shadowmap(s) for the scene (or a shadow volume octree), but then it obviously would only work for static environments.

So… I guess I’ll just leave it as a food for thought, or maybe for some future investigations.

= = = = = = = = = =

There may be some other ideas I could speculate about here (texture mapping, custom “shaders”, etc.), but none of them are something I’ve been seriously thinking about. All in all, there isn’t that much else left…

Except for one last important bit.

Return of the samurai

Both the slicing and the tiling methods could be used for that, but tiling may suffer from accumulation of floating-point errors after some number of subdivisions. Slicing possesses no such weakness, hi-ya! So, let’s sharpen our katanas ;)

Since we’ll be dealing with explicit cube vertices now, first we need a way to calculate the model transformation matrix from them. In my implementation, I just average two opposing corners and their corresponding half-edges to get the translation vector and the X/Y/Z axes.

Onto the subdivision itself: here we need a recursive traversal, so I made a helper class CageSubdivider with a stack of 3x3x3 grids / octant queues / some other data. After it is provided with 8 starting cube vertices and a callback (that tells whether to subdivide a given octant), it:

  1. Copies the cube vertices to the corners of a grid in the stack;
  2. Sorts the octant order/queue by the Z coordinate of the corresponding corners;
  3. Calculates (via averaging) the remaining grid vertices;
  4. Until the stack is exhausted:
  5. For each octant, passes the corresponding grid vertices to the callback;
  6. If the callback returned a “should subdivide further” value, adds a new grid to the stack and performs the first 3 steps, using the 8 vertices of the octant.

The callback, in its turn, checks for a bunch of conditions (whether the octant overlaps the viewport, whether it’s too big to render or intersects the near plane, whether it’s a leaf node…) to decide if a subtree can be rendered, or the octant should be subdivided further.

And… presto! Now we can zoom in much further than before:

But not just that. With slicing, we now have access to deformations and perspective!

(The character model is a voxelized version of “Samurai Mayumi” by SurplusBun. If you wish to test it yourself, the corresponding octree can be downloaded here.)

All that’s needed is an additional check in the subdivision callback. The more you subdivide a deformed cube, the closer to a parallelepiped each sub-cube becomes — so we just need to check whether the cube’s edges are parallel enough. The amount of parallelism (or, alternatively, distortion) can be estimated, for example, from the difference between the cube’s opposite edge vectors.

The final subdivision logic ended up looking something like this:

if node does not overlap the viewport:
skip
else if node intersects the near plane:
if it's a leaf node:
skip
else:
subdivide
else if node is too big:
subdivide
else if node is too distorted:
if using cube shape or node is non-leaf:
subdivide
else:
render
else:
render

However, since “parallel enough” is not “exactly parallel”, the neighboring subtrees will never quite match on the edges, which occasionally manifests as “seams”:

I tried dilating the subtrees by their amount of distortion, but this didn’t solve the problem completely. My suspicion is that, while the holes between the subtrees are covered, the dilated black voxels on the inside of the buildings still peek through the outer walls where the seams meet. And since dilated dense models are slower to render, in the end I disabled this effect.

A more promising approach would probably be a post-processing hole-filling filter, similar to what is used in point cloud rendering [F1, F2, F3]. But these sorts of things are much better suited for a shader, which is kinda beyond the scope of this project.

If you decide to take a look at my code, keep in mind that handling subdivisions is not very optimized there. But since I added support for deformations mostly for demonstration purposes, I didn't particularly bother with that.

Vs. Animation

Of course, rigid (non-deforming) and flip-book (each frame is a separate octree) animations are trivial for voxel models. But skeletal animation is trickier, as the usual approaches aren’t directly applicable.

Surprisingly, there doesn’t seem to be that much material on the subject of combining skinning with octrees. The only notable exception is probably Dennis Bautembach’s “Animated Sparse Voxel Octrees”, popularly explained in this short series of videos. His technique (implemented in CUDA) essentially traverses the octree to a LOD level sufficient for display, and then transforms each voxel at that level as if it was a regular mesh vertex. Naturally, this can lead to holes between the transformed voxels, which he counters by enlarging each voxel by some amount. Dennis also considers transforming the 8 corners of a voxel instead of just the center (in order to eliminate holes), but he notes that this is too computationally expensive. In the end, he concludes that this technique is mostly suitable for moderate deformations.

Obviously, we can’t afford anything of this sort in our CPU renderer. However, there exists another (and pretty popular!) animation technique that’s perfectly compatible with octrees: cage deformations! Or, at least, their linear variant.

In essence, this is a direct extension of textured triangles / quads to 3D. Texture (2D array of pixels) becomes 3D array of voxels, and triangles / quads become tetrahedrons and cuboids, respectively. So if we split a high-detail model into a collection of connected cubical / tetrahedral volumes (which together form a cage) and assign bone weights to their vertices, the cage itself may be animated using any of the traditional techniques. As a result, we get a skeletally animated voxel/point-cloud model (composed of many small octrees) at a much lower price :)

Interestingly, back in 2019 Atomontage engine released a teaser video with a (what I assume) skeletally animated voxel dragon swinging its head left and right. I have no idea what approach Atomontage uses there, but I wouldn’t be surprised if it’s something along the same lines.

Afterword

For me, it was. I still don’t have the slightest idea what exactly gave 2010’s Unlimited Detail its claimed performance (30 FPS at 1024x768 on a budget laptop), but at this point I’m pretty sure it’s more down to SIMD and low-level optimizations than some genius algorithmic tricks. At long last, I can consider my curiosity satisfied — and, having published a detailed overview of my findings, now can finally put this quest to rest with a peace of mind.

Perhaps some of this information will come up useful even for you too, if you decide to write your own renderer or improve upon mine. And if not — well, I hope at least this was an entertaining read :-)

Good luck in your endeavors!

dairin0d, 2021

 

P.S.

Comparing notes with PND3D, Atomontage and pabloreda's renderer

  1. Predicting which 4 vertices contribute to the bounding rectangle
  2. Splitting the screen into tiles (for multithreading)
  3. Skipping occlusion test for nodes larger than 32 pixels
  4. Using a small stencil buffer and testing an entire row for occlusion
  5. Rasterizing leaf voxels as convex hexagons and raycasting cube faces on GPU
  6. Using SIMD instructions (and, I assume, a lot of optimized assembly)

Indisputably, PND3D is very fast (though it would be cool to make an apples-to-apples comparison with my renderer… but alas, not with my skills or free time). However, perhaps ironically, none of the tricks mentioned by Ken ended up useful in my case:

  1. Since I wanted to support deformations, predicting bounding rectangle vertices is impossible in the general case.
  2. This would be genuinely useful for multithreading, but my focus here was on the single-threaded performance.
  3. A long time ago I tried something similar, but it didn’t bring any improvement, or maybe even was slower.
  4. As I mentioned earlier, in my experiments the overhead of multi-pixel stencil testing only made things worse. Though I haven’t looked at Ken’s code… maybe he found a way to do it very efficiently?
  5. I had no intention of supporting textured voxels, so this is simply not applicable in my case.
  6. .NET framework has SIMD support since some version, but it doesn’t work in Unity’s fork of Mono. Since I wanted to run my renderer both in .NET and Unity, I had to refrain from using it. And as for assembly-level optimizations… not in the CLR land ;-)

As for Atomontage, it’s not so much comparing notes as just one interesting bit of information I’ve stumbled across. It seems to be using raycasting (at least according to the title of this 2016 video), but in 2010 Branislav mentioned (in comments) that “AE is still mostly CPU-based” and “we had the necessary hardware some 7–10y(!!!) ago already”. I can’t even begin to comprehend what sort of crazy raycasting algorithm he must have invented that could do Atomontage-quality realtime raycasting on the 2000s-era CPUs… Or maybe I’m just reading it the wrong way. But hey, kudos to the man if that’s actually true! Perhaps he’s the one who will finally prove the impracticality of CPU renderers wrong :-)

Edit (December 2021): Well, an early version of Atomontage got finally released, and at least in the web demo it doesn't use any sort of CPU rendering (in fact, some people suspect it might not even use raycasting, and simply relies on traditional polygon rendering). Perhaps that's what Branislav meant when he mentioned 2000s-era hardware..?

Edit (December 2021): A few months after I published these results, I stumbled upon a series of blog posts written in 2020 by Pablo Reda, where he outlines the design of his software octree raycaster. Though our approaches obviously differ in some ways, it seems we ended up using quite a number of the same core ideas. I guess it goes to show that these sorts of optimizations are pretty fundamental, and many programmers would eventually come to them after thinking long enough :-)

Other things I tried

▸ Classic octree raycasting

▸ Stochastic point cloud splatting with “reprojection”

▸ “Chain-like” point splatting

▸ Wave surfing (“heightmap” raycasting)

I made an attempt to do something similar, but didn’t go very far in that direction. Under certain limitations, this could be quite fast, but in general it just doesn’t scale as well as tree-based methods.

▸ Splatting-like “raycasting”

Of course, for more than one pixel this results in a lot of repeated work (not to mention a lot more cache misses), so this was about an order of magnitude slower than splatting. Potentially, some work could be shared via beam tracing… but occlusion-culled splatting is kind of like extreme version of beam tracing anyways, and, as you’ve seen above, I later used the map idea to speed up the splatting of 2x2 pixel nodes.

▸ Traverse-order caching

However, while this solution improved the average performance in my test scenes, it was too impractical to be really useful: it required a lot of extra memory to store just one buffer with the traverse-order copies of nodes (and for each additional view / camera, the memory requirements would be correspondingly multiplied), and in the worst-case scenario (when no nodes from the previous frame can be reused) it would actually be slower than no caching at all. So when I switched to breadth-first storage, I was quite happy to sacrifice a bit of average-case performance for not having to deal with that memory-hungry maintenance nighmare :-)

The end!

Comments