## Smooth Voxel Terrain (Part 2)

Last time we formulated the problem of isosurface extraction and discussed some general approaches at a high level.  Today, we’re going to get very specific and look at meshing in particular.

For the sake of concreteness, let us suppose that we have approximated our potential field $f$ by sampling it onto a cubical grid at some fixed resolution.  To get intermediate values, we’ll just interpolate between grid points using the standard trilinear interpolation.  This is like a $C^0$ generalization of Minecraft-style voxel surfaces.  Our goal in this article is to figure out how to extract a mesh of the implicit surface (or zero-crossings of $f$).  In particular, we’re going to look at three different approaches to this problem:

## Marching Cubes

By far the most famous method for extracting isosurfaces is the marching cubes algorithm.  In fact, it is so popular that the term marching cubes’ is even more popular than the term isosurface’ (at least according to Google)!   It’s quite a feat when an algorithm becomes more popular than the problem which it solves!  The history behind this method is very interesting.  It was originally published back in SIGGRAPH 87, and then summarily patented by the Lorensen and Cline.  This fact has caused a lot of outrage, and is been widely cited as one of the classic examples of patents hampering innovation.  Fortunately, the patent on marching cubes expired back in 2005 and so today you can freely use this algorithm in the US with no fear of litigation.

Much of the popularity of marching cubes today is due in no small part to a famous article written by Paul Bourke.  Back in 1994 he made a webpage called “Polygonizing a Scalar Field”, which presented a short, self-contained reference implementation of marching cubes (derived from some earlier work by Cory Gene Bloyd.)  That tiny snippet of a C program is possibly the most copy-pasted code of all time.  I have seen some variation of Bloyd/Bourke’s code in every implementation of marching cubes that I’ve ever looked at, without exception.  There are at least a couple of reasons for this:

1. Paul Bourke’s exposition is really good.  Even today, with many articles and tutorials written on the technique, none of them seem to explain it quite as well.  (And I don’t have any delusions that I will do any better!)
2. Also their implementation is very small and fast.  It uses some clever tricks like a precalculated edge table to speed up vertex generation.  It is difficult to think of any non-trivial way to improve upon it.
3. Finally, marching cubes is incredibly difficult to code from scratch.

This last point needs some explaining,  Conceptually, marching cubes is rather simple.  What it does is sample the implicit function along a grid, and then checks the sign of the potential function at each point (either +/-).  Then, for every edge of the cube with a sign change, it finds the point where this edge intersects the volume and adds a vertex (this is just like ray casting a bunch of tiny little segments between each pair of grid points).  The hard part is figuring out how to stitch some surface between these intersection points.  Up to the position of the zero crossings, there are $2^8 = 256$ different possibilities, each of which is determined by the sign of the function at the 8 vertices of the cube:

Some of the marching cubes special cases.  (c) Wikipedia, created by Jean-Marie Favreau.

Even worse, some of these cases are ambiguous!  The only way to resolve this is to somewhat arbitrarily break the symmetry of the table based on a case-by-case analysis. What a mess!  Fortunately, if you just download Bloyd/Bourke’s code, then you don’t have to worry about any of this and everything will just work.  No wonder it gets used so much!

## Marching Tetrahedra

Both the importance of isosurface extraction and the perceived shortcomings of marching cubes motivated the search for alternatives.  One of the most popular was the marching tetrahedra, introduced by Doi and Koide.  Besides the historical advantage that marching tetrahedra was not patented, it does have a few technical benefits:

1. Marching tetrahedra does not have ambiguous topology, unlike marching cubes.  As a result, surfaces produced by marching tetrahedra are always manifold.
2. The amount of geometry generated per tetrahedra is much smaller, which might make it more suitable for use in say a geometry shader.
3. Finally, marching tetrahedra has only $2^4 = 16$ cases, a number which can be further reduced to just 3 special cases by symmetry considerations.  This is enough that you can work them out by hand.

Exercise:  Try working out the cases for marching tetrahedra yourself.  (It is really not bad.)

The general idea behind marching tetrahedra is the same as marching cubes, only it uses a tetrahedral subdivision.  Again, the standard reference for practical implementation is Paul Bourke (same page as before, just scroll down a bit.)  While there is a lot to like about marching tetrahedra, it does have some draw backs.  In particular, the meshes you get from marching tetrahedra are typically about 4x larger than marching cubes.  This makes both the algorithm and rendering about 4x slower.  If your main consideration is performance, you may be better off using a cubical method.  On the other hand, if you really need a manifold mesh, then marching tetrahedra could be a good option.  The other nice thing is that if you are obstinate and like to code everything yourself, then marching tetrahedra may be easier since there aren’t too many cases to check.

## The Primal/Dual Classification

By now, both marching cubes and tetrahedra are quite old.  However, research into isosurface extraction hardly stopped in the 1980s.  In the intervening years, many new techniques have been developed.  One general class of methods which has proven very effective are the so-called dual’ schemes.  The first dual method, surface nets, was proposed by Sarah Frisken Gibson in 1999:

S.F. Gibson, (1999) “Constrained Elastic Surface Nets”  Mitsubishi Electric Research Labs, Technical Report.

The main distinction between dual and primal methods (like marching cubes) is the way they generate surface topology.  In both algorithms, we start with the same input: a volumetric mesh determined by our samples, which I shall take the liberty of calling a sample complex for lack of a better term.  If you’ve never heard of the word cell complex before, you can think of it as an n-dimensional generalization of a triangular mesh, where the cells’ or facets don’t have to be simplices.

In the sample complex, vertices (or 0-cells) correspond to the sample points; edges (1-cells) correspond to pairs of nearby samples; faces (2-cells) bound edges and so on:

Here is an illustration of such a complex.  I’ve drawn the vertices where the potential function is negative black, and the ones where it is positive white.

Both primal and dual methods walk over the sample complex, looking for those cells which cross the 0-level of the potential function.  In the above illustration, this would include the following faces:

### Primal Methods

Primal methods, like marching cubes, try to turn the cells crossing the bounary into an isosurface using the following recipe:

• Edges crossing the boundary become vertices in the isosurface mesh.
• Faces crossing the boundary become edges in the isosurface mesh.
• n-cells crossing the boundary become (n-1)-cells in the isosurface mesh.

One way to construct a primal mesh for our sample complex would be the following:

This is pretty nice because it is easy to find intersection points along edges.  Of course, there is some topological ambiguity in this construction.  For non-simplicial cells crossing the boundary it is not always clear how you would glue the cells together:

As we have seen, these ambiguities lead to exponentially many special cases, and are generally a huge pain to deal with.

### Dual Methods

Dual methods on the other hand use a very different topology for the surface mesh.  Like primal methods, they only consider the cells which intersect the boundary, but the rule they use to construct surface cells is very different:

• For every edge crossing the boundary, create an (n-1) cell.  (Face in 3D)
• For every face crossing the boundary, create an (n-2) cell. (Edge in 3D)
• For every d-dimensional cell, create an (n-d) cell.
• For every n-cell, create a vertex.

This creates a much simpler topological structure:

The nice thing about this construction is that unlike primal methods, the topology of the dual isosurface mesh is completely determined by the sample complex (so there are no ambiguities).  The disadvantage is that you may sometimes get non-manifold vertices:

## Make Your Own Dual Scheme

To create your own dual method, you just have to specify two things:

1. A sample complex.
2. And a rule to assign vertices to every n-cell intersecting the boundary.

The second item is the tricky part, and much of the research into dual methods has focused on exploring the possibilities.  It is interesting to note that this is the opposite of primal methods, where finding vertices was pretty easy, but gluing them together consistently turned out to be quite hard.

### Surface Nets

Here’s a neat puzzle: what happens if we apply the dual recipe to a regular, cubical grid (like we did in marching cubes)?  Well, it turns out that you get the same boxy, cubical meshes that you’d make in a Minecraft game (topologically speaking)!

Left: A dual mesh with vertex positions snapped to integer coordinates.  Right: A dual mesh with smoothed vertex positions.

So if you know how to generate Minecraft meshes, then you already know how to make smooth shapes!  All you have to do is squish your vertices down onto the isosurface somehow.  How cool is that?

This technique is called “surface nets” (remember when we mentioned them before?)  Of course the trick is to figure out where you place the vertices.  In Gibson’s original paper, she formulated the process of vertex placement as a type of global energy minimization and applied it to arbitrary smooth functions.  Starting with some initial guess for the point on the surface (usually just the center of the box), her idea is to perturb it (using gradient descent) until it eventually hits the surface somewhere.  She also adds a spring energy term to keep the surface nice and globally smooth.  While this idea sounds pretty good in theory, in practice it can be a bit slow, and getting the balance between the energy terms just right is not always so easy.

### Naive Surface Nets

Of course we can often do much better if we make a few assumptions about our functions.  Remember how I said at the beginning that we were going to suppose that we approximated $f$ by trilinear filtering?  Well, we can exploit this fact to derive an optimal placement of the vertex in each cell — without having to do any iterative root finding!  In fact, if we expand out the definition of a trilinear filtered function, then we can see that the 0-set is always a hyperboloid.  This suggests that if we are looking for a 0-crossings, then a good candidate would be to just pick the vertex of the hyperboloid.

Unfortunately, calculating this can be a bit of a pain, so let’s do something even simpler: Rather than finding the optimal vertex, let’s just compute the edge crossings (like we did in marching cubes) and then take their center of mass as the vertex for each cube.  Surprisingly, this works pretty well, and the mesh you get from this process looks similar to marching cubes, only with fewer vertices and faces.  Here is a side-by-side comparison:

Left: Marching cubes.  Right: Naive surface nets.

Another advantage of this method is that it is really easy to code (just like the naive/culling algorithm for generating Minecraft meshes.)  I’ve not seen this technique published or mentioned before (probably because it is too trivial), but I have no doubt someone else has already thought of it.  Perhaps one of you readers knows a citation or some place where it is being used in practice?  Anyway, feel free to steal this idea or use it in your own projects.  I’ve also got a javascript implementation that you can take a look at.

### Dual Contouring

Say you aren’t happy with a mesh that is bevelled.  Maybe you want sharp features in your surface, or maybe you just want some more rigorous way to place vertices.  Well my friend, then you should take a look at dual contouring:

T. Ju, F. Losasso, S. Schaefer, and J. Warren.  (2004)  “Dual Contouring of Hermite Data”  SIGGRAPH 2004

Dual contouring is a very clever solution to the problem of where to place vertices within a dual mesh.  However, it makes a very big assumption.  In order to use dual contouring you need to know not only the value of the potential function but also its gradient!  That is, for each edge you must compute the point of intersection AND a normal direction.  But if you know this much, then it is possible to reformulate the problem of finding a nice vertex as a type of linear least squares problem.  This technique produces very high quality meshes that can preserve sharp features.  As far as I know, it is still one of the best methods for generating high quality meshes from potential fields.

Of course there are some downsides.  The first problem is that you need to have Hermite data, and recovering this from an arbitrary function requires using either numerical differentiation or applying some clunky automatic differentiator.  These tools are nice in theory, but can be difficult to use in practice (especially for things like noise functions or interpolated data).  The second issue is that solving an overdetermined linear least squares problem is much more expensive than taking a few floating point reciprocals, and is also more prone to blowing up unexpectedly when you run out of precision.  There is some discussion in the paper about how to manage these issues, but it can become very tricky.  As a result, I did not get around to implementng this method in javascript (maybe later, once I find a good linear least squares solver…)

## Demo

As usual, I made a WebGL widget to try all this stuff out (caution: this one is a bit browser heavy):

This tool box lets you compare marching cubes/tetrahedra and the (naive) surface nets that I described above.  The Perlin noise examples use the javascript code written by Kas Thomas.  Both the marching cubes and marching tetrahedra algorithms are direct ports of Bloyd/Bourke’s C implementation.  Here are some side-by-side comparisons.

#### Left-to-right:  Marching Cubes (MC), Marching Tetrahedra (MT), Surface Nets (SN)

MC: 15268 verts, 7638 faces. MT: 58580 verts, 17671 faces. SN: 3816 verts, 3701 faces.

MC: 1140 verts, 572 faces.  MT: 4200 verts, 1272 faces. SN: 272 verts, 270 faces.

MC: 80520 verts, 40276 faces. MT: 302744 verts, 91676 faces. SN: 20122 verts, 20130 faces.

MC: 172705 verts, 88071 faces. MT: 639522 verts, 192966 faces. SN: 41888 verts, 40995 faces.

A few general notes:

• The controls are left mouse to rotate, right mouse to pan, and middle mouse to zoom.  I have no idea how this works on Macs.
• I decided to try something different this time and put a little timing widget so you can see how long each algorithm takes.  Of course you really need to be skeptical of those numbers, since it is running in the browser and timings can fluctuate quite randomly depending on totally arbitrary outside forces.  However, it does help you get something of a feel for the relative performance of each method.
• In the marching tetrahedra example there are frequently many black triangles.  I’m not sure if this is because there is a bug in my port, or if it is a problem in three.js.  It seems like the issue might be related to the fact that my implementation mixes quads and triangles in the mesh, and that three.js does not handle this situation very well.
• I also didn’t implement dual contouring.  It isn’t that much different than surface nets, but in order to make it work you need to get Hermite data and solve some linear least squares problems, which is hard to do in Javascript due to lack of tools.

## Benchmarks

To compare the relative performance of each method, I adapted the experimental protocol described in my previous post.  As before, I tested the experiments on a sample sinusoid, varying the frequency over time.  That is, I generated a volume $65^3$ volume plot of

$\sin( \frac{n \pi}{2} x ) + \sin( \frac{n \pi}{2} y ) + \sin( \frac{n \pi}{2} z )$

Over the range $[ - \frac{\pi}{2}, + \frac{\pi}{2} ]^3$.  Here are the timings I got, measured in milliseconds

 Frequency Marching Cubes Marching Tetrahedra Surface Nets 0 29.93 57 24.06 1 43.62 171 29.42 2 61.48 250 37.78 3 93.31 392 47.72 4 138.2 510 51.36 5 145.8 620 74.54 6 186 784 83.99 7 213.2 922 97.34 8 255.9 1070 112.4 9 272.1 1220 109.2 10 274.6 1420 124.3

By far marching tetrahedra is the slowest method, mostly on account of it generating an order of magnitude more triangles.  Marching cubes on the other hand, despite generating nearly 2x as many primitives was still pretty quick.  For small geometries both marching cubes and surface nets perform comparably.  However, as the isosurfaces become more complicated, eventually surface nets win just on account of creating fewer primitives.  Of course this is a bit like comparing apples-to-oranges, since marching cubes generates triangles while surface nets generate quads, but even so surface nets still produce slightly less than half as many facets on the benchmark.  To see how they stack up, here is a side-by-side comparison of the number of primitives each method generates for the benchmark:

 Frequency Marching Cubes Marching Tetrahedra Surface Nets 0 0 0 0 1 15520 42701 7569 2 30512 65071 14513 3 46548 102805 22695 4 61204 130840 29132 5 77504 167781 37749 6 92224 197603 43861 7 108484 233265 52755 8 122576 263474 58304 9 139440 298725 67665 10 154168 329083 73133

## Conclusion

Each of the isosurface extraction methods has their relative strengths and weaknesses.  Marching cubes is nice on account of the free and easily usable implementations, and it is also pretty fast.  (Not to mention it is also the most widely known.)  Marching tetrahedra solves some issues with marching cubes at the expense of being much slower and creating far larger meshes.  On the other hand surface nets are much faster and can be extended to generate high quality meshes using more sophisticated vertex selection algorithms.  It is also easy to implement and produces slightly smaller meshes.  The only downside is that it can create non-manifold vertices, which may be a problem for some applications.  I unfortunately never got around to properly implementing dual contouring, mostly because I’d like to avoid having to write a robust linear least squares solver in javascript.  If any of you readers wants to take up the challenge, I’d be interested to see what results you get.

### PS

I’ve been messing around with the wordpress theme a lot lately.  For whatever reason, it seems like the old one I was using would continually crash Chrome.  I’ve been trying to find something nice and minimalist.  Hopefully this one works out ok.

## Smooth Voxel Terrain (Part 1)

My god…  I’ve now written three blog posts about Minecraft.  If you are at all like me, you may be getting just a bit tired of all those cubes.  To avoid fatigue, I think now is a good time to change things up.  So today let’s talk about something cool and different — like how to deal with smooth blobby shapes:

A rendering of Goursat’s surface.

## Blobby Objects

But before we get too far ahead of ourselves, we need to say what exactly a blobby object is.  The term blobby object’ is one of Jim Blinn’s many contributions to computer graphics, though some of the basic ideas go back much further.  In fact mathematicians have known about blobs for centuries (before they even had calculus!)  Conceptually, the basic idea of a blobby object is quite simple, though it can be a bit tricky to formalize it properly.

The theoretical basis for blobby object modeling is the concept of an implicit solid.  The idea is that we can represent a solid object by a contour of a smooth function.  I found the following beautiful picture from wikipedia that illustrates the concept:

Public domain illustration of a 2D implicit set. (c) Wikimedia, created by Oleg Alexandrov.

Now we want to get a solid (that is a closed regular set) — not just any old sublevel set.  Fortunately, the definition is pretty simple once you’ve seen it and get used to it.  Given any continuous function, $f(x,y,z)$, it determines a set which we call the regularized 0-sublevel,

$V_0(f) \stackrel{def}{=} \kappa f^{-1}([-\infty, 0)) = \kappa \{ (x,y,z) \in \mathbb{R}^3: f(x,y,z) < 0 \}$

(Here I’m using the symbol $\kappa$ to denote the closure operator.)  Unpacked, this means that $V_0(f)$ is the closure (of the preimage of (the open interval $[-\infty, 0)$ ) ).  Our definition of $V_0$is a little bit different than the standard (but unfortunately slightly broken) definition which you see quite frequently:

$Z_0(f) \stackrel{def}{=} f^{-1}( [-\infty, 0] ) = \{ (x,y,z) \in \mathbb{R}^3 : f(x,y,z) \leq 0 \}$

Spiritually, this is the right idea but technically it doesn’t quite work.  It is true that in many situations, these two definitions give similar results.  For example, let’s consider the 1D function $f(x) = x^2 - 1$.  This has a pair of roots at $\pm 1$, and looks like this:

In this case, both $Z_0(f)$ and $V_0(f)$ will give us the same result, that is the interval $[-1, 1]$.  Where things break down however is near the edge cases.  If for example we took the function $f(x) = x^2$:

Then $Z_0(f)$ would give us a point while $V_0(f)$ is empty!  Now you might ask, why should we prefer one definition over the other.  There are at least a couple good reasons:

1. First, $Z_0$ is not necessarily a solid.  This means that it is not always uniquely determined by its boundary, and that certain mass properties (like volume, inertia, momentum, etc.) may fail to be well defined.  On the other hand, $V_0$ is always a nice solid object, and so we don’t ever have to worry about such pathologies.
2. Closely related to the first point, it is very difficult to define regularized Boolean operation on the latter representation, while it is possible to perform CSG operations regularized sublevels using Rvachev functions.  (Maybe I will say more about this at some point in the future.)
3. Finally, it is not even practically possible to numerically compute $Z_0$ anyway!  The reason for this is that most general purpose methods for finding 0-sets use sampling to check for places where f crosses the 0-axis (this is an application of the intermediate value theorem from calculus).  If we include these isolated 0 points in our definition, then we are in the hopeless situation where we must be infinitely lucky to get one of our sample points to even detect them!  (And if you get really unlucky and one of them actually does show up, then there is a good chance your code might even break if you are using the $Z_0$ definition!)

To boil all this down to a pithy remark:

The reason we prefer $V_0$ over $Z_0$ is that it detects 0-crossings — not 0-values!

My personal reason for preferring $V_0$ over $Z_0$ is that it is what is actually used in practice anyway.  In fact, even when most people say they are using $Z_0$, what they end up actually computing is $V_0$, because the only thing they can actually compute are the crossings!

## Isosurface Extraction

Now that we’ve hopefully got a decent mental picture of what we want to find, let’s try to figure out how to compute something with it. In particular, we are most interested in finding the boundary of the implicit set. An obvious application would be to draw it, but knowing the boundary can be equally useful for other tasks as well; like computing mass properties for example.

In light of our previous definition, the boundary of our implicit solid is going to be the set of all 0-crossings of f.  One perspective is that computing the boundary is like a higher dimensional generalization of root finding.  Thinking metaphorically for a moment and confusing 0-values with 0-crossing, it is like saying that instead of trying to find solutions for:

$f(x) = 0$

We are now looking for all solutions to the more general problem:

$f(x,y,z) = 0$

In the first case, we get some points as an output, while in the second case the solution set will look like some surface.

### Implicit Function Theorem

It turns out that at least when f is smooth, we can convert implicit sets locally into parametric surfaces.  This is the content of the implicit function theorem. Finding such a parametric surface is as good as solving the equation.  For example, say we had a paraboloid given by:

$f(x,y,z) = z - x^2 - y^2$

Then we could parameterize the 0-set by the surface:

$x(u,v) = u$

$y(u,v) = v$

$z(u,v) = u^2 + v^2$

When you have a parametric solution, life is good.  Unfortunately, for more complicated equations (like Perlin noise for example), finding a parametric solution is generally intractable.  It also doesn’t work so well if your implicit function happens to come from some type of sampled geometry, such as a distance field or a point cloud.  As a result, exact parametric solutions are generally too much to hope for.

### Ray Tracing

A more general solution to the problem of isosurface extraction is ray tracing, which is basically a form of dimensionality reduction.  The idea is that instead of trying to find the whole isosurface, we instead just look to query it locally.  What we do is define some ray (or more generally any curve) $\varphi : [0,1] \to \mathbb{R}^3$, and then try to solve for the place where f composed with $\varphi$ crosses 0:

$f( \varphi(t) ) = 0$

This is now a standard 1D root-finding problem, and we could solve it using either symbolically or using techniques from calculus, like bisection or Newton’s method.  For example, in path tracing, one usually picks a collection of rays corresponding to the path light would take bouncing through the scene.  Then all we need to do is check where these individual rays meet the surface, rather than compute the entire thing all at once (though it can still become very expensive).

### Meshing

For most interesting functions, it is neither possible to find a parametric description, nor is ray tracing really fast enough.  Instead, it is often more practical to look at finding some polygonal approximation of the implicit surface instead.  Broadly speaking, this process has two parts:

1. First, we approximate $f$ by sampling
2. Then, from this sampled version of $f$, we try to find some mesh which approximates $f = 0$.

Of course there are many details hidden within this highly abstract description.  I wrote up a fairly lengthy exposition, but ended up more than tripling the length of this post!  So, I think now is a good time to take a break.

## Next Time

Next time we’ll talk about how to put all this stuff into practice.  The focus is going to be on meshing, since this is probably the most important thing to deal with in practice.  As a sneak preview of things to come, here is a demo that I put together.  I promise all (well ok at least something) will be explained in the coming post!

## Meshing in a Minecraft Game (Part 2)

Last time, I wrote about how to generate meshes in a Minecraft-style voxel engine.  I got a lot of interesting feedback, and so today I’m going to do a follow up highlighting some of the points that came up in the various discussions.  I’ll also talk about yet another interesting approach to generating meshes, which occurred to me in the mean time.  But first, I would like to take a moment to address some specific comments relating to greedy meshing:

## Multiple Voxel Types

By far, the most frequent question about greedy meshing was how to extend it to handle multiple voxel types and different normal directions.  This surprised me somewhat, though it is perhaps understandable since I did not spend much time explaining it in the previous post.  The general idea is pretty simple.  What you do is group blocks together according to their type, and then do the meshing on each part separately:

That’s it!  This same idea works for normals too.  All you need to do is add an extra bit to keep track of the orientation.  To show that this isn’t so difficult, I made an updated demo that shows how some of these methods work for different voxel types.

## T-Junctions

Now this one is pretty bizarre.  By far the most common criticism of the greedy method I saw on reddit was that the meshes contain many T-vertices.  I’m not really sure why so many people latched on to this concept, but it is certainly true that greedy meshing will produce them.  What is less clear though is what the harm of having a T-junction is.

Intuitively, one would assume that as long as you have a pair of colinear edges, subdividing one of them shouldn’t make much difference (at least that’s what one would hope anyway).  After thinking fairly hard about rendering, I couldn’t come up with a good reason why this would be a problem, and so I decided to code up a demo to test it:

Surprisingly, no matter how I messed with the parameters I couldn’t get any visible discontinuity to show up!  Here are some pictures:

The mesh with edges drawn on to show the T-junctions.

The mesh without edges drawn on.

In summary, I’m not sure that there is actually a problem with T-junctions.  You can try out the demo yourself if you aren’t convinced.  If anyone can come up with a plausible scenario where one of the greedy meshes has visible rendering artefacts, I’d be quite interested to see it.  I’ve heard old ghost stories that on some ancient ATi cards, when the moon is full, these type of cracks actually do appear and make a big difference; but I unfortunately do not have access to such hardware.  If you can manage to get a crack to show up in this demo (using WebGL), please take a screen shot and post a comment.

That much concludes what I have to say about greedy meshing.  The next thing I want to discuss is an efficient alternative to greedy meshing.  The inspiration for this idea comes from a comment made by fr0stbyte on the last post.  Implicitly, he asked the question of whether triangles could be used to create better meshes than quads.  Thinking about this made me realize that there is a much more direct way to optimize Minecraft-type meshes using standard computational geometry techniques.  Ultimately, we can just think of this problem as just plain-old ordinary polygon triangulation, and solve it using purely classical methods.  Starting from scratch, it is by no means a trivial thing to figure out how to triangulate a polygon, however it is by now very well understood and something of a standard technique (which makes me feel very foolish for having overlooked such a basic connection earlier 😐 ).

One of the most frequently used methods for polygon triangulation is the monotone decomposition algorithm due to Fournier et al.  The basic algorithm proceeds in two steps:

1. Decompose your polygon into monotone polygons.
2. Triangulate each monotone polygon.

The second step can be done in optimal linear time on the number of vertices of the monotone polygon, and is quite well documented elsewhere.  I won’t spend any time discussing it here, and instead I’ll just point you to a standard tutorial or text book.  The more interesting problem in this context is how to do the first step efficiently.  It can be shown that for non-simple polygons (of which our regions generally are), it requires at least $\Omega(n \log(n))$ operations to perform a monotone subdivision, where $n$ is the number of vertices.

However, in the case of Minecraft-style meshing, we can actually do much better.  The key thing is to realize that the number of voxels is strictly much greater than the number of vertices.  This suggests that we can cover the cost of generating the monotone subdivision by the fixed cost of walking along the voxels, and still get an $O(n)$ algorithm out at the end of the day (where this $n$ is the number of voxels, not the number of vertices).  One way that we could do both at the same time is to adapt the standard sweep line algorithm for monotone decomposition to process the polygon as we march over the volume.  Here is how this works in pseudocode for a single slice:

1. Initialize polygons, frontier to empty list
2. For each scan line:
1. Run length encode scan line
2. Set pf = pointer to start of frontier,  pr = pointer to start of runs
3. While pf and pr are not at the end of their respective lists:
1. Let r = *pr and p = *pf be the run and polygon at current positions
2. If r overlaps with the bottom of pf and is the same type:
1. Merge r with p
2. Increment pr and pf and continue
3. If the end of r is past the bottom right of p:
1. Close off p and remove from frontier
2. Increment pf
4. If the bottom end of p is past the end of r:
1. Turn r into a new monotone polygon
2. Increment pr
3. Close off all remaining polygons on the frontier

That looks like a terrible mess, but hopefully it makes some sense.  If it helps, here is some actual javascript which implements this method.  The time complexity of this approach is dominated by the cost of voxel traversal, and so it is $O(n)$ again.

## WebGL Demo!

The main changes from last time are the addition of different colors and types for voxels, the extension to handle orientations and the addition of a new algorithm.  Here are some pictures for comparison:

### Left: Naive culling, Middle: Greedy,  Right: Monotone

Naive: 26536 verts, 6634 quads. Greedy: 7932 verts, 1983 quads.  Monotone: 7554 verts, 4306 tris.

Naive: 19080 verts, 4770 quads.  Greedy: 8400 verts, 2100 quads.  Monotone: 8172 verts, 4572 tris.

Naive: 1344 verts, 336 quads.  Greedy: 264 verts, 66 quads.  Monotone: 204 verts, 104 tris.

While monotone meshing is at least conceptually straightforward, the mesh quality isn’t noticeably different.  One thing to keep in mind is that the primitive counts for the monotone mesh are in triangles, while the other two measurements are given in quads.  Thus, there is a factor of two difference between the quantities.  In all of the examples, the monotone mesh had the fewest vertices.  The reason for this is that the monotone triangle decomposition code pushes all the vertices up front, while the greedy mesher emits a new set of vertices per each quad.  It would be interesting to see if the greedy mesh vertex count could be reduced using some similar method.  On the other hand, the adjusted primitive count (taking each quad as 2 triangles) swings both ways.  On the terrain example, greedy meshing was a little better, while on the triangle example monotone wins by a similar margin.

In the end it is hard to declare a clear winner from this data.  Both greedy and monotone meshing have their advantages and draw backs, and there are situations where the primitive count advantage can swing easily from one to the other.  One slight edge should be given to the greedy method in terms of code complexity, though only barely.  On the other hand, if your rendering is vertex shader limited, you may gain some small FPS boost by switching to monotone meshing since the vertex count is typically lower.  This saves a bit of GPU memory, which is always nice, but the savings are so small that I’d have a hard time believing it is that big of a deal.

## Overtime Shoot Out

To break the stalemate, let’s race the different meshers against each other.  Recall that in terms of performance, the cost of each algorithm can be broken down into two factors:

1. The size of the underlying voxel grid, $n$.
2. The number of surface primitives, $k$.

One way to think of these two parameters is that $O(n)$ measures the overhead of running any algorithm on its own, (that is the cost of running the algorithm on an empty volume), while $O(k)$ measures the algorithm dependent overhead which is determined by the complexity of the surface and the type of mesh generation.  Since it is mesh generation that we want to study, ideally we’d like to keep $n$ fixed and let $k$ vary.  Another way to say this is that we want to construct some family of volumes with ever increasing surface areas.  A simple way to do this is to use trig functions of increasingly higher frequency.  For example, consider the following volume which is described b a functional equation:

$\sin( \frac{\omega}{2 \pi} x ) + \sin( \frac{\omega}{2 \pi} y ) +\sin( \frac{\omega}{2 \pi} z )< 0$.

As $\omega$ increases, the number of chambers increases as well, along with the surface area.  This gives us a pretty good way to control for the complexity of surfaces in an experiment.  I implemented this idea in a node.js script that takes as input a mesher and runs it on a volume of some specified size.  To control for cache warm up and JIT issues, the script runs each program  some number of iterations on a non-empty volume (in my experiments, I found 10 iterations to be sufficient on a volume where $\omega=4$).  Garbage collection costs are amortized by repeated runs (10 in this case).  All experiments were performed on a 65x65x65 volume with $0 \leq \omega \leq 10$.

Here is the data from running this benchmark on the naive culling method:

Naive Meshing Results:
 0  81.80      0     0
1 129.05  82488 20622
2 147.85 114696 28674
3 166.50 146016 36504
4 180.80 178792 44698
5 206.10 209256 52314
6 208.45 243672 60918
7 258.85 272304 68076
8 267.60 306640 76660
9 278.45 334968 83742
10 297.15 371496 92874 

The first column is $\omega$.  The second column is the average time required to mesh the volume in milliseconds.  The third and fourth columns are the number of vertices and faces respectively.  These results shouldn’t be too surprising.  As the frequency increases, the number of surface elements goes up, and so it ends up taking longer to generate the mesh.  In the 0 frequency case, you get an empty volume, and so the time required is reduced to just the overhead of iterating over the volume.  Just for fun, here are the results for stupid meshing:

Stupid Meshing Results:
 0    6.95       0      0
1 2733.65 3293184 823296
2 2848.05 3292128 823032
3 2727.35 3293184 823296
4 2673.40 3289032 822258
5 2729.50 3293184 823296
6 2741.10 3293088 823272
7 2687.75 3293184 823296
8 2729.20 3286512 821628
9 2682.40 3293184 823296
10 2772.95 3293136 823284

This may at first seem a little bizarre, but it makes sense.  For stupid meshing, iterating over the volume is basically free’.  The only limit at the end is the memory bandwidth.  Another weird thing is that the number of primitives in the stupid mesh does not scale with surface area, but rather with volume.  In this case, the volume of each region is relatively constant and so the run-time quickly spikes to 2.7 seconds or so, then stays flat as the frequency changes.

Anyway, let’s now get to the main point, which is how well greedy and monotone meshing stack up:

Greedy Meshing:                          Monotone Meshing:
 0  92.40      0     0                    0  79.00      0      0
1  99.10  20712  5178                    1  92.20  20202  11034
2 103.10  44068 11017                    2 110.10  43326  23631
3 110.35  61644 15411                    3 122.30  60420  32778
4 126.00  87984 21996                    4 144.60  86328  47319
5 134.25 102024 25506                    5 155.80 100056  54033
6 151.40 129344 32336                    6 192.10 127074  68871
7 153.60 142416 35604                    7 197.40 139476  75273
8 167.85 172140 43035                    8 227.60 167410  92843
9 164.90 182256 45564                    9 239.60 178302  96081
10 198.30 213452 53363                   10 297.60 209838 113559

A victory for greedy meshing!  Not only did it beat monotone meshing, but for sufficiently complicated geometry it is also the fastest method overall.  Though this is not exactly a fair apples-to-apples comparison, since greedy meshing spits out quads while monotone meshing generates triangles.  This fact alone is enough to account for nearly a 30% difference in performance, and explains some, but not all of the discrepancy between these two charts.  The remainder of the overhead is most likely due to the fact that Monotone meshing is a more complex algorithm and that it has to do a bit more work per triangle, while greedy meshing is a bit simpler, but does more work per voxel.

While I believe that the general conclusions of these benchmarks are valid, I wouldn’t say that this definitively proves greedy meshing is faster than monotone meshing.  Javascript is far from the best language to use for doing benchmarks, and these implementations are not particularly tuned for performance.  There may also be subtle bugs lurking in the code somewhere, like silent out of bounds accesses, that could end up having disastrous consequences for performance (and account for nearly the entire gap between the two algorithms).  I’ve also tried to control for as many variables as I could, but it is still possible that I overlooked something and that there is some systemic bias in these results.

## Conclusion

We’ve now seen at least two different ways to do better than naive meshing in Minecraft, and there are probably many others out there.  Portponky from r/gamedev sent me some C++ code to generate meshes, though I’ve not yet looked at it carefully enough to figure out how to port it to Javascript (sorry!).  I am still quite curious about it though, since it seems to be doing something quite different than either the greedy method or monotone triangulation.  Anyway, that’s all for now.  If you have any questions or other interesting thoughts you’d like to share, please leave a comment!

## Meshing in a Minecraft Game

The last post I wrote on Minecraft-like engines got a lot of attention, and so I figured it might be interesting to write a follow up.  This time I’ll try to tackle a different problem:

## Meshing

One of the main innovations in Infiniminer (and Minecraft) is that they use polygons instead of raycasting to render their volumes.  The main challenge in using polygons is figuring out how to convert the voxels into polygons efficiently.  This process is called meshing, and in this post I’m going to discuss three different algorithms for solving this problem.   But before doing this, we first need to say a bit more precisely what exactly meshing is (in the context of Minecraft maps) and what it means to do meshing well.  For the sake of exposition, we’ll consider meshing in a very simplified setting:

• Instead of having multiple block types with different textures and lighting conditions, we’ll suppose that our world is just a binary 0-1 voxel map.
• Similarly we won’t make any assumptions about the position of the camera, nor will we consider any level of detail type optimizations.
• Finally, we shall suppose that our chunks are stored naively in a flat bitmap (array).

I hope you are convinced that making these assumptions does not appreciably change the core of the meshing problem (and if not, well I’ll try to sketch out in the conclusion how the naive version of these algorithms could be extended to handle these special cases.)

In a typical Minecraft game chunks do not get modified that often compared to how frequently they are drawn.  As a result, it is quite sensible to cache the results from a mesher and only ever call it when the geometry changes.  Thus, over the lifetime of a mesh, the total amount of computational work it consumes is asymptotically dominated by the cost of rendering.

Recall that at a high level polygon rendering breaks down into two parts: 1.) primitive assembly, and 2.) scan conversion (aka polygon filling).  In general, there really isn’t any way to change the number of pixels/fragments which need to be drawn without changing either the resolution, framerate or scene geometry; and so for the purposes of rendering we can regard this as a fixed cost (this is our second assumption). That leaves primitive assembly, and the cost of this operation is strictly a function of the number of faces and vertices in the mesh.  The only way this cost can be reduced is by reducing the total number of polygons in the mesh, and so we make the following blanket statement about Minecraft meshes:

### Criteria 1: Smaller meshes are better meshes.

Of course, the above analysis is at least a little naive.  While it is true that chunk updates are rare, when they do happen we would like to respond to them quickly.  It would be unacceptable to wait for more than a frame or two to update the displayed geometry in response to some user change.  Therefore, we come to our second (and final) criteria for assessing mesh algorithms:

### Speed vs. Quality

Intuitively, it seems like there should be some tradeoff here.  One could imagine a super-mesher that does a big brute force search for the best mesh compared to a faster, but dumber method that generates a suboptimal mesh.  Supposing that we were given two such algorithms, it is not a-priori clear which method we should prefer.  Over the long term, maybe we would end up getting a better FPS with the sophisticated method, while in the short term the dumb algorithm might be more responsive (but we’d pay for it down the road).

Fortunately, there is a simple solution to this impasse.  If we have a fast, but dumb, mesher; we can just run that on any meshes that have changed recently, and mark the geometry that changed for remeshing later on.  Then, when we have some idle cycles (or alternatively in another thread) we can run the better mesher on them at some later point.  Once the high quality mesh is built, the low quality mesh can then be replaced.  In this way, one could conceivably achieve the best of both worlds.

## Some Meshing Algorithms

The conclusion of this digression is that of the two of these criteria, it is item 1 which is ultimately the most important (and unfortunately also the most difficult) to address.  Thus our main focus in the analysis of meshing algorithms will be on how many quads they spit out.  So with the preliminaries settles, let’s investigate some possible approaches:

### The Stupid Method:

The absolute dumbest way to generate a Minecraft mesh is to just iterate over each voxel and generate one 6-sided cube per solid voxel.  For example, here is the mesh you would get using this approach on a 8x8x8 solid region:

The time complexity of this method is linear in the number of voxels.  Similarly, the number of quads used is 6 * number of filled voxels.  For example, in this case the number of quads in the mesh is 8*8*8*6 = 3072.  Clearly this is pretty terrible, and not something you would ever want to do.  In a moment, we’ll see several easy ways to improve on this, though it is perhaps worth noting that for suitably small geometries it can actually be workable.

The only thing good about this approach is that it is almost impossible to screw up.  For example, I was able to code it up correctly in minutes and it worked the first time (compared to the culling method which took me a few iterations to get the edge cases right.)  Even if you don’t plan on using this approach it can still be handy to have it around as a reference implementation for debugging other algorithms.

### Culling:

Clearly the above method is pretty bad.  One obvious improvement is to just cull out the interior faces, leaving only quads on the surface.  For example, here is the mesh you get from culling the same 8x8x8 solid cube:

Practically speaking, this method is pretty much the same as the previous, except we have to do a bit of extra book keeping when we traverse the volume.  The reason for this is that now we not only have to read each voxel, but we we also have to scan through their neighbors.  This requires a little more thought when coding, but time complexity-wise the cost of generating this mesh is asymptotically the same as before.  The real improvement comes in the reduction of quads.  Unlike the stupid method, which generates a number of quads proportional to the boundary, culling produces quads proportional to the surface area.  In this case, the total number of faces generated is 8*8*6 = 384 — a factor of 8x better!

It is not hard to see that the culled mesh is strictly smaller than the stupid mesh (and often by quite a lot!).  We can try to estimate how much smaller using a dimensional analysis argument:  assuming that our geometry is more-or-less spherical, let $n$ be its radius.  Then, in the stupid method we would expect to generate about $O(n^3)$ quads, while in the culled  version we’d get only $O(n^2)$.  This gives an improvement of a factor of about $O(n)$, which is pretty good!  So, if our chunks were say 16x16x16, then we might expect the culled volume to have about 16x fewer faces than the naive method (heuristically).  Of course, spherical geometry is in some sense the best possible case.  If your world is really jagged and has lots of holes, you wouldn’t expect much improvement at all.  In the worst case (namely a checkerboard world) the size of the meshes produced would be identical!  In practice, one would usually expect something somewhere in between.

### Greedy Meshing:

The previous two approaches are probably the most frequently cited methods for generating Minecraft-like meshes, but they are quite far from optimal.  The last method that I will talk about today is a greedy algorithm which merges adjacent quads together into larger regions to reduce the total size of the geometry.  For example, we could try to mesh the previous cube by fusing all the faces along each side together:

This gives a mesh with only 6 quads (which is actually optimal in this case!)  Clearly this improvement by a factor of 64 is quite significant.  To explain how this works, we need to use two ideas.  First, observe that that it suffices to consider only the problem of generating a quad mesh for some 2D cross section.  That is we can just sweep the volume across once in each direction and mesh each cross section separately:

This reduces the 3D problem to 2D.  The next step (and the hard one!) is to figure out how to mesh each of these 2D slices.  Doing this optimally (that is with the fewest quads) is quite hard.  So instead, let’s reformulate the problem as a different type of optimization.  The idea is that we are going to define some type of total order on the set of all possible quadrangulations, and then pick the minimal element of this set as our mesh.  To do this, we will first define an order on each quad, which we will then extend to an order on the set of all meshes.  One simple way to order two quads is to sort them top-to-bottom, left-to-right and then compare by their length.

More formally, let’s denote a quad by a tuple $(x,y,w,h)$, where $(x,y)$ is the upper left corner and $(w,h)$ is the width / height of the quad.  Given such a representation of a quad, we denote its underlying set by:

$Q_{x,y,w,h} = \{ (s,t) \in \mathbb{R}^2 : x \leq s \leq x + w \: \mathrm{ and } \: y \leq t \leq y+h \} = [x,x+w] \times [y,y+h]$

By a 2D quad mesh, here we mean a partition of some set $S$ into a collection of quads $Q_i$.  Now let’s define a total order on these quads, given $(x_0, y_0, w_0, h_0), (x_1, y_1, w_1, h_1)$ define the relation:

$(x_0, y_0, w_0, h_0) \leq (x_1, y_1, w_1, h_1) \Leftrightarrow \left \{ \begin{array}{cc} y_0 < y_1 & \mathrm{if } \: y_0 \neq y_1 \\ x_0 < x_1 & \mathrm{if } \: x_0 \neq x_1 \\ w_0 > w_1 & \mathrm{if } \: w_0 \neq w_1 \\ h_0 \geq h_1 & \mathrm{otherwise} \end{array} \right.$

Or if you prefer to read psuedocode, we could write it like this:

   compareQuads( (x0, y0, w0, h0), (x1, y1, w1, h1) ):
if ( y0 != y1 )   return y0 < y1
if ( x0 != x1 )   return x0 < x1
if ( w0 != w1 )   return w0 > w1
return h0 >= h1

Exercise: Check that this is in fact a total order.

The next thing we want to do is extend this ordering on quads to an ordering on meshes, and we do this in a very simple way.  Given two sorted sequences of quads $( q_0, q_1, ... ), ( p_0, p_1, ... )$ such that $q_i \leq q_{i+1}$ and and $p_i \leq p_{i+1}$, we can simply compare them lexicographically.  Again, this new ordering is in fact a total order, and this means that it has a unique least element.  In greedy meshing, we are going to take this least element as our mesh, which we’ll call the greedy mesh.  Here is an example of what one of these lexicographically minimal meshes looks like:

You may be skeptical that finding the least element in this new order always produces a better quad mesh.  For example, there is no guarantee that the least element in this new order is also the mesh with the fewest quads.  Consider what you might get when quad meshing a T-shape using this method:

Exercise 2:  Find a mesh of this shape which uses fewer quads.

However, we can say a few things.  For example, it always true that each quad in the greedy mesh completely covers at least one edge on the perimeter of the region.  Therefore, we can prove the following:

Proposition: The number of quads in the greedy mesh is strictly less than the number of edges in the perimter of the region.

This means that in the worst case, we should expect the size of the greedy mesh to be roughly proportional to the number of edges.  By a heuristic/dimensional analysis, it would be reasonable to estimate that the greedy mesh is typically about $O( \sqrt(n) )$ times smaller than the culled mesh for spherical geometries (where $n$ is the radius of the sphere as before).  But we can be more precise about how good the greedy mesh actually is:

Theorem: The greedy mesh has no more than 8x as many quads than the optimal mesh.

That’s within a constant factor of optimality!  To prove this, we’ll need to introduce a bit of nomenclature first.  We’ll call a vertex on the perimeter of a region convex if it turns to the right when walking around the mesh clockwise, and otherwise we’ll call it concave.  Here is a picture, with the convex vertices colored in red and the concave vertices colored in green:

There’s a neat fact about these numbers.  It turns out that if you sum them up, you always get the same thing:

Proposition: For any simply connected region with $V_+$ convex and $V_-$ concave vertices on the perimeter, $V_+ - V_- = 4$.

This can be proven using the winding number theorem from calculus.  But we’re going to apply this theorem to get a lower bound on the number of quads in an arbitrary mesh:

Lemma: Any connected genus $G$ region with a perimeter of $E$ edges requires at least $\frac{E + 2G + 4}{8}$ quads to mesh.

Proof:  Let $V_+, V_-$ denote the number of convex/concave vertices on the perimeter respectively.  Because each quad can cover at most 4 convex vertices, the number of quads, $N$, in the mesh is always at least $N \geq \frac{V_+}{4}$.  By the winding number theorem, it is true that for simply connected regions $V_+ = V_- + 4$, and so $V_+ = \frac{E}{2} + 2$.  Therefore any mesh of a simply connected region requires at least $\frac{E + 4}{8}$ quads.  To extend this to non-simply connected regions, we observe that any quad mesh can be cut into a simply connected one by slicing along some edges in the quad mesh connecting out to the boundary.  Making this cut requires introducing at most two edges per each hole in the object, and so we could just as well treat this as a simply connected region having $E + 2 G$ edges, and applying the previous result gives us the bound in the lemma. $\Box$

To prove our main theorem, we just combine these two lemmas together, using the fact that $G \geq 0$.  In summary, we have the following progression:

### Optimal $\leq$ Greedy $\leq$ Culled $\leq$ Stupid

So now that I’ve hopefully convinced you that there are some good theoretical reasons why greedy meshing outperforms more naive methods, I’ll try to say a few words about how you would go about implementing it.  The idea (as in all greedy algorithms) is to grow the mesh by making locally optimal decisions.  To do this, we start from an empty mesh and the find the quad which comes lexicographically first.  Once we’ve located this, we remove it from the region and continue.  When the region is empty, we’re done!  The complexity of this method is again linear in the size of the volume, though in practice it is a bit more expensive since we end up doing multiple passes.  If you are the type who understands code better than prose, you can take a look at this javascript implementation which goes through all the details.

## Demo and Experiments

To try out some of these different methods, I put together a quick little three.js demo that you can run in your browser:

Here are some pictures comparing naive meshing with the greedy alternative:

.

As you can see, greedy meshing strictly outperforms naive meshing despite taking asymptotically the same amount of time.  However, the performance gets worse as the curvature and number of components of the data increase.  This is because the number of edges in each slice increases.  In the extreme case, for a collection of disconnected singleton cubes, they are identical (and optimal).

## Conclusion and some conjectures:

Well, that’s it for now.  Of course this is only the simplest version of meshing in a Minecraft game, and in practice you’d want to deal with textures, ambient occlusion and so on (which adds a lot of coding headaches, but not much which is too surprising).  Also I would ask the reader to excuse the fairly rough complexity analysis here.  I’d actually conjecture that the true approximation factor for the greedy algorithm much less than 8, but I’ve spent enough time writing this post so I guess I’ll just leave it as is for now.  I think the real weakness in my analysis lies in the computation of the upper bound on the size of greedy mesh.  I’d actually guess the following is true:

Conjecture: The size of the greedy mesh is at most $\frac{E}{2}$.

If this conjecture holds true, then it would reduce the current bounds on the approximation factor to 4x instead of 8x.  It’s also not clear that this is the best known greedy approximation.  I’d be interested to learn how Minecraft and other voxel engines out there solve this problem.  From my time playing the game, I got the impression that Minecraft was doing some form of mesh optimization beyond naive culling, though it was not clear to me exactly what was going on (maybe they’re already doing some form greedy meshing?)

It is also worth pointing out that the time complexity of each of these algorithms is optimal (ie linear) for a voxel world which is encoded as a bitmap.  However, it would be interesting to investigate using some other type of data structure.  In the previous post, I talked about using run-length encoding as an alternative in-memory storage of voxels.  But greedy meshing seems to suggest that we could do much better:  why not store the voxels themselves as a 3D cuboid mesh?  Doing this could drastically reduce the amount of memory, and it is plausible that such a representation could also be used to speed up meshing substantially.  Of course the additional complexity associated with implementing something like this is likely to be quite high.  I’ll now end this blog post with the following open problem:

Open Problem: Do there exist efficient data structures for dynamically maintaining greedy meshes?

If you have other questions or comments, please post a response or send me an email!

#### Errata

6/30/12: Corrected an error in the definition of the order on each quad.  Was written (x,y,w,h) when it should have been (y,x,w,h) (Thanks Matt!)

## Making an MMO in 48 Hours

Ludum Dare 23 is done, and here are my results:

I really doubt that I am going to win any awards for graphics…  or gameplay… or sound..   BUT!

• It is multiplayer
• It has persistent state
• It is HTML5
• And it actually works!

How cool is that?  Pretty amazing what you can do in a weekend.  I’m not sure if this is the first successfully completed MMO made for Ludum Dare, but I daresay that it is not something which is frequently attempted (possibly with good reason).  The tools I used to create this monstrosity were node.js, mongodb and crafty.  In terms of development, here is what went right:

• Node.JS is a fantastic platform for making multiplayer games.  Event driven concurrency is basically the model most games use anyway, and I have yet to see any system which does this better.
• Even though it would not be my first choice of language, if you are doing web programming you are stuck using javascript.  Being able to use the same language on the server (and share code) is a huge advantage.  The result was I didn’t have to write, debug and maintain two different versions of A-star while doing the development.
• MongoDB is also quite a nice tool for game programming.  While in most situations I would say the lack of a structured column format is a disadvantage, it actually turns out to be a real help for gamedev and rapid prototyping.  In particular, it was very convenient to make a table with player stats and just keep adding fields to it while developing without worrying about it.  Of course you could probably get similar results by just making a raw string column in a SQL table, but this is a bit ugly and does not integrate as well with node.js.
• Nodejitsu is a very nice host.  Nuno Job graciously offered me a coupon earlier, and I have to say I was very impressed with their service.  Compared to nodester, one of their best features is the automatic database set up and configuration.
• Making a click based system for navigation is a cheesy way to handle latency, but it works.

Now for what went wrong:

• Unfortunately, I was not very disciplined with the amount of time I spent programming.  For me coding stuff is the most fun, and I tend to procrastinate on the drawing side of things.  As a result, the graphics in the game look pretty terrible.  The grass texture in particular turned out eye-bleedingly bad 😦
• The other consequence of over indulging in coding was that I didn’t spend much time on the game itself.  There is really only one type of enemy (a rat), and not much to do in the game besides kill people.  The map is also very empty/boring.  (Though there is a boss battle/final secret.)
• No sound
• This was my first time using crafty.js, which I selected using the “deliberative” method of random google search for “HTML 5 game engine”.  As a result, I spent some time up front trying to figure out how to get everything working.  Crafty itself seems like a fine piece of technology, but the documentation and examples really need some work.  In particular the “RPG tutorial” is out of date and doesn’t work with the latest version.  This cost me maybe 2-3 hours trying to get it to work, which I would say was the low point of the whole competition.  In the end I just gave up with the docs and read the source code directly, which turned out to be much easier to understand.  However, despite this setback, using crafty was still a net savings over doing everything from scratch, especially taking into account debugging and portability issues.

In the end, I had a lot of fun making this (although the final game is kind of terrible).  Once again, if you want to give it a try here is the link:

http://ludumdare23.nodejitsu.com/

# Voxel engines are everywhere…

…and given the huge success of Minecraft, I think they are going to become a permanent fixture in the landscape of game engines.  Minecraft (and Infiniminer) live in a unique space somewhere between high-resolution voxels and static prefabricated environments.  Modifying a Minecraft world is both intuitive and elegant; you simply pick up and place blocks.  In many ways, it is the natural 3D generalization of older 2D tile based games.  Of course creating a Minecraft engine is much more challenging than making a 2D engine for a number of reasons.  First of all, it is in 3D and second it is generally expected that the environments must be much more dynamic, with interactive lighting, physics and so on.

When building a voxel game, it is important to choose a data structure for representing the world early on.  This decision more than any other has the greatest impact on the overall performance, flexibility, and scale of the game.  This post discusses some of the possible choices that can be made along these lines, and hopefully give a better explanation of what sort of technical considerations are involved.  In fact, I posit that some commonly held performance assumptions about Minecraft-type engines are probably wrong. In particular, the importance of random-access reads/writes for modifications is often overestimated, while the relative cost of iteration is greatly underestimated.  In concluding this article, I introduce a new data structure that exploits this observation, and ultimately gives much faster iteration (for typical Minecraft-esque maps) at the expense of slightly slower random accesses.

# Conventional data structures

But before getting ahead of ourselves, let’s review some of the approaches were are currently being used to solve this problem.  We start with the absolute simplest choice, which is the “flat array”.  In the flat array model, you simply store all of your level data in one gigantic array:

As Minecraft and Infiniminer haver shown us this strategy can be viable — providing you limit the number of voxel types and make the world small enough.  Storing blocks in a big array has many advantages, like constant time random access reads and writes, but it is far from perfect. The most obvious downside is that the amount of memory consumed grows with the cube of the linear dimension, and that they are ultimately fixed in size.  These properties make them suitable for small, sand-box worlds (like the old-school Minecraft creative mode), but prevent them from being used in an infinite world (like Minecraft survival mode).

The most naive solution to the perceived disadvantages of flat arrays (and I say this because it is almost invariably the first thing that anyone suggests) is to try putting all the voxels in an octree.  An octree is a tree that recursively subdivides space into equal sized octants:

On the face of it, this does sound like a good idea.  Minecraft maps are mostly empty space (or at least piecewise constant regions), and so removing all those empty voxels ought to simplify things quite a bit.  Unfortunately, for those who have tried octrees, the results are typically not so impressive:  in many situations, they turn out to be orders of magnitude slower than flat arrays.  One commonly cited explanation is that access for neighborhood queries times in octrees are too slow, causing needless cache misses and that (re)allocating nodes leads to memory fragmentation.

However, to understand why octrees are slower for Minecraft games, it isn’t really necessary to invoke such exotic explanations.  The underlying reason for the worse performance is purely algorithmic: each time an arbitrary voxel is touched, either when iterating or performing a random access, it is necessary to traverse to the entire height of the octree.  Since octrees are not necessarily balanced, this can be as much as log(maximum size of game world)!  Assuming the coordinate system is say 32 bit, this brings an added overhead of 32 additional non-coherent memory accesses per each operation that touches the map (as opposed to the flat array, where everything is simply constant time).

A more successful method — and even less sophisticated — method is to use paging or “virtual memory” to expand the size of the flat array.  To do this one breaks down the flat 3D array into a collection of pages or “chunks”, and then uses a hash table to sparsely store only a subset of the pages which are currently required by the game:

Pages/chunks are typically assigned in the same way as any virtual memory system: by reserving the upper $k$-bits of the coordinate for the chunk ID.  Using a virtual  memory system has the additional advantage that chunks can be lazily initialized, which is great for games with procedural content.  The same concept also allows chunks to be mapped back to disk when they are not needed (that is when there are no players nearby).  Using a hash map to store the chunks allows one to maintain constant time random access, while simultaneously taking advantage of sparsity as in an octree.

It seems like the current direction of Minecraft engines has begun to converge on the “virtualization” approach.  (And I base this claim on numerous blog posts).  But is this really the best solution?  After all, before we jump in and try to solve a problem we should at least try to quantitatively understand what is going on first; otherwise we might as well be doing literary criticism instead of computer science.  So I ask the following basic question:

# What factors really determine the performance of a voxel engine?

There is a superificial answer to this question, which is “read/write access times”; after all every operation on a Minecraft map can be reduced to simple voxel queries.  If you accept this claim, then a virtualization is optimal — end of story.  However for reasons which I shall soon explain, this assertion is false.  In fact, I hope to persuade you that really it is the following that is true:

### Claim:  The main bottleneck in voxel engines is iteration — not random access reads and writes!

The basic premise hinges on the fact that the vast majority of accesses to the voxel database come in the form of iterations over 3D ranges of voxels.  To demonstrate this point let us first consider some of the high level tasks involving the world map that a voxel engine needs to deal with:

• Placing and removing blocks
• Collision detection
• Picking/raytracing
• Mesh generation
• Disk serialization
• Network serialization

The last two items here could be considered optional, but I shall leave them in for the sake of completeness (since most voxel games require some level of persistence, and many aspire at least to one day be multiplayer games).  We shall classify each task in terms of whatever operations are required to implement it within the game loop, which are either read/write and random access/iteration.  To measure cost, we count the number of raw voxel operations (read-writes) that occur over some arbitrary interval of gameplay.  The idea here is that we wait some fixed unit of time (t) and count up how many times we hit the voxels (ie the number of bytes we had to read/write).  Under the assumption that the world is much larger than cache (which is invariably true), these memory operations are effectively the limiting factor in any computation involving the game world.

At a fairly coarse level, we can estimate this complexity in terms of  quantities in the game state, like the number of MOBs and players in the world, or in terms of the number of atomic events that occur in the interval, like elapsed frames, chunk updates etc.  This gives an asymptotic estimate for the cost of each task in terms of intelligible units:

 Task Operation Cost (in voxel ops) Block placement Write number of clicks Collision check Read (number of MOBs) * (frame count) Picking Read* (pick distance) * (frame count) Lighting Iteration (RW) (torch radius)^3 * (number of torches) + (active world size) * (sun position changes) Physics update Iteration (RW) (active world size) * (frame count) Mesh generation Iteration (R) (chunk size) * (chunk updates) Disk serialization Iteration (R) (active world size) * (save events) Network IO Iteration (R) (active world size) * (number of players) + (chunk size) * (chunk updates)
* Assuming that picking is implemented by ray-marching.
Disclaimer:  The entries in this chart are estimates, and I do not claim that the dimensions for each task are 100% accurate.  If you disagree with anything here, please leave a comment.

Our goal is to pinpoint which operation(s) in this chart is the bottleneck, and to do so we need dimensional analysis to convert all our units into a common set of dimensions.  Somewhat arbitrarily, we choose frame count and number players, and so our units will be in (voxel ops * frames * players).  Going through the list of each item, we now try to estimate some plausible bounds for each dimension:

1. (number of clicks) is the amount of times the user clicked the mouse in the interval.  Since a user can click at most once per frame (and in practice is likely to click much much less), this value is bounded above by:
• (number of clicks) < (frame count) * (number of players)
2. (number of MOBs)is the number of active entities, or movable objects in the game.  Assuming that they are not moving too fast (ie there is a maximum speed of a constant number of blocks per frame), then the number of reads due to a collision check is going to be proportional to the number of MOBs and the number of frames.  We assume that number of MOBs in the world is proportional to the number of players, according to some constant $k_{MOB}$,
• (number of MOBs) = $k_{MOB}$ * (number of players)
3. (pick distance) is the maximum range a block can be selected by a player, and we shall assume that it is just a small constant.
4. Again, (torch radius) is a small number.  In Minecraft, this could theoretically be as high as 16, but we will simply leave it as a tiny constant for now.
5. (number of torches)is the number of torches placed / removed during the interval.  Torch placement is generally an irregular event, and so we estimate that a player will place a torch somewhat infrequently, at a rate goverened by some constant $k_{TORCH}$,
• (number of torches) = (number of players) * (frame count) * $k_{TORCH}$
6. (sun position changes)is the number of times that the sun changes.  This happens regularly every few frames, and so we estimate that,
• (sun position changes) = (frame count) * $k_{SUN}$
7. (active world size)is the size of the world.  It is basically proportional to the number of chunks visible by each player.  Assuming a Minecraft style world with a height cap, this would vary quadratically with the visible radius, $r_V$, or cubically for a world with unlimited z-value.
• (active world size) = (number of players) * (chunk size) * $r_V^{2}$
8. (chunk updates) occur whenever a chunk changes.  Assuming this is random, but proportional to the size of the active world, we get
• (chunk updates) = (active chunk size)  / (chunk size) * $k_{UPDATE}$
9. (save events) occur at regular intervals and save the map to disk.  Again, because they are regular we can just bound them by a constant times the number of frames.
• (save events) = (frame count) * $k_{SAVE}$

Plugging all that in, we get the following we get the following expression for the total amount of random access voxel ops:

(random voxel ops) = C * (number of players) * (frame_count)

And for the sequential or iteration voxel ops, we get:

(sequential voxel ops) = C * (number of players) * (frame count) * (chunk size) * $r_V^2$

In general, the last quantity is much larger, by a factor of (chunk size) * $r_V^2$, or (chunk size) * $r_V^3$ if you have unlimited height.  So, if you believe these estimates are reasonable, then you should be convinced that iteration by far dominates the performance of a Minecraft style game.  In fact, this estimate explains a few other things, like why visible radius is such an important performance factor in Minecraft.  Increasing the draw radius linearly, degrades the performance quadratically!

# Can we make iteration faster?

The main conclusion to draw from the above estimate is that we should really focus our optimization efforts on iteration.  If we can bring that cost down measurably, then we can expect to see large improvements in the game’s performance.  But the question then becomes: is this even possible?  To show that at least in principle it is, we make use of a simple observation:  In any situation where we are iterating, we only need to consider local neighborhoods around each cell.  For example, if you are computing a physics update (in minecraft) you only consider the cells within your Moore neighborhood.  Moreover, these updates are deterministic:  If you have two voxels with the same Moore neighborhood, then the result of applying a physics update to each cell will be the same.  As a result, if we can iterate over the cells in groups which all have the same neighborhood, then we can apply the same update to all cells in that group at once!

This is the basic idea behind our strategy for speeding up iteration.  We will now try to explain this principle in more detail.  Let $B$ be the set of block types, and let $G : \mathbb{Z}^3 \to B$ be a map representing our world (ie an assignment of block types to integral coordinates on a regular grid).  A radius $r$-Moore neighborhood is an element of the set $B^{2r+1} \times B^{2r + 1} \times B^{2r + 1} \cong \{ f : \mathbb{Z}_{2r + 1}^3 \to B \}$.  The Moore neighborhood of a coordinate $(i,j,k)$ in $G$ is determined by a map, $M_{r} : \mathbb{Z}^3 \to (\mathbb{Z}_{2 r + 1}^3 \to B)$ such that,

$M_r(i,j,k)(o, p, q) = G(i + o - r, j + p - r, k + q - r)$

Then an update rule is a map sending a Moore neighborhood to a block, $P : B^{2r+1} \times B^{2r + 1} \times B^{2r + 1} \to B$, and the application of $P$ to $G$ is the rule sending,

$G \mapsto P \circ M_r$

Then the fact that we can group cells corresponds to the observation that:

$M_r(i_1, j_1, k_1) = M_r(i_2, j_2, k_2) \implies P(M_r(i_1, j_1, k_1)) = P(M_r(i_2, j_2, k_2))$

So, if our world data allows us to group up similar Moore neighborhoods, then we can expect that on average we should be able to reduce the complexity of iteration by a substantial amount.

In fact, this same idea also applies to meshing:  If you have a sequence of adjacent cubes that all have the same 3x3x3 neighborhood, then you can fuse their faces together to make a smaller mesh.  For example, instead of generating a mesh for each cube one at a time:

We can fuse cells with the same Moore neighborhood together:

So if we can iterate over the cells in batches, we should expect that not only will our physics loop get faster, but we should also expect that mesh quality should improve as well!

# Run length encoding and interval trees

At this point, we now have some pretty good intuition that we can do iteration faster, so the obvious next step is to figure out exactly how this works.  Here is one strategy that I have used in a simple javascript based Minecraft game (which you can get here).  The idea follows from the simple observation that Minecraft style worlds can be easily run-length-encoded.  In run length encoding, one first flattens the 3D array describing a chunk down to a 1D string.  Then, substrings of repeated characters or “runs” are grouped together.  For example, the string:

     aaaaabbbbacaa

Would become:

     5a4b1a1c2a

Which is a size reduction of about 25%.  You can imagine replacing the characters in the string with voxel types, and then compressing them down this way.  Now, applying run-length-encoding to chunks is by no means a new idea.  In fact, it is even reasonable to do so as a preprocessing step before g-zipping the chunks and sending them over the network/spooling to disk.  The fact that run-length encoding does compress chunks works to our advantage:  We can use this fact to iterate over the voxel set in units of runs instead of units of pixels, and can easily be extended to walk runs of Moore neighborhoods as well.

This sounds great for iterations, but what about doing random-access reads and writes?  After all, we still need to handle collision detection, block placement and raytracing somehow.  If we were dumb about it, just accessing a random block from a run-length-encoded chunk could take up to linear time.  It is obvious that we shouldn’t be happy with this performance, and indeed there is a simple solution.  The key observation is that a run-length encoding of a string is formally equivalent to an interval tree representation.

An interval tree is just an ordinary binary tree, where the key of each node is the start of a run and the value is the coordinate of the run.  Finding a greatest lower bound for a coordinate is equivalent to finding the interval containing a node.  To do insertion is a bit trickier, but not by much.  It does involve working through a few special cases by hand, but it is nothing that cannot be done without taking a bit of time and a pad of paper.  If we implement this data structure using some type of self-balancing binary tree (for example a red-black tree or a splay tree), then we can perform random reads and writes in logarithmic time on the number of runs.  Best of all: we also get improved mesh generation and in-memory compression for free!

Now in practice, you would want to combine this idea with virtualization.  Basically, you would store each chunk as an interval tree, then represent the entire world as a hash map.  The reason for doing this would be to integrate paging and chunk generation more seamlessly.  In fact, this is the method I took in the last two minecraft type games that I wrote, and you can find some example code illustrating this data structure right here.

# Comparisons and conclusion

Ok, now that we’ve gone through the details, let’s break down our options by data structure and time complexity:

 Data structure Random access time Iteration time Size Notes Flat array $O(1)$ $O(n)$ $O(n)$ Virtualized array $O(1)$ $O(n)$ $O(v)$ $v$ is the number of occupied (virtual) chunks Octree $O(h)$ $O(n h)$ $O( v h )$ $h$ is the height of the octree Interval tree + Hash table $O( \log(r_C) )$ $O( r )$ $O( v r_C )$ $r$ is the number of runs in a region, $r_C$ is the number of runs a single chunk.

Under our previous estimates, we can suppose quite reasonably that an interval tree should outperform a virtualized array for typical maps.  (Of course this claim goes out the window if your level geometry is really chaotic and not amenable to run-length encoding).  To test this hypothesis, I coded up a pretty simple benchmark.  At the outset the map is initialized to a 256x256x256 array of volume, with 5 different layers of blocks.  Randomly, 2^16 blocks from 32 different types are sprinkled around the domain.  To assess performance, we measure both how long an average random read takes and how long a sequential iteration requires (for Moore radii of 0 and 1).  The results of the benchmark are as follows:

In each column, the best result is underlined in bold.  There are a few things to take home from this.  First, for random access reads and writes, nothing beats a flat array.  Second, octrees are not a very good idea for voxel worlds.  On all accounts, a virtual array is far superior.  Finally, while the access time for random reads is noticeably slower in an interval tree, the speed benefits of improved sequential iteration make them more than worthwhile in practice.  The code for the benchmark can be found here.

Of course, interval trees will not perform well in worlds which are more-or-less random, or if there are not many runs of similar blocks that can be grouped together.  But when your world can be described in this way (which is certainly true of a Minecraft like game), switching to intervals can offer a huge improvement in performance.  I also firmly believe that there is still quite a bit of room for improvement.  If you have any suggestions or ideas, please leave a comment!

## Goofing around in node.js

In my spare time I’ve been messing around a bit with node.js, and have been trying out building a browser based multiplayer online game.  Right now I am using sprites/minecraft style levels since they are easy to make and low bandwidth; but ultimately I would like to try out doing some more stuff with meshes.  You can give it a try if you have a browser which supports WebGL.   Here is a screen shot:

The URL for the game is http://mmotest.nodester.com .  Right now I am using the freebie hosting coupon for nodester, but I have no idea how much longer it will last.  The database hosting is from mongolabs, and so again is fairly limited.  However, this does show that you can build something like a minecraft game in the browser pretty easily.  Of course the latency with a shared hosting provider like nodester starts to become a bigger problem as you get more players in the game.  I will try to post the source code once I get a chance to clean some things up.  I am having some issues configuring the nodester deployment so that I don’t have to hardcode in various strings like the database password.  Unfortunately, for the life of me I can’t figure out how to pass those sort of config things in on the command line via nodesters REST API, and so for now they are just stored in version control.  Once I can figure out how to remove this information, I will make the github repository public so that others can try it out.  On the other hand, if I can’t solve the problem I may try just do something irresponsible like release a repository with all the passwords in it and let everyone go crazy hacking my demo accounts.

## Polynomials in C++

Last time, we showed how one can use symmetric tensors to conveniently represent homogeneous polynomials and Taylor series.  Today, I am going to talk about how to actually implement a generic homogeneous polynomial/symmetric tensor class in C++.  The goal of this implementation (for the moment) is not efficiency, but rather generality and correctness.  If there is enough interest we can investigate some optimizations perhaps in future posts.

## How to deal with polynomials?

There are quite a few ways to represent polynomials.  For the moment, we aren’t going to assume any type of specialized sparse structure and will instead simply suppose that our polynomials are drawn at random from some uniform distribution on their coefficients.  In some situations, for example in the solution of various differential equations, this is not too far from the truth, and while there may be situations where this assumption is too pessimistic, there are at least equally many situations where it is a good approximation of reality.

In 1D, the most direct method to represent a dense, inhomogeneous polynomial is to just store a big array of coefficients, where the $i^{th}$ entry corresponds to the coefficient for the degree i monomial.  Naively generalizing a bit, the same idea works for multivariate polynomials as well, where you simply make the array multidimensional.  For example, suppose you had the 2D coefficient matrix:

$\begin{array}{cc} a_{00} & a_{01} \\ a_{10} & a_{11} \end{array}$

Which we could then expand out to give a polynomial like:

$a_{00} + a_{01} x + a_{10} y + a_{11} xy$

Though it is a bit simplistic, working in this format does have a few advantages.  For example:

1. Evaluating a polynomial is easy.  Just compute the vectors $(1, x, x^2, ...), (1, y, y^2, ...)$ and the then plug them into the coefficient tensor using ordinary Einstein summation.
2. Multiplying can also be pretty fast.  Observe that coefficient matrix of the product of two polynomial matrices is just the convolution of their two matrices.  Using fast convolution algorithms (for example an FFT based method would work), this can be computed with only a mere log factor overhead!

The amount of storage for representing a polynomial on $d$ variables with a maximum per-coefficient degree of $r$ using this scheme requires exactly $(r+1)^d$ scalars worth of memory. However, this representation scheme does have some major problems:

1. It doesn’t capture all the terms above degree r properly.  For example, if you take the term $a_{11} xy$ from the above example, it is actually second order, and so is the polynomial that it represents.  However, we are missing the extra terms for $x^2, y^2$.
2. Related to the previous point, it is not closed under linear coordinate transformations.  If you were to say do a rotation in $x, y$, that $a_{11} xy$ term could change into something like $a x^2 + b xy + c y^2$, and thus we would need to resize the matrix to keep up with the extra stuff.
3. If you only care about storing terms with degree at most $r$, then the above representation is wasteful by at least a factor of about $d!$.
4. And if you want to deal with terms that are exactly degree d, then it is even worse with an overhead of $O( d! r^{ \frac{d}{d-1} })$.

The underlying cause for each of these problems is similar, and ultimately stems from the fact that the function $M_{ij...} x^i y^j ...$ is not multilinear.  As we saw last time it can be useful to deal with multivariate polynomials using tensors.  Similarly, it is also useful to split them up into homogeneous parts (that is grouping all the terms which have the same degree together).  We also showed how this approach greatly simplified the expression of a general multidimensional Taylor series expansion.  So today, we are going to investigate a different, and more “tensor-flavored” method for representing homogeneous polynomials.

## Quick Review

Recall that a homogeneous polynomial on d variables, $x_0, x_1, ... x_{d-1}$, with degree r is a polynomial,

$p(x_0, x_1, ...) = A_{0...0} x_0^r + r A_{0...1} x_0^{r-1} x_1 + ...$

We also showed that this could be written equivalently using tensor notation:

$p(x) = A_{i j k ... } x_i x_j x_k ...$

Which allows us to reinterpret p as a “r-multilinear function” on x.  We say that a function $f(x_0, x_1, ..., x_{d-1} )$ is multilinear if for any of its arguments $x_i$, it is a linear function, that is:

$f(x_0, x_1, ... a x_i + y, .... x_{d-1}) = af(x_0, x_1, ... x_i, .... x_{d-1}) + f(x_0, x_1, ... y, .... x_{d-1})$

While it may seem a bit restrictive to work in homogeneous polynomials only, in practice it doesn’t make much difference.  Here is a simple trick that ought to be familiar to anyone in computer graphics who knows a bit about projective matrices.  Let us consider the following 1D example (which trivially generalizes):  given a polynomial in one variable,

$p(x) = a_0 + a_1 x + a_2 x^2$

Let us create a new variable $w$, and construct the new polynomial:

$p'(x, w) = a_0 w^2 + a_1 xw + a_2 x^2 = w^2 p(\frac{x_0}{w})$

Now we observe that:

1. $p'(x,w)$ is a homogeneous polynomial of degree 2.
2. If we pick $w = 1$, then $p'(x, 1) = p(x)$.

And so we have faithfully converted our inhomogeneous degree 2 polynomial in one variable into a new homogeneous degree 2 polynomial in two variables!

In general, converting from an inhomogeneous polynomial to a homogeneous polynomial is a simple matter of adding an extra variable.  The fully generalized version works like this:  Suppose that we have a polynomial $p(x_0, x_1, ... x_{d-1})$ of degree r in d unknowns.  Then we add a variable, w, and construct the new homogeneous degree r polynomial,

$p'(x_0, x_1, ... x_{d-1}, w) = w^r p( \frac{x_0}{w}, \frac{x_1}{w}, \frac{x_2}{w}, ...)$

And again, $p'(x_0, x_1, ... , 1) = p(x_0, x_1, ...)$.

## Indexing, iteration and combinatorics

So with that stuff above all out of the way, we shall now settle on an implementation: we’re going to use symmetric tensors to represent homogeneous polynomials.  Let us start off by listing a few basic properties about symmetric tensors:

1. A tensor $A_{ijk...}$ is called symmetric if for all permutations of an index, $ijk..., jki...$ and so on, $A_{ijk...} = A_{jki...} = ...$.
2. As a result, all vectors (aka a rank-1 tensors) are trivially symmetric.
3. Similarly, a matrix is symmetric if and only if $M_{ij} = M_{ji}$.
4. A rank 3 tensor is symmetric if $M_{ijk} = M_{ikj} = M_{jik} = M_{jki} = M_{kij} = M_{kji}$ (and so on).
5. Consequently is the dimension of each of the indices in a symmetric tensor must be equal, and so we shall sometimes refer to this quantity as the “dimension of the tensor”, or when we need to be more specific we shall call it the index dimension.
6. The dimension of the space of all symmetric tensors with rank r and (index) dimension d is ${ r + d - 1 \choose d - 1 }$.

And so the first problem that we must solve is to figure out how to pack the coordinates of a general rank r, dimension d symmetric tensor into an array of size ${ r + d - 1 \choose d - 1 }$.  Closely related to this issue is the problem of indexing, and we are going to need methods for both randomly accessing elements of the tensor and for performing sequential iteration over the unique entries.  In fact, it is really this last issue which is the most essential, since if we can define an efficient method for enumerating the unique elements of the tensor, then we can simply use the order in which we enumerate the entries of the tensor as the indexes into a flat array.

By definition, the entries of a symmetric tensor are invariant under index permutation.  This leads to two distinct ways to index into a symmetric rank r dimension d tensor:

1. Tensor style indexing – This is the ordinary way that we write the index for a generic tensor using subscripts.  Entries of the tensor are parameterized by a length r vector of integers, $(i_0, i_1, ... i_{r-1})$ where $0 \leq i_k< d$ for all $0 \leq k < r$.
2. Polynomial style (degree) indexing – This is the style of indexing is unique to symmetric tensors and comes from their close association with polynomials.  Here, we take the index to be the degree of one of the monomial terms in the corresponding homogeneous polynomial   These indices are enumerated by length d vectors of integers, $(a_0, a_1, ... a_{d-1})$ where $a_k \geq 0$ for all k, and $\sum \limits_{k=0}^{d-1} a_k = r$.

We can convert between these two types of indexes up to a point.  Converting from tensor indexing to degree indexing is straightforward, where we just pick

$a_k =$ the number of elements in $(i_0, ... i_{r-1})$ that are equal to k

Going the other way is a bit harder, since there are multiple tensor indices which map to the same degree index.  To resolve the ambiguity, we can simply take the lexicographically first item.  This same idea also turns out to be useful for enumerating the entries within our tensor.  For example, suppose that we wanted to enumerate the components of a rank 3, degree 4 tensor.  Then, we could use the following ordering,

Pos.   Tensor   Degree
-------------------------
0      (0,0,0)  (3,0,0,0)
1      (1,0,0)  (2,1,0,0)
2      (2,0,0)  (2,0,1,0)
3      (3,0,0)  (2,0,0,1)
4      (1,1,0)  (1,2,0,0)
5      (2,1,0)  (1,1,1,0)
6      (3,1,0)  (1,1,0,1)
7      (2,2,0)  (1,0,2,0)
8      (3,2,0)  (1,0,1,1)
9      (3,3,0)  (1,0,0,2)
10     (1,1,1)  (0,3,0,0)
11     (2,1,1)  (0,2,1,0)
12     (3,1,1)  (0,2,0,1)
13     (2,2,1)  (0,1,2,0)
14     (3,2,1)  (0,1,1,1)
15     (3,3,1)  (0,1,0,2)
16     (2,2,2)  (0,0,3,0)
17     (3,2,2)  (0,0,2,1)
18     (3,3,2)  (0,0,1,2)
19     (3,3,3)  (0,0,0,3)

Here, the left hand column represents the order in which the entries of the tensor were visited, while the middle column is the set of unique tensor indices listed lexicographically.  The right hand side is the corresponding degree index, which is remarkably in colexicographic order!  This suggests other interesting patterns as well, such as the following little theorem:

Proposition: Colexicographically sort all the degree indices in a rank r, d-dimensional symmetric tensor.  Then the position of the degree index, $(a_0, a_1, ... a_{d-1})$, is

$\sum \limits_{j=1}^n {{ j-1+\sum \limits_{k=n-j}^n a_k} \choose j}$

(That is supposed to be a binomial coefficient, but the parenthesis seem to be a bit messed up in wordpress.)  I will leave this proof as an exercise to the reader, but offer the hint that it can be proven directly by induction.  It is also interesting to note that the formula for the position does not depend on either the rank of the tensor or the first coefficient.

## Putting it all together

Now let us try to work out a method for iterating the tensor index incrementally.  A simple strategy is to use a greedy method:  We just scan over the index until we find a component that can be incremented, then scan backwards and reset all the previous indices.  In psuedo-C++, the code looks something like this:

bool next() {
for(int i=0; i<rank; ++i) {
if(tensor_index[i] < dimension - 1) {
++tensor_index[i];
for(int j=i-1; j>=0; --j) {
tensor_index[j] = tensor_index[i];
}
return true;
}
}
return false;
}

The tensor_index array contains rank elements, and is initialized to all 0’s before this code is called.  Each call to next() advances tensor_index one position.

Superficially, there are some similarities between this algorithm and the method for incrementing a binary counter.  As a result, I propose the following conjecture (which I have not investigated carefully enough to say is true confidently):

Conjecture: Amortized over course of a full traversal, the cost  of next is constant.

Using all this stuff, I’ve put together a little C++ class which iterates over the coordinates of a symmetric tensor.  I’ve also uploaded it to github, where you can go download and play around with it (though it isn’t much).  I hope to keep up this trend with future updates.  For those who are curious, here is the URL:

https://github.com/mikolalysenko/0fpsBlog

## Next time

In the next episode, I am going to continue on this journey and build an actual symmetric tensor class with a few useful operations in it.  More speculatively, we shall then move on to rational mappings, algebraic surfaces, transfinite interpolation and deformable objects!

## Multidimensional Taylor Series and Symmetric Tensors

As a warm up post, I’m going to talk about an important generalization of something that should be familiar to anyone’s who has made through a semester of calculus: Taylor series!  (And if you haven’t seen these guys before, or are perhaps feeling a bit rusty, then by all means please head on over to Khan academy to quickly brush up.  Go right ahead, I’ll still be here when you get back.)

Now this probably sounds like an awfully scary title for the first article in this miniseries about mathematics in graphics programming.  Perhaps you’re right!  But I’d like to think that graphics programmers are a tough bunch, and that using this kind of a name may indeed have quite the opposite effect of emboldening those cocky individuals who think they can press on and figure things out in spite of any frightening mathematical jargon.

## Taylor Series

At the very heart of this discussion we are going to deal with two of the most important tasks any graphics programmer needs to worry about:  approximation and book keeping.  Taylor series are of course one of the oldest and best known methods for approximating functions.  You can think of Taylor series in a couple of ways.  One possibility is to imagine that they are successively approximating a given input function by adding additional polynomial terms to it.  So the idea is that you can start with some arbitrary 1D function, let’s call it $f : R^1 \to R$, and suppose that you are allowed the following two operations:

1. You can evaluate a function at 0.
2. You can take a derivative, $\partial_x f(x)$

Then, we can compute the Taylor series expansion of f about 0 in the usual way,

$f(x) = f(0) + (\partial_x f)(0) x + (\partial_{x} \partial_x f)(0) \frac{x^2}{2!} + (\partial_{x} \partial_x \partial_x f)(0) \frac{x^3}{3!} + ...$ and so on

If we’re really slick, we can save the first $k$ coefficients for these polynomials in a vector, call them say $a_0, a_1, a_2, ... a_{k-1}$., and then we can evaluate some approximation of f by summing up the first k terms in the above approximation:

$f(x) = a_0 + a_1 x + a_2 \frac{x^2}{2!} + a_3 \frac{x^3}{3!} + ... a_{k-1} \frac{x^{k-1}}{(k-1)!} + O(x^{k})$

Here is an animated gif showing the convergence of the Taylor series for the exponential function that I shamelessly ripped off from wikipedia:

## Higher Dimensional Taylor Series

It is easy to adapt Taylor series to deal with vector valued functions in a single variable, you just treat each component separately.  But what if the domain f was not one dimensional?  This could easily happen for example, if you were trying to approximate something like a surface, a height field, or an image filter locally.  Well, in this post I will show you a slick way to deal with Taylor series of all sizes, shapes and dimensions!  But before I do that, let me show you what the ugly and naive approach to this problem looks like:

Suppose that $f (x,y) : R^2 \to R$ is a 2D scalar-valued function. Then, let us look for a best quadratic (aka degree 2) approximation to f within the region near $(0,0)$.  By analogy to the 1D case, we want to find some 2nd order polynomial $p(x,y)$ in two variables such that:

$p(x,y) = a_{00} + a_{10} x + a_{01} y + a_{20} x^2 + a_{11} xy + a_{02} y^2$

And:

$p(0,0) = f(0,0)$

$(\partial_x p)(0,0) = (\partial_x f)(0,0)$

$(\partial_y p)(0,0) = (\partial_y f)(0,0)$

$(\partial_{x} \partial_x p)(0,0) = (\partial_{xx} f)(0,0)$

$(\partial_{x} \partial_y p)(0,0) = (\partial_{xy} f)(0,0)$

$(\partial_{y} \partial_y p)(0,0) = (\partial_{yy} f)(0,0)$

Phew!  That’s a lot more equations and coefficients than we got in the 1D case for a second order approximation.  Let’s work through solving for the coefficient $a_{20}$:  To do this, we take p and plug it into the appropriate equation:

$\left( \partial_{x} \partial_x (a_{00} + a_{10} x + a_{01} y + a_{20} x^2 + a_{11} xy + a_{02} y^2) \right) = 2 a_{20} = (\partial_{x} \partial_x f)(0,0)$

If we generalize this idea a bit, then we can see that for an arbitrary Taylor series approximation in 2D, the coefficient $a_{ij}$ is determined by the equation,

$a_{ij} = \frac{1}{i! j!} (\partial_{x^i y^j} f)(0,0)$

All this is well and good, but it has a few problems. First, the summation for p is quite disorganized.  How are we supposed to keep track of which coefficients go where?  If we want to store the coefficients of p on a computer, then we need some less ad-hoc method for naming them and packing them into memory.  Second, it seems like this is going to be a headache to deal with 3 or more dimensions, since we’ll need to basically repeat the same sort of reasoning over and over again.  As a result, if we want to start playing around with higher dimensional Taylor series, we’re going to need a more principled notation for dealing with higher order multivariate polynomials.

## Tensors to the Rescue!

One such system is tensor notation!  Though often maligned by mathematical purists. algebraic geometers, and those more modern algebraically savvy category theorists, tensors are a simple and effective notation that dramatically simplifies calculations.  From a very simplistic point of view, a tensor is just a multidimensional array.  The rank of the tensor is the number of different indices, each of which has a distinct dimension.  In C-style syntax, you could declare a tensor in the following way:

  float vector[10];                          //A rank 1 tensor, with dimension (10)
float matrix[5][5];                        //A rank 2 tensor, with dimensions (5, 5)
float spreadsheet[ROWS][COLUMNS][PAGES];   //A rank 3 tensor, with dimensions (ROWS, COLUMNS, PAGES)
float crazy_thing[10][16][3][8];           //A rank 4 tensor, with dimension (10, 16, 3, 8 )

We will usually write tensors as upper case letters.  To reference an individual entry in this array, we will use an ordered subscript, like so:

$M_{ij}$

Which we can take to be equivalent to the C-code:

  M[i][j]

For the rest of the article, we are going to stick with this point of view that tensors are just big arrays of numbers.  We aren’t going to bother worrying about things like covariance/contravariance (and if you don’t know what those words are, forget I said anything), nor are we going to mess around too much with operators like tensor products.  There is nothing wrong with doing this, though it can be a bit narrow minded and it does somewhat limit the applications to which tensors may be applied.  If it bothers you to think about tensors in this way, here is a more algebraic/operational picture of what a tensor does: much as how a row vector can represent a scalar-valued linear function, $R^n \to R$ (via the dot-product), a tensor can represents a multi-linear function, $R^n \times R^m \times ... \to R$: that is, it takes in several vectors and spits out a scalar which varies linearly in each of its arguments.

That said, even if we just think about tensors as arrays, there’s still a number of useful, purely formal, operations that one can perform on tensors.  For example, you can add them up just like vectors.  If A and B are two tensors with the same dimensions, then we can define their sum componentwise as follows:

$(A + B)_{i} = A_{i} + B_{i}$

Where we take the symbol i to be a generic index here.  The other important operation that we will need to use is called tensor contraction.  You can think of tensor contraction as being something like a generalization of both the dot product for vectors, and matrix multiplication .  Here is the idea:  Suppose that you have two tensors $A, B$ with dimensions $(a_0, ... d, ... a_n), (b_0, ... d, ... b_m)$  and ranks n,m respectively and some index between them with a common dimension.  Then we can form a new tensor with rank n + m – 1 by summing over that index in A and B simultaneously:

$C_{a_0, ..., a_n, b_0, ... b_n} = \sum_{i=0}^d A_{a_0, ..., i ..., a_n } B_{b_0, ... i, ... b_m}$

Writing all this out is pretty awkward, so mathematicians use a slick short hand called Einstein summation convention to save space.  Here is the idea:  Any time that you see a repeated index in a product of tensor coefficients that are written next to each other, you sum over that index.  For example, you can use Einstein notation to write the dot product of two vectors, $x, y$, as follows:

$x_i y_i = \sum \limits_{i=0}^d x_i y_i = x \cdot y$

Similarly, suppose that you have a matrix M, then you can write the linear transformation of a vector x by M in the same shorthand,

$y_j = M_{ji} x_i$

Which beats having to remember the rows-by-columns rule for multiplying vectors!  Similarly, you can multiply two matrices using the same type of notation,

$(A B)_{i j} = A_{ik} B_{kj}$

It even works for computing things like the trace of a matrix:

$\mathrm{trace }(M) = M_{ii} = \sum \limits_{i} M_{ii}$

We can also use tensor notation to deal with things like computing a gradient.  For example, we write the derivative of a function $f : R^n \to R$ with respect to the $i^{th}$ component as $\partial_i f$.  Combined with Einstein’s notation, we can also perform operations such as taking a gradient of a function along a certain direction.  If $v$ is some direction vector, then the derivative of $f$ along $v$ evaluated at the point $x$ is,

$(\partial_i f)(x) v_i$

## Symmetric Tensors and Homogeneous Polynomials

Going back to multidimensional Taylor series, how can we use all this notation to help us deal with polynomials?  Well, let us define a vector $x = (x_0, x_1, ... )$ whose components are just the usual $(x, y, z ...)$ coordinate variables, and pick some rank 1 tensor A with appropriate dimension.  If we just plug in x, then we get the following expression:

$A_i x_i = A_0 x_0 + A_1 x_1 + ...$

Which is of course just a linear function on x! What if we wanted to make a quadratic function?  Again, using tensor contraction this is no big deal:

$A_{i j} x_{i} x_{j} = A_{0 0} x_0 x_0 + A_{0 1} x_0 x_1 + A_{1 0} x_1 x_0 + A_{1 1} x_1 x_1 + ...$

Neat!  This gives us a quadratic multivariate polynomial on x.  Moreover, it is also homogeneous, which means that it doesn’t have any degree 1 or lower terms sitting around. In fact, by generalizing from this pattern if we wanted to say store an arbitrary degree n polynomial, we could pack all of its coefficients into a rank n tensor, and evaluate by contracting:

$A_{i j k ... l} x_i x_j x_k ... x_l = A_{0 0 .... 0} x_0^n + A_{0 0 .... 1} x_0^{n-1} x_1 + ...$

But there is some redundancy here.  Notice in the degree 2 case, we are assuming that the components of x commute, or in other words the terms $x_0 x_1, x_1 x_0$ are really the same and so we really don’t need to store both the coefficients $A_{10}$ and $A_{01}$.  We could make our lives a bit easier if we just assumed that they were equal.  In fact, it would be really nice if whenever we took any index, like say $A_{i j k ... }$ and then permuted it to something arbitrary, for example $A_{j i k ...}$, it always gave us back the same thing.  To see why this is, let us try to work out what the coefficient for  the monomial $x_{i_0} x_{i_1} .. x_{i_n}$ in the degree n polynomial given by $A_{i j k ... } x_i x_j x_k ...$.  Directly expanding using Einstein summation, we get a sum over all permutations of the indices $i_0, ... i_n$:

$\sum \limits_{ \sigma \in \{ \mathrm{permutation of } i_0 ... i_n \} } A_{\sigma_0 \sigma_1 ... \sigma_n} x_{i_0} x_{i_1} ... x_{i_n}$

If we assume that all those $A_{...}$ coefficients are identical, then that above nasty summation has a pretty simple form:  namely it is a multinomial coefficient!  As a result, if we wanted to say find the coefficient of $x^2_0 x_1$ in $A_{i j k} x_i x_j x_k$, then we could just use the following simple formula:

$\frac{3!}{2! 1! 1!} A_{0 0 1} x^2_0 x_1 = { 3 \choose 2, 1, 0}$

A tensor which has the property that its coefficients are invariant under permutation of its indices is called a symmetric tensor, and as we have just seen symmetric tensors provide a particularly efficient method for representing homogeneous polynomials.  But wait!  There’s more…

If we pick $A_{0 0 1} = (\partial_{001} f)(0)$, then the above formula is almost exactly the right formula for the Taylor series coefficient of $x_0^2 x_1$ in the expansion of $f$.  The only thing is that we are off by a factor of $3!$, but no matter, we can just divide that off.  Taking this into account, we get the following striking formula for the Taylor series expansion of a function $f : R^n \to R$ about the origin,

$p(x) = f(0) + (\partial_{i} f)(0) x_i + \frac{1}{2!} (\partial_{ij} f)(0) x_i x_j + \frac{1}{3!} (\partial_{ijk} f) x_i x_j x_k + ...$

Which is remarkably close to the formula for 1D Taylor series expansions!

## Next Time

I will show how to actually implement symmetric tensors in C++11, and give some example applications of multidimensional Taylor series in implicit function modeling and non-linear deformations!  I’ll also show how the above expansion can be simplified even further by working in projective coordinates.