Meshing in a Minecraft Game

The last post I wrote on Minecraft-like engines got a lot of attention, and so I figured it might be interesting to write a follow up.  This time I’ll try to tackle a different problem:


One of the main innovations in Infiniminer (and Minecraft) is that they use polygons instead of raycasting to render their volumes.  The main challenge in using polygons is figuring out how to convert the voxels into polygons efficiently.  This process is called meshing, and in this post I’m going to discuss three different algorithms for solving this problem.   But before doing this, we first need to say a bit more precisely what exactly meshing is (in the context of Minecraft maps) and what it means to do meshing well.  For the sake of exposition, we’ll consider meshing in a very simplified setting:

  • Instead of having multiple block types with different textures and lighting conditions, we’ll suppose that our world is just a binary 0-1 voxel map.
  • Similarly we won’t make any assumptions about the position of the camera, nor will we consider any level of detail type optimizations.
  • Finally, we shall suppose that our chunks are stored naively in a flat bitmap (array).

I hope you are convinced that making these assumptions does not appreciably change the core of the meshing problem (and if not, well I’ll try to sketch out in the conclusion how the naive version of these algorithms could be extended to handle these special cases.)

In a typical Minecraft game chunks do not get modified that often compared to how frequently they are drawn.  As a result, it is quite sensible to cache the results from a mesher and only ever call it when the geometry changes.  Thus, over the lifetime of a mesh, the total amount of computational work it consumes is asymptotically dominated by the cost of rendering.

Recall that at a high level polygon rendering breaks down into two parts: 1.) primitive assembly, and 2.) scan conversion (aka polygon filling).  In general, there really isn’t any way to change the number of pixels/fragments which need to be drawn without changing either the resolution, framerate or scene geometry; and so for the purposes of rendering we can regard this as a fixed cost (this is our second assumption). That leaves primitive assembly, and the cost of this operation is strictly a function of the number of faces and vertices in the mesh.  The only way this cost can be reduced is by reducing the total number of polygons in the mesh, and so we make the following blanket statement about Minecraft meshes:

Criteria 1: Smaller meshes are better meshes.

Of course, the above analysis is at least a little naive.  While it is true that chunk updates are rare, when they do happen we would like to respond to them quickly.  It would be unacceptable to wait for more than a frame or two to update the displayed geometry in response to some user change.  Therefore, we come to our second (and final) criteria for assessing mesh algorithms:

Criteria 2: The latency of meshing cannot be too high.

Speed vs. Quality

Intuitively, it seems like there should be some tradeoff here.  One could imagine a super-mesher that does a big brute force search for the best mesh compared to a faster, but dumber method that generates a suboptimal mesh.  Supposing that we were given two such algorithms, it is not a-priori clear which method we should prefer.  Over the long term, maybe we would end up getting a better FPS with the sophisticated method, while in the short term the dumb algorithm might be more responsive (but we’d pay for it down the road).

Fortunately, there is a simple solution to this impasse.  If we have a fast, but dumb, mesher; we can just run that on any meshes that have changed recently, and mark the geometry that changed for remeshing later on.  Then, when we have some idle cycles (or alternatively in another thread) we can run the better mesher on them at some later point.  Once the high quality mesh is built, the low quality mesh can then be replaced.  In this way, one could conceivably achieve the best of both worlds.

Some Meshing Algorithms

The conclusion of this digression is that of the two of these criteria, it is item 1 which is ultimately the most important (and unfortunately also the most difficult) to address.  Thus our main focus in the analysis of meshing algorithms will be on how many quads they spit out.  So with the preliminaries settles, let’s investigate some possible approaches:

The Stupid Method:

The absolute dumbest way to generate a Minecraft mesh is to just iterate over each voxel and generate one 6-sided cube per solid voxel.  For example, here is the mesh you would get using this approach on a 8x8x8 solid region:

The time complexity of this method is linear in the number of voxels.  Similarly, the number of quads used is 6 * number of filled voxels.  For example, in this case the number of quads in the mesh is 8*8*8*6 = 3072.  Clearly this is pretty terrible, and not something you would ever want to do.  In a moment, we’ll see several easy ways to improve on this, though it is perhaps worth noting that for suitably small geometries it can actually be workable.

The only thing good about this approach is that it is almost impossible to screw up.  For example, I was able to code it up correctly in minutes and it worked the first time (compared to the culling method which took me a few iterations to get the edge cases right.)  Even if you don’t plan on using this approach it can still be handy to have it around as a reference implementation for debugging other algorithms.


Clearly the above method is pretty bad.  One obvious improvement is to just cull out the interior faces, leaving only quads on the surface.  For example, here is the mesh you get from culling the same 8x8x8 solid cube:

Practically speaking, this method is pretty much the same as the previous, except we have to do a bit of extra book keeping when we traverse the volume.  The reason for this is that now we not only have to read each voxel, but we we also have to scan through their neighbors.  This requires a little more thought when coding, but time complexity-wise the cost of generating this mesh is asymptotically the same as before.  The real improvement comes in the reduction of quads.  Unlike the stupid method, which generates a number of quads proportional to the boundary, culling produces quads proportional to the surface area.  In this case, the total number of faces generated is 8*8*6 = 384 — a factor of 8x better!

It is not hard to see that the culled mesh is strictly smaller than the stupid mesh (and often by quite a lot!).  We can try to estimate how much smaller using a dimensional analysis argument:  assuming that our geometry is more-or-less spherical, let n be its radius.  Then, in the stupid method we would expect to generate about O(n^3) quads, while in the culled  version we’d get only O(n^2).  This gives an improvement of a factor of about O(n), which is pretty good!  So, if our chunks were say 16x16x16, then we might expect the culled volume to have about 16x fewer faces than the naive method (heuristically).  Of course, spherical geometry is in some sense the best possible case.  If your world is really jagged and has lots of holes, you wouldn’t expect much improvement at all.  In the worst case (namely a checkerboard world) the size of the meshes produced would be identical!  In practice, one would usually expect something somewhere in between.

Greedy Meshing:

The previous two approaches are probably the most frequently cited methods for generating Minecraft-like meshes, but they are quite far from optimal.  The last method that I will talk about today is a greedy algorithm which merges adjacent quads together into larger regions to reduce the total size of the geometry.  For example, we could try to mesh the previous cube by fusing all the faces along each side together:

This gives a mesh with only 6 quads (which is actually optimal in this case!)  Clearly this improvement by a factor of 64 is quite significant.  To explain how this works, we need to use two ideas.  First, observe that that it suffices to consider only the problem of generating a quad mesh for some 2D cross section.  That is we can just sweep the volume across once in each direction and mesh each cross section separately:

This reduces the 3D problem to 2D.  The next step (and the hard one!) is to figure out how to mesh each of these 2D slices.  Doing this optimally (that is with the fewest quads) is quite hard.  So instead, let’s reformulate the problem as a different type of optimization.  The idea is that we are going to define some type of total order on the set of all possible quadrangulations, and then pick the minimal element of this set as our mesh.  To do this, we will first define an order on each quad, which we will then extend to an order on the set of all meshes.  One simple way to order two quads is to sort them top-to-bottom, left-to-right and then compare by their length.

More formally, let’s denote a quad by a tuple (x,y,w,h), where (x,y) is the upper left corner and (w,h) is the width / height of the quad.  Given such a representation of a quad, we denote its underlying set by:

Q_{x,y,w,h} = \{ (s,t) \in \mathbb{R}^2 : x \leq s \leq x + w \: \mathrm{ and } \: y \leq t \leq y+h \} = [x,x+w] \times [y,y+h]

By a 2D quad mesh, here we mean a partition of some set S into a collection of quads Q_i.  Now let’s define a total order on these quads, given (x_0, y_0, w_0, h_0), (x_1, y_1, w_1, h_1) define the relation:

(x_0, y_0, w_0, h_0) \leq (x_1, y_1, w_1, h_1) \Leftrightarrow \left \{ \begin{array}{cc} y_0 < y_1 & \mathrm{if } \: y_0 \neq y_1 \\ x_0 < x_1 & \mathrm{if } \: x_0 \neq x_1 \\ w_0 > w_1 & \mathrm{if } \: w_0 \neq w_1 \\ h_0 \geq h_1 & \mathrm{otherwise} \end{array} \right.

Or if you prefer to read psuedocode, we could write it like this:

   compareQuads( (x0, y0, w0, h0), (x1, y1, w1, h1) ):
        if ( y0 != y1 )   return y0 < y1
        if ( x0 != x1 )   return x0 < x1
        if ( w0 != w1 )   return w0 > w1
        return h0 >= h1

Exercise: Check that this is in fact a total order.

The next thing we want to do is extend this ordering on quads to an ordering on meshes, and we do this in a very simple way.  Given two sorted sequences of quads ( q_0, q_1, ... ), ( p_0, p_1, ... ) such that q_i \leq q_{i+1} and and p_i \leq p_{i+1}, we can simply compare them lexicographically.  Again, this new ordering is in fact a total order, and this means that it has a unique least element.  In greedy meshing, we are going to take this least element as our mesh, which we’ll call the greedy mesh.  Here is an example of what one of these lexicographically minimal meshes looks like:

You may be skeptical that finding the least element in this new order always produces a better quad mesh.  For example, there is no guarantee that the least element in this new order is also the mesh with the fewest quads.  Consider what you might get when quad meshing a T-shape using this method:

Exercise 2:  Find a mesh of this shape which uses fewer quads.

However, we can say a few things.  For example, it always true that each quad in the greedy mesh completely covers at least one edge on the perimeter of the region.  Therefore, we can prove the following:

Proposition: The number of quads in the greedy mesh is strictly less than the number of edges in the perimter of the region.

This means that in the worst case, we should expect the size of the greedy mesh to be roughly proportional to the number of edges.  By a heuristic/dimensional analysis, it would be reasonable to estimate that the greedy mesh is typically about O( \sqrt(n) ) times smaller than the culled mesh for spherical geometries (where n is the radius of the sphere as before).  But we can be more precise about how good the greedy mesh actually is:

Theorem: The greedy mesh has no more than 8x as many quads than the optimal mesh.

That’s within a constant factor of optimality!  To prove this, we’ll need to introduce a bit of nomenclature first.  We’ll call a vertex on the perimeter of a region convex if it turns to the right when walking around the mesh clockwise, and otherwise we’ll call it concave.  Here is a picture, with the convex vertices colored in red and the concave vertices colored in green:

There’s a neat fact about these numbers.  It turns out that if you sum them up, you always get the same thing:

Proposition: For any simply connected region with V_+ convex and V_- concave vertices on the perimeter, V_+ - V_- = 4.

This can be proven using the winding number theorem from calculus.  But we’re going to apply this theorem to get a lower bound on the number of quads in an arbitrary mesh:

Lemma: Any connected genus G region with a perimeter of E edges requires at least \frac{E + 2G + 4}{8} quads to mesh.

Proof:  Let V_+, V_- denote the number of convex/concave vertices on the perimeter respectively.  Because each quad can cover at most 4 convex vertices, the number of quads, N, in the mesh is always at least N \geq \frac{V_+}{4}.  By the winding number theorem, it is true that for simply connected regions V_+ = V_- + 4, and so V_+ = \frac{E}{2} + 2.  Therefore any mesh of a simply connected region requires at least \frac{E + 4}{8} quads.  To extend this to non-simply connected regions, we observe that any quad mesh can be cut into a simply connected one by slicing along some edges in the quad mesh connecting out to the boundary.  Making this cut requires introducing at most two edges per each hole in the object, and so we could just as well treat this as a simply connected region having E + 2 G edges, and applying the previous result gives us the bound in the lemma. \Box

To prove our main theorem, we just combine these two lemmas together, using the fact that G \geq 0.  In summary, we have the following progression:

Optimal \leq Greedy \leq Culled \leq Stupid

So now that I’ve hopefully convinced you that there are some good theoretical reasons why greedy meshing outperforms more naive methods, I’ll try to say a few words about how you would go about implementing it.  The idea (as in all greedy algorithms) is to grow the mesh by making locally optimal decisions.  To do this, we start from an empty mesh and the find the quad which comes lexicographically first.  Once we’ve located this, we remove it from the region and continue.  When the region is empty, we’re done!  The complexity of this method is again linear in the size of the volume, though in practice it is a bit more expensive since we end up doing multiple passes.  If you are the type who understands code better than prose, you can take a look at this javascript implementation which goes through all the details.

Demo and Experiments

To try out some of these different methods, I put together a quick little three.js demo that you can run in your browser:

Click here to try out the Javascript demo!

Here are some pictures comparing naive meshing with the greedy alternative:


Naive (left), 690 quads vs. Greedy (right), 22 quads


4770 quads vs. 2100 quads


2198 quads vs. 1670 quads

As you can see, greedy meshing strictly outperforms naive meshing despite taking asymptotically the same amount of time.  However, the performance gets worse as the curvature and number of components of the data increase.  This is because the number of edges in each slice increases.  In the extreme case, for a collection of disconnected singleton cubes, they are identical (and optimal).

Conclusion and some conjectures:

Well, that’s it for now.  Of course this is only the simplest version of meshing in a Minecraft game, and in practice you’d want to deal with textures, ambient occlusion and so on (which adds a lot of coding headaches, but not much which is too surprising).  Also I would ask the reader to excuse the fairly rough complexity analysis here.  I’d actually conjecture that the true approximation factor for the greedy algorithm much less than 8, but I’ve spent enough time writing this post so I guess I’ll just leave it as is for now.  I think the real weakness in my analysis lies in the computation of the upper bound on the size of greedy mesh.  I’d actually guess the following is true:

Conjecture: The size of the greedy mesh is at most \frac{E}{2}.

If this conjecture holds true, then it would reduce the current bounds on the approximation factor to 4x instead of 8x.  It’s also not clear that this is the best known greedy approximation.  I’d be interested to learn how Minecraft and other voxel engines out there solve this problem.  From my time playing the game, I got the impression that Minecraft was doing some form of mesh optimization beyond naive culling, though it was not clear to me exactly what was going on (maybe they’re already doing some form greedy meshing?)

It is also worth pointing out that the time complexity of each of these algorithms is optimal (ie linear) for a voxel world which is encoded as a bitmap.  However, it would be interesting to investigate using some other type of data structure.  In the previous post, I talked about using run-length encoding as an alternative in-memory storage of voxels.  But greedy meshing seems to suggest that we could do much better:  why not store the voxels themselves as a 3D cuboid mesh?  Doing this could drastically reduce the amount of memory, and it is plausible that such a representation could also be used to speed up meshing substantially.  Of course the additional complexity associated with implementing something like this is likely to be quite high.  I’ll now end this blog post with the following open problem:

Open Problem: Do there exist efficient data structures for dynamically maintaining greedy meshes?

If you have other questions or comments, please post a response or send me an email!


6/30/12: Corrected an error in the definition of the order on each quad.  Was written (x,y,w,h) when it should have been (y,x,w,h) (Thanks Matt!)

An Analysis of Minecraft-like Engines

Voxel engines are everywhere…

…and given the huge success of Minecraft, I think they are going to become a permanent fixture in the landscape of game engines.  Minecraft (and Infiniminer) live in a unique space somewhere between high-resolution voxels and static prefabricated environments.  Modifying a Minecraft world is both intuitive and elegant; you simply pick up and place blocks.  In many ways, it is the natural 3D generalization of older 2D tile based games.  Of course creating a Minecraft engine is much more challenging than making a 2D engine for a number of reasons.  First of all, it is in 3D and second it is generally expected that the environments must be much more dynamic, with interactive lighting, physics and so on.

When building a voxel game, it is important to choose a data structure for representing the world early on.  This decision more than any other has the greatest impact on the overall performance, flexibility, and scale of the game.  This post discusses some of the possible choices that can be made along these lines, and hopefully give a better explanation of what sort of technical considerations are involved.  In fact, I posit that some commonly held performance assumptions about Minecraft-type engines are probably wrong. In particular, the importance of random-access reads/writes for modifications is often overestimated, while the relative cost of iteration is greatly underestimated.  In concluding this article, I introduce a new data structure that exploits this observation, and ultimately gives much faster iteration (for typical Minecraft-esque maps) at the expense of slightly slower random accesses.

Conventional data structures

But before getting ahead of ourselves, let’s review some of the approaches were are currently being used to solve this problem.  We start with the absolute simplest choice, which is the “flat array”.  In the flat array model, you simply store all of your level data in one gigantic array:

As Minecraft and Infiniminer haver shown us this strategy can be viable — providing you limit the number of voxel types and make the world small enough.  Storing blocks in a big array has many advantages, like constant time random access reads and writes, but it is far from perfect. The most obvious downside is that the amount of memory consumed grows with the cube of the linear dimension, and that they are ultimately fixed in size.  These properties make them suitable for small, sand-box worlds (like the old-school Minecraft creative mode), but prevent them from being used in an infinite world (like Minecraft survival mode).

The most naive solution to the perceived disadvantages of flat arrays (and I say this because it is almost invariably the first thing that anyone suggests) is to try putting all the voxels in an octree.  An octree is a tree that recursively subdivides space into equal sized octants:

On the face of it, this does sound like a good idea.  Minecraft maps are mostly empty space (or at least piecewise constant regions), and so removing all those empty voxels ought to simplify things quite a bit.  Unfortunately, for those who have tried octrees, the results are typically not so impressive:  in many situations, they turn out to be orders of magnitude slower than flat arrays.  One commonly cited explanation is that access for neighborhood queries times in octrees are too slow, causing needless cache misses and that (re)allocating nodes leads to memory fragmentation.

However, to understand why octrees are slower for Minecraft games, it isn’t really necessary to invoke such exotic explanations.  The underlying reason for the worse performance is purely algorithmic: each time an arbitrary voxel is touched, either when iterating or performing a random access, it is necessary to traverse to the entire height of the octree.  Since octrees are not necessarily balanced, this can be as much as log(maximum size of game world)!  Assuming the coordinate system is say 32 bit, this brings an added overhead of 32 additional non-coherent memory accesses per each operation that touches the map (as opposed to the flat array, where everything is simply constant time).

A more successful method — and even less sophisticated — method is to use paging or “virtual memory” to expand the size of the flat array.  To do this one breaks down the flat 3D array into a collection of pages or “chunks”, and then uses a hash table to sparsely store only a subset of the pages which are currently required by the game:

Pages/chunks are typically assigned in the same way as any virtual memory system: by reserving the upper k-bits of the coordinate for the chunk ID.  Using a virtual  memory system has the additional advantage that chunks can be lazily initialized, which is great for games with procedural content.  The same concept also allows chunks to be mapped back to disk when they are not needed (that is when there are no players nearby).  Using a hash map to store the chunks allows one to maintain constant time random access, while simultaneously taking advantage of sparsity as in an octree.

It seems like the current direction of Minecraft engines has begun to converge on the “virtualization” approach.  (And I base this claim on numerous blog posts).  But is this really the best solution?  After all, before we jump in and try to solve a problem we should at least try to quantitatively understand what is going on first; otherwise we might as well be doing literary criticism instead of computer science.  So I ask the following basic question:

What factors really determine the performance of a voxel engine?

There is a superificial answer to this question, which is “read/write access times”; after all every operation on a Minecraft map can be reduced to simple voxel queries.  If you accept this claim, then a virtualization is optimal — end of story.  However for reasons which I shall soon explain, this assertion is false.  In fact, I hope to persuade you that really it is the following that is true:

Claim:  The main bottleneck in voxel engines is iteration — not random access reads and writes!

The basic premise hinges on the fact that the vast majority of accesses to the voxel database come in the form of iterations over 3D ranges of voxels.  To demonstrate this point let us first consider some of the high level tasks involving the world map that a voxel engine needs to deal with:

  • Placing and removing blocks
  • Collision detection
  • Picking/raytracing
  • Lighting updates
  • Physics updates
  • Mesh generation
  • Disk serialization
  • Network serialization

The last two items here could be considered optional, but I shall leave them in for the sake of completeness (since most voxel games require some level of persistence, and many aspire at least to one day be multiplayer games).  We shall classify each task in terms of whatever operations are required to implement it within the game loop, which are either read/write and random access/iteration.  To measure cost, we count the number of raw voxel operations (read-writes) that occur over some arbitrary interval of gameplay.  The idea here is that we wait some fixed unit of time (t) and count up how many times we hit the voxels (ie the number of bytes we had to read/write).  Under the assumption that the world is much larger than cache (which is invariably true), these memory operations are effectively the limiting factor in any computation involving the game world.

At a fairly coarse level, we can estimate this complexity in terms of  quantities in the game state, like the number of MOBs and players in the world, or in terms of the number of atomic events that occur in the interval, like elapsed frames, chunk updates etc.  This gives an asymptotic estimate for the cost of each task in terms of intelligible units:

Task Operation Cost (in voxel ops)
Block placement Write number of clicks
Collision check Read (number of MOBs) * (frame count)
Picking Read* (pick distance) * (frame count)
Lighting Iteration (RW) (torch radius)^3 * (number of torches) + (active world size) * (sun position changes)
Physics update Iteration (RW) (active world size) * (frame count)
Mesh generation Iteration (R) (chunk size) * (chunk updates)
Disk serialization Iteration (R) (active world size) * (save events)
Network IO Iteration (R) (active world size) * (number of players) + (chunk size) * (chunk updates)
* Assuming that picking is implemented by ray-marching.
Disclaimer:  The entries in this chart are estimates, and I do not claim that the dimensions for each task are 100% accurate.  If you disagree with anything here, please leave a comment.

Our goal is to pinpoint which operation(s) in this chart is the bottleneck, and to do so we need dimensional analysis to convert all our units into a common set of dimensions.  Somewhat arbitrarily, we choose frame count and number players, and so our units will be in (voxel ops * frames * players).  Going through the list of each item, we now try to estimate some plausible bounds for each dimension:

  1. (number of clicks) is the amount of times the user clicked the mouse in the interval.  Since a user can click at most once per frame (and in practice is likely to click much much less), this value is bounded above by:
    • (number of clicks) < (frame count) * (number of players)
  2. (number of MOBs)is the number of active entities, or movable objects in the game.  Assuming that they are not moving too fast (ie there is a maximum speed of a constant number of blocks per frame), then the number of reads due to a collision check is going to be proportional to the number of MOBs and the number of frames.  We assume that number of MOBs in the world is proportional to the number of players, according to some constant k_{MOB},
    • (number of MOBs) = k_{MOB} * (number of players)
  3. (pick distance) is the maximum range a block can be selected by a player, and we shall assume that it is just a small constant.
  4. Again, (torch radius) is a small number.  In Minecraft, this could theoretically be as high as 16, but we will simply leave it as a tiny constant for now.
  5. (number of torches)is the number of torches placed / removed during the interval.  Torch placement is generally an irregular event, and so we estimate that a player will place a torch somewhat infrequently, at a rate goverened by some constant k_{TORCH},
    • (number of torches) = (number of players) * (frame count) * k_{TORCH}
  6. (sun position changes)is the number of times that the sun changes.  This happens regularly every few frames, and so we estimate that,
    • (sun position changes) = (frame count) * k_{SUN}
  7. (active world size)is the size of the world.  It is basically proportional to the number of chunks visible by each player.  Assuming a Minecraft style world with a height cap, this would vary quadratically with the visible radius, r_V, or cubically for a world with unlimited z-value.
    • (active world size) = (number of players) * (chunk size) * r_V^{2}
  8. (chunk updates) occur whenever a chunk changes.  Assuming this is random, but proportional to the size of the active world, we get
    • (chunk updates) = (active chunk size)  / (chunk size) * k_{UPDATE}
  9. (save events) occur at regular intervals and save the map to disk.  Again, because they are regular we can just bound them by a constant times the number of frames.
    • (save events) = (frame count) * k_{SAVE}

Plugging all that in, we get the following we get the following expression for the total amount of random access voxel ops:

(random voxel ops) = C * (number of players) * (frame_count) 

And for the sequential or iteration voxel ops, we get:

(sequential voxel ops) = C * (number of players) * (frame count) * (chunk size) * r_V^2

In general, the last quantity is much larger, by a factor of (chunk size) * r_V^2, or (chunk size) * r_V^3 if you have unlimited height.  So, if you believe these estimates are reasonable, then you should be convinced that iteration by far dominates the performance of a Minecraft style game.  In fact, this estimate explains a few other things, like why visible radius is such an important performance factor in Minecraft.  Increasing the draw radius linearly, degrades the performance quadratically!

Can we make iteration faster?

The main conclusion to draw from the above estimate is that we should really focus our optimization efforts on iteration.  If we can bring that cost down measurably, then we can expect to see large improvements in the game’s performance.  But the question then becomes: is this even possible?  To show that at least in principle it is, we make use of a simple observation:  In any situation where we are iterating, we only need to consider local neighborhoods around each cell.  For example, if you are computing a physics update (in minecraft) you only consider the cells within your Moore neighborhood.  Moreover, these updates are deterministic:  If you have two voxels with the same Moore neighborhood, then the result of applying a physics update to each cell will be the same.  As a result, if we can iterate over the cells in groups which all have the same neighborhood, then we can apply the same update to all cells in that group at once!

This is the basic idea behind our strategy for speeding up iteration.  We will now try to explain this principle in more detail.  Let B be the set of block types, and let G : \mathbb{Z}^3 \to B be a map representing our world (ie an assignment of block types to integral coordinates on a regular grid).  A radius r-Moore neighborhood is an element of the set B^{2r+1} \times B^{2r + 1} \times B^{2r + 1} \cong \{ f : \mathbb{Z}_{2r + 1}^3 \to B \}.  The Moore neighborhood of a coordinate (i,j,k) in G is determined by a map, M_{r} : \mathbb{Z}^3 \to (\mathbb{Z}_{2 r + 1}^3 \to B) such that,

M_r(i,j,k)(o, p, q) = G(i + o - r, j + p - r, k + q - r)

Then an update rule is a map sending a Moore neighborhood to a block, P : B^{2r+1} \times B^{2r + 1} \times B^{2r + 1} \to B, and the application of P to G is the rule sending,

G \mapsto P \circ M_r

Then the fact that we can group cells corresponds to the observation that:

M_r(i_1, j_1, k_1) = M_r(i_2, j_2, k_2) \implies P(M_r(i_1, j_1, k_1)) = P(M_r(i_2, j_2, k_2))

So, if our world data allows us to group up similar Moore neighborhoods, then we can expect that on average we should be able to reduce the complexity of iteration by a substantial amount.

In fact, this same idea also applies to meshing:  If you have a sequence of adjacent cubes that all have the same 3x3x3 neighborhood, then you can fuse their faces together to make a smaller mesh.  For example, instead of generating a mesh for each cube one at a time:

We can fuse cells with the same Moore neighborhood together:

So if we can iterate over the cells in batches, we should expect that not only will our physics loop get faster, but we should also expect that mesh quality should improve as well!

Run length encoding and interval trees

At this point, we now have some pretty good intuition that we can do iteration faster, so the obvious next step is to figure out exactly how this works.  Here is one strategy that I have used in a simple javascript based Minecraft game (which you can get here).  The idea follows from the simple observation that Minecraft style worlds can be easily run-length-encoded.  In run length encoding, one first flattens the 3D array describing a chunk down to a 1D string.  Then, substrings of repeated characters or “runs” are grouped together.  For example, the string:


Would become:


Which is a size reduction of about 25%.  You can imagine replacing the characters in the string with voxel types, and then compressing them down this way.  Now, applying run-length-encoding to chunks is by no means a new idea.  In fact, it is even reasonable to do so as a preprocessing step before g-zipping the chunks and sending them over the network/spooling to disk.  The fact that run-length encoding does compress chunks works to our advantage:  We can use this fact to iterate over the voxel set in units of runs instead of units of pixels, and can easily be extended to walk runs of Moore neighborhoods as well.

This sounds great for iterations, but what about doing random-access reads and writes?  After all, we still need to handle collision detection, block placement and raytracing somehow.  If we were dumb about it, just accessing a random block from a run-length-encoded chunk could take up to linear time.  It is obvious that we shouldn’t be happy with this performance, and indeed there is a simple solution.  The key observation is that a run-length encoding of a string is formally equivalent to an interval tree representation.

An interval tree is just an ordinary binary tree, where the key of each node is the start of a run and the value is the coordinate of the run.  Finding a greatest lower bound for a coordinate is equivalent to finding the interval containing a node.  To do insertion is a bit trickier, but not by much.  It does involve working through a few special cases by hand, but it is nothing that cannot be done without taking a bit of time and a pad of paper.  If we implement this data structure using some type of self-balancing binary tree (for example a red-black tree or a splay tree), then we can perform random reads and writes in logarithmic time on the number of runs.  Best of all: we also get improved mesh generation and in-memory compression for free!

Now in practice, you would want to combine this idea with virtualization.  Basically, you would store each chunk as an interval tree, then represent the entire world as a hash map.  The reason for doing this would be to integrate paging and chunk generation more seamlessly.  In fact, this is the method I took in the last two minecraft type games that I wrote, and you can find some example code illustrating this data structure right here.

Comparisons and conclusion

Ok, now that we’ve gone through the details, let’s break down our options by data structure and time complexity:

Data structure Random access time Iteration time Size Notes
Flat array O(1) O(n) O(n)
Virtualized array O(1) O(n) O(v) v is the number of occupied (virtual) chunks
Octree O(h) O(n h) O( v h ) h is the height of the octree
Interval tree + Hash table O( \log(r_C) ) O( r ) O( v r_C ) r is the number of runs in a region, r_C is the number of runs a single chunk.

Under our previous estimates, we can suppose quite reasonably that an interval tree should outperform a virtualized array for typical maps.  (Of course this claim goes out the window if your level geometry is really chaotic and not amenable to run-length encoding).  To test this hypothesis, I coded up a pretty simple benchmark.  At the outset the map is initialized to a 256x256x256 array of volume, with 5 different layers of blocks.  Randomly, 2^16 blocks from 32 different types are sprinkled around the domain.  To assess performance, we measure both how long an average random read takes and how long a sequential iteration requires (for Moore radii of 0 and 1).  The results of the benchmark are as follows:

Data structure Avg. random read Avg. sequential read (Moore radius = 0) Avg. sequential read (Moore radius = 1)
Flat array 0.224 μs/read 0.178 μs/read 0.060 μs/read
Virtual array 0.278 μs/read 0.210 μs/read 0.107 μs/read
Octree 2.05 μs/read 0.981 μs/read 0.933 μs/read
Interval tree + hashing 0.571 μs/read 0.003 μs/read 0.006 μs/read

In each column, the best result is underlined in bold.  There are a few things to take home from this.  First, for random access reads and writes, nothing beats a flat array.  Second, octrees are not a very good idea for voxel worlds.  On all accounts, a virtual array is far superior.  Finally, while the access time for random reads is noticeably slower in an interval tree, the speed benefits of improved sequential iteration make them more than worthwhile in practice.  The code for the benchmark can be found here.

Of course, interval trees will not perform well in worlds which are more-or-less random, or if there are not many runs of similar blocks that can be grouped together.  But when your world can be described in this way (which is certainly true of a Minecraft like game), switching to intervals can offer a huge improvement in performance.  I also firmly believe that there is still quite a bit of room for improvement.  If you have any suggestions or ideas, please leave a comment!

Polynomials in C++

Last time, we showed how one can use symmetric tensors to conveniently represent homogeneous polynomials and Taylor series.  Today, I am going to talk about how to actually implement a generic homogeneous polynomial/symmetric tensor class in C++.  The goal of this implementation (for the moment) is not efficiency, but rather generality and correctness.  If there is enough interest we can investigate some optimizations perhaps in future posts.

How to deal with polynomials?

There are quite a few ways to represent polynomials.  For the moment, we aren’t going to assume any type of specialized sparse structure and will instead simply suppose that our polynomials are drawn at random from some uniform distribution on their coefficients.  In some situations, for example in the solution of various differential equations, this is not too far from the truth, and while there may be situations where this assumption is too pessimistic, there are at least equally many situations where it is a good approximation of reality.

In 1D, the most direct method to represent a dense, inhomogeneous polynomial is to just store a big array of coefficients, where the i^{th} entry corresponds to the coefficient for the degree i monomial.  Naively generalizing a bit, the same idea works for multivariate polynomials as well, where you simply make the array multidimensional.  For example, suppose you had the 2D coefficient matrix:

\begin{array}{cc} a_{00} & a_{01} \\ a_{10} & a_{11} \end{array}

Which we could then expand out to give a polynomial like:

a_{00} + a_{01} x + a_{10} y + a_{11} xy

Though it is a bit simplistic, working in this format does have a few advantages.  For example:

  1. Evaluating a polynomial is easy.  Just compute the vectors (1, x, x^2, ...), (1, y, y^2, ...) and the then plug them into the coefficient tensor using ordinary Einstein summation.
  2. Multiplying can also be pretty fast.  Observe that coefficient matrix of the product of two polynomial matrices is just the convolution of their two matrices.  Using fast convolution algorithms (for example an FFT based method would work), this can be computed with only a mere log factor overhead!

The amount of storage for representing a polynomial on d variables with a maximum per-coefficient degree of r using this scheme requires exactly (r+1)^d scalars worth of memory. However, this representation scheme does have some major problems:

  1. It doesn’t capture all the terms above degree r properly.  For example, if you take the term a_{11} xy from the above example, it is actually second order, and so is the polynomial that it represents.  However, we are missing the extra terms for x^2, y^2.
  2. Related to the previous point, it is not closed under linear coordinate transformations.  If you were to say do a rotation in x, y, that a_{11} xy term could change into something like a x^2 + b xy + c y^2, and thus we would need to resize the matrix to keep up with the extra stuff.
  3. If you only care about storing terms with degree at most r, then the above representation is wasteful by at least a factor of about d!.
  4. And if you want to deal with terms that are exactly degree d, then it is even worse with an overhead of O( d! r^{ \frac{d}{d-1} }).

The underlying cause for each of these problems is similar, and ultimately stems from the fact that the function M_{ij...} x^i y^j ... is not multilinear.  As we saw last time it can be useful to deal with multivariate polynomials using tensors.  Similarly, it is also useful to split them up into homogeneous parts (that is grouping all the terms which have the same degree together).  We also showed how this approach greatly simplified the expression of a general multidimensional Taylor series expansion.  So today, we are going to investigate a different, and more “tensor-flavored” method for representing homogeneous polynomials.

Quick Review

Recall that a homogeneous polynomial on d variables, x_0, x_1, ... x_{d-1}, with degree r is a polynomial,

p(x_0, x_1, ...) = A_{0...0} x_0^r + r A_{0...1} x_0^{r-1} x_1 + ...

We also showed that this could be written equivalently using tensor notation:

p(x) = A_{i j k ... } x_i x_j x_k ...

Which allows us to reinterpret p as a “r-multilinear function” on x.  We say that a function f(x_0, x_1, ..., x_{d-1} ) is multilinear if for any of its arguments x_i, it is a linear function, that is:

f(x_0, x_1, ... a x_i + y, .... x_{d-1}) = af(x_0, x_1, ... x_i, .... x_{d-1}) + f(x_0, x_1, ... y, .... x_{d-1})

While it may seem a bit restrictive to work in homogeneous polynomials only, in practice it doesn’t make much difference.  Here is a simple trick that ought to be familiar to anyone in computer graphics who knows a bit about projective matrices.  Let us consider the following 1D example (which trivially generalizes):  given a polynomial in one variable,

p(x) = a_0 + a_1 x + a_2 x^2

Let us create a new variable w, and construct the new polynomial:

p'(x, w) = a_0 w^2 + a_1 xw + a_2 x^2 = w^2 p(\frac{x_0}{w})

Now we observe that:

  1. p'(x,w) is a homogeneous polynomial of degree 2.
  2. If we pick w = 1, then p'(x, 1) = p(x).

And so we have faithfully converted our inhomogeneous degree 2 polynomial in one variable into a new homogeneous degree 2 polynomial in two variables!

In general, converting from an inhomogeneous polynomial to a homogeneous polynomial is a simple matter of adding an extra variable.  The fully generalized version works like this:  Suppose that we have a polynomial p(x_0, x_1, ... x_{d-1}) of degree r in d unknowns.  Then we add a variable, w, and construct the new homogeneous degree r polynomial,

p'(x_0, x_1, ... x_{d-1}, w) = w^r p( \frac{x_0}{w}, \frac{x_1}{w}, \frac{x_2}{w}, ...)

And again, p'(x_0, x_1, ... , 1) = p(x_0, x_1, ...).

Indexing, iteration and combinatorics

So with that stuff above all out of the way, we shall now settle on an implementation: we’re going to use symmetric tensors to represent homogeneous polynomials.  Let us start off by listing a few basic properties about symmetric tensors:

  1. A tensor A_{ijk...} is called symmetric if for all permutations of an index, ijk..., jki... and so on, A_{ijk...} = A_{jki...} = ... .
  2. As a result, all vectors (aka a rank-1 tensors) are trivially symmetric.
  3. Similarly, a matrix is symmetric if and only if M_{ij} = M_{ji}.
  4. A rank 3 tensor is symmetric if M_{ijk} = M_{ikj} = M_{jik} = M_{jki} = M_{kij} = M_{kji} (and so on).
  5. Consequently is the dimension of each of the indices in a symmetric tensor must be equal, and so we shall sometimes refer to this quantity as the “dimension of the tensor”, or when we need to be more specific we shall call it the index dimension.
  6. The dimension of the space of all symmetric tensors with rank r and (index) dimension d is { r + d - 1 \choose d - 1 }.

And so the first problem that we must solve is to figure out how to pack the coordinates of a general rank r, dimension d symmetric tensor into an array of size { r + d - 1 \choose d - 1 }.  Closely related to this issue is the problem of indexing, and we are going to need methods for both randomly accessing elements of the tensor and for performing sequential iteration over the unique entries.  In fact, it is really this last issue which is the most essential, since if we can define an efficient method for enumerating the unique elements of the tensor, then we can simply use the order in which we enumerate the entries of the tensor as the indexes into a flat array.

By definition, the entries of a symmetric tensor are invariant under index permutation.  This leads to two distinct ways to index into a symmetric rank r dimension d tensor:

  1. Tensor style indexing – This is the ordinary way that we write the index for a generic tensor using subscripts.  Entries of the tensor are parameterized by a length r vector of integers, (i_0, i_1, ... i_{r-1}) where 0 \leq i_k< d for all 0 \leq k < r.
  2. Polynomial style (degree) indexing – This is the style of indexing is unique to symmetric tensors and comes from their close association with polynomials.  Here, we take the index to be the degree of one of the monomial terms in the corresponding homogeneous polynomial   These indices are enumerated by length d vectors of integers, (a_0, a_1, ... a_{d-1}) where a_k \geq 0 for all k, and \sum \limits_{k=0}^{d-1} a_k = r.

We can convert between these two types of indexes up to a point.  Converting from tensor indexing to degree indexing is straightforward, where we just pick

a_k = the number of elements in (i_0, ... i_{r-1}) that are equal to k

Going the other way is a bit harder, since there are multiple tensor indices which map to the same degree index.  To resolve the ambiguity, we can simply take the lexicographically first item.  This same idea also turns out to be useful for enumerating the entries within our tensor.  For example, suppose that we wanted to enumerate the components of a rank 3, degree 4 tensor.  Then, we could use the following ordering,

Pos.   Tensor   Degree
0      (0,0,0)  (3,0,0,0)
1      (1,0,0)  (2,1,0,0)
2      (2,0,0)  (2,0,1,0)
3      (3,0,0)  (2,0,0,1)
4      (1,1,0)  (1,2,0,0)
5      (2,1,0)  (1,1,1,0)
6      (3,1,0)  (1,1,0,1)
7      (2,2,0)  (1,0,2,0)
8      (3,2,0)  (1,0,1,1)
9      (3,3,0)  (1,0,0,2)
10     (1,1,1)  (0,3,0,0)
11     (2,1,1)  (0,2,1,0)
12     (3,1,1)  (0,2,0,1)
13     (2,2,1)  (0,1,2,0)
14     (3,2,1)  (0,1,1,1)
15     (3,3,1)  (0,1,0,2)
16     (2,2,2)  (0,0,3,0)
17     (3,2,2)  (0,0,2,1)
18     (3,3,2)  (0,0,1,2)
19     (3,3,3)  (0,0,0,3)

Here, the left hand column represents the order in which the entries of the tensor were visited, while the middle column is the set of unique tensor indices listed lexicographically.  The right hand side is the corresponding degree index, which is remarkably in colexicographic order!  This suggests other interesting patterns as well, such as the following little theorem:

Proposition: Colexicographically sort all the degree indices in a rank r, d-dimensional symmetric tensor.  Then the position of the degree index, (a_0, a_1, ... a_{d-1}), is

\sum \limits_{j=1}^n {{ j-1+\sum \limits_{k=n-j}^n a_k} \choose j}

(That is supposed to be a binomial coefficient, but the parenthesis seem to be a bit messed up in wordpress.)  I will leave this proof as an exercise to the reader, but offer the hint that it can be proven directly by induction.  It is also interesting to note that the formula for the position does not depend on either the rank of the tensor or the first coefficient.

Putting it all together

Now let us try to work out a method for iterating the tensor index incrementally.  A simple strategy is to use a greedy method:  We just scan over the index until we find a component that can be incremented, then scan backwards and reset all the previous indices.  In psuedo-C++, the code looks something like this:

bool next() {
  for(int i=0; i<rank; ++i) {
    if(tensor_index[i] < dimension - 1) {
      for(int j=i-1; j>=0; --j) {
          tensor_index[j] = tensor_index[i];
      return true;
  return false;

The tensor_index array contains rank elements, and is initialized to all 0’s before this code is called.  Each call to next() advances tensor_index one position.

Superficially, there are some similarities between this algorithm and the method for incrementing a binary counter.  As a result, I propose the following conjecture (which I have not investigated carefully enough to say is true confidently):

Conjecture: Amortized over course of a full traversal, the cost  of next is constant.

Using all this stuff, I’ve put together a little C++ class which iterates over the coordinates of a symmetric tensor.  I’ve also uploaded it to github, where you can go download and play around with it (though it isn’t much).  I hope to keep up this trend with future updates.  For those who are curious, here is the URL:

Next time

In the next episode, I am going to continue on this journey and build an actual symmetric tensor class with a few useful operations in it.  More speculatively, we shall then move on to rational mappings, algebraic surfaces, transfinite interpolation and deformable objects!

Multidimensional Taylor Series and Symmetric Tensors

As a warm up post, I’m going to talk about an important generalization of something that should be familiar to anyone’s who has made through a semester of calculus: Taylor series!  (And if you haven’t seen these guys before, or are perhaps feeling a bit rusty, then by all means please head on over to Khan academy to quickly brush up.  Go right ahead, I’ll still be here when you get back.)

Now this probably sounds like an awfully scary title for the first article in this miniseries about mathematics in graphics programming.  Perhaps you’re right!  But I’d like to think that graphics programmers are a tough bunch, and that using this kind of a name may indeed have quite the opposite effect of emboldening those cocky individuals who think they can press on and figure things out in spite of any frightening mathematical jargon.

Taylor Series

At the very heart of this discussion we are going to deal with two of the most important tasks any graphics programmer needs to worry about:  approximation and book keeping.  Taylor series are of course one of the oldest and best known methods for approximating functions.  You can think of Taylor series in a couple of ways.  One possibility is to imagine that they are successively approximating a given input function by adding additional polynomial terms to it.  So the idea is that you can start with some arbitrary 1D function, let’s call it f : R^1 \to R, and suppose that you are allowed the following two operations:

  1. You can evaluate a function at 0.
  2. You can take a derivative, \partial_x f(x)

Then, we can compute the Taylor series expansion of f about 0 in the usual way,

f(x) = f(0) + (\partial_x f)(0) x + (\partial_{x} \partial_x f)(0) \frac{x^2}{2!} + (\partial_{x} \partial_x \partial_x f)(0) \frac{x^3}{3!} + ... and so on

If we’re really slick, we can save the first k coefficients for these polynomials in a vector, call them say a_0, a_1, a_2, ... a_{k-1}., and then we can evaluate some approximation of f by summing up the first k terms in the above approximation:

f(x) = a_0 + a_1 x + a_2 \frac{x^2}{2!} + a_3 \frac{x^3}{3!} + ... a_{k-1} \frac{x^{k-1}}{(k-1)!} + O(x^{k})

Here is an animated gif showing the convergence of the Taylor series for the exponential function that I shamelessly ripped off from wikipedia:

Higher Dimensional Taylor Series

It is easy to adapt Taylor series to deal with vector valued functions in a single variable, you just treat each component separately.  But what if the domain f was not one dimensional?  This could easily happen for example, if you were trying to approximate something like a surface, a height field, or an image filter locally.  Well, in this post I will show you a slick way to deal with Taylor series of all sizes, shapes and dimensions!  But before I do that, let me show you what the ugly and naive approach to this problem looks like:

Suppose that f (x,y) : R^2 \to R is a 2D scalar-valued function. Then, let us look for a best quadratic (aka degree 2) approximation to f within the region near (0,0).  By analogy to the 1D case, we want to find some 2nd order polynomial $p(x,y)$ in two variables such that:

p(x,y) = a_{00} + a_{10} x + a_{01} y + a_{20} x^2 + a_{11} xy + a_{02} y^2


p(0,0) = f(0,0)

(\partial_x p)(0,0) = (\partial_x f)(0,0)

(\partial_y p)(0,0) = (\partial_y f)(0,0)

(\partial_{x} \partial_x p)(0,0) = (\partial_{xx} f)(0,0)

(\partial_{x} \partial_y p)(0,0) = (\partial_{xy} f)(0,0)

(\partial_{y} \partial_y p)(0,0) = (\partial_{yy} f)(0,0)

Phew!  That’s a lot more equations and coefficients than we got in the 1D case for a second order approximation.  Let’s work through solving for the coefficient $a_{20}$:  To do this, we take p and plug it into the appropriate equation:

\left( \partial_{x} \partial_x (a_{00} + a_{10} x + a_{01} y + a_{20} x^2 + a_{11} xy + a_{02} y^2) \right) = 2 a_{20} = (\partial_{x} \partial_x f)(0,0)

If we generalize this idea a bit, then we can see that for an arbitrary Taylor series approximation in 2D, the coefficient a_{ij} is determined by the equation,

a_{ij} = \frac{1}{i! j!} (\partial_{x^i y^j} f)(0,0)

All this is well and good, but it has a few problems. First, the summation for p is quite disorganized.  How are we supposed to keep track of which coefficients go where?  If we want to store the coefficients of p on a computer, then we need some less ad-hoc method for naming them and packing them into memory.  Second, it seems like this is going to be a headache to deal with 3 or more dimensions, since we’ll need to basically repeat the same sort of reasoning over and over again.  As a result, if we want to start playing around with higher dimensional Taylor series, we’re going to need a more principled notation for dealing with higher order multivariate polynomials.

Tensors to the Rescue!

One such system is tensor notation!  Though often maligned by mathematical purists. algebraic geometers, and those more modern algebraically savvy category theorists, tensors are a simple and effective notation that dramatically simplifies calculations.  From a very simplistic point of view, a tensor is just a multidimensional array.  The rank of the tensor is the number of different indices, each of which has a distinct dimension.  In C-style syntax, you could declare a tensor in the following way:

  float vector[10];                          //A rank 1 tensor, with dimension (10)
  float matrix[5][5];                        //A rank 2 tensor, with dimensions (5, 5)
  float spreadsheet[ROWS][COLUMNS][PAGES];   //A rank 3 tensor, with dimensions (ROWS, COLUMNS, PAGES)
  float crazy_thing[10][16][3][8];           //A rank 4 tensor, with dimension (10, 16, 3, 8 )

We will usually write tensors as upper case letters.  To reference an individual entry in this array, we will use an ordered subscript, like so:


Which we can take to be equivalent to the C-code:


For the rest of the article, we are going to stick with this point of view that tensors are just big arrays of numbers.  We aren’t going to bother worrying about things like covariance/contravariance (and if you don’t know what those words are, forget I said anything), nor are we going to mess around too much with operators like tensor products.  There is nothing wrong with doing this, though it can be a bit narrow minded and it does somewhat limit the applications to which tensors may be applied.  If it bothers you to think about tensors in this way, here is a more algebraic/operational picture of what a tensor does: much as how a row vector can represent a scalar-valued linear function, R^n \to R (via the dot-product), a tensor can represents a multi-linear function, R^n \times R^m \times ... \to R: that is, it takes in several vectors and spits out a scalar which varies linearly in each of its arguments.

That said, even if we just think about tensors as arrays, there’s still a number of useful, purely formal, operations that one can perform on tensors.  For example, you can add them up just like vectors.  If A and B are two tensors with the same dimensions, then we can define their sum componentwise as follows:

(A + B)_{i} = A_{i} + B_{i}

Where we take the symbol i to be a generic index here.  The other important operation that we will need to use is called tensor contraction.  You can think of tensor contraction as being something like a generalization of both the dot product for vectors, and matrix multiplication .  Here is the idea:  Suppose that you have two tensors A, B with dimensions (a_0, ... d, ... a_n), (b_0, ... d, ... b_m)  and ranks n,m respectively and some index between them with a common dimension.  Then we can form a new tensor with rank n + m – 1 by summing over that index in A and B simultaneously:

C_{a_0, ..., a_n, b_0, ... b_n} = \sum_{i=0}^d A_{a_0, ..., i ..., a_n } B_{b_0, ... i, ... b_m}

Writing all this out is pretty awkward, so mathematicians use a slick short hand called Einstein summation convention to save space.  Here is the idea:  Any time that you see a repeated index in a product of tensor coefficients that are written next to each other, you sum over that index.  For example, you can use Einstein notation to write the dot product of two vectors, x, y, as follows:

x_i y_i = \sum \limits_{i=0}^d x_i y_i = x \cdot y

Similarly, suppose that you have a matrix M, then you can write the linear transformation of a vector x by M in the same shorthand,

y_j = M_{ji} x_i

Which beats having to remember the rows-by-columns rule for multiplying vectors!  Similarly, you can multiply two matrices using the same type of notation,

(A B)_{i j} = A_{ik} B_{kj}

It even works for computing things like the trace of a matrix:

\mathrm{trace }(M) = M_{ii} = \sum \limits_{i} M_{ii}

We can also use tensor notation to deal with things like computing a gradient.  For example, we write the derivative of a function f : R^n \to R with respect to the i^{th} component as \partial_i f.  Combined with Einstein’s notation, we can also perform operations such as taking a gradient of a function along a certain direction.  If v is some direction vector, then the derivative of f along v evaluated at the point x is,

(\partial_i f)(x) v_i

Symmetric Tensors and Homogeneous Polynomials

Going back to multidimensional Taylor series, how can we use all this notation to help us deal with polynomials?  Well, let us define a vector x = (x_0, x_1, ... ) whose components are just the usual (x, y, z ...) coordinate variables, and pick some rank 1 tensor A with appropriate dimension.  If we just plug in x, then we get the following expression:

A_i x_i = A_0 x_0 + A_1 x_1 + ...

Which is of course just a linear function on x! What if we wanted to make a quadratic function?  Again, using tensor contraction this is no big deal:

A_{i j} x_{i} x_{j} = A_{0 0} x_0 x_0 + A_{0 1} x_0 x_1 + A_{1 0} x_1 x_0 + A_{1 1} x_1 x_1 + ...

Neat!  This gives us a quadratic multivariate polynomial on x.  Moreover, it is also homogeneous, which means that it doesn’t have any degree 1 or lower terms sitting around. In fact, by generalizing from this pattern if we wanted to say store an arbitrary degree n polynomial, we could pack all of its coefficients into a rank n tensor, and evaluate by contracting:

A_{i j k ... l} x_i x_j x_k ... x_l = A_{0 0 .... 0} x_0^n + A_{0 0 .... 1} x_0^{n-1} x_1 + ...

But there is some redundancy here.  Notice in the degree 2 case, we are assuming that the components of x commute, or in other words the terms x_0 x_1, x_1 x_0 are really the same and so we really don’t need to store both the coefficients A_{10} and A_{01}.  We could make our lives a bit easier if we just assumed that they were equal.  In fact, it would be really nice if whenever we took any index, like say A_{i j k ... } and then permuted it to something arbitrary, for example A_{j i k ...}, it always gave us back the same thing.  To see why this is, let us try to work out what the coefficient for  the monomial x_{i_0} x_{i_1} .. x_{i_n} in the degree n polynomial given by A_{i j k ... } x_i x_j x_k ....  Directly expanding using Einstein summation, we get a sum over all permutations of the indices i_0, ... i_n:

\sum \limits_{ \sigma \in \{ \mathrm{permutation of } i_0 ... i_n \} } A_{\sigma_0 \sigma_1 ... \sigma_n} x_{i_0} x_{i_1} ... x_{i_n}

If we assume that all those A_{...} coefficients are identical, then that above nasty summation has a pretty simple form:  namely it is a multinomial coefficient!  As a result, if we wanted to say find the coefficient of x^2_0 x_1 in $A_{i j k} x_i x_j x_k$, then we could just use the following simple formula:

\frac{3!}{2! 1! 1!} A_{0 0 1} x^2_0 x_1 = { 3 \choose 2, 1, 0}

A tensor which has the property that its coefficients are invariant under permutation of its indices is called a symmetric tensor, and as we have just seen symmetric tensors provide a particularly efficient method for representing homogeneous polynomials.  But wait!  There’s more…

If we pick A_{0 0 1} = (\partial_{001} f)(0), then the above formula is almost exactly the right formula for the Taylor series coefficient of x_0^2 x_1 in the expansion of f.  The only thing is that we are off by a factor of 3!, but no matter, we can just divide that off.  Taking this into account, we get the following striking formula for the Taylor series expansion of a function f : R^n \to R about the origin,

p(x) = f(0) + (\partial_{i} f)(0) x_i + \frac{1}{2!} (\partial_{ij} f)(0) x_i x_j + \frac{1}{3!} (\partial_{ijk} f) x_i x_j x_k + ...

Which is remarkably close to the formula for 1D Taylor series expansions!

Next Time

I will show how to actually implement symmetric tensors in C++11, and give some example applications of multidimensional Taylor series in implicit function modeling and non-linear deformations!  I’ll also show how the above expansion can be simplified even further by working in projective coordinates.

BSP Trees and Collision Detection

Recently I published a paper in CAD on BSP tree merging via linear programming.  At a very high level, the idea behind this paper was to replace the tree partitioning step with a linear programming feasibility test.  Doing this leads to a bunch of nice things, such as the removal of the partition function and all 9 of its special cases.  It also has the benefit that that for some suitable linear programming algorithms, the cost of evaluating feasibility can be amortized over the tree traversal giving an optimal linear-time output sensitive algorithm.

Of course as it always turns out with these things, I recently discovered that I was not the first to think of using linear programming for solving this problem.  Just last week Prof. Alberto Paoluzzi visited our lab, and I was lucky enough to get a chance to talk to him about BSP trees.  As it just so happens, he proposed using linear programming for tree partitioning in his book Geometric Programming for Computer-Aided Design (which is actually a very good read.)  Of course, this shouldn’t diminish the contribution of our paper, as we do present a much simpler merge algorithm along with an analysis of both time complexity and robustness.  In any future works, I will definitely add a reference to his book.

Anyway, I wanted to talk about one idea which didn’t make it into that paper, but I still happen to think is rather interesting.  As an accidental side effect of our merging algorithm, we discovered a straightforward method for solving exact continuous collision detection between arbitrary moving non-convex objects.  The key idea to this process was initially suggested to me by my colleague Nicholas Smolinske (who sadly is no longer in graduate school.)  Here’s how it goes:

Suppose you have two convex shapes moving at a constant velocity.  For each shape, the object Minkowski summed with its trajectory traces out a 4D convex set space and time.  If you take these two sets for both shapes and intersect them, then the result is also a convex shape in 4D.  Moreover, the lowest point in time within the overlap region must be the point at which the two objects initially made contact.  This idea is illustrated in the following diagram:


Because all of the sets we are dealing with are convex, it is possible to exactly solve for the point of intersection using linear programming with an objective function c(x,t) = -t.  Now, the next question is how to do this with BSP trees?  Well, it turns out that one side effect from the algorithm proposed in our paper is that it ends up evaluating linear programming on each convex cell contained in the final BSP tree.  We can make use of this by simply picking the same objective function as above, and storing the minimal point over all convex cells as our result.  Alternatively, if we only want to get the point of intersection, then we don’t even need to compute the result of the merge so we may instead use it as a kind of driver for solving linear programming over the cell-complex hierarchically.

The idea of BSP trees representing shapes in space/time is not new, and to the best of my knowledge was first explored by Agarwal et al. (they give some novel tree construction strategies, but don’t really deal with merging or related issues).  Another interesting way to look at this idea is that one could understand the BSP tree as generalizing some of the ideas of polytope hierarchies (such as Dobkin-Kirkpatrick) to non-convex sets.  Of course this still doesn’t answer the big question which is how to go about building decent BSP trees (or hierarchies for that matter) and also how to pick which order to merge them.  Someday I might go back and try refining these ideas to the point where they could make for a nice short conference paper, but for now it is just something fun to think about.