## Geometry without geometric algebra

When I was younger I invested a lot of time into studying geometric algebra.  Geometric algebra is a system where you can add, subtract and multiply oriented linear subspaces like lines and hyperplanes (cf. Grassmanian). These things are pretty important if you’re doing geometry, so it’s worth it to learn many ways to work with them. Geometric algebra emphasizes exterior products as a way to parameterize these primitives (cf. Plücker coordinates).  Proponents claim that it’s simpler and more efficient than using “linear algebra”, but is this really the case?

In this blog post I want to dig into the much maligned linear algebra approach to geometry.  Specifically we’ll see:

1. Two ways to parameterize a linear subspace using matrices
2. How to transform subspaces
3. How to compute the intersection and span of any two subspaces

## Encoding flats

A subspace of a vector space is a subset of vectors which are closed under scalar addition and multiplication.  Geometrically they are points, lines and planes (which pass through the origin unless we use homogeneous coordinates).  We can represent a k-dimensional subspace of an n-dimensional in two ways:

1. As the span of a collection of k vectors.
2. As the solution to a set of n – k linear equations.

These can be written using matrices:

• In the first case, we can interpret the span of k vectors as the image of an n-by-k matrix $M : R^k \to R^n$
• In the second, the solution of a set of n-k linear equations is another way of saying the kernel of an (n – k)-by-n matrix, $A : R^n \to R^{n - k}$.

These two forms are dual to one another in the sense that taking the matrix transpose of one representation gives a different subspace, which happens to be it’s orthogonal complement.

### Intersections and joins

The best parameterization depends on the application and the size of the flat under consideration.  Stuff that’s easy to do in one form may be harder in the other and vice-versa.  To get more specific, let’s consider the problem of intersecting and joining two subspaces.

If we have a pair of flats represented by systems of equations, then computing their intersection is trivial: just concatenate all the equations together.  Similarly we can compute the smallest enclosing subspace of a set of subspaces which are all given by spans: again just concatenate them.  And we can test if a subspace given by a span is contained in one given by equations by plugging in each of the basis vectors and checking that the result is contained in the kernel (ie maps to 0).

### Linear transformations of flats

Depending on the form we pick flats transform differently.  If we want to apply a linear transformation $T : R^n \to R^n$ to a flat, then we need to consider it’s encoding:

1. If the flat is given by the image of a map, $M(V)$, then we can just multiply by $T$
2. And if a flat is a system of equations, ie $A^{-1}(0)$, then we need to multiply by the inverse transpose of $T$.

The well known rule that normal vectors must transform by inverse transposes is a special case of the above.

### Conversion

Finally we can convert between these two forms, but it takes a bit of work.  For example, finding the line determined by the intersection of two planes through the origin in 3D is equivalent to solving a 2×2 linear system.  In the general case one can use Gaussian elimination.

## Is this less intuitive?

I don’t really know.  At this point I’m too far gone to learn something else, but it’s much easier for me to keep these two ideas in my head and just grind through some the same basic matrix algorithm over and over than to work with all the specialized geometric algebra terms.  Converting things into exterior forms and plucker coordinates always seems to slow me down with extra details (is this a vee product, inner product, circle, etc.), but maybe it works for some people.

## A level of detail method for blocky voxels

Large voxel terrains may contain millions of polygons.  Rendering such terrains at a uniform scale is both inefficient and can lead to aliasing of distant objects.  As a result, many game engines choose to implement some form of level of detail based rendering, so that distant terrain is rendered with less geometry.

In this post, I’ll talk about a simple technique based on vertex clustering with some optimizations to improve seam handling.  The specific ideas in this article are applied to Minecraft-like blocky terrain using the same clustering/sorting scheme as POP buffers.

M. Limper, Y. Jung, J. Behr, M. Alexa: “The POP Buffer: Rapid Progressive Clustering by Geometry Quantization“, Computer Graphics Forum (Proceedings of Pacific Graphics 2013)

## Review of POP buffers

Progressively Ordered Primitive (POP) buffers are a special case of vertex clustering, where for each level of detail we round the vertices down to the previous power of two.  The cool thing about them is that unlike other level of detail methods, they are implicit, which means that we don’t have to store multiple meshes for each level detail on the GPU.

When any two vertices of a cell are rounded to the same point (in other words, an edge collapse), then we delete that cell from that level of detail.  This can be illustrated in the following diagram:

Suppose that each vertex $v_j \in V \subset \mathbb{Z}^3$ has integer coordinates.  Define,

$L_i(v) = 2^i \left \lfloor \frac{v}{2^i} \right \rfloor$

This determines a filtration on the vertices,

$V = L_0(V) \supseteq L_1(V) \supseteq ... \supseteq L_n(V) = \{ 0 \}$

Which extends to triangles $(c_0, c_1, c_2) \in C \subseteq V^3$ according to the rule,

$P_i = \left \{ (c_0, c_1, c_2) \in C : L_i(c_0) \neq L_i(c_1), L_i(c_1) \neq L_i(c_2), L_i(c_2) \neq L_i(c_0) \right \}$

And so it follows that,

$C = P_0 \supseteq P_1 \supseteq ... \supseteq P_n = \emptyset$

Each of the sets $P_i$ represents the topology mesh at some level of detail, with $P_0$ being the finest, full detail mesh and $P_n$ the coarsest.   To get the actual geometry at level $i$, we can take any $j \leq i$ and compute,

$M_i = \{ (L_i(c_0), L_i(c_1), L_i(c_2)) : (c_0, c_1, c_2) \in P_j \}$

Using this property, we can encode the different levels of detail by sorting the primitives of the mesh from coarse-to-fine and storing a table of offsets:

To render the mesh at any level of detail we can adjust the vertex count, and round the vertices in the shader.

## Building POP buffers

To construct the POP buffer, we need to sort the quads and count how many quads are in each LOD.  This is an ideal place to use counting sort, which we can do in-place in O(n) time, illustrated in the following psuedo-code:

// Assume MAX_LOD is the total number of levels of detail
const buckets = (new Array(MAX_LOD)).fill(0)
// count number of quads in each LOD
for (let i = 0; i < quads.length; ++i) {
}
// compute prefix sum
let t = 0;
for (let i = 0; i < MAX_LOD; ++i) {
const b = buckets[i]
buckets[i] = t
t += b
}
// partition quads across each LOD
for (let i = quads.length - 1; i >= 0; --i) {
while (true) {
const ptr = buckets[lod]
if (i < ptr) {
break;
}
buckets[lod] += 1
}
}
// buckets now contains the prefixes for each LOD
return buckets
}


The quadLOD() helper function returns the coarsest level of detail where a quad is non-degenerate.  If each quad is an integer unit square (i.e. not the output from a greedy mesh), then we can take the smallest corner and compute the quad LOD in constant time using a call to count-trailing zeroes.  For general quads, the situation is a bit more involved.

### LOD computation

For a general axis-aligned quad, we can compute the level of detail by taking the minimum level of detail along each axis.  So it then suffices to consider the case of one interval, where the level of detail can be computed by brute force using the following algorithm:

function intervalLOD (lo, hi) {
for (let i = 0; i <= 32; ++i) {
if ((lo >> i) === (hi >> i)) {
return i
}
}
}

We can simplify this if our platform supports a fast count-leading-zeroes operation:

function intervalLOD (lo, hi) {
}

### Squashed faces

The last thing to consider is that when we are collapsing faces we can end up with over drawing due to rounding multiple faces to the same level.  We can remove these squashed faces by doing one final pass over the face array and moving these squashed faces up to the next level of detail.  This step is not required but can improve performance if the rendering is fragment processing limited.

## Geomorphing, seams and stable rounding

In a voxel engine we need to handle level of detail transitions between adjacent chunks.  Transitions occur when we switch from one level of detail to another abruptly, giving a discontinuity.  These can edges be hidden using skirts or transition seams at the expense of greater implementation complexity or increased drawing overhead.

In POP buffers, we can avoid the discontinuity by making the level of detail transition continuous, similar to 2D terrain techniques like ClipMaps or CLOD.  Observe that we can interpolate between two levels of detail using vertex morphing,

$L_t(x) = (\lceil t \rceil - t) 2^{\lfloor t \rfloor} \left \lfloor \frac{x}{2^{\lfloor t \rfloor}} \right \rfloor + (t - \lfloor t \rfloor) 2^{\lceil t \rceil} \left \lfloor \frac{x}{2^{\lceil t \rceil}} \right \rfloor$

In the original POP buffer paper, they proposed a simple logarithmic model for selecting the LOD parameter as a function of the vertex coordinate $x$:

$t(x) = b - \log ( \text{viewDist}(x) )$

Where $b$ is a bias parameter (based on the display resolution, FOV and hardware requirements) and viewDist is a function that computes the distance to the vertex from the camera.  Unfortunately, this LOD function is discontinuous across gaps due to rounding.

The authors of the original POP buffer paper proposed modifying their algorithm to place vertices along the boundary at the lowest level of detail.  This removes any cracks in the geometry, but increases LOD generation time and the size of the geometry.

Instead we can solve this problem using a stable version of LOD rounding.  Let $M$ be the maximum LOD value for the chunk and all its neighbors.  Then we compute a fixed point for $t$:

$t_0 = M$

$t_n = t(L_{t_{n - 1}}(x))$

In practice 2-3 iterations is usually sufficient to get a stable solution for most vertices.  This iteration can be implemented in a vertex shader and unrolled, giving a fast seamless level of detail selection.

As an aside, this construction seems fairly generic.  The moral of the story is really that if we have geomorphing, then we don’t need to implement seams or skirts to get crack-free LOD.

## World space texture coordinates

Finally, the last issue we need to think about are texture coordinates.  We can reuse the same texturing idea from greedy meshing.  For more info see the previous post.

## Collision detection (part 3): Benchmarks

Previously in this series we covered the basics of collision detection and discussed some different approaches to finding intersections in sets of boxes:

Today, we’ll see how well this theory squares with reality and put as many algorithms as we can find to the test.  For those who want to follow along with some code, here is a link to the accompanying GitHub repo for this article:

## Finding modules

To get a survey of the different ways people solve this problem in practice, I searched GitHub, Google and npm, and also took several polls via IRC and twitter.  I hope that I managed to cover most of the popular libraries, but if there is anything here that I missed, please leave a comment and let me know.

While it is not an objective measurement, I have also tried to document my subjective experiences working each library in this benchmark.  In terms of effort, some libraries took far less time to install and configure than others.  I also took notes on libraries which I considered, but rejected for various reasons.  These generally could be lumped into 3 categories:

1. Broken: The library did not report correct results.
2. Too much work: Setting up the library took too long.  I was not as rigorous with enforcing a tight bound here, as I tended to give more generous effort to libraries which were popular or well documented.  Libraries with 0 stars and no README I generally skipped over.
3. Irrelevant: While the library may have at first looked like it was relevant, closer inspection revealed that it did not actually solve the problem of box intersection.  This usually happened because the library had a suspicious name, or because it was some framework whose domain appeared to include this problem.

### A word on JavaScript

For the purpose of this benchmark, I limited myself to consider only JavaScript libraries. One major benefit of JavaScript is that it is easier to install and configure JavaScript libraries, which greatly simplifies the task of comparing a large number of systems.  Also, due to the ubiquity of JavaScript, it is easy for anyone to replicate these results or rerun these benchmarks on their own machine.  The disadvantage though is that there are not as many mature geometry processing libraries for JavaScript as there are for older languages like C++ or Java.  Still, JS is popular enough that I had no trouble finding plenty of things to test although the quality of these modules turned out to be wildly varying.

## Implementations surveyed

### Brute force

As a control I implemented a simple brute force $O(n^2)$ algorithm in the obvious way.  While it is not efficient for large problem sizes, it performs pretty well up to few hundred points.

### Bounding volume hierarchy modules

I found many modules implementing bounding volume hierarchies, usually in the form of R-trees.  Here is a short summary of the ones which I tested:

• rbush:  This is one of the fastest libraries for box intersection detection.  It also has a very simple API, though it did take a bit of time tuning the branching factor to get best performance on my machine.  I highly recommend checking this one out!
• rtree: An older rtree module, which appears to have largely been replaced by rbush.  It has more features, is more complicated, and runs a bit slower.  Still pretty easy to use though.
• jsts-strtree:  The jsts library is a JavaScript port of the Java Topology Suite.  Many of the core implementations are solid, but the interfaces are not efficient.  For example, they use the visitor pattern instead of just passing a closure to enumerate objects which requires lots of extra boilerplate.  Overall found it clumsy to use, but reasonably performant.
• lazykdtree: I included this library because it was relatively easy to set up, even though it ended up being very slow.  Also notable for working in any dimension.

Because of their popularity, I decided to make a special section for quad trees.  By a survey of modules on npm/GitHub, I would estimate that quad trees are one of the most commonly implemented data structures in JavaScript. However it is not clear to me how much of this effort is serious.  After much searching, I was only able to find a small handful of libraries that correctly implemented quad trees based rectangle stabbing:

• simple-quadtree: Simple interface, but sluggish performance.
• jsts-quadtree:  Similar problems as jsts-strtree.  Unlike strtree, also requires you to filter out the boxes by further pruning them against your query window.  I do not know why it does this, but at least the behavior is well documented.

Beyond this, there many other libraries which I investigated and rejected for various reasons.  Here is a (non-exhaustive) list of some libraries that didn’t make the cut:

• Google’s Closure Library: This library implements something called “quadtree”, but it doesn’t support any queries other than set membership.  I am still not sure what the point of this data structure is.
• quadtree2: Only supports ball queries, not boxes
• Mike Chamber’s QuadTree: Returns incorrect results
• node-trees: Returns incorrect results
• generic-quadtree:  Only implements point-in-rectangle queries, not rectangle-rectangle (stabbing) queries.
• quadtree: I’m don’t know what this module does, but it is definitely not a quad tree.

### Physics engines

Unfortunately, many of the most mature and robust collision detection codes for JavaScript are not available as independent modules.  Instead, they come bundled as part of some larger “physics framework” which implements multiple algorithms and a number of other features (like a scene graph, constraint solver, integrator, etc.).  This makes it difficult to extract just the collision detection component and apply it to other problems.  Still, in the interest of being comprehensive, I was able to get a couple of these engines working within the benchmark:

• p2.js:  A popular 2D physics engine written from the ground up in JavaScript.  Supports brute force, sweep and prune and grid based collision detection.  If you are looking for a good 2D physics engine, check this one out!
• Box2D: Probably the de-facto 2D physics engine, has been extremely influential in realtime physics and game development.  Unfortunately, the quality of the JS translations are much lower than the original C version.  Supports sweep-and-prune and brute force for broad phase collision detection.
• oimo.js:  This 3D physics engine is very popular in the THREE.js community.  It implements brute force, sweep and prune and bounding volume hierarchies for collision detection.  Its API is also very large and makes heavy use of object-oriented programming, which comes with some performance tradeoffs. However oimo does deserve credit for being one of the few libraries to tackle 3D collision detection in JavaScript.

I also considered the following physics engines, but ended up rejecting them for various reasons:

• cannon.js: To its credit, cannon.js has a very clear API and well documented code.  It is also by the same author as p2.js, so it is probably good.  However, it uses spheres for broad phase collision detection, not boxes, and so it is not eligible for this test.
• GoblinPhysics:  Still at very early stages.  Right now only supports brute force collision detection, but it seems to be progressing quickly.  Probably good to keep an eye on this one.
• PhysicsJS:  I found this framework incredibly difficult to deal with.  I wasted 2 days trying to get it to work before eventually giving up.  The scant API documentation was inconsistent and incomplete.  Also, it did not want to play nice in node.js or with any other library in the browser, hooking event handlers into all nooks and crannies of the DOM, effectively making it impossible to run as a standalone program for benchmarking purposes.  Working with PhysicsJS made me upset.
• Matter.js:  After the fight with PhysicsJS, I didn’t have much patience for dealing with large broken libraries.  Matter.js seems to have many of the same problems, again trying to patch a bunch of weird stuff onto the window/DOM on load, though at least the documentation is better.  I spent about an hour with it before giving up.
• ammo.js/physijs: This is an emscripten generated port of the popular bullet library, however due to the translation process the API is quite mangled.  I couldn’t figure out how to access any of the collision detection methods or make it work in node, so I decided to pass on it.

### Range-trees

Finally, I tried to find an implementation of segment tree based intersection for JavaScript, but searching turned up nothing.  As far as I know, the only widely used implementation of these techniques is contained in CGAL, which comes with some licensing restrictions and also only works in C++.  As a result, I decided to implement of Edelsbrunner & Zomorodian’s algorithm for streaming segment trees myself.  You can find the implementation here:

### box-intersect: Fast and robust d-dimensional box intersection

The code is available under a liberal MIT license and easily installable via npm.  It should work in any modern CommonJS environment including browserify, iojs and node.js.

## Testing procedure

In each of these experiments, a set of $n$ boxes was generated, and then sent to each library to compute a checksum of the set of pairs of intersections.  Because different libraries expect their inputs in different formats, as a preprocessing step the boxes are converted into whatever data type is expected by the library.  This conversion step is not counted towards the total running time.  Note that the preparation phase does not include any time required to build associated data structures like search trees or grids; these timings are counted toward the total run time.

Because algorithms for collision detection are output sensitive, care was taken to ensure that the total number of intersections in each distribution is at most $O(n)$, in order to avoid measuring the reporting time for each method.

A limitation of this protocol is that it favors batched algorithms (as would be typically required in CAD applications), and so it may unfairly penalize iterative algorithms like those used in many physics engines.  To assess the performance of algorithms in the context of dynamic boxes more work would be needed.

## Results

Here is a summary of the results from this benchmark.  For a more in depth analysis of the data, please see the associated GitHub repo.  All figures in this work were made with plot.ly, click on any of the images to get an interactive version.

### Uniform distribution

I began this study by testing each algorithm against the uniform distribution.  To ensure that the number of intersections is at most $O(n)$, I borrowed a trick from Edelsbrunner and Zomorodian; and scaled the side length of each box to be $O(n^{\frac{1}{d}-1})$ while constraining the boxes to remain within the unit hypercube, $[0,1]^d$.  A typical sample from this distribution looks like this,

To save time, I split the trials into phases ordered by number of boxes; algorithms which performed took too long on smaller instances were not tested on larger problem sizes.  One of the first and most shocking results was a test instance I ran with just 500 boxes:

Here two libraries stand out for their incredibly bad performance: Box2D and lazykdtree. I am not sure why this situation is so bad, since I believe the C version of Box2D does not have these problems (though I need to verify this).  Scaling out to 1500 boxes without these two libraries gives the following results:

The next two worst performing libraries were simple-quadtree and rtree.  simple-quadtree appears to have worse than quadratic growth, suggesting fundamental algorithmic flaws.  rtree’s growth is closer to $O(n \:\mathrm{polylog}(n))$, but due to the constants involved is still far too slow.  As it took too long and there were 2 other representatives of the r-tree data structure in the benchmark, I decided to drop it from larger tests.  Moving on to 10k boxes,

At this point, the performance of brute force begins to grow too large, and so it was omitted from the large brute force tests.

Because p2’s sweep method uses an insertion sort, it takes $O(n^2)$ time with a cold start.  This causes it to perform close to brute force in almost all cases which were considered.  I suspect that the results would be better if p2 was used from a warm start or in a dynamic environment.

Both of jsts’ data structures continue to trend at about $O(n \log(n))$ growth, however because the constants involved are so large they were also dropped from the large problem size.

10000 boxes appears to be the cut off for real time simulation (or 60fps) at least on my machine.  Beyond, this point, even the fastest collision detection algorithm (in this case box-intersect) takes at least 16ms.  It is unlikely that any purely JavaScript collision detection library would be able to handle significantly more boxes than this, barring substantial improvements in either CPU hardware or VM technology.

One surprise is that within this regime, box-intersect is consistently 25-50% faster than p2-grid, even though one would expect the $O(n)$ complexity of the grid algorithm to beat the $O(n \log^2(n))$ time of box-intersect.  An explanation for this phenomenon would be that box-intersect enjoys better cache locality (scaling linearly with block size), while hashing causes $O(1)$ indirect main memory accesses for each box.  If the size of a cache line is $B$, then the cross over point should occur when $n \approx \Theta(2^{\sqrt{B}})$ in 2D.  This is illustrated in the following chart which carries out the experiment to 250k boxes,

As expected, somewhere between 10000 and 20000 boxes, p2-grid finally surpasses box-intersect.  This is expected as grids realize $O(n)$ complexity for uniform distributions, which is ultimately faster than $O(n \log^2(n))$ for sufficiently large $n$.

The complexity of rbush on this distribution is more subtle.  For the bulk insertion method, rbush uses the “overlap minimizing tree” (OMT) heuristic of Lee and Lee,

T. Lee, S. Lee. (2003) “OMT: Overlap minimizing top-down bulk loading algorithm for R-Trees” CAiSE

The OMT heuristic partitions the space into an adaptive $R \times R$ grid at each level of the tree along the quantiles.  In the case of a uniform distribution of boxes, this reduces to uniform grid giving a query time of $O( R^2 \log_{R^2}(n) + k)$, which for finite $R$ gives means that rbush will find all intersections in $O(n \log(n) + k)$ time.  As a result, we can expect that once $n \approx \Theta(2^B)$, rbush-bulk should eventually surpass box-intersect in performance (though this did not occur in my benchmarks).  This also suggests another way to interpret the OMT heuristic:  it is basically a hedged version of the uniform grid.  While not quite as fast in the uniform case, it is more adaptive to sparse data.

### Sphere

Of course realistic data is hardly ever uniformly distributed.  In CAD applications, most boxes tend to be concentrated in a lower dimensional sub-manifold, typically on the boundary of some region.  As an example of this case, we consider boxes which are distributed over the surface of a sphere (again with the side lengths scaled to ensure that the expected number of collisions remains at most $O(n)$).

To streamline these benchmarks, I excluded some of the worst performing libraries from the comparison.  Starting with a small case, here are the results:

Again, brute force and p2’s sweep reach similar $O(n^2)$ performance.

More significantly, p2-grid did not perform as well as in the uniform case, running an order of magnitude slower.  This is as theory would predict, so no real surprises.  Continuing the trend out to 50k boxes gives the following results,

Both p2-grid and jsts-quadtree diverge towards $O(n^2)$, as grid based partitioning fails for sparse data.

Both rbush and jsts-strtree show similar growth rates, as they implement nearly identical algorithms, however the constant factors in rbush are much better.  The large gap in their performance can probably be explained by the fact that jsts uses Java style object oriented programming which does not translate to JavaScript well.

One way to understand the OMT heuristic in rbush, is that it is something like a grid, only hedged against sparse cases (like this circle).  In the case of a uniform distribution, it is not quite as fast as a grid, suffering a $O(\log(n))$ penalty, while adding robustness against sparse data.

Again box-intersect delivers consistent $O(n \log^2(n) + k)$ performance, as predicted.

### High aspect ratio

Finally, we come to the challenging case of high aspect ratio boxes.  Here we generate boxes which are uniformly distributed in $x$ and and stretched along the $y$-axis by a factor of $O(n)$:

High aspect ratio boxes absolutely destroy the performance of grids and quad trees, causing exponential divergence in complexity based on the aspect ratio.  Here are some results,

Even for very small $n$grids take almost a second to process all intersections for just 10000 boxes.  This number can be made arbitrarily high by choosing as extreme an aspect ratio as one likes.  Here I select an aspect ratio of $O(n)$, forcing the asymptotic growth of both jsts-quadtree and p2-grid to be $O(n^2)$.  If I had selected the aspect ratio as $O(2^n)$ or $O(\infty)$, then it would have been possible to increase their running times to some arbitrarily large value.  In this benchmark, rbush also grows though by a much slower $O(n^\frac{3}{2})$.  Continuing out to 100k boxes, eventually rbush also fails,

In this case rbush-bulk takes more than 40x slower and rbush-bulk more than 70x. Unlike in the case of grids however, these numbers are only realized by scaling in the number of boxes and so they cannot be made arbitrarily large.  However, it does illustrate that for certain inputs rbush will fail catastrophically.  box-intersect again continues to grow very slowly.

### 3D

The only libraries which I found that implemented 3D box intersection detection were lazykdtree and oimo.js.  As it is very popular, I decided to test out oimo’s implementation on a few small problem sizes.  Results like the following are typical:

## Conclusions

For large, uniform distributions of boxes, grids are still the best.  For everything else, use segment trees.

• Brute force is a good idea up to maybe 500 boxes or so.
• Grids succeed spectacularly for large, uniform distributions, but fail catastrophically for anything more structured.
• Quad trees (at least when properly implemented) realize similar performance as grids.  While not as fast for uniform data, they are slightly hedged against boxes of wildly variable size.
• RTrees with a tuned heuristic can give good performance in many practical cases, but due to theoretical limitations (see the previous post), they will always fail catastrophically in at least some cases, typically when dealing with boxes having a high aspect ratio.
• Zomorodian & Edelsbrunner’s streaming segment tree algorithm gives robust worst case $O(n \log^d(n) + k)$ performance no matter what type of input you throw at it.  It is even faster than grids for uniform distributions at small problem sizes(<10k) due to superior cache performance.

Overall, streaming segment trees are probably the safest option to select as they are fastest in almost every case.  The one exception is if you have a large number of uniformly sized boxes, in which case you might consider using a grid.

## Collision detection (part 2): Box intersection

Last time, we discussed collision detection in general and surveyed some techniques for narrow phase collision detection.  In this article we will go into more detail on broad phase collision detection for closed axis-aligned boxes.  This was a big problem in the 1970’s and early 1980’s in VLSI design, which resulted in many efficient algorithms and data structures being developed around that period.  Here we survey some approaches to this problem and review a few theoretical results.

## Boxes

A box is a cartesian product of intervals, so if we want to represent a d-dimensional box, it is enough to represent a tuple of d 1-dimensional intervals.  There are at least two ways to do this:

• As a point with a length
• As a pair of upper and lower bounds

For example, in 2D the first form is equivalent to representing a box as a corner point together with its width and height (e.g. left, top, width, height), while the second is equivalent to storing a pair of bounds (e.g. $[x_{min}, x_{max}] \times [y_{min}, y_{max}]$).

To test if a pair of boxes intersect, it is enough to check that their projections onto each coordinate axes intersects. This reduces the d-dimensional problem of box intersection testing to the 1D problem of detecting interval overlap.  Again, there are multiple ways to do this depending on how the intervals are represented:

• Given two intervals represented by their center point and radius, $[x_0-r_0, x_0+r_0], [x_1-r_1, x_1+r_1]$,

$[x_0-r_0, x_0+r_0] \cap [x_1-r_1, x_1+r_1] \neq \emptyset \Longleftrightarrow |x_0 - x_1| \leq r_0 + r_1$

• Given two intervals represented by upper and lower bounds, $[l_0, h_0], [l_1, h_1]$,

$[l_0, h_0] \cap [l_1, h_1] \neq \emptyset \Longleftrightarrow l_0 \leq h_1 \wedge l_1 \leq h_0$

In the first predicate, we require two addition operations, one absolute value and one comparison, while the second form just uses two comparisons.  Which version you prefer depends on your application:

1. In my experiments, I found that the first test was about 30-40% faster in Chrome 39 on my MacBook, (though this is probably compiler and architecture dependent so take it with a grain of salt).
2. The second test is more robust as it does not require any arithmetic operations.  This means that it cannot fail due to overflow or rounding errors, making it more suitable for floating point inputs or applications where exact results are necessary.  It also works with unbounded (infinite) intervals, which is useful in many problems.

For applications like games where speed is of the utmost importance, one could make a case for using the first form.  However, in applications where it is more important to get correct results (and not crash!) robustness is a higher priority.  As a result, we will generally prefer to use the second form.

## 1D interval intersection

Before dealing with the general case of box intersections in d-dimensions, it is instructive to look at what happens in 1D.  In the 1D case, there is an efficient sweep line algorithm to report all intersections.  The general idea is to process the end points of each interval in order, keeping track of all the intervals which are currently active.  When we reach the start of a new interval, we report intersections with all currently active intervals and it to the active set, and when we reach the end of an interval we delete the interval from the active set:

In JavaScript, it looks something like this:

function sweepIntervals(intervals) {
var events = [], result = [], active = []
intervals.forEach(function(interval, id) {
events.push(
{ t: interval[0], id: id, create: 1 },
{ t: interval[1], id: id, create: 0 })
})
events.sort(function(a, b) {
return a.t - b.t || b.create - a.create
})
events.forEach(function(ev) {
if(ev.create) {
active.forEach(function(id) {
result.push([ev.id, id])
})
active.push(ev.id)
} else
active.splice(active.indexOf(ev.id), 1)
})
return result
}

If the number of intervals is $n$, then there are $O(n)$ events total, and so sorting them all takes $O(n \log(n))$ time.  Processing event requires a scan through the active set, however for each iteration one intersecting pair is reported.  If the total number of collisions is $k$, then the amortized cost of looping over the events is $O(n + k)$.  Therefore, the total running time of this algorithm is in $O(n \log(n) + k)$.

## Sweep and prune

Sweeping is probably the best solution for finding interval overlaps in 1D.  The challenge is to generalize this to higher dimensions somehow.  One approach is to just run the 1D interval sweep to filter out collisions along some axis, and then use a brute force test to filter these pairs down to an exact set,

In JavaScript, here is an illustration of how it could be implemented in terms of the previous 1D sweep algorithm:

//Assume each box is represented by a list of d intervals
//Each interval is of the form [lo,hi]
function sweepAndPrune(boxes) {
return sweepIntervals(boxes.map(function(box) {
return box[0]
}).filter(function(pair) {
var A = boxes[pair[0]], B = boxes[pair[1]]
for(var i=1; i<A.length; ++i) {
if(B[i][1] < A[i][1] || A[i][1] < B[i][0])
return false
}
return true
})
}

The germ of this idea is contained in Shamos and Hoey’s famous paper on geometric intersection problems,

M. Shamos, D. Hoey (1976) “Geometric intersection problems” FoCS

In the case of rectangles in the plane, one can store the active set in an interval tree (more on this later), giving an optimal $O(n \log(n) + k)$ algorithm for planar intersection of rectangles.  If we just store the active set as an array, then this technique is known as sweep-and-prune collision detection, which is widely used in packages like I-COLLIDE,

J. Cohen, M. Lin, D. Manocha, M. Ponamgi. (1995) “I-COLLIDE: An interactive and exact collision detection system for large-scale environments” Symposium on Interactive 3D Graphics

For objects which are well separated along some axis, the simple sweep-and-prune technique is very effective at speeding up collision detection.  However, if the objects are grouped together, then sweep-and-prune is less effective, realizing a complexity no better than brute force $O(n^2 + k)$.

## Uniform grids

After brute force, grids are one of the simplest techniques for box intersection detection.  While grids have been rediscovered many times, it seems that Franklin was one of the first to write extensively about their use in collision detection,

W. Franklin (1989) “Uniform grids: A technique for intersection detection on serial and parallel machines” Proceedings of Auto-Carto

Today, grids are used for many different problems, from small video games all the way up to enormous physical simulations with millions of bodies.  The grid algorithm for collision detection proceeds in two phases; first we subdivide space into uniformly sized cubes of side length $H$, then insert each of the boxes into the cells they overlap.  Boxes which share a common grid cell are tested for overlaps:

Implementing a grid for collision detection is only just more complicated than sweep and prune:

//Same convention as above, boxes are list of d intervals
// H is the side length for the grid
function gridIntersect2D(boxes, H) {
var grid = {}, result = [], x = [0,0]
boxes.forEach(function(b, id) {
for(x[0]=Math.floor(b[0][0]/H); x[0]<=Math.ceil(b[0][1]/H); ++x[0])
for(x[1]=Math.floor(b[1][0]/H); x[1]<=Math.ceil(b[1][1]/H); ++x[1]) {
var list = grid[x]
if(list) {
list.forEach(function(otherId) {
var a = boxes[otherId]
for(var i=0; i<2; ++i) {
var s = Math.max(a[i][0], b[i][0]),
t = Math.min(a[i][1], b[i][1])
if(t < s || Math.floor(s/H) !== x[i])
return
}
result.push([id, otherId])
})
list.push(id)
} else grid[x] = [id]
}
})
return result
}

Note here how duplicate pairs are handled:  Because in a grid it is possible that we may end up testing the same pair of boxes against each other many times, we need to be careful that we don’t accidentally report multiple pairs of collisions.  One way to prevent this is to check if the current grid cell is the lexicographically smallest cell in their intersection.  If it isn’t, then we skip reporting the pair.

While the basic idea of a grid is quite simple, the details of building efficient implementations are an ongoing topic of research.  Most implementations of grids differ primarily in how they manage the storage of the grid itself.  There are 3 basic approaches:

• Dense array:  Here the grid is encoded as a flat array of memory.  While this can be expensive, for systems with a bounded domain and a dense distribution of objects, the fast access times may make it preferable for small systems or parallel (GPU) simulations.
• Hash table:  For small systems which are more sparse, or which have unbounded domains, a hash table is generally preferred.  Accessing the hash table is still $O(1)$, however because it requires more indirection iterating over the cells covering a box may be slower due to degraded data locality.
• Sorted list: Finally, it is possible to skip storing a grid as such and instead store the grid implicitly.  Here, each box generates a cover of cells which are then appended to a list which is then sorted.  Collisions correspond to duplicate cells which can be detected with a linear scan over the sorted list. This approach is easy to parallelize and has excellent data locality, making it efficient for systems which do not fit in main memory or need to run in parallel.  However, sorting is asymptotically slower than hashing, requiring an extra $O(\log(n))$ overhead, which may make it less suitable for problems small enough to fit in RAM.

Analyzing the performance of a grid is somewhat tricky.  Ultimately, the grid algorithm’s complexity depends on the distribution of boxes.  At a high level, there are 3 basic behaviors for grids:

• Too coarse: If the grid size is too large, then it won’t be effective at pruning out non-intersecting boxes.  As a result, the algorithm will effectively degenerate to brute force, running in $O(n^2 + k)$.
• Too fine: An even worse situation is if we pick a grid size that is too small.  In the limit where the grid is arbitrarily fine, a box can overlap an infinite number of cells, giving the unbounded worst case performance of $O(\infty)$!!!
• Just right:  The best case scenario for the grid is that the objects are uniformly distributed in both space and size.  Ideally, we want each box to intersect at most $O(2^d)$ cells and that each cell contains at most $O(2^d)$ objects. In this case, the performance of a grid becomes $O(2^d n + k)$ (using a grid or hash table), or $O(2^d n \log(n) + k)$ for sorted lists, which for small $d$ is effectively an optimal $O(n + k)$ complexity.

Note that these cases are not mutually exclusive, it is possible for a grid to be both too sparse and too fine at the same time. As a result, there are inputs where grids will always fail, no matter what size you pick.  These difficulties can arise in two situations:

• Size variation:  If the side lengths of the boxes have enormous variability, then we can’t pick just one grid size.  Hierarchical grids or quad trees are a possible solution here, though it remains difficult to tune parameters like the number of levels.
• High aspect ratio: If the ratio of the largest to smallest side of the boxes in the grid is too extreme, then grids will always fail catastrophically.  There is no easy fix or known strategy to avoid this failure mode other than to not use a grid.

While this might sound pessimistic, it is important to remember that when grids work they are effectively optimal.  The trouble is that when they fail, it is catastrophic.  The bottom line is that you should use them only if you know the distribution of objects will be close to uniform in advance.

## Partition based data structures

After grids, the second most widely recommended approach to collision detection are partition based tree data structures.  Sometimes called “bounding volume hierarchies,” partition based data structures recursively split space into smaller regions using trees. Objects are iteratively tested against these trees, and then inserted into the resulting data structure.

In psuedo-JavaScript, here is how this algorithm works:

function bvhIntersect(boxes) {
var tree = createEmptyTree(), result = []
boxes.forEach(function(box, id) {
bvhQuery(tree, box).forEach(function(otherBox) {
result.push([box, otherBox])
})
bvhInsert(tree, box)
})
return result
}

While the insertion procedure is different for each tree-like data structure, in bounding volume hierarchies querying always follows the same general pattern: starting from the root of the tree, recursively test if any children intersect the object.  If so, then visit those children, continuing until the leaves of the tree are reached.  Again, in psuedo-JavaScript:

function bvhQuery(node, box) {
if(!node.bounds.intersects(box))
return []
if(isLeaf(node))
return [ node.item ]
return node.children.reduce(function(child, result) {
return result.concat(bvhQuery(child, box))
}, [])
}

The precise complexity of this meta-algorithm depends on the structure of the tree, the distribution of the boxes and how insertion is implemented.  In the literature, there are many different types of bounding volume hierarchies, which are conventionally classified based on the shape of the partitions they use.  Some common examples include:

Within each of these general types of trees, further classification is possible based on the partition selection strategy and insertion algorithm.  One of the most comprehensive resources on such data structures is Samet’s book,

H. Samet. (2006) “Foundations of multidimensional and metric data structures

### Lower bounds

It would seem like the above description is too vague to extract any meaningful sort of analysis.  However, it turns out that using only a few modest assumptions we can prove a reasonable lower bound on the worst case complexity of bvhIntersect.  Specifically, we will require that:

1. The size of a tree with $n$ items is at most $O(n)$ bits.
2. Each node of the tree is of bounded size and implemented using pointers/references.  (As would be common in Java for example).
3. Querying the tree take $O(Q(n) + k)$ time, where $n$ is the number of items in the tree.

In the case of all the aforementioned trees these conditions hold.  Furthermore, let us assume that insertion into the tree is “cheap”, that is at most polylogarithmic time; then the total complexity of bvhIntersect is in the worst case,

$O \left( n(Q(n)+\mathrm{polylog}(n))+k \right)$.

Now here is the trick:  using the fact that the trees are all made out of a linear number of constant sized objects, we can bound the complexity of querying by a famous result due to Chazelle,

B. Chazelle. (1990) “Lower bounds for orthogonal range search I: The reporting case” Journal of the ACM

More specifically, he proved the following theorem:

Theorem: If a data structure answers box intersection queries in $O(\mathrm{polylog}(n) + k)$ time, then it uses at least $\Omega \left(n \left( \frac{\log(n)}{\log \log(n)} \right)^{d-1} \right)$ bits.

As a corollary of this result, any (reasonable) bounding volume hierarchy takes at least $Q(n) \in \omega(\mathrm{polylog}(n) )$ time per query.  Therefore, the worst case time complexity of bvhIntersect is slower than,

$\omega(n\:\mathrm{polylog}(n) + k)$.

Now this bound does come with a few caveats.  Specifically, we assumed that the query overhead was linear in the number of reported results and neglected interactions with insertion.  If $k$ is practically $O(n)$, then it is at least theoretically possible to do better.  Again, we cite a result due to Chazelle, which shows that it is possible to report rectangle intersection queries in $O( k \log( \frac{n}{k} ))$ time using $O(n)$ space,

B. Chazelle, (1988) “A functional approach to data structures and its use in multidimensional searching” SIAM Journal of Computing

### R-trees

Finally, let us look at one particular type of bounding volume hierarchy in detail; specifically the R-tree.  Originally invented by Guttmann, R-trees have become widely used in GIS databases due to their low storage overhead and ease of implementation,

A. Guttmann, (1984) “R-Trees: A dynamic index structure for spatial searching” SIGMOD

The high level idea behind an R-tree is to group objects together into bounding rectangles of increasing size.  In the original paper, Guttmann gave several heuristics for construction and experimentally validated them.  Agarwal et al. showed that the worst case query time for an R-tree is $\Omega(n^{1-\frac{1}{d}} + k)$ and gave a construction which effectively matches this bound,

P.K. Agarwal, M de Berg, J. Gudmundsson, M. Hammar, H.J. Haverkort. (2002) “Box-trees and R-trees with near optimal query time” Discrete & Computational Geometry

Disappointingly, this means that in the worst case using R-trees as a bounding volume hierarchy gives an overall time complexity that is only slightly better than quadratic. For example in 2D, we get $O(n^\frac{3}{2} + k)$, and for 3D $O(n^\frac{5}{3} + k)$.

Still, R-trees are quite successful in practice.  This is because for cases with smaller query rectangles the overhead of searching in an R-tree approaches $O(\log(n) + k)$.  Situations where the complexity degenerates to $O( n^\frac{d-1}{d} + k)$ tend to be rare, and in practice applications can be designed to avoid them.  Moreover, because R-trees have small space overhead and support fast updates, they are relatively cheap to maintain as an index. This has lead to them being used in many GIS applications, where the problem sizes make conserving memory the highest priority.

## Range tree based algorithms

So far none of the approaches we’ve seen have managed to break the $O(n \: \mathrm{polylog}(n) + k)$ barrier in the worst case (with the exception of 1D interval sweeping and a brief digression into Shamos & Hoey’s algorithm).  To the best of my knowledge, all efficient approaches to this problem make some essential use of range trees.  Range trees were invented by Bentley in 1979, and they solve the orthogonal range query problem for points in $O( \log^d (n) + k)$ time using $O(n \log^{d-1}(n))$ space  (these results can be improved somewhat using fractional cascading and other more sophisticated techniques).  The first application of range tree like ideas to rectangle intersections was done by Bentley and Wood,

J. Bentley, D. Wood. (1980) “An optimal worst case algorithm for reporting intersections of rectangles” IEEE Transactions on Computers

In this paper they, introduced the concept of a segment tree, which solves the problem of interval stabbing queries.  One can improve the space complexity of their result in 2D using an interval tree instead, giving a $O(n \log(n) + k)$ algorithm for detecting rectangle intersections.  These results were generalized to higher dimensions and improved by Edelsbrunner in a series of papers,

H. Edelsbrunner. (1983) “A new approach to rectangle intersections I” International Journal of Computer Math

H. Edelsbrunner. (1983) “A new approach to rectangle intersections II”  International Journal of Computer Math

Amongst the ideas in those papers is the reduction of the intersection test to range searching on end points.  Specifically, it is true that if two 1D intervals intersect, then at least one of them contains the end points of the other box.  The recursive application of this idea allows one to use a range tree to resolve box intersection queries in $d$ dimensions using $O(n \log^{d-1}(n))$ space and $O(n \log^{d-1}(n) + k)$ time.

### Streaming algorithms

The main limitation of range tree based algorithms is that they consume too much space. One way to solve this problem is to use streaming.  In a streaming algorithm, the tree is built lazily as needed instead of being constructed in one huge batch.  If this is done carefully, then the box intersections can be computed using only $O(n)$ extra memory (or even $O(1)$ if we allow in place mutation) instead of $O(n \log^d(n))$.  This approach is described in the following paper by Zomorodian and Edelsbrunner,

A. Zomorodian, H. Edelsbrunner (2000) “A fast software for box intersections” SoCG

In this paper, the authors build a segment tree using the streaming technique, and apply it to resolve bipartite box interactions.  The overall time complexity of the method is $O(n \log^d(n) + k)$, which matches the results for segment trees.

## Next time

In the next part of this series we will look at some actual performance data from real benchmarks for each of these approaches.  Stay tuned!

## Collision detection (part 1): Overview

Collision, or intersection, detection is an important geometric operation with a large number of applications in graphics, CAD and virtual reality including: map overlay operations, constructive solid geometry, physics simulation, and label placement.  It is common to make a distinction between two types of collision detection:

• Narrow phase:  Test if 2 objects intersect
• Broad phase: Find all pairs of intersections in a set of n objects

In this series, I want to focus on the latter (broad phase), though first I want to spend a bit of time surveying the bigger picture and explaining the significance of the problem and some various approaches.

## Narrow phase

The approach to narrow phase collision detection that one adopts depends on the types of shapes involved:

### Constant complexity shapes

While it is true that for simple shapes (like triangles, boxes or spheres) pairwise intersection detection is a constant time operation, because it is frequently used in realtime applications (like VR, robotics or games) an enormous amount of work has been spent on optimizing.  The book “Realtime Collision Detection” by Christer Ericson has a large collection of carefully written subroutines for intersection tests between various shapes which exploit SIMD arithmetic,

C. Ericson, (2004) “Realtime Collision Detection

### Convex polytopes

For more complicated shapes (that is shapes with a description length longer than $O(1)$) detecting intersections becomes algorithmically interesting.  One important class of objects are convex polytopes, which have the property that between any pair of points in the shape the straight line segment connecting them is also contained in the shape.  There are two basic ways to describe a convex polytope:

These two representations are equivalent in their descriptive power (though proving this is a bit tricky).  The process of converting a V-polytope into an H-polytope is called taking the “convex hull” of the points, and the dual algorithm of converting an H-polytope into a V-polytope is called “vertex enumeration.”

The problem of testing if two convex polytopes intersect is a special case of linear programming feasibility.  This is pretty easy to see for H-polytopes; suppose that:

$S = \left \{ x \in \mathbb{R}^d : A x \leq b \right \}$

$T = \left \{ x \in \mathbb{R}^d : C x \leq d \right \}$

Where,

• $A$ is a n-by-d matrix
• $b$ is a n-dimensional vector
• $C$ is a m-by-d matrix
• $d$ is a m-dimensional vector

Then the region $S \cap T$ is equivalent to the feasible region of a system of $n + m$ linear inequalities in $d$ variables:

$A x \leq b$

$C x \leq d$

If this system has a solution (that is it is feasible), then there is a common point $x$ in the interior of both sets which satisfies the above equations.  The case for V-polytopes is similar, and leads to a transposed system of $n + m$ variables in $d$ constraints (that is, it is the asymmetric dual of the above linear program).

Linear programs are a special case of LP-type problems, and for low dimensions can be solved linear time in the number of half spaces or variables. (For those curious about the details, here are some lectures).  For example, using Seidel’s algorithm testing the feasibility of the above system takes $O(d! (n + m))$ time, which for fixed dimension is just $O(n + m)$. The dependence on $d$ can be improved using more advanced techniques like Clarkson’s algorithm.

#### Preprocessing

If we are willing to preprocess the polytopes, it is possible to do exponentially better than $O(n + m)$.  In 2D, the problem of testing intersection between a pair of polygons reduces to calculating a pair of tangent lines between the polygons.  There is a well known algorithm for solving this in $O(\log(n))$ time by binary search (assuming that the vertices/faces of the polygon are stored in an ordered array, which is almost always the case).

The 3D case is a bit trickier, but it can be solved in $O(\log^3(n))$ using a more sophisticated data structure,

B. Chazelle, D. Dobkin. (1988) “Intersection of convex objects in two and three dimensions”  Journal of the ACM

For interactive applications like physics simulations, another important technique is to reuse previous closest points in calculating distances (similar to using a warm restart in the simplex method for linear programming).  This ideas were first applied to collision detection in the now famous Lin-Canny method:

M. Lin, J. Canny, (1991) “A fast algorithm for incremental distance calculation” ICRA

For “temporally coherent” collision tests (that is repeatedly solved problems where the shapes involved do not change much) the complexity of this method is practically constant.  However, because it relies on a good initial guess, the performance of the Lin-Canny method can be somewhat poor if the objects move rapidly.  More recent techniques like H-walk improve on this idea by combining it with fast data structures for linear programming queries, such as the Dobkin-Kirkpatrick hierarchy to get more robust performance:

L. Guibas, D. Hsu, L. Zhang. (1999) “H-Walk: Hierarchical distance computation for moving bodies” SoCG

### Algebraic and semialgebraic sets

Outside of convex polytopes, the situation for resolving narrow phase collisions efficiently and exactly becomes pretty hopeless.  For algebraic sets like NURBS or subdivision surfaces, the fastest known methods all reduce to some form of approximate root finding (usually via bisection or Newton’s method).  Exact techniques like Grobner basis are typically impractical or prohibitively expensive.  In constructive solid geometry working with semialgebraic sets, it is even worse where one must often resort to general nonlinear optimization, or in the most extreme cases fully symbolic Tarski-Seidenberg quantifier elimination (like the cylindrical algebraic decomposition).

### Measure theoretic methods

I guess I can say a few words about some of my own small contributions here.  An alternative to computing the distance between two shapes for testing separation is to compute the volume of their intersection, $\mu(S \cap T)$.  If this volume is $>0$, then the shapes collide.  One way to compute this volume is to rewrite it as an integral.  Let $1_S$ denote the indicator function of $S$, then

$\mu(S \cap T) = \int \limits_{\mathbb{R}^d} 1_S(x) 1_T(x) dx$

This integral is essentially a dot product.  If we perform an expansion of $1_S, 1_T$ in some basis, (for example, as Fourier waves), then we can use that to approximate this volume.  If we do this expansion carefully, then with enough work we can show that the resulting approximation preserves something of the original geometry.  For more details, here is a paper I wrote:

M. Lysenko, (2013) “Fourier Collision Detection” International Journal of Robotics Research

The advantage to this type of approach to collision detection is that it can support any sort of geometry, no matter how complicated. This is because the cost of the testing intersections scales with the accuracy of the collision test in a predictable, well-defined way.  The disadvantage though is that at high accuracies it is slower than other exact techniques.  Whether it is worthwhile or not depends on the desired accuracy, the types of shapes involved and if additional information like a separating axis is needed and so on.

Given a fast narrow phase collision test, we can solve the broad phase collision detection problem for $n$ objects in $O(n^2)$ calls to the underlying test.  As the number of reported collisions could be $O(n^2)$ in the worst case, this would seem optimal.  However, we can get a sharper picture using a more detailed output sensitive analysis.  To do this, define k to be the number of reported intersections, and let us then analyze the time required to do the collision detection as a function of both n and k.  Using output sensitive analysis, there is also a lower bound of $\Omega(n \log(n) + k)$ (for comparison based algorithms) by reduction to the element uniqueness problem.

### Special cases

If we are only allowed to use pairwise intersection tests and know no other property of the shapes, then it is impossible to compute all pairwise intersections any faster than $O(n^2 + k)$.  However, for special types of shapes in low dimensional spaces substantially faster algorithms are known:

#### Line segments

For line segments in the plane, it is possible to report all intersections in $O((n+k) \log(n))$ time using a sweep line algorithm.  If $k \in o( \frac{n^2}{\log(n)} )$, this is a big improvement over naive brute force.  This technique can also be adapted to compute intersections in sets of general convex polygons by decomposing them into segments, and then building a secondary point location data structure to handle the case where a polygon is completely contained in another.

#### Uniformly sized and distributed balls

It is also possible to find all intersections in a collection of balls with constant radii in optimal $O(n + k)$ time, assuming that the number of balls any single ball intersects is at most $O(1)$.  The key to this idea is to use a grid, or hash table to detect collisions. This process is both simple to implement and has robust performance, and so it is used in many simulations and video games.

#### Axis aligned boxes

Finally, it is possible to detect all intersections in a collection of axis aligned boxes in $O(n \log^d(n) + k)$ time, though we will postpone talking about this until next time.

### General objects and bounding volumes

For general objects, no algorithms with running time substantially faster than $O(n^2 + k)$ are known.  However, we can in practice still get a big speed up by using a simpler broad phase collision test to filter out intersections.  The main idea is to cover each object, $S$, with some simpler shape $S' \supseteq S$ called a bounding volume.  If a pair of bounding volumes do not intersect, then the shapes which they are covering can not intersect either.  In this way, bounding volumes can be used to prune down the set of collision tests which must be performed.

In practice, the most common choice for a bounding volume is either a box or a sphere.  The reason for this is that boxes and spheres support efficient broad phase intersection tests, and so they are relatively cheap.

Spheres tend to be more useful if all of the shapes are more or less the same size, but computing tight bounding spheres is slightly more expensive.  For example, if the objects being intersected consist of uniformly subdivided triangle meshes, then spheres can be a good choice of bounding volume.  However, spheres do have some weakness.  Because testing sphere intersection requires multiplication, it is harder to do it exactly than it is for boxes.  Additionally, for spheres of highly variable sizes it is harder to detect intersections efficiently.

Computing intersections in boxes on the other hand tends to be much cheaper, and it is simpler to exactly detect if a pair of boxes intersect.  Also for many shapes boxes tend to give better approximations than spheres, since they can have skewed aspect ratios.  Finally, broad phase box intersection has theoretically more robust performance than sphere intersection for highly variable box sizes.  Perhaps based on these observations, it seems that most modern high performance physics engines and intersection codes have converged on axis-aligned boxes as the preferred primitive for broad phase collision detection.  (See for example, Bullet, Box2D)

### Bipartite vs complete

It is sometimes useful to separate objects for collision detection into different groups.  For example if we are intersecting water-tight meshes, it is useless to test for self intersections.  Or as another example, in a shooter game we only need to test the player’s bullets against all enemies.  These are both examples of bipartite collision detection.  In bipartite collision detection, we have two groups of objects (conventionally colored red and blue), and the goal is to report all pairs of red and blue objects which intersect.

### Range searching and more references

There is a large body of literature on intersection detection and the related problems of range searching.  Agarwal and Erickson give an excellent survey of these results in the following paper,

P.K. Agarwal, J. Erickson. (1997) “Geometric range searching and its relatives

## Next time

In the next article, we will look at broad phase collision detection in more depth, focusing on boxes as a basic primitive.

## Relations are hard to model in category theory

WARNING: This is a somewhat rambling post about category theory.  If half-baked mathematical philosophy is not your thing, please consider navigating away right now.

Anyway, the thing that I want to write about today is the difference between category theory and what I shall call for lack of a better term the older “set theoretic” approach.  My goal is to try to articulate what I see as the main difference is between these two structures, and why I think that while category theory offers many insights and new perspectives it is probably hopeless to try to shoe horn all of mathematics into that symbolism.

## Relation Theory

If you’ve ever taken an advanced math course, you probably already know that set theory is default “programming language” of modern mathematics.  Modern set theory is built upon two basic components:

A set is any well defined unordered collection of objects, the members of which are called its elements.  Sets by themselves are rather boring things, and don’t do much on their own.  What makes set theory interesting and useful, is that in addition to sets we also have relations.  There are many ways to define a relation, but if you already know set theory we can bootstrap the definition using the concept of a Cartesian product:

An n-ary relation amongst the sets $X_1, X_2, ..., X_n$ is a subset of $R \subseteq X_1 \times X_2 \times ... \times X_n$.  A tuple of n elements, $x_1 \in X_1, x_2 \in X_2, ..., x_n \in X_n$ is related (under R) if and only if,

$R(x_1, x_2, ..., x_n) \Leftrightarrow (x_1, x_2, ... ) \in R$

This may seem a bit circular, but it is unavoidable since we need some formal language to define set theory, and without set theory don’t really have any way to talk about formal languages!  The only way out is to take at least something as a given, and for most mathematicians this is the definition of sets and relations.

The main goal of set theory is to define new sets and the relations between them.  A classic example of a relation is a graph, which in the plane visualizes relationships between a pair of variables:

Relations show up all over the place in mathematics.  For example, we have binary relations like =, <, >, etc. that we can use to compare quantities.  It is also possible to think of arithmetic in terms of relations, for example + can be thought of as a ternary relation that takes 3 numbers as input and checks if the 3rd is the sum of the two inputs.

It is possible to build up more complex relations in set theory by combining simple relations using the quantifiers there-exists and for-all. For example using the multiplication relation we can write the statement “2 divides x” using a quantifier:

$\exists y \in \mathbb{N}:\times(2,y,x)$

Where I am using  somewhat non-standard notation here to write multiplication as a ternary relation:

$\times(a, b, c) \cong (a b = c)$

Relational thinking is also the main concept behind tensor calculus, where one replaces all the sets involved with vector spaces and the relations with multilinear forms.  In fact, the whole concept of the Galerkin method in numerical analysis can be thought of as first relaxing a set theoretic problem into a tensor relation; then performing a finite expansion in some basis.

## Category Theory

You can approach category theory in many ways, but coming from set theory the easiest way to understand it is that it is the study of functional relations above all else.  The basic axioms of a category define a mathematical structure in which we can study certain abstract classes of functions.  Briefly a category has three pieces of data:

• A set of objects, $\text{Ob}(C)$
• For every pair of objects $a, b \in \text{Ob}(C)$ set of morphisms, $\text{Mor}_C(a,b)$
• A relation $\circ \subseteq \text{Mor}_C(a,b) \times \text{Mor}_C(b,c) \times \text{Mor}_C(a,c)$ for every triple of objects $a,b,c \in \text{Ob}(C)$

Such that the following conditions are satisfied:

• $\circ$ is a functional relation, $\text{Mor}_C(a,b) \times \text{Mor}_C(b,c) \to \text{Mor}_C(a,c)$
• There exists some $1_a \in \text{Mor}_C(a,a)$ such that $\forall f\in\text{Mor}_C(a,b):f\circ 1_a=f$
• $\circ$ is associative, that is $f \circ (g \circ h) = (f \circ g) \circ h$

Categories play a central role in algebra, where they are used to express transformations between various structures.  Perhaps the place where this is most obviously useful is in the study of groups and their representations.  Also the fact that many common algebraic structures like monoids are secretly just degenerate versions of categories, highlights their central importance.  Unlike relations categories have a great deal of structure, which makes it possible to say much more about them than one can about a relation in general.  It can be difficult to cast a relational problem into the language of categories, but the payoff is generally worth it.  For example, one of the more successful ways to study tensor algebra today is from the perspective of algebraic geometry.

## Categories vs Relations

The main difference between categories and relations is that categories focus on change, while relations express invariance.  Both philosophies are equal in their overall expressive power, but they may be better suited to some problems over others.

The main power of category theory is that it lends itself to explicit calculations and so it is most useful as a language for describing transformations.  This makes it an especially nice language for reasoning about algorithms and programs, and one can see this in languages like Haskell.  On the other hand, relations make minimal assertions about how we might compute something and instead only describe “what” is to be computed.  Languages like SQL or Prolog make heavy use of relations and are good at expressing data processing concepts.

For example, it is trivial to convert any problem in category theory into the language of relations (this is vacuously easy, since the axioms of a category are specified in terms of sets and relations!)  However, going the other way is a more complicated proposition and there is no perfect solution.  Perhaps the most direct way is to “emulate” relational reasoning within category theory, which can be done using the language of monoidal categories.  However, simply shoe horning relational thinking into this framework loses most of the advantage of categorical reasoning.  It is similar to the argument that you can take any program written in BrainFuck and port it to Haskell by simply writing an interpreter in Haskell.  While it would then be strictly true that doing this translates BrainFuck to Haskell, it misses the point since you are still basically coding in BrainFuck (just with an extra layer of abstraction tacked on).

### Categorification is hard

This is really why categorification (and programming in general) is hard: there is always more than one way to do it, and the more general you get the worse the result.  Effectively categorifiying various structures requires thinking deeply about the specific details fo the situation and picking axioms which emphasize the important features of a problem while neglecting the superfluous details.

## Conclusion

In the end, one view point isn’t any better than the other – but they are different and it is worth trying to understand both deeply to appreciate their strengths and weaknesses.  Relational thinking is successful when one needs to reason about very complicated structures, and is pervasive in analysis.  Categories on the other hand bring with them much more structure and precision, and are most useful in describing transformations and syntactic abstraction.

## ndarray modules

In the last two posts I introduced ndarrays, and explained the rationale and implementation details of the library.  In this post I am going to show a few of the libraries that I have written using ndarrays.  All of this code works in both node.js and within any browser that supports typed arrays.  You can run the demos directly in node.js or else test them out in a browser using a bundler like browserify.  You can click on the links for each of them to find out more details about a specific module.

Fourier analysis

## Morphology and miscellaneous stuff

### ball-morphology: Mathematical morphology with ball-shaped structuring elements

Conclusions

This list is by no means exhaustive, and I have been writing more modules as I need them.  One of the nice things about working with CommonJS modules is that it is pretty straight forward to create your own module on npm, and reuse its functionality.  I think that this style of programming could make building large scientific computing projects like SciPy/NumPy much more manageable.  Each function in such a project could be decomposed into a separate module, and it would be easy to experiment with different implementations.

## Cache oblivious array operations

I must confess that secretly the article I wrote last time (in which I introduced ndarrays) was just a pretext to introduce the stuff that I am going to write about today: which is the cwise library for array operations in JavaScript.

## Array operations

Briefly, array operations are component-wise operations that can be applied across multiple multidimensional arrays simultaneously.  Array operations can be used to implement basic vector arithmetic, like addition or scaling and are a fundamental tool in linear algebra/numerical computing in general.  Because they are so ubiquitous, many languages have special syntax for array operations as well as routines for specifically optimizing the performance; for example in MATLAB if you prefix an operator with . then executes component-wise.  In the functional programming world, array operations can be implemented as a sequence of zip /map /reduce higher order functions, while in a language with procedural constructs like C or JavaScript you would probably just use a for-loop like this:

for(i=0; i<shape[0]; ++i) {
for(j=0; j<shape[1]; ++j) {
// ... do operation ...
a[i][j] += b[i][j]
}
}

### Simple algorithm

The simple for-loop algorithm is a good starting point for implementing operations on ndarrays.  However, in an ndarray performing a lookup at each loop iteration introduces an overhead of $O(d)$ (where $d$ is the dimension  of the ndarray) due to the extra indexing and multiplication required.  To avoid doing this, we can compute the index into the underlying array incrementally.  As a sketch of how this works, consider the following psuedocode:

var a_ptr = a.offset
, b_ptr = b.offset

for(var i=0; i<a.shape[0]; ++i) {
for(var j=0; j<a.shape[1]; ++j) {
//... do operation ...
a.data[a_ptr] += b.data[b_ptr]

a_ptr += a.stride[1]
b_ptr += b.stride[1]
}
a_ptr += a.stride[0] - a.stride[1] * a.shape[1]
b_ptr += b.stride[0] - b.stride[1] * a.shape[1]
}

Using this new algorithm, the next index for the arrays can be computed by just adding some constant shift to the pointer instead of performing a series of multiplies.  This is indeed a good thing, and one can easily show that in the conventional RAM model of computation, (which is heavily emphasized in undergraduate CS curricula), this approach is optimal.

But as everyone who has ever tried hard to optimize code for processing arrays knows, this is hardly the full story.  Real computers have hierarchical memory, and execute block IO operations on chunks of cache.  Failing to take this into account when analyzing an algorithm can lead to wildly different (and inaccurate) conclusions about its performance.

### Two-level memory and striding

So if we want to come up with a better algorithm for array operations we need a better theory of how our computer works.  And perhaps the simplest extension of the RAM model which does this is the two-level memory model.  The basic idea in the two-level model is that you are allowed to operate on memory in contiguous chunks of up to size $B$ words at a time.  This is basically a block IO version of the RAM model, and accurately models many realistic algorithms.  In general, the running time of an algorithm on the RAM model is an upper bound on its running time in the two-level model, and the fastest that we can ever speed up an algorithm in the two level model is by a factor of $O(B)$.

This fact is especially important when we are considering array operations and illustrates the importance of striding in our algorithms.  For example, suppose in the above algorithm that the arrays were contiguously packed in row major order, then each of the memory reads could be executed sequentially and the running time would be $O(1 + \frac{\text{shape}[0]\text{shape}[1] }{B})$.  On the other hand, if the arrays were stored in column major order and if we looped over them in the same order, then we would get no advantage from blocked IO operations and so the cost of executing the algorithm would be $O( \text{shape}[0] \text{shape}[1] )$.

To fix this, we should iterate over the indices where the largest strides are in the outer loop and the smallest strides are packed inside.  This makes the most efficient use of each block memory operation, and reduces the total computation time by a factor of $O(B)$ (which is optimal) if the stride orderings are compatibile in each array.

### Cache aware algorithm

So the next question is what happens if the strides are not compatible?  For example, suppose that a was column major while b was row major?  Then in the two level model there is no way to take advantage of block memory operations to speed up our loop, and so it would seem that we are back to RAM model performance.  However, there is an important aspect of hierarchical memory which the two level model neglects: caching.

But before we can understand and exploit this new technology, we need some sort of model to describe how it works.  For this purpose, the minimal extension of the two level  memory model is known as the external memory model.  Like the two level memory model, we can read and write memory in chunks, but in addition there is also a separate (but small and finite) memory called the cache which can store up to $M$ words at a time.  We assume that accessing cache is infinitely fast, and so the time complexity of any algorithm in the external memory model is bounded by the number of reads/writes to the external memory source.

This model may seem slightly absurd since it is possible to execute things like infinite loops for free (as long as they only use the cache), however it turns out to be pretty good accurate for numerical computations and data structures, which are all asymptotically limited by the amount of time they spend moving bits around at the slowest level of the memory hierarchy.

Now it turns out that within the external memory model there is an asymptotically efficient algorithm for array operations that works regardless of their stride.  The first place where this seems to have been written down is in the following paper:

A. Aggarwal, J. Vitter. “The input/output complexity of sorting and related problems” (1987) CACM

The general idea is that we break our arrays down into chunks which have shapes that are on the order of $B$ in each dimension.  Then we iterate over the whole matrix in these chunks and execute the array operation across each chunk independently.  In psuedocode, it works like this:

//Assume shape[0], shape[1] are multiples of B for simplicity
for(var i=0; i<shape[0]; i+=B) {
for(var j=0; j<shape[1]; j+=B) {
var a_ptr = i * a.stride[0] + j * a.stride[1]
var b_ptr = i * b.stride[0] + j * b.stride[1]
for(var k=0; k<B; ++k) {
for(var l=0; l<B; ++l) {
a.data[a_ptr] += b.data[b_ptr]
a_ptr += a.stride[1]
b_ptr += b.stride[1]
}
a_ptr += a.stride[0] - a.stride[1] * B
b_ptr += b.stride[0] - b.stride[1] * B
}
}
}

Now if $M > B^d$, then this approach can be used to speed up any array operation by a factor $B$ – regardless of their stride!  Pretty neat!

### Cache oblivious algorithm

The above approach is awesome, but it only works if we know how big $B$ and $M$ are.  However, how can we figure out what these parameters are?  One possible solution is to run lots of experiments and directly optimize those values.  This benchmark-first approach to performance tuning works pretty well for known hardware configurations, like game consoles or iPhone and can produce awesome results.  However it is obviously not very robust, nor is it even feasible in all situations, like web applications for example which need to perform well across many different browsers and devices.

Similarly, we often can’t count on the values for $M$ and $B$ to remain constant throughout the life time of a program.  Contention for disk, RAM and CPU cache can cause the amount of available cache and bandwidth to fluctuate, and so we really shouldn’t assume these things are constant.

Finally, there is also the small problem that computers typically have more than one level of caching, and optimizing array operations for multiple levels of cache would require introducing one extra loop per layer, which can be quite impractical to implement.

Fortunately there is a single elegant model of computation which solves all of these problems at once.  It is known as the cache oblivious memory model, and it was invented by Harald Prokop:

H. Prokop. “Cache-oblivious algorithms” (1999) MIT Master’s Thesis.

For the amount of problems that it solves, the cache oblivious model of computation is remarkably simple: it is just the external memory model, except we are not allowed to know $B$ and $M$.  But despite being so trivial, the cache oblivious model has a number of remarkable properties:

1. Algorithms that scale in the cache oblivious model scale across all levels of the memory hierarchy simultaneously.
2. Cache oblivious algorithm “just work” across any type of hardware or virtual machine with no fine tuning.
3. Cache oblivious algorithms automatically throttle with hardware availability and scale performance accordingly.
4. Finally, programs written for the cache oblivious model look just like ordinary RAM model programs!  There is no need to introduce extra notation or operations.

These properties make the cache oblivious model the gold standard for analyzing high performance algorithms, and it should always be our goal to design data structures and algorithms that scale in this model.  However, to achieve these benefits we need to do a bit more work up front when we are designing our algorithms.

For array operations, a simple way to do this is to adapt Prokop’s cache oblivious matrix transpose algorithm (see the above thesis for details).  The way it works is that we keep recursively splitting our multidimensional arrays in half along the largest axis until finally we are left with blocks of unit size.  In psuedo-JavaScript using ndarrays, it would look like this:

function recursiveAdd(a, b) {
var rows = a.shape[0]
, cols = a.shape[1]
if(rows === 1 && cols === 1) {
a.data[a.offset] += b.data[b.offset]
} else if(rows > cols) {
rows >>>= 1
} else {
cols >>>= 1
}
}

The above code also works regardless of the stride and always uses an optimal $O(1+\frac{\text{shape}[0]\text{shape}[1]}{B})$ block memory transfers if the cache is at least $M \geq B^2$.  It is also pretty easy to generalize this idea to multiple arrays with higher dimensions, which again performs optimally assuming that $M \geq n B^d$ where is the number of arguments and is the dimension of the arrays.

Array operations for JavaScript

The above cache oblivious algorithm is clearly what we want to implement, but the code in the previous listing is just a sketch and if implemented naively is not likely to be very fast (at least in JavaScript).  There are quite a few important JS-specific micro-optimizations that we ought to apply before seriously implementing cache-oblivious array operations:

• Slicing the arrays creates a small amount of garbage at each level of recursion, and so we should probably flatten out the objects and keep track of their structures as arguments incrementally to avoid unnecessary memory allocation.
• Using a recursive function call is much slower than manually using a stack to keep track of previous calls,.
• Similarly, we should use the pointer tricks from the simple algorithm to speed up array indexing.
• Also if the strides match, we should use the direct pointer arithmetic version as it has lower overhead
• For small array sizes the overhead of recursion is much larger than the savings due to cache improvements, and so we would want to terminate the subdivision sometime before getting all the way down to unit arrays.  (This is not related to the size of the cache blocks since it is a function of the amount of CPU overhead, not memory bandwidth).

## Code!

All of these issues can be pretty tricky to get right, and so it makes sense to try to push all of this complexity into a library.  This is what the cwise library does, and you can get it here:

### https://github.com/mikolalysenko/cwise

That library should work in any JavaScript environment that supports typed arrays, including most current browsers (IE 9+, Firefox, Chrome and Safari).  If you want to try it out in your own projects, you can install it via npm or use browserify (or some similar tool) to stick it in your web page.

### Macros and dynamic code generation

The basic idea behind cwise is similar to libraries like FFTW which dynamically generate tuned code based on their inputs (though because it is much easier to do dynamic code generation in JavaScript the interface for cwise is a bit simpler).  The way it works is that it lazily compiles optimized scans for array operations based on the shape and stride of the input arguments.  You can specify the component-wise operations for cwise using ordinary JavaScript which gets parsed and compiled using esprima/falafel at run time.  This is only done the first time you execute a cwise function, all subsequent calls reuse the same optimized script.

## Tricks

There are a lot of useful things that you can do with ndarrays using cwise.  There are plenty of useful recipes on the github page, but to help get things started here is a quick survey of some stuff you can do:

### Vector arithmetic

The most obvious use of cwise is to implement basic vector arithmetic.  You can grab a handful of prebaked ready-to-go operations from the following github project:

https://github.com/mikolalysenko/ndarray-ops

As a rule, using these sort of helper methods is not as efficient as unpacking your array operations into a cwise data structure, but on the other hand they can simplify your code and for basic tasks they are often fast enough.

### Matrix transpose

A more complicated example of using ndarrays is to perform matrix transpose or cache oblivious reindexing of arrays.  For example, this is pretty easy to do by just changing the stride in the target.  Suppose for example, suppose we want to transpose a square image which is stored in column-major format.  Then we can do this using the following code:

var ops = require("ndarray-ops")
var ndarray = require("ndarray")
function transpose(output, input, size) {
var a = ndarray.ctor(output, [size, size], [size, 1], 0)
, b = ndarray.ctor(input, [size, size], [1, size], 0)
ops.assign(a, b)
}

This is essentially the same thing that Prokop’s cache oblivious matrix transpose algorithm does.

### Array copying

Another neat trick we can do with array operations is make copies of arrays in memory.  For example, if we want to fill an array with a repeating pattern, we can use the following snippet:

var ops = require("ndarray-ops")
var ndarray = require("ndarray")
function fillPattern(output, pattern, count) {
var a = ndarray.ctor(output, [count, pattern.length], [pattern.length, 1], 0)
, b = ndarray.ctor(pattern, [count, pattern.length], [0, 1], 0)
ops.assign(a, b)
}

This generalizes to filling 2D images with repeated tiles or for stuff like slicing volume data apart into a texture.

## Experiments

So how much difference does all of this make?  To add some numbers to all of these different array operations I made a few quick experiments in node.js to try everything out.  You can check out the results for yourself here on github:

https://github.com/mikolalysenko/cwise-experiment

And here are the results summarized in graphical format:

We can draw the following conclusions from these measurements:

• When possible, sequential iteration is the optimal implementation of array operations.
• For non-compatible strides, cache aware array operations are fastest, but they require knowledge of the underlying architecture (which may not always be available)
• Though cache oblivious array operations are almost never the fastest possible algorithm, they come closest across the widest array of possible inputs.

## Conclusion

In summary, we’ve discussed an asymptotically optimal algorithm cache-oblivious array operations across many multidimensional arrays with arbitrary shapes and strides.  We’ve also seen an implementation of these concepts in JavaScript via the ndarray and cwise libraries (that should work in node.js and in any browser that supports typed arrays).  These tools enable a wide variety of interesting image processing and linear algebra operations, but a lot more work is still necessary to realize a full suite of numerical linear algebra tools in JavaScript.  Here are a few possible ideas for future projects:

• Tools for basic linear algebra computations (matrix inverse, iterative/direct solvers, eigenvalue problems)
• Data structures for sparse multidimensional arrays
• Libraries for implementing Einstein tensor summation/higher order map-reduce in a style similar to cwise.

## Implementing Multidimensional Arrays in JavaScript

The past few months I’ve been working to move more of my work into JavaScript, and in particular the node.js ecosystem.  I hope that by doing this I will be able to create better demos and applications that are easier to use as well as develop.

## Numerical Computing in a Hostile Environment

But despite the numerous practical advantages to working in JavaScript (especially from a user experience perspective) there are still many things that JS just sucks at.  For me the most painful of these is the weak support for numerical computing and binary data types.  These problems are due to fundamental limitations of the language, and not something that can be papered over in a library.  For example, in JS there is:

• No pass-by-value for non-primitive data types
• No way to control memory alignment for members of objects
• No way to restrict references to pointers/prevent aliasing
• No way to ensure arrays of objects are allocated in place
• No array operations
• No SIMD instructions and data types
• … etc.

This situation is bad, but the benefits of using JS from an accessibility perspective are just too big to ignore, and so it is at least worth it to try as much as can be done within the limitations of the system.  Moreover, these obstacles are by no means unprecedented in the world of computing.  As an instructive example, consider the Java programming language:  Java suffers from exactly the same design flaws listed above, and yet it is a viable platform for numerical computing with an enthusiastic community (though it is still nowhere near as fast as more efficiently designed languages like Fortran/C99).  So, it is a plausible hypothesis that JavaScript could achieve similar results.

Now, reimplementing say LAPACK or SciPy is a huge project, and it is not reasonable to try doing it in a single article or even within a single library.  To help focus our inquiries, let us first restrict the scope of our search.    Instead, let us study today on the fundamental problem of implementing multidimensional arrays.

## Multidimensional arrays

Multidimensional arrays are a natural first step towards building systems for solving numerical problems, and are a generally useful data type in their own right, with applications in graphics and vision.  In a language like JavaScript, there are multiple ways

### Arrays-of-Arrays

Perhaps the most popular way to represent higher dimensional arrays in JavaScript is as arrays-of-arrays.  For example,

var A = [[1, 0, 0],
[0, 1, 0],
[0, 0, 1]]

This is the approach that numeric.js uses, and it has some merits.  For example, the syntax for accessing elements in an array-of-arrays is quite pleasingly simple:

var x = A[i][j]

And it does not require introducing any extra data structures or conventions into the language.

However, while arrays-of-arrays are by far the simplest solution, from a performance perspective they are disastrous.  The most obvious problem is that to allocate a d-dimensional array-of-arrays with O(n) elements per each element, you need to store $O(n^{d-1})$ extra bytes worth of memory in pointers and intermediate arrays.  Additionally, accessing an element in an array-of-arrays requires $O(d)$ dependent memory dereferences.  These operations can’t be easily pipelined and are terrible from a cache performance perspective.  If performance is even remotely close to being an objective, then we need to try for something better.

Typed Arrays

Fortunately, in both JavaScript and Java, there is a better alternative to arrays-of-arrays: namely typed arrays.  Using a typed array lets us guarantee that the data for our array is arranged contiguously with no wasted space.  For example, here is how we would create a typed array for the same matrix described earlier:

var A = new Float32Array([1, 0, 0, 0, 1, 0, 0, 0, 1])

The downside is that accessing elements from a typed array is a bit more complicated.  To do this we need to agree on some convention up front about how the elements in the array are arranged in memory.  In languages like C, typically this is done in colexicographic (also known as row major order), while other languages like Fortran use a lexicographic ordering (also called column major order).  The choice between the two is  totally arbitrary, but once you pick one you have to stick with it.  Assuming a row major ordering then, we can access an element from the array using the formula:

var x = A[3 * i + j]

Typed arrays can be very fast, but they are a bit clumsy to work with.  You need to keep track of the number of both what convention you are using to store the array as well as what its dimensions are, which is an extra blob of data that needs to be passed around and slows things down substantially.

### Strided Arrays

Strided arrays solve these problems by considering array storage in a more general form. Instead of keeping a fixed convention for indexing, they write the index of an element in array as a general linear function of its coordinate.  For example,

var x = A[ c0 * i + c1 * j + ... ]

The result is that they achieve the same performance as flattened typed arrays, but gain the advantage that they can be easily sliced, transposed or reversed just by changing the stride.  For example, if we wanted to switch the x/y axes of an image, all that we would need to do is swap the strides:

var x = A[ c1 * i + c0 * j + ... ]

Setting a stride to a negative value can be used to reverse an axis, and changing the shape of an array lets us easily crop out subarrays from higher dimensional arrays.

## Introducing ndarray

At least theoretically, strided arrays sound like a great idea and so to see how well they worked in practice I created the following library:

### https://github.com/mikolalysenko/ndarray

Essentially, an ndarray is an object that has 4 pieces of data:

• data – The underlying typed array
• shape – The shape of the ndarray
• stride  – The stride of the ndarray
• offset – A pointer to the first element in the array

To look up an item in an ndarray you use the formula:

data[stride[0] * i[0] + stride[1] * i[1] + ... + offset]

You can download it and try it out yourself using npm, or check out some projects already built on top of ndarray here.

## But is it fast?

To see how each of these multidimensional array representations compare, I decided to compare their performance on a simple component-wise operation for various shapes of arrays.  In pseudocode, this operation looks something like:

function test(A, B) {
for(i in range) {
A[i] += B[i] + 0.1
B[i] -= A[i] * 0.5
}
}

I ran this simple loop using different variations of multidimensional arrays with various shapes.  The following chart summarizes the results:

You can find more detailed results and further experiments (including direct comparisons using numeric.js) at the github page for this project:

https://github.com/mikolalysenko/ndarray-experiments

These results are somewhat typical, indexing in an ndarray is a little slower than directly walking over a typed array, but not by much.  For arrays of arrays, the performance gets progressively worse as the first axes get larger, but even in the best case they trail far behind data structures built on typed arrays.

## Next time

The natural question to ask is can we do even better?  And the answer is yes.  The key to getting further performance improvements is to batch operations together allowing for full array comprehensions.  I’ll discuss this in the next post, but for now you can take a look at my current implementation of bulk operations on ndarrays in the cwise library.

## Comparing Sequences Without Sorting

This is a post about a silly (and mostly pointless) optimization.  To motivate this, consider the following problem which you see all the time in mesh processing:

Given a collection of triangles represented as ordered tuples of indices, find all topological duplicates.

By a topological duplicate, we mean that the two faces have the same vertices.  Now you aren’t allowed to mutate the faces themselves (since orientation matters), but changing their order in the array is ok.  We could also consider a related problem for edges in a graph, or for tetrahedra in mesh.

There are of course many easy ways to solve this, but in the end it basically boils down to finding some function or other for comparing faces up to permutation, and then using either sorting or hashing to check for duplicates.  It is that comparison step that I want to talk about today, or in other words:

How do we check if two faces are the same?

In fact, what I really want to talk about is the following more general question where we let the lengths of our faces become arbitrarily large:

Given a pair of arrays, A and B, find a total order $\leq$ such $A \leq B \: \mathrm{and} B \leq A$ if and only if $A = \pi(B)$ up to some permutation $\pi$.

For example, we should require that:

[0, 1, 2] $\leq$ [2, 0, 1]

[2, 0, 1] $\leq$ [0, 1, 2]

And for two sequences which are not equal, like [0,1,2] and [3, 4, 5] we want the ordering to be invariant under permutation.

## Obvious Solution

An obvious solution to this problem is to sort the arrays element-wise and return the result:

function compareSimple(a, b) {
if(a.length !== b.length) {
return a.length - b.length;
}
var as = a.slice(0)
, bs = b.slice(0);
as.sort();
bs.sort();
for(var i=0; i<a.length; ++i) {
var d = as[i] - bs[i];
if(d) {
return d;
}
}
return 0;
}

This follows the standard C/JavaScript convention for comparison operators, where it returns -1 if a comes before b, +1 if a comes after b and 0 if they are equal.  For large enough sequences, this isn’t a bad solution.  It takes $O(n \log(n))$ time and is pretty straightforward to implement.  However, for smaller sequences it is somewhat suboptimal.  Most notably, it makes a copy of a and b, which introduces some overhead into the computation and in JavaScript may even trigger a garbage collection event (bad news).  If we need to do this on a large mesh, it could slow things down a lot.

An obvious way to fix this would be to try inlining the sorting function for small values of n (which is all we really care about), but doing this yourself is pure punishment.  Here is an optimized version for length 2 sets:

function compare2Brutal(a, b) {
if(a[0] < a[1]) {
if(b[0] < b[1]) {
if(a[0] === b[0]) {
return a[1] - b[1];
}
return a[0] - b[0];
} else {
if(a[0] === b[1]) {
return a[1] - b[0];
}
return a[0] - b[1];
}
} else {
if(b[0] < b[1]) {
if(a[1] === b[0]) {
return a[0] - b[1];
}
return a[1] - b[0];
} else {
if(a[1] === b[1]) {
return a[0] - b[0];
}
return a[1] - b[1];
}
}
}

If you have any aesthetic sensibility at all, then that code probably makes you want to vomit.   And that’s just for length two arrays!  You really don’t want to see what the version for length 3 arrays looks like.  But it is faster by a wide margin.  I ran a benchmark to try comparing how fast these two approaches were at sorting an array of 1000 randomly generated tuples, and here are the results:

• compareSimple:  5864 µs
• compareBrutal: 404 µs

That is a pretty good speed up, but it would be nice if there was a prettier way to do this.

Symmetry

The starting point for our next approach is the following screwy idea:  what if we could find a hash function for each face that was invariant under permutations?  Or even better, if the function was injective up to permutations, then we could just use the symmetrized hash compare two sets.  At first this may seem like a tall order, but if you know a little algebra then there is an obvious way to do this; namely you can use the (elementary) symmetric polynomials.  Here is how they are defined:

Given a collection of $n$ numbers, $a_0, a_1, ..., a_{n-1}$ the kth elementary symmetric polynomial, $S_{n,k}$ is the coefficient of $x^{n-k-1}$ in the polynomial:

$(x + a_0) (x + a_1) ... (x + a_{n-1}) = S_{n,n-1} + S_{n,n-2} x + ... + S_{n,0} x^{n-1} + x^n$

For example, if $n = 2$, then the symmetric polynomials are just:

$S_{2,0} = a_0 + a_1$

$S_{2,1} = a_0 a_1$

And for $n = 3$ we get:

$S_{3,0} = a_0 + a_1 + a_2$

$S_{3,1} = a_0 a_1 + a_0 a_2 + a_1 a_2$

$S_{3,2} = a_0 a_1 a_2$

The neat thing about these functions is that they contain enough data to uniquely determine $a_0, a_1, ..., a_{n-1}$ up to permutation.  This is a consequence of the fundamental theorem for elementary symmetric polynomials, which basically says that these $S_{n,k}$ polynomials form a complete independent basis for the ring of all symmetric polynomials.  Using this trick, we can formulate a simplified version of the sequence comparison for $n=2$:

function compare2Sym(a, b) {
var d = a[0] + a[1] - b[0] - b[1];
if(d) {
return d;
}
return a[0] * a[1] - b[0] * b[1];
}

Not only is this way shorter than the brutal method, but it also turns out to be a little bit faster.  On the same test, I got:

• compare2Sym: 336 µs

Which is about a 25% improvement over brute force inlining.  The same idea extends to higher n, for example here is the result for $n=3$:

function compare3Sym(a, b) {
var d = a[0] + a[1] + a[2] - b[0] - b[1] - b[2];
if(d) {
return d;
}
d = a[0] * a[1] + a[0] * a[2] + a[1] * a[2] - b[0] * b[1] - b[0] * b[2] - b[1] * b[2];
if(d) {
return d;
}
return a[0] * a[1] * a[2] - b[0] * b[1] * b[2];
}

Running the sort-1000-items test against the simple method gives the following results:

• compareSimple: 7637 µs
• compare3Sym:  352 µs

### Computing Symmetric Polynomials

This is all well and good, and it avoids making a copy of the arrays like we used in the basic sorting method.  However, it is also not very efficient.  If one were to compute the coefficients of a symmetric polynomial directly using the naive method we just wrote, then you would quickly end up with $O(2^n)$ terms!  That is because the number of terms in $\# S_{n,k} = { n \choose k+1 }$, and so the binomial theorem tells us that:

$\#S_{n,0} + \#S_{n,1} + \#S_{n,2} + ... + \#S_{n,n-1} = 2^n - 1$

A slightly better way to compute them is to use the polynomial formula and apply the FOIL method.  That is, we just expand the symmetric polynomials using multiplication.  This dynamic programming algorithm speeds up the time complexity to $O(n^2)$.  For example, here is an optimized version of the $n=3$ case:

function compare3SymOpt(a,b) {
var l0 = a[0] + a[1]
, m0 = b[0] + b[1]
, d  = l0 + a[2] - m0 - b[2];
if(d) {
return d;
}
var l1 = a[0] * a[1]
, m1 = b[0] * b[1];
d = l1 * a[2] - m1 * b[2];
if(d) {
return d;
}
return l0 * a[2] + l1 - m0 * b[2] - m1;
}

For comparison, the first version of compare3 used 11 adds and 10 multiplies, while this new version only uses 9 adds and 6 multiplies, and also has the option to early out more often.  This may seem like an improvement, but it turns out that in practice the difference isn’t so great.  Based on my experiments, the reordered version ends up taking about the same amount of time overall, more or less:

• compare3SymOpt: 356 µs

Which isn’t very good.  Part of the reason for the discrepancy most likely has something to do with the way compare3Sym gets optimized.  One possible explanation is that the expressions in compare3Sym might be easier to vectorize than those in compare3SymOpt, though I must admit this is pretty speculative.

But there is also a deeper question of can we do better than $O(n^2)$ asumptotically?  It turns out the answer is yes, and it requires the following observation:

Polynomial multiplication is convolution.

Using a fast convolution algorithm, we can multiply two polynomials together in $O(n \log(n))$ time.  Combined with a basic divide and conquer strategy, this gives an $O(n \log^2(n))$ algorithm for computing all the elementary symmetric functions. However, this is still worse than sorting the sequences!  It feels like we ought to be able to do better, but further progress escapes me.  I’ll pose the following question to you readers:

Question: Can we compute the n elementary symmetric polynomials in $O(n \log(n))$ time or better?

### Overflow

Now there is still a small problem with using symmetric polynomials for this purpose: namely integer overflow.  If you have any indices in your tuple that go over 1000 then you are going to run into this problem once you start multiplying them together.  Fortunately, we can fix this problem by just working in a ring large enough to contain all our elements.  In languages with unsigned 32-bit integers, the natural choice is $\mathbb{Z}_{2^{32}}$, and we get these operations for free just using ordinary arithmetic.

But life is not so simple in the weird world of JavaScript!  It turns out for reasons that can charitably be described as “arbitrary” JavaScript does not support a proper integer data type, and so every number type in the language gets promoted to a double when it overflows whether you like it or not (mostly not).  The net result:  the above approach messes up.  One way to fix this is to try applying a modulus operator at each step, but the results are pretty bad.  Here are the timings for a modified version of compare2Sym that enforces bounds:

That’s more than a 5-fold increase in running time, all because we added a few innocent bit masks!  How frustrating!

Semirings

The weirdness of JavaScript suggests that we need to find a better approach.  But in the world of commutative rings, it doesn’t seem like there are any immediately good alternatives.  And so we must cast the net wider.  One interesting possibility is to extend our approach to include semirings.  A semiring is basically a ring where we don’t require addition to be invertible, hence they are sometimes called “rigs” (which is short for a “ring without negatives”, get it? haha).

Just like a ring, a semiring is basically a set $S$ with a pair of operators $\oplus, \otimes$ that act as generalized addition and multiplication.  You also have a pair of elements, $0,1 \in S$ which act as the additive and multiplicative identity.  These things then have to satisfy a list of axioms which are directly analogous to those of the natural numbers (ie nonnegative integers).  Here are a few examples of semirings, which you may or may not be familiar with:

• The complex numbers are semiring (and more generally, so is every field)
• The integers are a semiring (and so is every other ring)
• The natural numbers are a semiring (but not a ring)
• The set of Boolean truth values, $\mathbb{B} = \{ \mathrm{true}, \mathrm{false} \}$ under the operations OR, AND is a semiring (but is definitely not a ring)
• The set of reals under the operations min,max is a semiring (and more generally so is any distributive lattice)
• The tropical semiring, which is the set of reals under (max, +) is a semiring.

In many ways, semirings are more fundamental than rings.  After all, we learn to count stuff using the natural numbers long before we learn about integers.  But they are also much more diverse, and some of our favorite definitions from ring theory don’t necessarily translate well.  For those who are familiar with algebra, probably the most shocking thing is that the concept of an ideal in a ring does not really have a good analogue in the language of semirings, much like how monoids don’t really have any workable generalization of a normal subgroup.

### Symmetric Polynomials in Semirings

However, for the modest purposes of this blog post, the most important thing is that we can define polynomials in a semiring (just like we do in a ring) and that we therefore have a good notion of an elementary symmetric polynomial.  The way this works is pretty much the same as before:

Let $R$ be a semiring under $\oplus, \otimes$; then we have the symmetric functions.  Then for two variables, we have the symmetric functions:

$S_{2,0} = a_0 \oplus a_1$

$S_{2,1} = a_0 \otimes a_1$

And for $n=3$ we get:

$S_{3,0} = a_0 \oplus a_1 \oplus a_2$

$S_{3,1} = (a_0 \otimes a_1) \oplus (a_0 \otimes a_2) \oplus (a_1 \otimes a_2)$

$S_{3,2} = a_0 \otimes a_1 \otimes a_2$

And so on on…

### Rank Functions and (min,max)

Let’s look at what happens if we fix a particular semiring, say the (min,max) lattice.  This is a semiring over the extended real line $\mathbb{R} \cup \{ - \infty, \infty \}$ where:

• $\oplus \mapsto \mathrm{min}$
• $\otimes \mapsto \mathrm{max}$
• $0 \mapsto \infty$
• $1 \mapsto -\infty$

Now, here is a neat puzzle:

Question: What are the elementary symmetric polynomials in this semiring?

Here is a hint:

$S_{2,0} = \min(a_0, a_1)$

$S_{2,1} = \max(a_0, a_1)$

And…

$S_{3,0} = \min(a_0, a_1, a_2)$

$S_{3,1} = \min( \max(a_0, a_1), \max(a_0, a_2), \max(a_1, a_2) )$

$S_{3,2} = \max(a_0, a_1, a_2)$

Give up?  These are the rank functions!

Theorem: Over the min,max semiring, $S_{n,k}$ is the kth element of the sorted sequence $a_0, a_1, ...$

In other words, evaluating the symmetric polynomials over the min/max semiring is the same thing as sorting.  It also suggests a more streamlined way to do the brutal inlining of a sort:

function compare2Rank(a, b) {
var d = Math.min(a[0],a[1]) - Math.min(b[0],b[1]);
if(d) {
return d;
}
return Math.max(a[0],a[1]) - Math.max(b[0],b[1]);
}

Slick!  We went from 25 lines down to just 5.  Unfortunately, this approach is a bit less efficient since it does the comparison between a and b twice, a fact which is reflected in the timing measurements:

• compare2Rank: 495 µs

And we can also easily extend this technique to triples as well:

function compare3Rank(a,b) {
var l0 = Math.min(a[0], a[1])
, m0 = Math.min(b[0], b[1])
, d  = Math.min(l0, a[2]) - Math.min(m0, b[2]);
if(d) {
return d;
}
var l1 = Math.max(a[0], a[1])
, m1 = Math.max(b[0], b[1]);
d = Math.max(l1, a[2]) - Math.max(m1, b[2]);
if(d) {
return d;
}
return Math.min(Math.max(l0, a[2]), l1) - Math.min(Math.max(m0, b[2]), m1);
}
• compare3Rank: 618.71 microseconds

### The Tropical Alternative

Using rank functions to sort the elements turns out to be much simpler than doing a selection sort, and it is way faster than calling the native JS sort on small arrays while avoiding the overflow problems of integer symmetric functions.  However, it is still quite a bit slower than the integer approach.

The final method which I will now describe (and the one which I believe to be best suited to the vagaries of JS) is to compute the symmetric functions over the (max,+), or tropical semiring.  It is basically a semiring over the half-extended real line, $\mathbb{R} \cup \{ - \infty \}$ with the following operators:

• $\oplus \mapsto \max$
• $\otimes \mapsto +$
• $0 \mapsto -\infty$
• $1 \mapsto 0$

There is a cute story for why the tropical semiring has such a colorful name, which is that it was popularized at an algebraic geometry conference in Brazil.  Of course people have known about (max,+) for quite some time before that and most of the pioneering research into it was done by the mathematician Victor P. Maslov in the 50s.  The (max,+) semiring is actually quite interesting and plays an important role in the algebra of distance transforms, numerical optimization and the transition from quantum systems to classical systems.

This is because the (max,+) semiring works in many ways like the large scale limit of the complex numbers under addition and multiplication.  For example, we all know that:

$\log(a b) = \log(a) + \log(b)$

But did you also know that:

$\log(a + b) = \max(\log(a), \log(b)) + O(1)$

This basically a formalization of that time honored engineering philosophy that once your numbers get big enough, you can start to think about them on a log scale.  If you do this systematically, then you eventually will end up doing arithmetic in the (max,+)-semiring.  Masolv asserts that this is essentially what happens in quantum mechanics when we go from small isolated systems to very big things.  A brief overview of this philosophy that has been translated to English can be found here:

Litvinov, G. “The Maslov dequantization, idempotent and tropical mathematics: a very brief introduction” (2006) arXiv:math/0501038

The more detailed explanations of this are unfortunately all store in thick, inscrutable Russian text books (but you can find an English translation if you look hard enough):

Maslov, V. P.; Fedoriuk, M. V. “Semiclassical approximation in quantum mechanics”

But enough of that digression!  Let’s apply (max,+) to the problem at hand of comparing sequences.  If we expand the symmetric polynomials in the (max,+) world, here is what we get:

$S_{2,0} = \max(a_0, a_1)$

$S_{2,1} = a_0 + a_1$

And for $n = 3$:

$S_{3,0} = \max(a_0, a_1, a_2)$

$S_{3,1} = \max(a_0+a_1, a_0+a_2, a_1+a_2)$

$S_{3,2} = a_0+a_1+a_2$

If you stare at this enough, I am sure you can spot the pattern:

Theorem: The elementary symmetric polynomials on the (max,+) semiring are the partial sums of the sorted sequence.

This means that if we want to compute the (max,+) symmetric polynomials, we can do it in $O(n \log(n))$ by sorting and folding.

Working the (max,+) solves most of our woes about overflow, since adding numbers is much less likely to go beyond INT_MAX.  However, we will tweak just one thing: instead of using max, we’ll flip it around and use min so that the values stay small.  Both theoretically and practically, it doesn’t save much but it gives us a maybe a fraction of a bit of extra address space to use for indices.  Here is an implementation for pairs:

function compare2MinP(a, b) {
var d = a[0]+a[1]-b[0]-b[1];
if(d) {
return d;
}
return Math.min(a[0],a[1]) - Math.min(b[0],b[1]);
}

And it clocks in at:

• compare2MinP: 354 µs

Which is a bit slower than the symmetric functions, but still way faster than ranking.  We can also play the same game for $n=3$:

function compare3MinP(a, b) {
var l1 = a[0]+a[1]
, m1 = b[0]+b[1];
d = l1+a[2] - (m1+b[2]);
if(d) {
return d;
}
var l0 = Math.min(a[0], a[1])
, m0 = Math.min(b[0], b[1])
, d  = Math.min(l0, a[2]) - Math.min(m0, b[2]);
if(d) {
return d;
}
return Math.min(l0+a[2], l1) - Math.min(m0+b[2], m1);
}

Which hits:

• compare3MinP: 382 µs

Again, not quite as fast as integers, but pretty good for JavaScript.

Summary

You can get all the code to run these experiments on github:

https://github.com/mikolalysenko/set-compare

And here are the results that I got from running the experiment, all collated for easy reading:

### Dimension = 2

• compareSimple: 5982 µs
• compare2Brutal: 415 µs
• compare2Sym: 352 µs
• compare2Rank: 498 µs
• compare2MinP: 369 µs

### Dimension = 3

• compareSimple: 7737 µs
• compare3Sym: 361 µs
• compare3Sym_opt: 362 µs
• compare3Rank: 612 µs
• compare3MinP: 377 µs

As you can see, the (min,+) solution is nearly as fast as the symmetric version without having the same overflow problems.

I hope you enjoyed reading this as much as I enjoyed tinkering around!  Of course I still don’t know of an optimal way to compare two lists.  As a final puzzle, I leave you with the following:

Question: Is there any method which can test if two unordered sequences are equal in linear time and at most log space?

Frankly, I don’t know the answer to this question and it may very well be impossible.  If you have any thoughts or opinions on this, please leave a comment!