May 2013 – 0 FPS

Cache oblivious array operations

I must confess that secretly the article I wrote last time (in which I introduced ndarrays) was just a pretext to introduce the stuff that I am going to write about today: which is the cwise library for array operations in JavaScript.

Array operations

Briefly, array operations are component-wise operations that can be applied across multiple multidimensional arrays simultaneously. Array operations can be used to implement basic vector arithmetic, like addition or scaling and are a fundamental tool in linear algebra/numerical computing in general. Because they are so ubiquitous, many languages have special syntax for array operations as well as routines for specifically optimizing the performance; for example in MATLAB if you prefix an operator with . then executes component-wise. In the functional programming world, array operations can be implemented as a sequence of zip /map /reduce higher order functions, while in a language with procedural constructs like C or JavaScript you would probably just use a for-loop like this:

for(i=0; i<shape[0]; ++i) {
 for(j=0; j<shape[1]; ++j) {
   // ... do operation ...
   a[i][j] += b[i][j]
 }
}

The goal of this article is to investigate different ways to implement these sorts of array operations across multiple ndarrays of various shapes and strides.

Simple algorithm

The simple for-loop algorithm is a good starting point for implementing operations on ndarrays. However, in an ndarray performing a lookup at each loop iteration introduces an overhead of $O(d)$ (where $d$ is the dimension of the ndarray) due to the extra indexing and multiplication required. To avoid doing this, we can compute the index into the underlying array incrementally. As a sketch of how this works, consider the following psuedocode:

var a_ptr = a.offset
  , b_ptr = b.offset

for(var i=0; i<a.shape[0]; ++i) {
  for(var j=0; j<a.shape[1]; ++j) {
    //... do operation ...
    a.data[a_ptr] += b.data[b_ptr]

    //Advance pointers along axis 1:
    a_ptr += a.stride[1]
    b_ptr += b.stride[1]
  }
  //Advance pointers along axis 0:
  a_ptr += a.stride[0] - a.stride[1] * a.shape[1]
  b_ptr += b.stride[0] - b.stride[1] * a.shape[1]
}

Using this new algorithm, the next index for the arrays can be computed by just adding some constant shift to the pointer instead of performing a series of multiplies. This is indeed a good thing, and one can easily show that in the conventional RAM model of computation, (which is heavily emphasized in undergraduate CS curricula), this approach is optimal.

But as everyone who has ever tried hard to optimize code for processing arrays knows, this is hardly the full story. Real computers have hierarchical memory, and execute block IO operations on chunks of cache. Failing to take this into account when analyzing an algorithm can lead to wildly different (and inaccurate) conclusions about its performance.

Two-level memory and striding

So if we want to come up with a better algorithm for array operations we need a better theory of how our computer works. And perhaps the simplest extension of the RAM model which does this is the two-level memory model. The basic idea in the two-level model is that you are allowed to operate on memory in contiguous chunks of up to size $B$ words at a time. This is basically a block IO version of the RAM model, and accurately models many realistic algorithms. In general, the running time of an algorithm on the RAM model is an upper bound on its running time in the two-level model, and the fastest that we can ever speed up an algorithm in the two level model is by a factor of $O(B)$ .

This fact is especially important when we are considering array operations and illustrates the importance of striding in our algorithms. For example, suppose in the above algorithm that the arrays were contiguously packed in row major order, then each of the memory reads could be executed sequentially and the running time would be $O(1 + \frac{\text{shape}[0]\text{shape}[1] }{B})$ . On the other hand, if the arrays were stored in column major order and if we looped over them in the same order, then we would get no advantage from blocked IO operations and so the cost of executing the algorithm would be $O( \text{shape}[0] \text{shape}[1] )$ .

To fix this, we should iterate over the indices where the largest strides are in the outer loop and the smallest strides are packed inside. This makes the most efficient use of each block memory operation, and reduces the total computation time by a factor of $O(B)$ (which is optimal) if the stride orderings are compatibile in each array.

Cache aware algorithm

So the next question is what happens if the strides are not compatible? For example, suppose that a was column major while b was row major? Then in the two level model there is no way to take advantage of block memory operations to speed up our loop, and so it would seem that we are back to RAM model performance. However, there is an important aspect of hierarchical memory which the two level model neglects: caching.

But before we can understand and exploit this new technology, we need some sort of model to describe how it works. For this purpose, the minimal extension of the two level memory model is known as the external memory model. Like the two level memory model, we can read and write memory in chunks, but in addition there is also a separate (but small and finite) memory called the cache which can store up to $M$ words at a time. We assume that accessing cache is infinitely fast, and so the time complexity of any algorithm in the external memory model is bounded by the number of reads/writes to the external memory source.

This model may seem slightly absurd since it is possible to execute things like infinite loops for free (as long as they only use the cache), however it turns out to be pretty good accurate for numerical computations and data structures, which are all asymptotically limited by the amount of time they spend moving bits around at the slowest level of the memory hierarchy.

Now it turns out that within the external memory model there is an asymptotically efficient algorithm for array operations that works regardless of their stride. The first place where this seems to have been written down is in the following paper:

A. Aggarwal, J. Vitter. “The input/output complexity of sorting and related problems” (1987) CACM

The general idea is that we break our arrays down into chunks which have shapes that are on the order of $B$ in each dimension. Then we iterate over the whole matrix in these chunks and execute the array operation across each chunk independently. In psuedocode, it works like this:

//Assume shape[0], shape[1] are multiples of B for simplicity
for(var i=0; i<shape[0]; i+=B) {
  for(var j=0; j<shape[1]; j+=B) {
    var a_ptr = i * a.stride[0] + j * a.stride[1]
    var b_ptr = i * b.stride[0] + j * b.stride[1]
    for(var k=0; k<B; ++k) {
      for(var l=0; l<B; ++l) {
        a.data[a_ptr] += b.data[b_ptr]
        a_ptr += a.stride[1]
        b_ptr += b.stride[1]
      }
      a_ptr += a.stride[0] - a.stride[1] * B
      b_ptr += b.stride[0] - b.stride[1] * B
    }
  }
}

Now if $M > B^d$ , then this approach can be used to speed up any array operation by a factor $B$ – regardless of their stride! Pretty neat!

Cache oblivious algorithm

The above approach is awesome, but it only works if we know how big $B$ and $M$ are. However, how can we figure out what these parameters are? One possible solution is to run lots of experiments and directly optimize those values. This benchmark-first approach to performance tuning works pretty well for known hardware configurations, like game consoles or iPhone and can produce awesome results. However it is obviously not very robust, nor is it even feasible in all situations, like web applications for example which need to perform well across many different browsers and devices.

Similarly, we often can’t count on the values for $M$ and $B$ to remain constant throughout the life time of a program. Contention for disk, RAM and CPU cache can cause the amount of available cache and bandwidth to fluctuate, and so we really shouldn’t assume these things are constant.

Finally, there is also the small problem that computers typically have more than one level of caching, and optimizing array operations for multiple levels of cache would require introducing one extra loop per layer, which can be quite impractical to implement.

Fortunately there is a single elegant model of computation which solves all of these problems at once. It is known as the cache oblivious memory model, and it was invented by Harald Prokop:

H. Prokop. “Cache-oblivious algorithms” (1999) MIT Master’s Thesis.

For the amount of problems that it solves, the cache oblivious model of computation is remarkably simple: it is just the external memory model, except we are not allowed to know $B$ and $M$ . But despite being so trivial, the cache oblivious model has a number of remarkable properties:

Algorithms that scale in the cache oblivious model scale across all levels of the memory hierarchy simultaneously.
Cache oblivious algorithm “just work” across any type of hardware or virtual machine with no fine tuning.
Cache oblivious algorithms automatically throttle with hardware availability and scale performance accordingly.
Finally, programs written for the cache oblivious model look just like ordinary RAM model programs! There is no need to introduce extra notation or operations.

These properties make the cache oblivious model the gold standard for analyzing high performance algorithms, and it should always be our goal to design data structures and algorithms that scale in this model. However, to achieve these benefits we need to do a bit more work up front when we are designing our algorithms.

For array operations, a simple way to do this is to adapt Prokop’s cache oblivious matrix transpose algorithm (see the above thesis for details). The way it works is that we keep recursively splitting our multidimensional arrays in half along the largest axis until finally we are left with blocks of unit size. In psuedo-JavaScript using ndarrays, it would look like this:

function recursiveAdd(a, b) {
  var rows = a.shape[0]
    , cols = a.shape[1]
  if(rows === 1 && cols === 1) {
    a.data[a.offset] += b.data[b.offset]
  } else if(rows > cols) {
    rows >>>= 1
    recursiveAdd(a.hi(rows, cols), b.hi(rows, cols))
    recursiveAdd(a.lo(rows, 0), b.lo(rows, 0))
  } else {
    cols >>>= 1
    recursiveAdd(a.hi(rows, cols), b.hi(rows, cols))
    recursiveAdd(a.lo(0, cols), b.lo(0, cols))
  }
}

The above code also works regardless of the stride and always uses an optimal $O(1+\frac{\text{shape}[0]\text{shape}[1]}{B})$ block memory transfers if the cache is at least $M \geq B^2$ . It is also pretty easy to generalize this idea to multiple arrays with higher dimensions, which again performs optimally assuming that $M \geq n B^d$ where n is the number of arguments and d is the dimension of the arrays.

Array operations for JavaScript

The above cache oblivious algorithm is clearly what we want to implement, but the code in the previous listing is just a sketch and if implemented naively is not likely to be very fast (at least in JavaScript). There are quite a few important JS-specific micro-optimizations that we ought to apply before seriously implementing cache-oblivious array operations:

Slicing the arrays creates a small amount of garbage at each level of recursion, and so we should probably flatten out the objects and keep track of their structures as arguments incrementally to avoid unnecessary memory allocation.
Using a recursive function call is much slower than manually using a stack to keep track of previous calls,.
Similarly, we should use the pointer tricks from the simple algorithm to speed up array indexing.
Also if the strides match, we should use the direct pointer arithmetic version as it has lower overhead
For small array sizes the overhead of recursion is much larger than the savings due to cache improvements, and so we would want to terminate the subdivision sometime before getting all the way down to unit arrays. (This is not related to the size of the cache blocks since it is a function of the amount of CPU overhead, not memory bandwidth).

Code!

All of these issues can be pretty tricky to get right, and so it makes sense to try to push all of this complexity into a library. This is what the cwise library does, and you can get it here:

https://github.com/mikolalysenko/cwise

That library should work in any JavaScript environment that supports typed arrays, including most current browsers (IE 9+, Firefox, Chrome and Safari). If you want to try it out in your own projects, you can install it via npm or use browserify (or some similar tool) to stick it in your web page.

Macros and dynamic code generation

The basic idea behind cwise is similar to libraries like FFTW which dynamically generate tuned code based on their inputs (though because it is much easier to do dynamic code generation in JavaScript the interface for cwise is a bit simpler). The way it works is that it lazily compiles optimized scans for array operations based on the shape and stride of the input arguments. You can specify the component-wise operations for cwise using ordinary JavaScript which gets parsed and compiled using esprima/falafel at run time. This is only done the first time you execute a cwise function, all subsequent calls reuse the same optimized script.

Tricks

There are a lot of useful things that you can do with ndarrays using cwise. There are plenty of useful recipes on the github page, but to help get things started here is a quick survey of some stuff you can do:

Vector arithmetic

The most obvious use of cwise is to implement basic vector arithmetic. You can grab a handful of prebaked ready-to-go operations from the following github project:

https://github.com/mikolalysenko/ndarray-ops

As a rule, using these sort of helper methods is not as efficient as unpacking your array operations into a cwise data structure, but on the other hand they can simplify your code and for basic tasks they are often fast enough.

Matrix transpose

A more complicated example of using ndarrays is to perform matrix transpose or cache oblivious reindexing of arrays. For example, this is pretty easy to do by just changing the stride in the target. Suppose for example, suppose we want to transpose a square image which is stored in column-major format. Then we can do this using the following code:

var ops = require("ndarray-ops")
var ndarray = require("ndarray")
function transpose(output, input, size) {
  var a = ndarray.ctor(output, [size, size], [size, 1], 0)
    , b = ndarray.ctor(input, [size, size], [1, size], 0)
  ops.assign(a, b)
}

This is essentially the same thing that Prokop’s cache oblivious matrix transpose algorithm does.

Array copying

Another neat trick we can do with array operations is make copies of arrays in memory. For example, if we want to fill an array with a repeating pattern, we can use the following snippet:

var ops = require("ndarray-ops")
var ndarray = require("ndarray")
function fillPattern(output, pattern, count) {
  var a = ndarray.ctor(output, [count, pattern.length], [pattern.length, 1], 0)
    , b = ndarray.ctor(pattern, [count, pattern.length], [0, 1], 0)
  ops.assign(a, b)
}

This generalizes to filling 2D images with repeated tiles or for stuff like slicing volume data apart into a texture.

Experiments

So how much difference does all of this make? To add some numbers to all of these different array operations I made a few quick experiments in node.js to try everything out. You can check out the results for yourself here on github:

https://github.com/mikolalysenko/cwise-experiment

And here are the results summarized in graphical format:

Comparison of various algorithms for 2D array operations. — Comparison of various algorithms for 2D array operations for different memory layouts.

We can draw the following conclusions from these measurements:

When possible, sequential iteration is the optimal implementation of array operations.
For non-compatible strides, cache aware array operations are fastest, but they require knowledge of the underlying architecture (which may not always be available)
Though cache oblivious array operations are almost never the fastest possible algorithm, they come closest across the widest array of possible inputs.

Conclusion

In summary, we’ve discussed an asymptotically optimal algorithm cache-oblivious array operations across many multidimensional arrays with arbitrary shapes and strides. We’ve also seen an implementation of these concepts in JavaScript via the ndarray and cwise libraries (that should work in node.js and in any browser that supports typed arrays). These tools enable a wide variety of interesting image processing and linear algebra operations, but a lot more work is still necessary to realize a full suite of numerical linear algebra tools in JavaScript. Here are a few possible ideas for future projects:

Tools for basic linear algebra computations (matrix inverse, iterative/direct solvers, eigenvalue problems)
Data structures for sparse multidimensional arrays
Libraries for implementing Einstein tensor summation/higher order map-reduce in a style similar to cwise.

Implementing Multidimensional Arrays in JavaScript

The past few months I’ve been working to move more of my work into JavaScript, and in particular the node.js ecosystem. I hope that by doing this I will be able to create better demos and applications that are easier to use as well as develop.

Numerical Computing in a Hostile Environment

But despite the numerous practical advantages to working in JavaScript (especially from a user experience perspective) there are still many things that JS just sucks at. For me the most painful of these is the weak support for numerical computing and binary data types. These problems are due to fundamental limitations of the language, and not something that can be papered over in a library. For example, in JS there is:

No pass-by-value for non-primitive data types
No way to control memory alignment for members of objects
No way to restrict references to pointers/prevent aliasing
No way to ensure arrays of objects are allocated in place
No operator overloading
No array operations
No SIMD instructions and data types
… etc.

This situation is bad, but the benefits of using JS from an accessibility perspective are just too big to ignore, and so it is at least worth it to try as much as can be done within the limitations of the system. Moreover, these obstacles are by no means unprecedented in the world of computing. As an instructive example, consider the Java programming language: Java suffers from exactly the same design flaws listed above, and yet it is a viable platform for numerical computing with an enthusiastic community (though it is still nowhere near as fast as more efficiently designed languages like Fortran/C99). So, it is a plausible hypothesis that JavaScript could achieve similar results.

Now, reimplementing say LAPACK or SciPy is a huge project, and it is not reasonable to try doing it in a single article or even within a single library. To help focus our inquiries, let us first restrict the scope of our search. Instead, let us study today on the fundamental problem of implementing multidimensional arrays.

Multidimensional arrays

Multidimensional arrays are a natural first step towards building systems for solving numerical problems, and are a generally useful data type in their own right, with applications in graphics and vision. In a language like JavaScript, there are multiple ways

Arrays-of-Arrays

Perhaps the most popular way to represent higher dimensional arrays in JavaScript is as arrays-of-arrays. For example,

var A = [[1, 0, 0],
         [0, 1, 0],
         [0, 0, 1]]

This is the approach that numeric.js uses, and it has some merits. For example, the syntax for accessing elements in an array-of-arrays is quite pleasingly simple:

var x = A[i][j]

And it does not require introducing any extra data structures or conventions into the language.

However, while arrays-of-arrays are by far the simplest solution, from a performance perspective they are disastrous. The most obvious problem is that to allocate a d-dimensional array-of-arrays with O(n) elements per each element, you need to store $O(n^{d-1})$ extra bytes worth of memory in pointers and intermediate arrays. Additionally, accessing an element in an array-of-arrays requires $O(d)$ dependent memory dereferences. These operations can’t be easily pipelined and are terrible from a cache performance perspective. If performance is even remotely close to being an objective, then we need to try for something better.

Typed Arrays

Fortunately, in both JavaScript and Java, there is a better alternative to arrays-of-arrays: namely typed arrays. Using a typed array lets us guarantee that the data for our array is arranged contiguously with no wasted space. For example, here is how we would create a typed array for the same matrix described earlier:

var A = new Float32Array([1, 0, 0, 0, 1, 0, 0, 0, 1])

The downside is that accessing elements from a typed array is a bit more complicated. To do this we need to agree on some convention up front about how the elements in the array are arranged in memory. In languages like C, typically this is done in colexicographic (also known as row major order), while other languages like Fortran use a lexicographic ordering (also called column major order). The choice between the two is totally arbitrary, but once you pick one you have to stick with it. Assuming a row major ordering then, we can access an element from the array using the formula:

var x = A[3 * i + j]

Typed arrays can be very fast, but they are a bit clumsy to work with. You need to keep track of the number of both what convention you are using to store the array as well as what its dimensions are, which is an extra blob of data that needs to be passed around and slows things down substantially.

Strided Arrays

Strided arrays solve these problems by considering array storage in a more general form. Instead of keeping a fixed convention for indexing, they write the index of an element in array as a general linear function of its coordinate. For example,

var x = A[ c0 * i + c1 * j + ... ]

The result is that they achieve the same performance as flattened typed arrays, but gain the advantage that they can be easily sliced, transposed or reversed just by changing the stride. For example, if we wanted to switch the x/y axes of an image, all that we would need to do is swap the strides:

var x = A[ c1 * i + c0 * j + ... ]

Setting a stride to a negative value can be used to reverse an axis, and changing the shape of an array lets us easily crop out subarrays from higher dimensional arrays.

Introducing ndarray

At least theoretically, strided arrays sound like a great idea and so to see how well they worked in practice I created the following library:

https://github.com/mikolalysenko/ndarray

Essentially, an ndarray is an object that has 4 pieces of data:

data – The underlying typed array
shape – The shape of the ndarray
stride – The stride of the ndarray
offset – A pointer to the first element in the array

To look up an item in an ndarray you use the formula:

data[stride[0] * i[0] + stride[1] * i[1] + ... + offset]

You can download it and try it out yourself using npm, or check out some projects already built on top of ndarray here.

But is it fast?

To see how each of these multidimensional array representations compare, I decided to compare their performance on a simple component-wise operation for various shapes of arrays. In pseudocode, this operation looks something like:

function test(A, B) {
  for(i in range) {
    A[i] += B[i] + 0.1
    B[i] -= A[i] * 0.5
  }
}

I ran this simple loop using different variations of multidimensional arrays with various shapes. The following chart summarizes the results:

Summary of array algorithm performance for various implementations.

You can find more detailed results and further experiments (including direct comparisons using numeric.js) at the github page for this project:

https://github.com/mikolalysenko/ndarray-experiments

These results are somewhat typical, indexing in an ndarray is a little slower than directly walking over a typed array, but not by much. For arrays of arrays, the performance gets progressively worse as the first axes get larger, but even in the best case they trail far behind data structures built on typed arrays.

Next time

The natural question to ask is can we do even better? And the answer is yes. The key to getting further performance improvements is to batch operations together allowing for full array comprehensions. I’ll discuss this in the next post, but for now you can take a look at my current implementation of bulk operations on ndarrays in the cwise library.