Wednesday, August 17, 2005

Persistent Datastructures

Many moons ago I wrote a short tutorial on pthreads. It was adopted (with permission) by a number of universities to provide a brief introduction for their students doing pthread based coursework. It strikes me as bizzare, but the tutorial remains the top google hit for 'pthreads tutorial'. Consequently I periodically get requests for help with threading issues.

When I wrote the tutorial I was still enamoured with threads, an infatuation that didn't last very long :). Since then, my experience working with pthreads, auto-parallelising compilers, java, erlang, kowari, and 10 years of professional programming, have fundamentally changed my perspective on how to approach concurrency.

  1. Concurrency bad
  2. See 1
  3. Shared-memory concurrency very very bad
  4. see 3
However we can't always avoid concurrency so:
  • Process isolation good
  • Message-Passing/Rendevous good
  • Co-routine/Continuation good
Again, sometimes we can't avoid shared-memory concurrency if we want to make good use of some modern cpu architectures so:
  • Locks are bad - you've accepted all the pain and suffering associated with shared-memory concurrency, and now you're going to SERIALISE operations!!

Assuming you are stuck using shared-memory threads, what do you do now? Well first hope like hell you have a garbage collector. If you don't it is going to hurt enough I would suggest you give up now and go back and reconsider the reasons why you discounted the other options above -- anything is better than this. If you do have a garbage collector, start reading up on persistent datastructures.

Persistent datastructures are very important in safe efficient thread-based design because they don't require locking. They are called persistant because they are immutable. Once you have a reference to a structure that structure is guaranteed const. What this means is that any method that would mutate an empherial (ie, what you are used to) structure returns a new structure.

As an example consider a list

x = []              | x = null 
y = x.prepend(1)    | y = { 1, x }
z = y.prepend(2)    { z = { 2, y }
Now if you need the rest of your program to see the updated list you use some sort of reference cell. The simplest is just a one element queue. This is a classic reader-writer problem; except you don't have to worry about evicting readers before the writer does his thing.
  init(T data);   // Initialises the cell with data.  All subsequent access to data must originate with this cell.
  T read();       // Returns the current value of data (internally access is protected by the 'access lock').
  T writeLock();  // Obtains the 'write lock' (or blocks), and returns the current value of data.
  unLock(T data); // Updates the current value of data under the 'access lock', and releases the 'write lock'.
Note there is almost no contention for readers, no book-keeping, and once a reader has a reference to data, all further access to the structure is lock-free.

However take another look at it. Step back and think about what this is and you will realise that what we have written here is a transaction based datastore. What this means is that you can now go examine all your transaction management theory and introduce multiple-writers, lock-stealing, etc, et al. Now as long as T is persistent, you are atomic, consistent, and isolated. If you ensure unLock flushes the structure to disk you now have an ACID datastore (surprised? :). Add consistency checks to the mutation operations and you can call it a database :).

Granted, without a garbage collector, memory management is a bitch. The standard reference on persistent structures is Chris Okasaki. A quick google will get you a fair way; however citeseer is the motherlode. If you have access to a university library, I would lookup his book (Purely Functional Data Structures), which will save you alot of searching on citeseer :).

Another thing worth noting is that it is worth noting is that once you leave the ephemeral space, you need to start making decisions beyond the simple "Vector/List/Hash/Tree". There are alot more tradeoffs to consider. Do you really need the efficient random-access of a vector, or are you actually using it as a queue, or dqueue? A persistent queue can be implemented with O(1) amortized operations, but the best I have been able to find (quickly) for a true 'vector' is log(i) access (i = index) [citeseer]. Just as importantly, a persistant queue is very simple, almost identical in fact to an efficient ephemeral queue; whereas a persistant random-access-list much more complicated than a vector.