Introduction to lock-free/wait-free and the ABA problem

I recently read a paper on Hazard Pointers[2] and thought I would share what I learned and some general information on lock-free/wait-free programming. I hope it is useful to others getting started.

I'm not very clever, so keep in mind this is mostly regurgitating knowledge I've found in various places. If you find anything wrong or that could use more/better explaination leave a comment and we can cooperatively improve this.

Traditional problems in Mutli-Processor programming

As we reached the speed-limit of serialized CPU instructions we started to add more cores. Obviously, adding more cores means we have to change how we process information. Instead of optimizing for the fastest serial processing of data, we look for ways to split the work into multiple tasks. That way, we can put tasks on different cores and efficiently use the resources available.

So now we have lots of individual tasks. It would be most excellent if they could communicate. Often times we end up using a Mutex or other synchronization primitive to lock access to some shared state. Unfortunately, this doesn't scale well as you add more concurrent workers. This ends up creating more lock-contention and reduces the overall usefulness of splitting the problem into tasks.

What if we could communicate to manage shared state without introducing locks? That is precisely what some very smart folks have been designing for years. Now the rest of us are catching up.

What are lock-free algorithms

Lock-free algorithms provide a way to perform operations on shared state without the need to perform costly synchronization between threads. Think about the times you've used a mutex to protect access to shared state such as a queue. In highly-threaded scenarios you probably noticed CPU time wasted in lock-contention (often "time in kernel"). There are many types of synchronization, but many of these result in the same perception by users: inefficient use of their computing resources. That is not very green. The polar bears are crying.

Lets think for a moment about why there is so much CPU time wasted under traditional locking scenarios. For the sake of simplicity, lets build a very simple spinlock.

typedef struct {
    int s;
} spinlock_t;

void
spinlock_lock(spinlock_t *l)
{
    while (!g_atomic_int_compare_and_exchange(&l->s, FALSE, TRUE));
}

void
spinlock_unlock(spinlock_t *l)
{
    l->s = FALSE;
}

Experienced readers will forgive me for the naivete. What essentially happens is that we try to set the lock bit of the structure until we succeed; potentially blocking for long periods of time (in terms of CPU cycles). Not to mention that the CPU will spin up at full speed and waste power in the meantime.

Why is this generally a bad approach? Well, to begin with, this is just a synchronization primitive to acquire our lock. Once acquired, the critical section is performed. Meanwhile, all those other threads are still spinning! If only they were doing something constructive! In addition, each Compare and Swap (CAS here-forward) requires a Memory Barrier. This means that each call to the method requires that the cache line containing the structure be flushed from the CPU cache and fetched from main-memory. This alone can take hundreds of CPU cycles and now each thread is doing it repeatedly.

Now, that doesn't mean that CAS is a scary or bad thing! We just need to use it more appropriately.

For good measure I should mention that there are times when spin-locks and variants there-of are the right data structure. For example, many locks now days are complex and contain multiple steps such as spinning for a short period of time followed by a sleep back-off upon failure.

This non-deterministic behavior is precisely what lock-free algorithms are designed to avoid. Lock-free algorithms are carefully designed data-structures and functions to allow for multiple threads to attempt to make progress independently of one-another. This means that you do not try to acquire a lock before performing your critical region. Instead, you independently update a local copy of a portion of the data-structure and then apply it atomically to the shared structure with a CAS.

Wait-Free, a better Lock-Free

Those experienced with lock-free programming will likely know of a superset called wait-free programming. Wait-free is the same as lock-free except it provides a stellar guarantee. Each thread is guaranteed to be progressing itself or a cooperative thread. This is an important phenomenon because wait-free uses concurrent access to a data-structure to cooperatively progress the data-structure as a whole.

Constructing Lock-Free/Wait-Free algorithms

Lets take a look a lock-free/wait-free implementation of a Queue. This is a fairly strait-forward implementation of the design by M.M. Michael and M.L. Scott[1].

typedef struct _Node Node;
typedef struct _Queue Queue;

struct _Node {
    void *data;
    Node *next;
};

struct _Queue {
    Node *head;
    Node *tail;
};

Queue*
queue_new(void)
{
    Queue *q = g_slice_new(sizeof(Queue));
    q->head = q->tail = g_slice_new0(sizeof(Node));
    return q;
}

void
queue_enqueue(Queue *q,
          gpointer data)
{
    Node *node, *tail, *next;

    node = g_slice_new(Node);
    node->data = data;
    node->next = NULL;
    while (TRUE) {
        tail = q->tail;
        next = tail->next;
        if (tail != q->tail)
            continue;
        if (next != NULL) {
            CAS(&q->tail, tail, next);
            continue;
        }
        if (CAS(&tail->next, null, node)
            break;
    }
    CAS(&q->tail, tail, node);
}

gpointer
queue_dequeue(Queue *q)
{
    Node *node, *tail, *next;

    while (TRUE) {
        head = q->head;
        tail = q->tail;
        next = head->next;
        if (head != q->head)
            continue;
        if (next == NULL)
            return NULL; // Empty
        if (head == tail) {
            CAS(&q->tail, tail, next);
            continue;
        }
        data = next->data;
        if (CAS(&q->head, head, next))
            break;
    }
    g_slice_free(Node, head); // This isn't safe, we'll discuss why.
    return data;
}

The FIFO queue is constructed using a singly-linked list. We keep track of the head and tail of the queue. Because the structure uses a linked-list, we can CAS the node's next pointer to atomically perform updates.

I won't go into too much detail on the implementation of the queue. The gist is this. To enqueue a new node, we work locally to prepare the new node and try to apply it atomically to the tail. If we detect an incomplete operation we help cleanup after it. To dequeue, we try to take the first item off the queue atomically. Again, we help proceed an inconsistent state.

The ABA Problem

There is an interesting problem that arises in this algorithm. What happens if a node is removed, de-allocated (through free() or similar), re-allocated (through malloc() or similar) and added back; all within the time in which another thread is paused? If that other thread is about to perform a CAS using that pointer it could still succeed even though the state has changed! This is what is known as the ABA problem in lock-free programming.

In garbage collected languages this isn't a problem. Why? Because the node's memory cannot be reclaimed for a new object until observing threads containing pointers to the structure have released them.

In C, however, we don't typically have the luxury of a garbage collector. Historically, there have been a few approaches to work around this. The original approach, by IBM, was to use a tag next to the pointer to be swapped. They would increment that tag counter and use a double-word CAS. While many 32-bit machines provide a 64-bit CAS, most 64-bit machines do not provide a 128-bit CAS. This prevents it from being a truly useful solution for general use.

My concurrency library, libiris[4], currently does something similar to this. By aligning the pointers properly you can use portions of the pointer itself for tagging. This isn't scalable as you add workers, but it was a start.

But, ever the pragmatist that I am, I thought it was time to research the proper way to handle the situation. Turns out there is a fantastic paper written in 2004 on a methodology called Hazard Pointers[2]. Hazard pointers are a way of notifying cooperative threads that you are removing some potentially unsafe structure. Later, a reclamation step is performed which can free the memory once it is safe to do so.

To make these lock-free/wait-free algorithms ABA safe we can notify cooperative threads using these hazard pointers. I've implemented the hazard pointer methodology from the paper here. We can include that and alter the algorithm slightly.

void
queue_enqueue(Queue *q,
          gpointer data)
{
    Node *node, *tail, *next;

    node = g_slice_new(Node);
    node->data = data;
    node->next = NULL;
    while (TRUE) {
        tail = q->tail;
        HAZARD_SET(0, tail);          // Mark tail has hazardous
        if (tail != q->tail)          // Check tail hasn't changed
            continue;
        next = tail->next;
        if (tail != q->tail)
            continue;
        if (next != NULL) {
            CAS(&q->tail, tail, next);
            continue;
        }
        if (CAS(&tail->next, null, node)
            break;
    }
    CAS(&q->tail, tail, node);
}

gpointer
queue_dequeue(Queue *q)
{
    Node *tail, *next, *head;

    while (TRUE) {
        head = q->head;
        LF_HAZARD_SET(0, head);    // Mark head as hazardous
        if (head != q->head)       // Check head hasn't changed
            continue;
        tail = q->tail;
        next = head->next;
        LF_HAZARD_SET(1, next);    // Mark next has hazardous
        if (head != q->head)
            continue;
        if (next == NULL)
            return NULL; // Empty
        if (head == tail) {
            CAS(&q->tail, tail, next);
            continue;
        }
        data = next->data;
        if (CAS(&q->head, head, next))
            break;
    }
    LF_HAZARD_UNSET(head);        // Retire head, and perform
                      // reclamation if needed.
    return data;
}

The paper discusses a few other algorithms too which I'd like to add to the repository. Including a Set and LIFO Stack. After which, I hope to do some merging and refactoring of this into libiris[5]. I want to make it easy for gtk+ applications to be fully asynchronous from within the main-loop. Doing synchronous IO, or other blocking operations from the main-loop is unacceptable as far as I'm concerned.

[1] M.M. Michael and M.L. Scott, “Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms,” Proc. 15th Ann. ACM Symp. Principles of Distributed Computing, pp. 267-275, May 1996.
[2] IEEE Transactions on Parallel and Distributed Systems, VOL. 15, NO. 6, June 2004 Page 491
[3] http://github.com/chergert/dukesofhazard
[4] http://git.dronelabs.com/iris

-- Christian Hergert 2009-12-25

Back to Index