# I. Introduction

The problem of memory reclamation in data structures is one of releasing ownership of nodes that will no longer be accessed. This may be automated with garbage collection, but it is not always possible or efficient to do so. Hence manual memory reclamation schemes are sometimes more desirable. A popular example is reference counting, where each node $N$ stores a count $N.rc$ of the number of incoming pointers; when $N.rc=0$, node $N$ is unreachable and may be reclaimed.

Suppose that multiple processes operate on a data structure, executing at various speeds. Reference counting is nontrivial in this model: A process $P$ at node $N$ intending to increment the reference count of node $M$, pointed to by $N.next$, must read the pointer value before incrementing the count. During the interval between these two instructions, another process $Q$ may modify $N.next$ and decrement $M$’s reference count. If $Q$ resets $M.rc$ to 0, then it may reclaim $M$ before $P$ can increment the count.

There are many memory reclamation schemes, each with advantages and disadvantages. An excellent survey can be found in [1]. A distributed variant of reference counting was introduced by Lee [2] for a particular data structure, and the goal of this research was to generalize it to arbitrary shared data structures. The result is efficient and applicable in the same contexts as ordinary reference counting.

function fetch-and-add $(\mathcal{O}, i)$

$x \gets \mathcal{O}$

$\mathcal{O} \gets x + i$

return $x$

function fetch-and-store $(\mathcal{O}, y)$

$x \gets \mathcal{O}$

$\mathcal{O} \gets y$

return $x$

function compare-and-swap $(\mathcal{O}, old, new)$

if $\mathcal{O} = old$ then

$\mathcal{O} \gets new$

return $T$

return $F$

# II. The asynchronous shared memory model

We assume a model in which $n$ processes operate on a data structure, communicating only using shared memory. Processes may be asynchronous, meaning they can be delayed arbitrarily after any instruction and are prone to crash failure, possibly leaving their operation uncompleted.

Data structures in this model use atomic primitives: instructions that consist of multiple steps, but that the hardware performs all at once. For example, fetch-and-add (see Figure 1) adds to the value in a word and returns its previous value. Suppose that two processes try to concurrently increment a counter $C$ by reading its value and then writing to it. It is possible that both processes read 0, then both write 1. Using fetch-and-add($C,1$), it is guaranteed that each process increments the value, one writing 1 and the other writing 2.

We will also consider fetch-and-store, which writes a value to a word and returns its old value, and compare-and-swap (CAS), which writes $new$ to a word if its current value is $old$ (see Figure 1); we shall always assume that $new \neq old$.

We also make the distinction between pointers stored in shared memory, which we call references, and local pointers, which are stored in processes’ local memory. A node is safe for process $P$ if the reclamation scheme guarantees that the node will not be reclaimed before $P$ takes its next step. A local pointer is safe for $P$ if the node it points to is safe for $P$.

# III. Distributed reference counting

Detlefs et al. [3] gave a simple reference counting scheme for our model using the double compare-and-swap (DCAS) primitive, which atomically invokes CAS on two separate memory locations. This is used to increment the reference count of a node, provided that a particular reference continues pointing to it during the operation. However, DCAS is not commonly provided in hardware. Herlihy et al. [4] modified this scheme so it does not use DCAS, but their solution also requires another memory reclamation scheme.

Lee [2] provides an efficient alternative. Given a node $N$, its original reference count, $N.orc$, maintains the number of incoming references. Let $R_N$ be the set of all references pointing to $N$. Every $r \in R_N$ is augmented with a proactive reference count, $r.prc$, which counts safe local pointers. The location pointed to by the reference is stored in $r.ptr$. To make $N$ safe, a process atomically increments $r.prc$ and reads $r.ptr$. This is implemented by storing $r.prc$ and $r.ptr$ in the same word and invoking fetch-and-add(r, 1), which returns $(r.ptr, r.prc)$. Lastly, the node maintains a distributed reference count, $N.drc$, such that $N.drc + \sum_{r \in R_N} r.prc = \text{# of safe local pointers to } N.$ When local pointers and references to $N$ are changed, this invariant must be maintained. When a process releases a safe local pointer to $N$, it decrements $N.drc$. When it deletes a reference to $r$, it atomically adds $r.prc$ to $N.drc$ and decrements $N.orc$ (to satisfy the definition of $N.orc$). To do this atomically, $N.drc$ and $N.orc$ are stored in the same word, and modified using fetch-and-add. It follows that when $N.orc = N.drc = 0$, node $N$ will never again be accessed and may be reclaimed.

Lee [2] uses this approach for an in-tree that updates references only using fetch-and-store. In the full version of this paper, we prove that the scheme generalizes to arbitrary data structures using fetch-and-store or write operations.

# IV. A further generalization

Lee’s method does not work for data structures that apply CAS to references. Suppose a process intends to change a reference $r$ from $oldptr$ to $newptr$. After reading $(oldptr,oldprc)$ from $r$, it would then execute CAS$(r,(oldptr,oldprc),(newptr,0))$. However, this would fail if another process incremented $r.prc$ between the read and CAS steps. This violates the specification that original CAS fails only if the reference changes value. One countermeasure is to repeatedly read $r$ and then execute CAS until either $r.prc$ remains fixed or $r.ptr$ is changed. However, it is not guaranteed that the operation will ever complete.

To ensure that the operation always completes, we equip each reference with an additional $pid$ variable stored in a separate word. Let processes $P_i,P_j$ have identifiers $i,j$ respectively. Suppose $P_i$ intends to CAS $r.ptr$ from $old$ to $new$. If it reads that $r.ptr \neq old$, it may return F immediately. Otherwise it invokes CAS$(r.pid,NULL,i)$. If this succeeds, then every other process that wants to access $r$ must help $P_i$ complete its CAS before continuing with their own operations.

Suppose instead that CAS$(r.pid,NULL,i)$ fails with $r.pid=j$. Then after helping $P_j$, process $P_i$ may conclude that $r.ptr \neq old$, and therefore that its operation will inevitably fail. Hence it may return F without trying to set $r.pid$ again. Consequently, $P_i$’s operation always terminates.

#### Acknowledgements

I owe great thanks to my supervisor Professor Faith Ellen, and to Kenneth D. Hoover, Yuanhao Wei, and Alexey Shablygin for fruitful discussions. This work was supported by NSERC.

#### References

[1] T. Brown, "Reclaiming memory for lock-free data structures: There has to be a better way," Proceedings of the 2015 ACM Symposium on Principles of Distributed Computing (PODC), pp. 261–270, 2015.

[2] H. Lee, "Fast local-spin abortable mutual exclusion with bounded space," Proceedings of the 14th International Conference on Principles of Distributed Systems (OPODIS), pp. 364–379, 2010.

[3] D. L. Detlefs, P. A. Martin, M. Moir, and G. L. Steele Jr, "Lock-free reference counting," Distributed Computing, vol. 15, no. 4, pp. 255–271, 2002.

[4] M. Herlihy, V. Luchangco, P. Martin, and M. Moir, "Nonblocking memory management support for dynamic-sized data structures," ACM Transactions on Computer Systems (TOCS), vol. 23, no. 2, pp. 146–196, 2005.