[RFC] Garbage Collector Elimination

2026-01-25 05:36:25 +00:00 · 2026-01-24 14:21:54 +01:00
parent 0412013229
commit 3b81fcf880
1 changed files with 66 additions and 0 deletions
--- a/doc/drafts/garbage-collector.md
+++ b/doc/drafts/garbage-collector.md
@@ -0,0 +1,66 @@
+# [RFC] Garbage Collector Elimination
+
+## Statement of problem and prior art
+
+Currently, Garage's garbage collector has few identified issues.
+Namely, it can only be run if all nodes in a partition are currently online,
+it may not be correct in front of a rebalancing (this is partially mitigated by a 24h delay added to tombstone deletion),
+and it isn't resilient to a subset of nodes being restored from snapshots.
+It's not clear if it is possible to implement a garbage collection process that can eliminate tombstones, but also support
+a node rollback to a point in time where a key existed.
+
+This problematic, perhaps unsurprisingly, maps very well to the general abstraction of CRDTs for sets, where a whole partition
+would be a single CRDT. The semantic required in Garage demands reinsertion of deleted keys, which excludes the most simple
+forms of sets such as G-Sets and 2P-Sets. As the goal is to not handle garbage collection of tombstones, a standard ORSet
+is also unfitting. In fact it could be argued Garage already uses something akin to an ORSet with a garbage collector today.
+
+There exists CRDTs supporting this feature-set, one of which is an OptORSet[^1].
+It only needs a set-wide metadata proportional in the number of writers, and a per alive-key metadata proportional in the number
+of writers to that key (but no metadata for dead keys).
+These metadata are akin to DVVs[^2]. An element is considered new if its DVV comes causaly after the current DVV of the set.
+Correctness of this algorithm depends however on having causal delivery, which isn't given in Garage, neither in general nor in
+presence of snapshot restoration.
+
+## Proposal
+
+We devise a new kind of CRDT based on the core ideas of an OptORSet, but replacing each version inside its DVVs with a list of
+range of observed updates, which we name a Seen Vector (SV). Such metadata can in the worst case grow linearly with the number of
+insertion. In practice, assuming all elements are eventually known to every replica, the storage requirement of a SV is equivalent
+to that of a standard DVV. Under causal delivery, a SV degenerates into a standard DVV.
+
+To add an element to the set, a node increments its own version counter, and sends an update containing (element, replica, version).
+On receiving such an update, a node checks if it has already seen this particular (replica, version), and if so ignore it.
+If it hasn't seen that update, it saves the new element, and update its SV to include that new (replica, version).
+
+TODO: describe formaly the algorithm
+
+## Storage evaluation
+
+As stated, if all updates are received, the SV is similar in size to that of a standard DVV. It may be however that a node create and
+immediately delete an element, creating holes in its sequence from the point of view of other replicas. These holes could be filled
+through an interactive process where the replica observing holes asks the node to scan over the whole set, and for each version in these
+hole, reply if no element has that exact version number. Holes caused by existing elements should be eventually fixed by an anti-entropy
+process, so replying with these elements appears unnecessary.
+
+The per-element storage requirement is proportional to the number of replicas having modified that element, even under non steady-state.
+This happens because as we always exchange whole elements, we have causal delivery for individual keys.
+
+## Replica version rollback
+
+This scheme assumes the same node won't issue the same version twice, which isn't a given when a node might be rollback to a previous state.
+The author proposes that on initialization, a replica asks all other replicas for the highest version number they know for it.
+If all replicas reply with a number less than or equal to the current version, it is safe to reuse the currently known number.
+If some replicas reply with a number higher, the node increase its version to that number.
+If at least one replica doesn't reply, it can't make any assumption about its actual version number.
+The node then increment a generation number, which is made part of its replica id, starts a new sequence from zero.
+
+## Appendix: providing SV to the underlying elements
+
+Some more complexe elements may want to have access to a version id and the Seen Vector to perform their own internal merge operations.
+The author reckon this may help implementing S3 versioning, by giving a simple way for Objects to know if an ObjectVersion was yet
+unknown or is known and already deleted.
+
+## References
+ 	
+[^1]: An optimized conflict-free replicated set, https://doi.org/10.48550/arXiv.1210.3368
+[^2]: Dotted Version Vectors: Logical Clocks for Optimistic Replication, https://doi.org/10.48550/arXiv.1011.5808