1 of 17

An Efficient Approach to Move Elements in a Geo-Replicated Tree

Parwat Singh Anjana^*, Adithya Rajesh Chandrassery⁺, Sathya Peri^*

^*Indian Institute of Technology, Hyderabad, India

⁺National Institute of Technology Karnataka, Surathkal, India

IEEE CLOUD�2022

2 of 17

Introduction

Replicated tree data structures are extensively used in collaborative applications and distributed file systems, where clients often perform move operations.
However when multiple clients perform move operations concurrently that may not be safe.
Our work proposes an algorithm to perform move operations in an efficient manner while ensuring eventual consistency.
The proposed technique is highly available and can work under network partitions.

3 of 17

How Concurrent Updates Cause Cycles

4 of 17

Problem Formulation

A concurrent move operation for the replicated tree that does not require

Continuous synchronization or centralized coordination is problematic because

Two operations that are individually safe at their local replicas, when combined, might produce a cycle.

Popular file systems suffer from these problems:

Dropbox duplicates the data.
In Google Drive one of the replica falls out of sync due to cycle error.

We formulate a system that is highly available, convergent, works without coordination and can function even when offline.

5 of 17

Previous Work

Many of the previous approaches have only supported two operations on trees:

Existing algorithms are based on Operational Transformation (OT) requires a central server, and hence is

Is not highly available and
Can not work when offline.

Some approaches uses locks for synchronization between replicas.
Most recent work proposed by Kleppmann et al. [1] provides a solution for move operation on Tree CRDT but it is computationally intensive.

[1] Martin Kleppmann, Dominic P. Mulligan, Victor B. F. Gomes, and Alastair Beresford. A highly-available move operation for replicated trees. IEEE Transactions on Parallel and Distributed Systems, pages 1–1, 2021.

Delete

Insert

6 of 17

System Model

Assume there are n replicas that communicate asynchronously via messages in a peer to peer fashion.

Replicas can crash or go offline or fail unexpectedly.

Every operation performed on a replica is propagated to other replicas asynchronously.

It is possible that messages be delayed or be delivered out of order.

Consider a tree structure that is stored locally in all of these replicas.

There is no distributed shared memory.

Our model supports three operations:

Move

Delete

Insert

7 of 17

An operation consist of three parameters:

n - node to be moved
p - going to be parent of node n
ts - which is the timestamp of the operation

In every operation we are moving the node along with it’s subtree to the new parent.

We have special nodes that always remain unchanged and have no parents:

Each node also maintains a timestamp of the last operation applied on it.

In case the operation timestamp is lesser than the timestamp of the last operation applied on that node, then the operation is ignored.

We use lamport clock for timestamps and replica-id is used as tiebreaker in case of conflict.

Conflict

Trash

Root

8 of 17

Algorithm to Detect Cycle

When an operation (ts, n, p) is received at a replica,

Need to check if applying this operation will lead to a cycle or not.

It only leads to a cycle if p (i.e., parent) is in the subtree of n.

It means n to p is an ancestor to descendant relationship.

We traverse from p all the way up to the ROOT node and if n is not found then ⇒ n is not an ancestor of p and the operation is safe to apply.

9 of 17

Algorithm to Resolve Conflict

If a cycle is detected,

traverse from p to n in order to find a node belonging to the cycle which has the latest moved timestamp.

Such a node will be move back to its previous position in order to break the cycle.

For each node we store a fixed number of its previous (let say k) parents in a list.

To move a node back

It is necessary to find from the list a previous parent that is safe to move the node.
This implies that it will not result in a cycle, so the previous parent should not be in the subtree of the original node to be moved.

If such a previous parent does not exist then

The node will be moved to be a child of CONFLICT node.

Apply the operation to move it back locally and propagate this operation to other replicas

After that the received operation can be applied..

11 of 17

Proof for Convergence

Two replicas that have seen the same set of operations would have be in the same state. Why?

All nodes will have the same timestamp for the last applied operation.
In case it differs between replicas then it implies they have not seen the same set of operations since otherwise the latest timestamp operation would always have been applied.
So if all the nodes have the same parent it is attached to then all the tree edges are same in each replica.
Which implies the tree structure is same.

12 of 17

Experiments

The first experiment consisted of generating 5000 operations by each replica at varied rates per second and then measuring the average time taken to apply it.

The size of the tree was kept fixed in this experiment.

The operations are split into two types:

Local Operation

The operations locally generated by the replicas.

Remote Operation

The operations received by the replica that were generated by other replicas.

In the second experiment we varied the size of the tree.

The operations generation rate was kept fixed to 500/sec.

Using Microsoft Azure E2s_v3 VM instances, the algorithm was executed on three replicas in different geographical locations.

13 of 17

Experiment 1:�Varying Rate of Operations

Experiment 2:�Varying Tree Size

The proposed approach achieves an effective speedup between 14.6× to 68.19× over the model we compared against.

As rates increase there are a lot more operations in flight leading to more interleaving.

Varying tree size does not seem to cause much difference. �

14 of 17

Number of Conflicts v/s Tree Size

When tree size increases, number of conflicts will decrease since there are less operations per node in the tree.
The operations in our experiment were generated randomly, so in most real world cases the number of conflicts would be much lesser than this.
Here it varies from about 1-10 % of total operations.

15 of 17

Analysis

As we noticed conflicting operations are an extremely small percentage of the total operations.
Hence handling them separately would be much better than requiring every operation to follow a certain procedure.
This was the main reason that our model did better in terms of time of execution.
The model we compared against required a lot of compensation operations to be applied for each operation as their system was based on each replica maintaining the same total order of operations.

16 of 17

Conclusion and Future work

We have proposed a novel approach for performing an efficient move operation in a geo-replicated tree.

Our model does not rely on any coordination mechanisms for synchronization.
It is highly available and also works when offline.
There is no metadata transferred in communication messages.
We have followed a last-write-wins policy for conflict resolution.
It also achieves a significant speedup over the state of art approach.�

Identifying the optimal number of previous parents to store is left as future work.
Trying to apply move operations in other CRDTs could also be an interesting direction.