A CRDT for contact synchronization

Background

Some things are too important in technology to not standardize. Contacts and calendars, for example. They can be represented as vCard and iCal respectively. These work fine for simple import/export operations, but they provide little aid in supporting users across devices.

CardDAV and CalDAV are the synchronization protocols built on top of vCard and iCal. Thanks to them, you can use whatever apps you want to on your phone or desktop, and it all synchronizes through the cloud.

As long as it goes through the central server.

That central server thing has some problems:

You have to trust someone to host your data
Devices can only communicate through the central server
Your protocols can be less robust since the server can fix issues for you

Email is similar. IMAP synchronization lets everyone use whatever client they want to while the central server keeps your data.

Except with email, and instant messaging in general, there is the desire to replace it with a fully distributed system. Matrix is the main effort to do so. The vision is every application or (potentially static) website-based client acting as a peer. Servers may exist, too, but serve mainly as peers with higher-than-usual availability. It’s a pure distributed/p2p/web3 solution.

It’s not the first to do so. Source code management did this a long time ago. Flat files are your data format. CVS/Subversion was the synchronization solution. Then git/mercurial made decentralization popular.

Note that removing the central server isn’t the only benefit. You get operational benefits in cloud hosting since you can run multiple less-reliable peers. You can benefit from p2p and mesh networks, so nearby devices could sync even when neither can connect to the internet. Users also get the benefit of a more generalized merging algorithm; it’s naturally more robust.

Contacts and calendars may eventually go this way too; even though the need is less pressing.

A distributed vCard list

Unlike source code management, where conflicts between versions are all but guaranteed, Matrix uses Conflict-free replicated data types.

Each person has their own copy of a data structure, and there is a method for merging any two data structures together whereby any order of merges results in the same output. This allows for each peer in a system to have a copy and use gossip to propagate changes. And the gossip that it’s sending is just its current state. It doesn’t need a push mechanism.

Consider two servers hosting a file over HTTP. By periodically reading each other’s values (à la RSS) and merging their data, they can propagate their changes across arbitrary topologies.

A single vCard entry

There are a lot of things you can put in a vCard file. Luckily, for this purpose, only cardinality matters. It has four different classes of labels:

Exactly 1
0 or 1
1 or more
0 or more

A different CRDT can be chosen for each of these based on when you find losing data acceptable. This is the less interesting part of the problem; as long as some logical CRDT is used, it doesn’t matter.

I’m going to use a Last-writer-wins register for exactly 1, and the nullable version of this for the 0 or 1 case. For the remaining two, an observed removed set (with adds taking precedence over removes) will suffice.

A very simple example:

struct ContactEntry {
    name: Lww<String>,
    tel: ORSet<String>,
    // insert the remaining thousand vCard keys here
}

A list of vCards

Synchronizing contacts isn’t as simple as synchronizing a single contact. Contacts can be added and deleted. They can also be merged. Mergers are just an update of one and the deletion of another, but since we want changes that haven’t propagated to the current peer to be reflected, a merger needs to be a first-class operation.

Mergers are hierarchical. They form a tree. You can store a tree as a CRDT. But that’s not necessary here.

Rather, we can store one logical “contact” as a set of contact entries. To differentiate between contact entries, each is assigned a random UUID. Each contact starts with one entry. When it’s merged with another, they form a new set with the union of their entries. When merging two of these sets, any entry in two separate sets will replace both with their union.

This is a Disjoint-set data structure (i.e., a set of disjoint sets). It’s a CRDT, or can be implemented as one. I can’t find anyone else who has shown this, so that might be new.

A list can be represented as such:

struct ContactList {
    entries: Map<Uuid, ContactEntry>,
    contacts: Set<Set<Uuid>>, // uuids point to entries
}

This takes advantage of the mergeable property of CRDTs: from the user perspective, a contact is the merger of each of its entries. It does mergers for the logical representation, not just the physical representation (as we explicitly don’t want to merge different entries). I can’t find this used elsewhere in the ecosystem either.

We can put a tombstone boolean on each of these entry sets to implement deletion. However, if one person deletes a contact while on another device merging it with another contact, it will delete the other once propagated. Rather, we can delete at the entry level:

struct ContactEntry {
    name: Lww<String>,
    tel: ORSet<String>,
    tombstone: TrueWinsBool,
}

A contact is only considered deleted once all of its entries have a tombstone bit set. When a client wants to delete a contact, it sets this bit on each of the entries. The contact will only ever be undeleted if another peer had merged this contact with another non-deleted contact without observing the deletion to protect the new entry from instant deletion.

New things you can now do

Collaborative editor CRDTs are the most developed. They allow for distributed text editing of a single field. One can imagine embedding one of these for each note associated with a contact, or similar use cases.

Also, one useful feature many devices have is showing the most recent or most frequently contacted people first in lists. This could be embedded in their contact to work across platforms. The simple case of the most recently contacted could be implemented with a timestamp. A total interaction count could be done with a counter. To be even more accurate, count-min sketches and HyperLogLog structures can be used to probabilistically keep track of frequencies without embedding too much sensitive data within the contacts themselves.

Many clients show the user when they last synchronized with the cloud. The “no single source of truth” model may seem opposed to this; however, it provides a superset of functionality. You can embed propagation observability of a CRDT into itself; you can pick a node to be the “main” one if you want. Or, you can be more creative and use any arbitrary standard for consensus. For example, you could host cloud peers on three different providers and only consider a version “up to date” when two of the three have observed it.

While the improvements of switching to a distributed architecture are likely marginal, they would be a win for decentralization and user experience. Similar solutions are desirable for any product that has become a near-commodity. Currently, that would be contacts, calendars, and photo sync. In the future, playlists, credential/password stores, browser settings/history/bookmarks, and even user identity could benefit from elegant applications of distributed systems concepts.