Collaborative Editor System Design: The Algorithm That Syncs Two Cursors

May 27, 202615 min read
interview-prepcareersystem-designalgorithms
Collaborative Editor System Design: The Algorithm That Syncs Two Cursors
TL;DR
  • Operational transformation (OT) is the algorithm Google Docs uses: each operation carries a base_revision, the server transforms concurrent ops before applying them, and all clients converge to the same state.
  • OT requires a central server to serialize operations, which sidesteps the hard TP2 convergence property; CRDTs avoid the server but grow metadata up to 10x per character.
  • The data model is event sourcing: an append-only operations table plus periodic snapshots to S3; version history and restore fall out for free with no extra work.
  • Route all editors of a document to the same Collaboration Service instance by hashing doc_id; consistent hashing limits rebalancing to K/(N+1) documents when a node is added.
  • Presence is deliberately ephemeral: cursor positions live in Redis with a TTL, fully separated from the durable OT engine so latency stays low and cleanup is automatic.
  • The OT transform step is the interview pivot: candidates who can trace the transform example precisely separate from those who only draw architecture boxes.

Two people open the same document. Alice inserts "Hello" at position 0. Bob, at the same instant, deletes the character at position 0. Neither knows what the other is doing. Applying Alice first gives "Hello". Applying Bob first gives nothing. Both users need to see the same final document.

That is the whole problem. And it is harder than it looks.

This walkthrough covers requirements, the synchronization algorithm, architecture, data model, scaling, and the tradeoffs your interviewer will push on.

Start With Requirements, Not Architecture

Interviewers reward candidates who scope before they draw boxes. Nail this in the first five minutes.

Functional requirements:

  • Multiple users edit the same document simultaneously
  • Edits appear on all connected clients in near-real-time (under 100ms within a region)
  • Cursor positions and presence ("Alice is editing") are visible to collaborators
  • Documents persist; sessions survive page refreshes
  • Version history allows viewing and restoring past states

Non-functional requirements:

  • 1M concurrent documents, up to 50 simultaneous editors per doc
  • Sub-100ms edit propagation within a region
  • Eventual consistency is acceptable for presence; document content requires strong consistency
  • 99.9% availability for the collaboration path

One scope decision matters most: are you handling plain text or rich text? Rich text adds document-tree operations (bold, list nesting, embedded images) that multiply the algorithm's complexity. Start with plain text and name the rich-text extension. Your interviewer will be impressed you thought of it. If they ask you to go deeper, look them in the eyes and explain why that path goes dark very quickly.

Why Naive Approaches Fail

Two orderly rows of metal grate converging into an absolute tangle in the middle Two concurrent edits, no conflict resolution. Left side: Alice. Right side: Bob. Middle: your document.

You cannot just send snapshots and apply the last one. Alice and Bob each type independently. Last write wins means one person always loses work. You've turned a collaborative document into a very disappointing game of last-key-wins.

Locks do not work either. Lock the document when Alice starts typing and nobody else can edit. Technically correct, completely unusable. This is the CS term for "your users will hate you." Google Docs has millions of concurrent sessions. Nobody is waiting for Alice to finish her sentence.

The real problem is that concurrent operations are not commutative. Insert-then-delete and delete-then-insert produce different documents. Any correct solution must address commutativity explicitly.

Operational Transformation: How the Algorithm Works

Operational Transformation (OT) was first described by Ellis and Gibbs in their 1989 SIGMOD paper and is what Google Docs uses today. Instead of sending snapshots, clients send operations: insert("X", position=3) or delete(position=5, length=1). The server maintains a linear sequence of accepted operations, each with a revision number.

Here is the classic conflict scenario:

Initial document: "abc"    (revision 5)

Alice: insert "X" at position 1  →  "aXbc"
Bob:   delete at position 2      →  "ac"

Both see revision 5. Both send their operations. The server receives Alice's first, applies it (revision 6: "aXbc"), and then Bob's operation arrives. Bob's delete(position=2) was written against revision 5, where position 2 was "c". After Alice's insert, position 2 is now "b". Applying Bob's operation verbatim deletes the wrong character.

The server transforms Bob's operation against Alice's before applying it. Alice inserted one character at position 1, which is before Bob's delete at position 2. So Bob's delete shifts right by one: delete(position=3). Both operations converge correctly to "aXc".

def transform_delete_against_insert(delete_op, insert_op): # delete_op = {"pos": 2, "len": 1} # insert_op = {"pos": 1, "text": "X"} ins_pos = insert_op["pos"] del_pos = delete_op["pos"] ins_len = len(insert_op["text"]) if ins_pos <= del_pos: # Insert is before delete; shift delete right return {**delete_op, "pos": del_pos + ins_len} else: # Insert is after delete; delete position unchanged return delete_op

The full set of transform functions covers insert-vs-insert, insert-vs-delete, delete-vs-insert, and delete-vs-delete. For rich text, you also need format-vs-insert, format-vs-delete, and format-vs-format, and the document is a tree rather than a string, so positions become tree paths. Google's full implementation for rich text is not public. Given what you've just read about the plain-text version, that is probably for the best.

OT revision timeline showing Alice and Bob both starting at rev 5, server transforming Bob's operation, and both clients converging to the same rev 7 state The server serializes operations. Bob's delete was written against rev 5 where "c" was at position 2. After Alice's insert, it moves to position 3. Both clients land on "aXc" at rev 7.

OT must satisfy two convergence properties. TP1 says that if two concurrent operations are applied in different orders, the results converge. TP2 extends this to three or more concurrent operations. TP2 is notoriously hard to satisfy for arbitrary operation types, which is why most production systems use a central server to serialize all operations and simply sidestep TP2 entirely. The server is the source of truth for ordering. Clients transform their locally-applied operations against the server's accepted operations when an acknowledgement arrives. Distributed systems academics still argue about TP2. You are not going to solve it in this interview.

What About CRDTs?

CRDTs (Conflict-free Replicated Data Types) take the opposite approach. Every character gets a globally unique identifier with a fractional position encoding its intended order. Because identifiers are immutable, concurrent inserts and deletes always converge without transformation. No central server required for ordering.

The cost is metadata. Every character carries an author ID, a logical timestamp, and a tombstone flag when deleted. A 10,000-character document might carry 10x the content in CRDT metadata. Garbage collection and undo are expensive. Your RAM bill grows along with your user's prose.

The practical comparison: OT requires a central server and is O(n²) to merge concurrent histories, but metadata stays small. CRDTs are peer-to-peer friendly and mathematically elegant, but memory-hungry. Google Docs uses OT with a central server. Notion uses a hybrid: CRDT for block structure, OT for text within blocks.

Collaborative Editor Architecture: Four Paths Through the System

System architecture: client browser connects through load balancer to WebSocket gateway, which routes to collaboration service, operation log (Kafka), and document store (PostgreSQL plus S3). A separate dashed path goes from gateway to presence service (Redis). The durable edit path (solid lines) and the ephemeral presence path (dashed) are separated deliberately. Cursor positions are not worth the same durability guarantees as document content.

Edit path: Client sends an operation over WebSocket to the Gateway. Gateway routes to the Collaboration Service responsible for that document. The service holds the document's in-memory state, applies the transform, writes the accepted operation to the Operation Log (Kafka), and broadcasts the result to all connected clients.

Presence path: Client sends cursor position pings every 2 seconds. Gateway writes to the Presence Service (Redis with TTL). Presence fans out cursor updates to other clients in the same document session.

Read path: Client opens a document. It fetches the latest snapshot from S3 plus any delta operations from PostgreSQL since that snapshot. Cold loads hit this path; hot loads use the in-memory state in the Collaboration Service if the document is already active.

Version history path: REST call to list or restore revisions. Served from the Operation Log directly.

The Data Model: Three Tables

documents ( doc_id UUID PRIMARY KEY, owner_id UUID, title TEXT, snapshot JSONB, -- compressed latest state snapshot_rev INTEGER, -- revision number of snapshot created_at TIMESTAMPTZ, updated_at TIMESTAMPTZ ) operations ( op_id UUID PRIMARY KEY, doc_id UUID REFERENCES documents, revision INTEGER NOT NULL, author_id UUID, op_type TEXT, -- "insert" | "delete" | "format" op_data JSONB, -- position, text, attributes applied_at TIMESTAMPTZ, UNIQUE (doc_id, revision) ) permissions ( doc_id UUID, user_id UUID, role TEXT, -- "owner" | "editor" | "viewer" PRIMARY KEY (doc_id, user_id) )

The operations table is append-only. Document state at any revision is: load the nearest snapshot, then replay operations WHERE doc_id = $1 AND revision > snapshot_rev ORDER BY revision. This is event sourcing applied to documents. Every edit ever made is in there. Nothing gets deleted. Auditors love it. Storage costs creep up on you.

Snapshot compaction is not optional. Replaying 50,000 raw operations on every document open is too slow. A background job snapshots the full document state every 100 operations and writes it to S3. Cold opens load the S3 snapshot and apply a small delta from PostgreSQL. Snapshots are cheap relative to cold-load latency. Skip them and your document open time will tell you so.

The WebSocket Protocol

POST   /docs                      -- create document
GET    /docs/:id                  -- load (latest snapshot + delta)
GET    /docs/:id/revisions        -- list versions
POST   /docs/:id/revisions/:rev   -- restore to revision

WS     /docs/:id/collaborate
  client → server:  { type: "op", base_revision: 42, op: { ... } }
  server → client:  { type: "op", revision: 43, op: { ... }, author_id: "bob" }
  server → client:  { type: "ack", accepted_revision: 43 }
  client → server:  { type: "presence", cursor_pos: 1547, selection: null }
  server → client:  { type: "presence", user_id: "bob", cursor_pos: 1547 }

The base_revision field in each outgoing operation tells the server what state the operation was written against. If the server is at revision 47 and the client sent an operation against revision 44, the server transforms the operation against revisions 45, 46, and 47 before applying it. The client receives an ack with the accepted revision number and advances its local revision counter.

The base_revision is the reason this works. Without it, the server has no idea how stale the client's view is or which operations it needs to transform against. Get the protocol wrong and you get the tangle from the photo above.

Presence: Keep It Ephemeral

Cursor positions are not part of the document. Losing a cursor update does not corrupt anything. If presence is lost, the cursor disappears for a moment and reappears on the next ping. Keeping presence out of the OT engine is the right call.

The implementation: clients send a presence message over the existing WebSocket every 2 seconds. The Gateway writes cursor:{doc_id}:{user_id} to Redis with a 5-second TTL. Redis pub/sub fans the update to all other Gateway nodes serving that document. Each Gateway forwards the update to its connected clients. Stale cursors evaporate automatically when the TTL expires. No cleanup code required. No zombie cursors haunting a document after someone rage-quits.

The real-time push model over WebSockets is faster than polling for this use case. For a broader discussion of the push vs pull tradeoff, the push vs pull tradeoff piece covers the decision framework.

Redis sorted sets (which use a skip list internally) are a natural fit for presence queries like "give me all active editors on document X sorted by last-seen time."

Scaling: The Stateful Problem

Collaboration services are inherently stateful. Every client editing document A must connect to the same Collaboration Service instance, because that instance holds the in-memory document state. Most distributed systems love to be stateless. This one cannot be. Everyone is very sorry about that.

Routing by doc_id. The load balancer hashes doc_id and routes all clients editing that document to the same Gateway node, which routes to the same Collaboration Service. When a node goes down, its documents reload from PostgreSQL and resume on a healthy node. In-flight operations are replayed from the Operation Log.

Consistent hashing keeps rebalancing cheap. When you add a server, only K/(N+1) documents move, where K is total active documents and N is the existing server count. A mod-N scheme would move nearly all documents and cause a thundering herd of reloads. The routing table lives in Redis so Gateways can look up the correct shard in O(1).

Consistent hash ring with doc_id hash points mapped to collaboration shards A through E, with Redis routing table and gateway nodes pointing to the correct shards Each doc_id hashes to a position on the ring and gets assigned to the next clockwise shard. Adding shard F would only steal the arc between it and its predecessor. Everything else stays put.

Connection capacity math matters. A single server handles roughly 50,000 to 100,000 concurrent WebSocket connections (the limiting factor is file descriptors and memory per connection, not CPU). With 50 editors per document and 1M active documents, the system needs up to 50M concurrent connections across 500 to 1,000 Gateway nodes. The math is fine. Horizontal scaling handles it. Name the numbers out loud in the interview anyway: it signals you thought about the scale.

Version History Falls Out for Free

Version history is a direct consequence of the append-only operation log. The UI timeline is a query over the operations table grouped by applied_at. Restoring a revision means: read the nearest snapshot before the target revision, replay operations forward to the target, write the result as a new snapshot. The original operations are preserved. The log never mutates.

One edge case worth naming out loud: two users can issue concurrent restores. User A restores to revision 10; User B simultaneously restores to revision 8. Both restores are themselves operations in the OT system. The server serializes them. The later one wins. Both are recorded. Correct behavior, automatically. You get this for free because the operation log is your source of truth and not just a cache of the current state.

The Tradeoffs to Name Out Loud

Your interviewer wants to hear these articulated, not left implicit.

OT vs CRDT. OT requires a central server for operation ordering, which simplifies the algorithm and keeps metadata small, but creates a coordination bottleneck and complicates offline editing. CRDTs avoid centralization and work offline naturally, but metadata growth is real and garbage collection is hard. Name the tradeoff. Name your choice. Name why. "It depends" without the follow-up is a non-answer.

Snapshot frequency. More frequent snapshots mean faster cold loads and smaller deltas to replay, but more S3 writes and storage cost. Less frequent means cheaper storage but slower reopens for cold documents. 100 operations per snapshot is a reasonable starting point. Tune based on median cold-load latency once you have data.

Presence vs correctness. Separating presence (Redis, best-effort) from edits (OT engine, durable) is the correct architectural decision. Conflating them would add durability overhead to the latency-sensitive presence path, and TTL-driven cursor cleanup is simpler than coordinating cursor deletes through the OT engine. This is a genuinely good design choice, not just a performance optimization.

Rich text complexity. Plain-text OT is solved. Rich text is not. Operations on nested document trees require transform functions for every operation type pair. Formatting operations that span ranges interact with insertions in non-obvious ways. Naming this complexity in the interview is more valuable than pretending it away with a hand wave. Interviewers can tell the difference.

The 45-Minute Clock

TimeFocus
0-5 minClarify requirements. Plain text or rich text? How many editors per doc? Version history?
5-12 minExplain the synchronization problem. Draw the OT transform example. Name TP1 and why central server avoids TP2.
12-22 minHigh-level architecture: Gateway, Collaboration Service, Operation Log, DB, Presence
22-32 minData model: documents, operations, permissions. Event sourcing. Snapshot compaction.
32-40 minScaling: doc_id routing, consistent hashing, connection capacity math
40-45 minTradeoffs: OT vs CRDT, presence separation, snapshot policy, rich text

The synchronization algorithm is the pivot. Strong candidates separate from average ones by whether they can trace the OT transform step precisely, not by how many database tables they draw. You can have the most elegant schema in the room and still look like you read a blog post if you cannot walk through the transform function when asked.

Quick Recap

  • Concurrent edits require OT or CRDTs. Both solve commutativity. They trade metadata overhead for centralization.
  • OT uses a central server to serialize operations and transforms incoming ops against the concurrent history before applying them.
  • Clients include a base_revision with each operation. The server transforms forward and broadcasts.
  • Architecture: WebSocket Gateway routes by doc_id. Collaboration Service holds in-memory doc state. Operation Log is append-only.
  • Presence is ephemeral: Redis with TTL, separated entirely from the OT durability path.
  • Snapshot plus delta replay gives version history for free, with no extra work.
  • Rich text is the unsolved hard part. Name it. Don't hand-wave it.

Explaining the transform step fluently under time pressure is harder than it looks. SpaceComplexity runs voice-based system design mock interviews where you practice the OT walkthrough out loud until the explanation flows naturally, not just until you can read it back on paper.

Further Reading