What Is a Disjoint Set? The Data Structure Behind Connectivity

You have a grid of a million cells. Cells get marked as land one at a time, and after each insertion someone asks: how many islands are there now? You could run a BFS after every update. You won't. Not at that scale. That's the kind of answer that gets your query cancelled mid-flight.

The disjoint set data structure, also called union-find, was built for exactly this. It tracks which elements belong to the same group and merges groups in nearly constant time, no matter how many elements you have. Two operations. Zero graph traversal.

What BFS Gets Wrong at Scale

Union-find builds a partition dynamically: start with every element in its own group, then merge groups as connections arrive. No element lives in two groups at once.

The friend-group model makes this concrete. Five users: 0, 1, 2, 3, 4. Initially each is their own group, living the solitary life.

Groups: {0}  {1}  {2}  {3}  {4}

Then three connections arrive:

union(0, 1): users 0 and 1 are friends. One group forms.
union(2, 3): users 2 and 3 are friends. Another group forms.
union(1, 2): user 1 and user 2 connect. The two groups merge.

Groups: {0, 1, 2, 3}  {4}

Now you can ask "are 0 and 3 connected?" (yes) or "are 0 and 4 connected?" (no). Union-find answers both in effectively O(1) time. BFS would walk the whole graph again. Every time. For every query.

Two Operations, One Forest

Union-find stores a forest of trees, one tree per group. Each node holds a pointer to its parent. The root of each tree is the group's representative, identified by the fact that it points to itself.

parent = [0, 1, 2, 3, 4]
#         0  1  2  3  4   (initially everyone is their own root)

The two operations are:

find(x): follow parent pointers until you reach the root. If find(x) == find(y), x and y are in the same group.
union(x, y): find both roots, then make one root the parent of the other.

After the three unions above, the parent array looks like this:

parent = [0, 0, 0, 2, 4]

Reading it: node 1 points to 0, node 2 points to 0, node 3 points to 2. Following 3's chain: 3 → 2 → 0. Root of group {0,1,2,3} is 0. Node 4 is still its own root, still alone at the party.

Why the Naive Find Is O(n)

The naive implementation just walks the chain:

def find(parent, x):
    while parent[x] != x:
        x = parent[x]
    return x

Without any optimization, this is O(n) in the worst case. If you always union by attaching one long chain to another, your tree degenerates into a linked list. find has to traverse the whole thing. You've essentially reinvented the thing you were trying to replace.

Path Compression Fixes It

Every call to find reveals a path from a node to the root. Path compression makes every node on that path point directly to the root. The next find call on any of those nodes takes O(1).

def find(parent, x):
    if parent[x] != x:
        parent[x] = find(parent, parent[x])
    return parent[x]

The recursive call compresses the path on the way back up. Before compression, node 3's chain is 3 → 2 → 0. After calling find(3), node 3's parent becomes 0 directly. Free shortcut. Every future find(3) is a one-hop answer.

Path compression: before and after calling find(3), node 3 jumps directly to the root

One find call, and suddenly node 3 remembers where the boss sits.

An iterative two-pass version avoids deep recursion on very large inputs:

def find(parent, x):
    root = x
    while parent[root] != root:
        root = parent[root]
    while parent[x] != root:
        parent[x], x = root, parent[x]
    return root

Union by Size Keeps Trees Shallow

Path compression handles find. But without guidance, union might attach a tall tree under a short one over and over. You'd fix it on every find, but the tree grows back during unions. It's like mopping a floor while someone else is still spilling.

Union by size attaches the smaller tree under the larger one. This keeps tree height bounded at O(log n) even before path compression.

def union(parent, size, x, y):
    rx, ry = find(parent, x), find(parent, y)
    if rx == ry:
        return False
    if size[rx] < size[ry]:
        rx, ry = ry, rx
    parent[ry] = rx
    size[rx] += size[ry]
    return True

Union by rank works too, tracking an upper bound on height instead of exact size. Same asymptotic result. Union by size is easier when you need to query group sizes directly.

How Fast Is Near-Constant, Actually?

With both path compression and union by size:

Operation	No optimization	Union by size only	Path compression only	Both
`find`	O(n)	O(log n)	O(log n) amortized	O(α(n)) amortized
`union`	O(n)	O(log n)	O(log n) amortized	O(α(n)) amortized

α is the inverse Ackermann function. It grows so slowly that α(n) ≤ 4 for any n you could encounter in practice (in fact, for any n below 2^65536). Treat O(α(n)) as O(1). If you need to explain this in an interview and they ask you to prove it, politely mention that the proof takes about 30 pages of dense mathematics and suggest you move on.

Space complexity is O(n) for the parent and size arrays. No adjacency lists, no visited sets, no recursion stack beyond the find call.

This is proven optimal. No data structure can answer both find and union faster in the comparison model. You've hit the ceiling.

Where the Disjoint Set Shows Up in Interviews

The signal for union-find is dynamic connectivity: edges arrive one at a time and you need connectivity answers in between. BFS re-explores from scratch every time. Union-find processes each edge in O(α(n)). That gap matters a lot once your input hits five or six figures.

Reach for union-find when you see:

"How many connected components?"
"Are these two nodes in the same group?"
"Does adding this edge create a cycle?"
Merge operations on groups (accounts, clusters, islands)

Classic problems:

Number of Connected Components (LeetCode 323): Initialize one component per node, union on each edge, return components at the end.

Redundant Connection (LeetCode 684): Process edges one at a time. The first edge where both endpoints already share a root is your answer.

Accounts Merge (LeetCode 721): Treat email strings as nodes. Union all emails belonging to the same account entry. Then gather accounts by root. Sounds weird. Works perfectly.

Number of Islands II (LeetCode 305): The online version. Add land cells one at a time. On each addition, try to union with each adjacent land cell. The components counter gives the island count after each step. This is the problem that justifies the entire data structure.

Kruskal's MST algorithm also depends on union-find directly. You can read more about that in the Kruskal's algorithm guide. For pattern recognition, the guide to when union-find applies covers the full set of signals.

One Structure, Every Language

All implementations use path compression and union by size. They expose find, union, and a components counter.

class UnionFind:
    def __init__(self, n: int):
        self.parent = list(range(n))
        self.size = [1] * n
        self.components = n

    def find(self, x: int) -> int:
        if self.parent[x] != x:
            self.parent[x] = self.find(self.parent[x])
        return self.parent[x]

    def union(self, x: int, y: int) -> bool:
        rx, ry = self.find(x), self.find(y)
        if rx == ry:
            return False
        if self.size[rx] < self.size[ry]:
            rx, ry = ry, rx
        self.parent[ry] = rx
        self.size[rx] += self.size[ry]
        self.components -= 1
        return True

    def connected(self, x: int, y: int) -> bool:
        return self.find(x) == self.find(y)

What to Say When They Push on Complexity

In an interview, you'll rarely need to build union-find from scratch on the whiteboard. What you do need is to recognize when the problem wants it and explain the two optimizations clearly.

"Can you make this faster?" Path compression. "What's the worst case for find?" The degenerate tree, fixed by union by size. "What's the time complexity?" O(α(n)) amortized per operation. You can call it O(1) for any practical input because α(n) never exceeds 4 for any n you'll encounter.

The mistake most candidates make is confusing union-find with BFS. They're both connectivity checks. BFS is better when you need the actual path or the graph is static. Union-find is better when you're building the graph incrementally and only need yes-or-no connectivity. If you want to practice explaining that distinction out loud under interview conditions, SpaceComplexity runs voice-based DSA mock interviews with rubric-based feedback on exactly this kind of reasoning.

More union-find problem patterns are in the full algorithm walkthrough and the pattern recognition guide. For a broader look at graph problems where union-find competes with BFS and DFS, see the graph data structure interview guide.