What Is Bitmask DP? State Compression for Subset Problems

June 12, 20269 min read
dsaalgorithmsinterview-prepdynamic-programming
What Is Bitmask DP? State Compression for Subset Problems
TL;DR
  • Bitmask DP (state compression) encodes a subset of n elements as a single integer, collapsing factorial permutation search into O(n·2^n)
  • Three signals: n ≤ 20, you need to track which elements were used (not just how many), and the problem has optimal substructure
  • The DP table is a flat array of size 2^n — each index is a bitmask representing a distinct visited-element subset, no hashing needed
  • Guard for INF before transitioning: propagating INF + cost corrupts unreachable states and produces wrong answers silently
  • Iterate masks in increasing integer order: more bits set means a later sub-problem, so ascending order naturally resolves dependencies first
  • Frozenset memoization is 5-10x slower due to hashing overhead; the bitmask array wins on cache locality and constant factors

Most DP problems have a clean state. "Maximum profit from the first i items with capacity j" maps to a 2D table. The dimensions are numbers. The table is just an array. You feel, briefly, like a competent engineer.

Then someone asks you to track which elements from a set you've already used. That's not a number. That's a subset. The number of distinct subsets of an n-element set is 2^n. Your nicely indexed array just became a very awkward dictionary, and your O(polynomial) algorithm quietly packed its bags.

State compression encodes that subset as a single integer, then runs DP over all integers from 0 to 2^n - 1. You'll also hear it called bitmask DP. Same idea, two names: one says what you're compressing, the other says how.


Why Subsets Break Ordinary DP

Consider the Traveling Salesman Problem. You have n cities. You want the shortest round trip that visits every city exactly once.

The state you need is: "I'm at city X, and I've already visited these specific cities." That "already visited" part is a subset of {0, 1, ..., n-1}. You can't shove it into an array index without encoding which cities are actually in there.

The naive approach tries all permutations. That's O(n!) routes. For n=20, roughly 2.4 × 10^18 operations. Your laptop will still be computing when the heat death of the universe arrives. Not as a dramatic exaggeration. Literally.

Bitmask DP reduces this to O(n^2 · 2^n). For n=20, about 400 million operations. Still large, but it finishes in seconds. The universe gets to survive another day.

There are exactly 2^n distinct subsets of an n-element set, and each maps to a unique integer from 0 to 2^n - 1. A flat array of size 2^n holds every sub-problem exactly once, no hashing required, no crying required.


How One Integer Holds an Entire Set

If bitmasks are new to you: bit i of an integer represents whether element i belongs to the set. One bit. One element. That's it.

# Represent the set {0, 2, 3} as a bitmask mask = (1 << 0) | (1 << 2) | (1 << 3) # 0b1101 = 13 # Check if element 2 is in the set present = (mask >> 2) & 1 # 1 (yes) # Add element 1 expanded = mask | (1 << 1) # 0b1111 = 15 # Remove element 0 shrunk = mask & ~(1 << 0) # 0b1100 = 12 # Count elements in the set size = bin(mask).count('1') # 3

Every operation is O(1). Membership check, add, remove, count: all single bitwise operations. No data structure overhead, no collision buckets, no amortized-anything asterisks in the complexity analysis.

Diagram showing how the integer 13 (0b1101) encodes the set {0, 1, 3}: each lit bit maps to an element in the set, unlit bits represent absent elements

Each bit is one slot in the set. Flip it on, the element's in. Flip it off, it's out. The integer does all the bookkeeping.

For enumerating all subsets of an n-element set, you loop from 0 to 2^n - 1. Each integer is a subset. Your DP table is a plain array indexed by these integers.

In practice, bitmask DP problems cap n at around 20. At n=20, 2^20 is about one million states. At n=25, 33 million. Past n=30, even allocating the table starts requiring a conversation about RAM budgets.


A Clean Bitmask DP: The Assignment Problem

You have n workers and n tasks. cost[i][j] is the cost for worker i to do task j. Each worker gets exactly one task, each task goes to exactly one worker. Minimize total cost.

Trying all n! assignments fails past n=12. The bitmask DP:

  • mask = bitmask of which tasks are assigned so far
  • The number of set bits in mask equals the number of workers assigned, which tells you which worker is next
  • dp[mask] = minimum cost to assign the first popcount(mask) workers to exactly the tasks in mask

Transition: for each task not yet in mask, assign it to the next worker and update dp[mask | (1 << task)].

def min_cost_assignment(cost: list[list[int]]) -> int: n = len(cost) INF = float('inf') dp = [INF] * (1 << n) dp[0] = 0 for mask in range(1 << n): if dp[mask] == INF: continue worker = bin(mask).count('1') if worker == n: continue for task in range(n): if mask & (1 << task): continue new_mask = mask | (1 << task) dp[new_mask] = min(dp[new_mask], dp[mask] + cost[worker][task]) return dp[(1 << n) - 1]

Walk through it for n=3:

  • dp[0b000] = 0: no tasks assigned, zero cost
  • Worker 0 tries tasks 0, 1, 2. We fill dp[0b001], dp[0b010], dp[0b100]
  • Worker 1 picks from the remaining tasks in each of those states
  • Worker 2 picks whatever's left
  • dp[0b111] holds the minimum over all valid complete assignments

The outer loop runs 2^n times. The inner loop runs at most n times. Total: O(n · 2^n) time, O(2^n) space. For n=20, that's 20 million operations versus a factorial version that will never finish. Not "takes a while." Never.


O(n · 2^n) Is the Price. Here's Why It's Worth It.

State compression trades factorial time for exponential time. Sounds like a dubious bargain until you see what "factorial" actually means for any non-trivial n:

nn! (naive)n · 2^n (bitmask DP)
103.6 million~10 thousand
151.3 trillion~500 thousand
202.4 × 10^18~20 million

1.3 trillion operations for n=15. That's not slow. That's cosmologically wrong. Bitmask DP does the same problem in half a million operations. This is not a close comparison.

The crossover happens early. Past n=12, bitmask DP is the only approach that actually completes during your lifetime.

Space follows the same pattern. For n=20, the DP table is about 4 MB at 4 bytes per entry. When the state also tracks current position (as in TSP, where you need both visited set and current city), the table grows to n · 2^n entries, around 80 MB. That's real memory. The factorial version cannot even write its answer down in the age of the universe, so "uses 80 MB" feels like a reasonable trade.

The flat integer array beats frozenset-keyed memoization by 5-10x on cache locality and constant factors, even when asymptotic complexity is identical. Hashing a Python frozenset is not cheap. The bitmask table pays none of that overhead.


Three Signals That Tell You to Compress State

State compression applies when all three of these hold:

  1. n is small, typically n ≤ 20. This is nearly a flashing neon sign. When you see that constraint, immediately ask whether the problem involves tracking a subset.
  2. You need to track which elements you've used, not just how many. A count fits in a 1D index. Specific membership requires a bitmask. If swapping two elements gives a different state, you need the bitmask.
  3. The problem has optimal substructure. Bitmask DP is still DP. An integer index doesn't conjure overlapping subproblems where none exist.

Common interview appearances:

  • "Visit every node at least once" (complete-coverage graph problems)
  • "Assign every element" (matching problems)
  • "Cover every position" (set-cover variants)
  • "Minimum path visiting all cities" (TSP variants)

The Traveling Salesman Problem is the canonical example. LeetCode 847 (Shortest Path Visiting All Nodes) is one of the cleanest interview versions. Some "minimum cost to hire k workers" problems also hide bitmask DP behind the k-selection constraint.

When in doubt: ask whether the state is a subset of some small set. If yes, try a bitmask.


What Breaks Without the Bitmask

Skipping state compression on a bitmask DP problem usually ends one of two ways.

Option 1: Brute force over permutations. Recursive code that tries every ordering. Correct for n ≤ 10, dead in the water after that. If you don't see the bitmask DP pattern, you're stuck when the interviewer asks you to optimize. They will ask.

Option 2: Memoization with a frozenset key. Logically correct. Also 10x slower, because Python's frozenset hashing is genuinely expensive. "It works, it's just slow" is not a great answer when the follow-up question is "can you improve the time complexity?"

Two specific bugs to avoid when you do use bitmask DP:

Guard for infinity before transitioning. If dp[mask] is still INF, that state hasn't been reached. Propagating INF + cost fills the table with garbage values that will silently corrupt your answer.

Iterate masks in increasing integer order. A mask with more bits set represents a later sub-problem. Since a bitwise OR only adds bits, processing 0 to 2^n - 1 in order naturally ensures sub-problems are solved before the states that depend on them. Violate this and you're reading values before they're computed.


You Need to Explain It, Not Just Code It

State compression sits at the harder end of the DSA spectrum. It rarely shows up in phone screens. Expect it in onsite rounds at competitive companies or in problems labeled hard on major practice platforms.

The recognition skill matters as much as the implementation. Spotting "n ≤ 20" and "visit all elements" and immediately thinking "bitmask DP" is what separates prepared candidates from candidates who solved it once, forgot the shape, and now stare at the constraint box like it said something rude.

The verbal explanation matters equally. Walking an interviewer through why factorial is too slow, why 2^n subsets map cleanly to array indices, and why iteration order matters demonstrates understanding that correct code alone cannot. That's exactly the kind of fluency SpaceComplexity is built to train: voice-based mock interviews where you explain your approach under real conditions, with rubric-based feedback on your reasoning quality, not just whether the output is correct.


Further Reading