What Is Square Root Decomposition? O(√n) Queries, Explained

June 5, 20269 min read
dsaalgorithmsinterview-prepdata-structures
What Is Square Root Decomposition? O(√n) Queries, Explained
TL;DR
  • Square root decomposition divides an array into blocks of size √n and precomputes one aggregate per block in O(n) build time.
  • Range queries cost O(√n): two partial edge scans plus one pass over complete middle blocks using precomputed aggregates.
  • Point updates cost O(1): only the single block containing the changed element needs its aggregate adjusted.
  • √n is the optimal block size because it minimizes B + n/B, balancing edge scan cost against block count so both terms equal √n.
  • Choose it over a segment tree when O(√n) fits constraints and a correct 25-line implementation under pressure beats a 60-line tree with off-by-one traps.
  • Mo's algorithm extends sqrt decomposition to offline queries in O((n+Q)√n), handling distinct-count and mode problems that merge-based structures can't.

You have an array of a million numbers. Someone asks for the sum of elements 50,000 through 800,000. Fine, you scan it. Then again with different indices. Then 500 more times.

Scanning every query costs O(n). A prefix sum answers in O(1) but turns into O(n) rubble the moment someone updates a value. Square root decomposition lives in the middle: O(√n) per query, O(1) per update, and the implementation fits in about 25 lines of code you can actually write under pressure.


The Core Idea Is Simple Enough to Draw

Take an array of n elements. Divide it into contiguous blocks, each of size approximately √n. For each block, precompute one aggregate: the sum, minimum, or maximum. That preprocessing pass takes O(n) and only happens once.

You now have a two-level structure: individual elements on the bottom, block aggregates on top.

When a query arrives for range [L, R], you handle three segments:

  • Scan element by element through the partial block at the left edge
  • Jump over complete middle blocks in one step using the precomputed aggregate
  • Scan element by element through the partial block at the right edge

Each partial edge touches at most √n elements. The middle blocks contribute at most √n lookups. Total per query: O(√n). It is embarrassingly straightforward. The kind of idea that makes you feel smart for learning it and slightly stupid for not inventing it yourself.


Let's Build One by Hand

Array of 9 elements, block size 3:

index:  0   1   2   3   4   5   6   7   8
value: [3,  2,  1,  4,  5,  8,  2,  7,  9]

Block 0: [3, 2, 1]  → block_sum[0] = 6
Block 1: [4, 5, 8]  → block_sum[1] = 17
Block 2: [2, 7, 9]  → block_sum[2] = 18

Query: sum of indices 1 through 7.

Step 1: Index 1 is in Block 0 (spans indices 0 to 2). The query starts at 1, so scan indices 1 and 2. Partial left sum: 2 + 1 = 3.

Step 2: Block 1 (indices 3 to 5) is fully inside the range. Add block_sum[1] = 17. Running sum: 20.

Step 3: Index 7 is in Block 2 (spans indices 6 to 8). The query ends at 7, so scan indices 6 and 7. Partial right sum: 9. Final answer: 29.

Four individual elements touched. One block value read. Seven total accesses for a seven-element range. Unimpressive at this scale. Push n to 1,000,000 with a block size of 1,000, and any query touches at most 1,000 elements on each edge plus at most 1,000 block values. Three thousand operations instead of a million.

Array divided into blocks of size √n, with a range query spanning a partial left block, two full middle blocks, and a partial right block


Twenty-Five Lines, No Tricks

import math class SqrtDecomposition: def __init__(self, nums: list[int]): self.n = len(nums) self.block_size = max(1, math.isqrt(self.n)) self.nums = nums[:] num_blocks = math.ceil(self.n / self.block_size) self.block_sums = [0] * num_blocks for i, val in enumerate(self.nums): self.block_sums[i // self.block_size] += val def update(self, index: int, val: int) -> None: self.block_sums[index // self.block_size] += val - self.nums[index] self.nums[index] = val def query(self, left: int, right: int) -> int: total = 0 left_block = left // self.block_size right_block = right // self.block_size if left_block == right_block: return sum(self.nums[left:right + 1]) for i in range(left, (left_block + 1) * self.block_size): total += self.nums[i] for b in range(left_block + 1, right_block): total += self.block_sums[b] for i in range(right_block * self.block_size, right + 1): total += self.nums[i] return total

The update method changes one element and adjusts one block sum. O(1). The query method runs two partial loops and one block loop. O(√n). The early return when L and R land in the same block is the one edge case worth thinking about: no middle blocks exist, so just scan directly.

That's it. Read it twice and you basically know it.


Why √n Is the Right Block Size

Worth internalizing. Interviewers ask this directly, and the answer has a satisfying shape.

If you use a block size of B, each query costs at most B steps for each partial edge and n/B steps for the middle blocks. Total per query: O(B + n/B).

You want to minimize B + n/B. If you took calculus in high school, you already know what comes next. Take the derivative with respect to B, set it to zero: 1 - n/B² = 0, so B = √n.

At B = √n, both terms equal √n, the sum is 2√n, and neither term can shrink without making the other grow.

Set B too large and you're scanning huge edge blocks. Set B too small and you're reading too many block aggregates. A block the size of the whole array means one fast middle lookup but O(n) edge work. A block of size 1 means no blocks at all, just a plain scan. The optimum is exactly in the middle, and the math tells you where.


Updates Are O(1), and That's the Whole Point

A prefix sum answers range queries in O(1). Update a single element and every prefix sum to its right goes stale. Rebuilding costs O(n). This is fine for static data. It is less fine when your array is getting updated 500 times a second.

Sqrt decomposition breaks that coupling. When element i changes, only one block aggregate changes: the one containing i. Everything else stays valid. The update is three lines of code.

This O(1) update is the reason sqrt decomposition exists as a technique. The query being O(√n) rather than O(log n) is the price you pay for it. Usually that price is worth it.


How Square Root Decomposition Compares to the Alternatives

StructureBuildQueryPoint UpdateExtra SpaceCode Complexity
Prefix sumO(n)O(1)O(n) rebuildO(n)5 lines
Sqrt decompositionO(n)O(√n)O(1)O(√n)25 lines
Fenwick treeO(n)O(log n)O(log n)O(n)20 lines
Segment treeO(n)O(log n)O(log n)O(n)60+ lines

Fenwick and segment trees are both faster: O(log n) versus O(√n). For n = 10^6, that's about 20 operations versus 1,000. If you're on a tight time limit, that gap matters.

But sqrt decomposition is the structure you can actually implement correctly in ten minutes under interview pressure.

A segment tree with lazy propagation is 60 lines of careful off-by-one arithmetic. One indexing mistake and it silently returns wrong answers. You won't notice until your fourth example fails and you have twelve minutes left. Sqrt decomposition has almost no subtle edge cases. If O(√n) fits within the problem's constraints (and it usually does up to n = 10^5 or so) the boring implementation is often the right call.

There's also a category where sqrt decomposition isn't just easier but outright better: queries where the aggregate is hard to merge. Counting distinct values in a range. Finding the mode. Computing the median. A segment tree node needs to merge its children in O(1) or O(log n). Sqrt decomposition just needs a block-level aggregate. For these queries, it's often the only clean approach.

Divide and Conquer meme from r/ProgrammerHumor

Divide the array into blocks. Conquer each query in O(√n). O(√n) isn't a consolation prize, it's the design.


Mo's Algorithm Extends This Idea

Square root decomposition underlies one of the more elegant offline query techniques: Mo's algorithm. The trick sounds like it shouldn't work.

Sort your queries by (left_block_index, right endpoint). Process them in that order. When you move from one query to the next, the boundaries shift by at most O(√n) on average across all queries. Total cost for Q queries over an array of size n: O((n + Q)√n).

That's it. The whole algorithm is a sort and a walk.

Mo's algorithm handles "count distinct elements in a range" and similar frequency queries that no standard online structure handles cleanly. If you see an offline problem with aggregates that don't merge, Mo's is almost certainly what the problem setter had in mind. The block-sorting trick is the same √n intuition applied to query ordering instead of array partitioning.

Look for the simple solution meme from r/ProgrammerHumor

Sort by block index, walk the pointers, update the count. Sometimes the elegant solution is just "sort it first."


Watch for These Edge Cases in Interviews

Non-square array sizes. If n is not a perfect square, the last block is smaller than the rest. Use math.ceil(n / block_size) for the block count and clamp the right edge when iterating. The code above handles this by iterating to right + 1 rather than a computed block end.

The same-block query. When L and R both land inside a single block, the three-part split breaks. Handle it as a special case before the main logic. Easy to miss, easy to test.

Block size of 1. For very small arrays (n = 1, 2, 3), math.isqrt(n) returns 1, which is fine. The code still works. You'll probably never hit this on a real problem, but you might in a test case designed to break you.

Explaining your block sizing and handling these cases out loud is exactly what a real interviewer probes. On SpaceComplexity, you can work through data structure problems in voice-based mock interviews where you talk through tradeoffs live, the part LeetCode never actually trains you for.


Before You Move On

  • Divide the array into blocks of size √n. Precompute one aggregate per block in O(n).
  • Range queries cost O(√n): two partial edge scans plus one pass over middle block aggregates.
  • Point updates cost O(1): change one element, update one block aggregate.
  • √n is optimal because it minimizes B + n/B. Any other block size worsens one term.
  • Use prefix sums for static arrays. Use sqrt decomposition when updates happen and you need something you can implement in 25 lines under pressure without hallucinating a tree node structure.
  • Mo's algorithm extends sqrt decomposition to offline query processing in O((n + Q)√n). It handles distinct-count and mode queries that merge-based structures cannot.

Further Reading