Top 15 LeetCode Problems for SRE and DevOps Interviews

- Heaps cover four SRE patterns in one family: top-K frequency, streaming median, task scheduling, and concurrent capacity planning.
- Sliding window maximum is the O(n) time-series anomaly detector; the monotonic deque is the mechanism, not a sorted structure.
- Binary search on the answer generalizes to any "minimum resource to meet a deadline" problem, from LC 1011 to rate limit sizing.
- Kahn's topological sort handles service dependency ordering and detects circular deps for free via processed-count check.
- LRU Cache (LC 146) appears in nearly every SRE loop; sentinel head and tail nodes are required, not optional.
- The follow-up question is the real test: memory constraints, concurrency, and approximate structures (Count-Min Sketch, HyperLogLog) signal production readiness.
- Prefix sum plus hash map solves sustained-anomaly window detection in O(n); initializing with
{0: 1}is the one bug everyone misses.
SRE coding interviews feel like SWE interviews until the follow-up hits. You solve the problem. The interviewer nods. Then: "How would you run this against a 100GB log file without blowing up memory?" The coding problem was the entry ticket. The ops question is what the loop has been building toward the whole time, and it does not care that you memorized seven DP patterns.
SRE and DevOps loops use the same LeetCode patterns as general software engineering, but the problems are chosen for their infrastructure analogues. You will see more heaps, interval problems, and graph problems than bit manipulation or tree rebalancing. Almost no dynamic programming outside of Google-level roles.
This guide covers the 15 problems that actually recur, grouped by the five pattern families that dominate the distribution. Each one maps to a real ops scenario with a production follow-up. Assumes LeetCode medium difficulty, Python or Go preferred.
Which LeetCode Patterns Actually Show Up in SRE Interviews
The problem set is not random. Each pattern maps to something SREs actually do in production.
Heaps show up because SREs track ranked lists: top-K errors by frequency, highest-latency endpoints, priority queues of alerts. Interval problems map to scheduling: maintenance windows, on-call rotations, deploy slots. Graphs model infrastructure: dependency ordering, network topology, connected components after failures. Sliding windows capture time-windowed metrics. Binary search on the answer is the canonical pattern for any "minimum capacity to meet this SLA" question.
Cover these five families and you have covered most of what you will see.
Heaps: Because Everything Worth Monitoring Has a Ranking
1. Top K Frequent Elements (LC 347)
The ops frame: Find the 10 error types that fired most in the last hour across your log stream.
Counter plus min-heap of size K gives O(n log K). Count frequencies, then maintain a heap that evicts the smallest frequency when it exceeds size K. The follow-up: "What if the stream is infinite and only the last five minutes matter?" That is a sliding window variant. Under memory constraints, a Count-Min Sketch gives approximate top-K.
2. Find Median from Data Stream (LC 295)
The ops frame: Track p50 latency on a live service. New data points arrive every 100ms and you need the current median at any time.
Two heaps: a max-heap for the lower half, a min-heap for the upper half. After every insert, rebalance so their sizes differ by at most one. The median is either the top of the larger heap or the average of both tops. Change the rebalance threshold and you can track p90 or p99 from a stream without storing every data point.
3. Task Scheduler (LC 621)
The ops frame: N job types, each requiring a cooldown period before it can run again. How many time slots does the full queue need?
The greedy insight: always schedule the most-frequent remaining job not in cooldown. There is a closed-form: max(total_tasks, (max_freq - 1) * (n + 1) + count_of_max_freq). Know the formula to verify your heap simulation, but derive it from the simulation in the interview. Idle slots only appear when the highest-frequency job type is cooling down. In a CI system, those idle slots are the gap where every other test suite waits while the slow integration test runs. The formula is real. The pain is also real.
from collections import Counter def least_interval(tasks: list[str], n: int) -> int: freq = list(Counter(tasks).values()) max_freq = max(freq) max_count = freq.count(max_freq) return max(len(tasks), (max_freq - 1) * (n + 1) + max_count)
4. Meeting Rooms II (LC 253)
The ops frame: N deploys with start and end times. What is the minimum number of parallel runners needed so no deploy queues behind another?
Sort by start time. Use a min-heap of end times. If the earliest-ending job finishes before the next one starts, reuse its slot. Otherwise open a new one. The heap size at the end equals the maximum concurrent overlap, which is your required capacity. Your PM calls this capacity planning. Your on-call rotation calls it the incident where everything queued for two hours on a Friday afternoon.
Sliding Window: When Your Metric Has a Memory
5. Sliding Window Maximum (LC 239)
The ops frame: CPU utilization at one-second intervals. You need the maximum reading in every five-minute window for anomaly detection. O(nk) is too slow for a production metrics pipeline.
A monotonic deque maintains indices in decreasing value order. On each step, pop indices outside the window from the left. Pop smaller values from the right before pushing the current index. The front is always the current window's maximum.
from collections import deque def max_sliding_window(nums: list[int], k: int) -> list[int]: dq = deque() result = [] for i, val in enumerate(nums): while dq and nums[dq[-1]] < val: dq.pop() dq.append(i) if dq[0] == i - k: dq.popleft() if i >= k - 1: result.append(nums[dq[0]]) return result
O(n) total because each index is pushed and popped at most once. The follow-up: "How does this change for rolling p99 instead of max?" That is Find Median from Data Stream applied to a fixed window. The pager will ask you this at 3am before the interview does.
6. Minimum Window Substring (LC 76)
The ops frame: Scanning log entries for lines that contain all of a set of required error codes. Find the shortest log window containing every required code at least once.
Two-pointer shrink-and-expand. Track a "have" counter against "need". When all required items are covered, shrink from the left until the window becomes invalid. Only increment "have" when a character's count hits exactly the required threshold, not on every occurrence. Missing this produces subtle overcounting.
Intervals: The Pattern Behind Every Maintenance Window
7. Merge Intervals (LC 56)
The ops frame: Multiple teams submit maintenance windows. Merge overlapping ones to compute the total scheduled downtime.
Sort by start time. Merge when the current start is at or before the last merged end. The non-obvious bug: use max() on the end time in the merge step. A fully contained interval would otherwise silently truncate the merged window to a shorter one. Silent correctness bugs in scheduling code are how you end up with overlapping maintenance windows in production even though the system said everything was fine.
8. Capacity to Ship Packages Within D Days (LC 1011)
The ops frame: Given N jobs and a deadline of D days, find the minimum throughput that processes all jobs on time.
Binary search on the answer. Lower bound is max(weights) since you cannot split a job. Upper bound is sum(weights). The feasibility check is a greedy scan: simulate processing with capacity mid, count days needed, return needed <= D.
This pattern generalizes to every "minimize the maximum resource to meet a deadline" question in ops. Server bin packing, batch job sizing, rate limit calculation. Learn the template once and recognize it everywhere.
Graphs: Your Microservices Have a Topology Problem
9. Course Schedule II (LC 210)
The ops frame: Service A depends on B, B depends on C. Find a valid deployment order. Detect circular dependencies before you push to prod.
Topological sort via Kahn's algorithm: compute in-degrees, push zero-in-degree nodes to a queue, process and decrement neighbors. Cycle detection is free: if len(result) != num_courses, a cycle exists. Kahn's is easier to implement correctly under interview pressure than DFS-based topological sort, because it does not require managing a visit-state per node.
from collections import deque def find_order(num_courses: int, prerequisites: list[list[int]]) -> list[int]: graph = [[] for _ in range(num_courses)] in_degree = [0] * num_courses for a, b in prerequisites: graph[b].append(a) in_degree[a] += 1 queue = deque(i for i in range(num_courses) if in_degree[i] == 0) order = [] while queue: node = queue.popleft() order.append(node) for neighbor in graph[node]: in_degree[neighbor] -= 1 if in_degree[neighbor] == 0: queue.append(neighbor) return order if len(order) == num_courses else []
10. Number of Islands (LC 200)
The ops frame: A grid of '1's (healthy nodes) and '0's (down nodes). Count the isolated healthy network segments.
Standard BFS or DFS. The outer loop over all cells is mandatory to handle disconnected components. This is the SRE interviewer's favorite graph warmup because it directly models network segmentation and cluster isolation. The expected follow-up: "Label each island with a region ID," which requires storing group assignments in a hash map alongside the traversal.
11. Redundant Connection (LC 684)
The ops frame: Parsing an infrastructure-as-code manifest processed edge by edge. One edge creates a dependency cycle. Find it.
Union-Find. Process edges in order. If two nodes are already in the same component when you process an edge, that edge is redundant. Union-Find is more natural here than DFS cycle detection because you are streaming edges and want the first cycle-forming one, not just any cycle. Path compression and union by rank keep everything near O(1) per operation.
12. Network Delay Time (LC 743)
The ops frame: A network where edge weights represent latency. How long until a signal from node K has reached every other node?
Dijkstra's algorithm. The answer is max(dist.values()). If any node is unreachable, return -1. That unreachable case maps directly to split-brain detection. Know why Dijkstra fails with negative edge weights and what replaces it (Bellman-Ford, covered in the graph algorithm section).
Time-Series Math: Two Patterns Your Dashboards Already Use
13. Subarray Sum Equals K (LC 560)
The ops frame: A time series of per-minute request counts. How many contiguous windows sum to exactly K? This is your "periods of sustained anomalous load" detector.
Prefix sums plus hash map. For each prefix sum P, check how many times P - k appeared in previous prefix sums. Initialize the map with {0: 1} to handle subarrays starting at index zero. O(n) time and space.
14. Daily Temperatures (LC 739)
The ops frame: For each alert threshold breach in a time series, how many time steps until the next recovery?
Monotonic stack. Maintain a decreasing stack of indices. When you encounter a value larger than the stack's top, pop indices and record the gap as the answer for that index. The stack represents a queue of unresolved incidents waiting for their recovery event. Each element is pushed and popped at most once, so this is O(n) despite looking quadratic.
LRU and the Cache Problem That Never Goes Away
15. LRU Cache (LC 146)
The ops frame: Every CDN, service mesh, proxy cache, and DNS resolver uses LRU eviction. Get and put must run in O(1).
Doubly linked list plus hash map. The list tracks recency order; the map gives O(1) access to any node. On get, move the accessed node to the front. On put, evict the tail if you are over capacity. Sentinel head and tail nodes eliminate every edge case in node removal and are not optional. Without them, null pointer bugs appear on every eviction.
Know this one cold before anything else on this list. If you have shipped to production in the last few years, there is a reasonable chance you have hit an LRU-adjacent bug without knowing it. Now you get to implement one from scratch. LFU Cache (LC 460) adds a frequency dimension and comes up at Google and Datadog. The LRU implementation deep dive covers the full sentinel node approach.
The Follow-Up Is the Actual Test
Nearly every problem above will come with a production-scale follow-up. The structure of good answers is consistent.
| Follow-up type | Signal they want | Common answer |
|---|---|---|
| "What if input is 100GB?" | Memory awareness | External sort, streaming, chunked processing |
| "What if writes are concurrent?" | Concurrency awareness | Read-write locks, atomic operations, lock-free structures |
| "What if you need approximate results?" | Probabilistic data structures | Count-Min Sketch, HyperLogLog, Bloom filter |
| "How would you test this?" | Operational rigor | Edge cases, load testing, chaos injection |
The coding problem gets you in the room. The follow-up gets you the offer. It is also where you signal that you have thought about running code in production and not just on a laptop pointed at example inputs. Most candidates tank the follow-up not because they do not know the answer but because they have never been asked the question before. Now you have been.
To practice both sides under time pressure, SpaceComplexity runs voice-based mock interviews covering SRE and systems patterns, with rubric feedback on the code and the ops reasoning separately.
For the pattern deep dives, start with BFS pattern recognition, merge intervals, and binary search on the answer before working through the heap and graph problems.
Further Reading
- Google SRE Interview Guide, IGotAnOffer's breakdown of the process and rounds
- SRE Interview Prep Guide, community-maintained GitHub guide covering Linux, networking, and coding
- Google SRE Book, the canonical reference for what SREs actually do in production
- LeetCode Patterns, searchable table of problems sorted by pattern family