Top 15 Data Engineering Interview Problems: SQL and Python

- ROW_NUMBER() OVER (PARTITION BY ...) is the canonical SQL deduplication pattern and handles partial duplicates correctly
- SQL window functions (LAG/LEAD, DENSE_RANK, PERCENTILE_CONT) appear in at least half of all data engineering coding rounds
- Gaps-and-islands detection via date-minus-row-number is the most DE-specific SQL pattern you will encounter
- Generator-based file streaming in Python is the correct default for any file that could exceed available RAM
- SCD Type 2 tracks history by expiring old rows, and knowing when to choose it over Type 1 is a scored signal
- Exponential backoff with jitter prevents thundering-herd failures when retrying pipeline API calls
- Reconciliation logic requires indexing both sides first to avoid O(n*m) comparison on million-row datasets
The fastest way to fail a data engineering interview is to prep like it is a software engineering loop. Go ahead, spend four weeks grinding LeetCode graphs and dynamic programming. Then walk into your DE loop and get handed a SQL deduplication question. The hashmap was not going to save you.
The core of most DE coding rounds is SQL fluency plus practical Python, not algorithmic puzzles. Expect one to three medium-difficulty problems per round, grounded in realistic data scenarios: deduplication, aggregation, streaming, and transformation.
These 15 problems cover roughly 80% of what actually shows up. Work through all of them.
SQL Is the Core Skill (Problems 1-8)
DE interviews weight SQL more heavily than any other skill. SQL window functions alone appear in at least half of all rounds. The SWE instinct is to reach for a hashmap. The DE instinct is to reach for PARTITION BY. These eight problems cover the families that appear most often.
The interview question that's haunted every engineer who skipped SQL and went straight to Pandas.
1. Deduplication: Keep the Latest Row Per Key
The setup: a table has duplicate rows for the same entity, and you need to keep only the most recent one.
WITH ranked AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn FROM customers ) SELECT * FROM ranked WHERE rn = 1;
ROW_NUMBER() with PARTITION BY is the canonical dedup pattern. DISTINCT and GROUP BY min-max tricks both fall apart on partial duplicates. The follow-up: how do you delete the duplicates instead of selecting around them? Wrap the CTE in a DELETE WHERE rn > 1.
2. Running Total and Cumulative Sum
Compute a running total of revenue by day.
SELECT order_date, daily_revenue, SUM(daily_revenue) OVER (ORDER BY order_date) AS running_total FROM daily_sales;
SUM() OVER (ORDER BY ...) defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Interviewers often ask about the difference between RANGE and ROWS frame modes. ROWS is precise; RANGE includes all rows with the same ORDER BY value, which creates surprising behavior on ties.
3. Top-N Per Group
Return the top 3 products by revenue for each category.
WITH ranked AS ( SELECT category, product, revenue, DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS dr FROM products ) SELECT * FROM ranked WHERE dr <= 3;
Know when to use ROW_NUMBER vs RANK vs DENSE_RANK. For top-N with ties included, DENSE_RANK is correct. For top-N breaking ties arbitrarily, use ROW_NUMBER. Getting this wrong is common because all three look identical in simple examples. They only diverge when you have duplicates, which is exactly when correctness matters.
4. Period-over-Period Comparison with LAG and LEAD
Compute month-over-month revenue growth.
SELECT month, revenue, LAG(revenue) OVER (ORDER BY month) AS prev_month, revenue - LAG(revenue) OVER (ORDER BY month) AS change, ROUND( 100.0 * (revenue - LAG(revenue) OVER (ORDER BY month)) / NULLIF(LAG(revenue) OVER (ORDER BY month), 0), 2) AS pct_change FROM monthly_revenue;
The NULLIF guard on the denominator is not optional. Division by zero crashes the query and the interviewer is watching for it. First-period rows return NULL from LAG, which propagates cleanly through the arithmetic.
5. Gaps and Islands: Consecutive Activity Detection
Find date ranges where a user was active on consecutive days.
The key insight: subtracting a row number from a date produces the same constant for every consecutive run. Group by that constant to identify each island.
WITH numbered AS ( SELECT user_id, activity_date, activity_date - (ROW_NUMBER() OVER ( PARTITION BY user_id ORDER BY activity_date ) * INTERVAL '1 day') AS grp FROM user_activity ) SELECT user_id, MIN(activity_date) AS streak_start, MAX(activity_date) AS streak_end, COUNT(*) AS streak_length FROM numbered GROUP BY user_id, grp;
Gaps-and-islands is the most DE-specific SQL problem on this list. Pure SWE loops almost never ask it. Expect it at Databricks, Snowflake, or any company running analytics on event streams.
6. Median and Percentile per Group
Standard aggregate functions skip the median. Use PERCENTILE_CONT.
SELECT department, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary, PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY salary) AS p90_salary FROM employees GROUP BY department;
On MySQL pre-8.0, which lacks PERCENTILE_CONT, you simulate it with ROW_NUMBER and COUNT. The simulation uses ROW_NUMBER() / COUNT(*) OVER (PARTITION BY ...) to identify the middle row index. Know both paths.
7. Consecutive Days Active: Self-Join Approach
Find all users active on at least 3 consecutive days.
SELECT DISTINCT a.user_id FROM user_activity a JOIN user_activity b ON a.user_id = b.user_id AND b.activity_date = a.activity_date + INTERVAL '1 day' JOIN user_activity c ON a.user_id = c.user_id AND c.activity_date = a.activity_date + INTERVAL '2 days';
The self-join is simple for small N. For anything beyond N=3, or to find the longest streak, gaps-and-islands from problem 5 scales better. Know both approaches and be ready to say when you would switch.
8. Slowly Changing Dimensions: Find the Active Record
SCD Type 2 questions come up at companies with a data warehouse background. The table tracks history with an is_current flag or a NULL end_date.
-- Read the current record SELECT * FROM dim_customer WHERE is_current = TRUE; -- Expire the old row and insert the new one (idempotent upsert) UPDATE dim_customer SET end_date = CURRENT_DATE, is_current = FALSE WHERE customer_id = 42 AND is_current = TRUE; INSERT INTO dim_customer (customer_id, name, email, start_date, end_date, is_current) VALUES (42, 'New Name', '[email protected]', CURRENT_DATE, NULL, TRUE);
Know what SCD Type 2 is and why you would choose it over Type 1. Type 1 overwrites. Type 2 preserves history. The interviewer cares about the tradeoff, not just the SQL.
Python Tests ETL Instincts (Problems 9-15)
The Python section is not about algorithms. Nobody is asking you to invert a binary tree on a data team. They want to know whether you reach for the right primitive, whether you know that loading a 10 GB file with readlines() is a bad idea, and whether you handle malformed input like a professional instead of hoping it does not exist.
The DE skills landscape. Knowing which of these to reach for in an interview is the whole game.
9. Stream Large Files Without Loading Into Memory
Process a 10 GB log file and compute a count per event type.
from collections import Counter def count_events(filepath: str) -> Counter: counts: Counter = Counter() with open(filepath) as f: for line in f: event_type = line.split("\t")[0].strip() counts[event_type] += 1 return counts
Iterating a file object is already lazy. The mistake to avoid is wrapping the file in list() or calling readlines() before processing. For JSON-lines files, parse each line inside the loop rather than passing the whole file to json.load.
10. Group-By Aggregation Without Pandas
Sum revenue per region from a list of transaction dicts.
from collections import defaultdict def revenue_by_region(transactions: list[dict]) -> dict: totals: defaultdict = defaultdict(float) for t in transactions: totals[t["region"]] += t["amount"] return dict(totals)
Interviewers want to see you reach for defaultdict or Counter before importing Pandas. It signals that you understand the underlying structure, not just the library API. Pandas is a fine answer in a second pass.
11. Flatten Nested JSON Records
Convert records with nested fields into flat rows for loading into a warehouse.
def flatten(record: dict, prefix: str = "") -> dict: flat: dict = {} for key, value in record.items(): full_key = f"{prefix}_{key}" if prefix else key if isinstance(value, dict): flat.update(flatten(value, full_key)) else: flat[full_key] = value return flat # Input: {"user": {"id": 1, "name": "Alice"}, "amount": 99.0} # Output: {"user_id": 1, "user_name": "Alice", "amount": 99.0}
The recursion is simple. Interviewers extend it: handle lists of nested objects, cap depth at N levels, handle None values. Know how to add a depth guard.
12. Parse a Messy CSV with Error Handling
import csv from typing import Iterator def parse_transactions(filepath: str) -> Iterator[dict]: with open(filepath, newline="", encoding="utf-8") as f: reader = csv.DictReader(f) for row_num, row in enumerate(reader, start=2): try: yield { "id": int(row["id"]), "amount": float(row["amount"]), "region": row["region"].strip().lower(), } except (ValueError, KeyError) as e: print(f"Row {row_num} skipped: {e}")
yield makes this a generator, keeping memory constant regardless of file size. The row_num starts at 2 to account for the header row in error messages. Catching both ValueError (bad type cast) and KeyError (missing column) is a detail the interviewer notices.
13. Sliding Window Over Timestamped Events
Count events in the last 60 seconds at each point in a sorted event stream. This is the same sliding window pattern you have seen in array problems, applied to a real streaming use case.
from collections import deque def rolling_count( events: list[tuple[int, str]], window_sec: int = 60 ) -> list[int]: queue: deque = deque() result: list[int] = [] for timestamp, _ in events: queue.append(timestamp) while queue and timestamp - queue[0] > window_sec: queue.popleft() result.append(len(queue)) return result
For sorted input, this runs in O(n) total. Each event is enqueued and dequeued at most once. The wrong approach is a nested loop at O(n times window_size).
14. Reconcile Two Datasets for Discrepancies
Given records from a source system and a destination, find what is missing, extra, or mismatched.
def reconcile( source: list[dict], dest: list[dict], key: str ) -> dict: source_map = {r[key]: r for r in source} dest_map = {r[key]: r for r in dest} missing = [source_map[k] for k in source_map if k not in dest_map] extra = [dest_map[k] for k in dest_map if k not in source_map] mismatch = [ {"key": k, "source": source_map[k], "dest": dest_map[k]} for k in source_map if k in dest_map and source_map[k] != dest_map[k] ] return {"missing": missing, "extra": extra, "mismatch": mismatch}
Build an index on both sides first. The hash map lookup drops this from O(n times m) to O(n + m), which matters when both sides are millions of rows.
15. Retry with Exponential Backoff
Production pipelines call external APIs that occasionally fail. This tests whether you can write reliable pipeline code.
import time import random import requests def fetch_with_retry(url: str, max_retries: int = 5) -> dict: for attempt in range(max_retries): try: response = requests.get(url, timeout=10) response.raise_for_status() return response.json() except requests.RequestException as e: if attempt == max_retries - 1: raise wait = (2 ** attempt) + random.uniform(0, 1) print(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait:.1f}s") time.sleep(wait)
The random.uniform jitter is not decoration. Without it, all retrying workers hit the same second together, which can bring down the external service. Interviewers often test whether you know this. If you skip the jitter, you will hear about it.
How Data Engineering Interview Problems Are Scored
Correctness is necessary but not sufficient. The rubric in most DE loops includes:
- Edge case handling: nulls, empty input, duplicate keys, encoding errors
- Memory awareness: do you read everything into RAM, or stream it?
- SQL idiom choices: window functions vs correlated subqueries vs application-side logic
- Error handling: do you silently discard bad rows, or surface them with context?
The problems above map directly onto those four dimensions. Working through them gives you a feel for where the interviewer is looking, not just what the correct answer is.
Thirty to Forty Problems Is Enough
| Domain | Count | Focus areas |
|---|---|---|
| SQL window functions | 10-12 | ROW_NUMBER, LAG/LEAD, frames |
| SQL aggregation and CTEs | 8-10 | GROUP BY extensions, subqueries |
| Python data manipulation | 8-10 | generators, dicts, error handling |
| SQL schema and upserts | 4-6 | SCD Type 2, idempotent writes |
LeetCode's SQL 50 study plan covers the SQL side well. For a broader picture of what DSA knowledge DE roles actually require, see DSA for Backend Engineers and the Data Engineer Interview Prep guide.
One thing practice problems cannot train is explaining your choices under pressure. If the loop includes a live coding component (most do), you will be scored on whether you can say "I am streaming this file because it could be multi-GB and loading it into memory would cause OOM" while typing. Running voice-based mock interviews on SpaceComplexity is a direct way to build that habit before the real thing.
Further Reading
- LeetCode SQL 50 Study Plan, the most focused SQL practice set for analyst and DE roles
- SQL Window Functions on Wikipedia, formal definition and history
- GeeksforGeeks: Data Engineering Interview Questions, broad coverage of system design and pipeline questions
- Real Python: Data Engineer Interview Questions, Python-specific walkthrough with worked examples
- DataLemur: SQL Window Functions Interview Questions, curated practice problems with explanations