Top 15 Data Engineering Interview Problems: SQL and Python

The fastest way to fail a data engineering interview is to prep like it is a software engineering loop. Go ahead, spend four weeks grinding LeetCode graphs and dynamic programming. Then walk into your DE loop and get handed a SQL deduplication question. The hashmap was not going to save you.

The core of most DE coding rounds is SQL fluency plus practical Python, not algorithmic puzzles. Expect one to three medium-difficulty problems per round, grounded in realistic data scenarios: deduplication, aggregation, streaming, and transformation.

These 15 problems cover roughly 80% of what actually shows up. Work through all of them.

SQL Is the Core Skill (Problems 1-8)

DE interviews weight SQL more heavily than any other skill. SQL window functions alone appear in at least half of all rounds. The SWE instinct is to reach for a hashmap. The DE instinct is to reach for PARTITION BY. These eight problems cover the families that appear most often.

The interview question that's haunted every engineer who skipped SQL and went straight to Pandas.

1. Deduplication: Keep the Latest Row Per Key

The setup: a table has duplicate rows for the same entity, and you need to keep only the most recent one.

WITH ranked AS (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY updated_at DESC) AS rn
  FROM customers
)
SELECT * FROM ranked WHERE rn = 1;

ROW_NUMBER() with PARTITION BY is the canonical dedup pattern. DISTINCT and GROUP BY min-max tricks both fall apart on partial duplicates. The follow-up: how do you delete the duplicates instead of selecting around them? Wrap the CTE in a DELETE WHERE rn > 1.

2. Running Total and Cumulative Sum

Compute a running total of revenue by day.

SELECT
  order_date,
  daily_revenue,
  SUM(daily_revenue) OVER (ORDER BY order_date) AS running_total
FROM daily_sales;

SUM() OVER (ORDER BY ...) defaults to RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Interviewers often ask about the difference between RANGE and ROWS frame modes. ROWS is precise; RANGE includes all rows with the same ORDER BY value, which creates surprising behavior on ties.

3. Top-N Per Group

Return the top 3 products by revenue for each category.

WITH ranked AS (
  SELECT
    category,
    product,
    revenue,
    DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS dr
  FROM products
)
SELECT * FROM ranked WHERE dr <= 3;

Know when to use ROW_NUMBER vs RANK vs DENSE_RANK. For top-N with ties included, DENSE_RANK is correct. For top-N breaking ties arbitrarily, use ROW_NUMBER. Getting this wrong is common because all three look identical in simple examples. They only diverge when you have duplicates, which is exactly when correctness matters.

4. Period-over-Period Comparison with LAG and LEAD

Compute month-over-month revenue growth.

SELECT
  month,
  revenue,
  LAG(revenue) OVER (ORDER BY month) AS prev_month,
  revenue - LAG(revenue) OVER (ORDER BY month) AS change,
  ROUND(
    100.0 * (revenue - LAG(revenue) OVER (ORDER BY month))
    / NULLIF(LAG(revenue) OVER (ORDER BY month), 0),
  2) AS pct_change
FROM monthly_revenue;

The NULLIF guard on the denominator is not optional. Division by zero crashes the query and the interviewer is watching for it. First-period rows return NULL from LAG, which propagates cleanly through the arithmetic.

5. Gaps and Islands: Consecutive Activity Detection

Find date ranges where a user was active on consecutive days.

The key insight: subtracting a row number from a date produces the same constant for every consecutive run. Group by that constant to identify each island.

WITH numbered AS (
  SELECT
    user_id,
    activity_date,
    activity_date - (ROW_NUMBER() OVER (
      PARTITION BY user_id ORDER BY activity_date
    ) * INTERVAL '1 day') AS grp
  FROM user_activity
)
SELECT
  user_id,
  MIN(activity_date) AS streak_start,
  MAX(activity_date) AS streak_end,
  COUNT(*)           AS streak_length
FROM numbered
GROUP BY user_id, grp;

Gaps-and-islands is the most DE-specific SQL problem on this list. Pure SWE loops almost never ask it. Expect it at Databricks, Snowflake, or any company running analytics on event streams.

6. Median and Percentile per Group

Standard aggregate functions skip the median. Use PERCENTILE_CONT.

SELECT
  department,
  PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary,
  PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY salary) AS p90_salary
FROM employees
GROUP BY department;

On MySQL pre-8.0, which lacks PERCENTILE_CONT, you simulate it with ROW_NUMBER and COUNT. The simulation uses ROW_NUMBER() / COUNT(*) OVER (PARTITION BY ...) to identify the middle row index. Know both paths.

7. Consecutive Days Active: Self-Join Approach

Find all users active on at least 3 consecutive days.

SELECT DISTINCT a.user_id
FROM user_activity a
JOIN user_activity b
  ON a.user_id = b.user_id
  AND b.activity_date = a.activity_date + INTERVAL '1 day'
JOIN user_activity c
  ON a.user_id = c.user_id
  AND c.activity_date = a.activity_date + INTERVAL '2 days';

The self-join is simple for small N. For anything beyond N=3, or to find the longest streak, gaps-and-islands from problem 5 scales better. Know both approaches and be ready to say when you would switch.

8. Slowly Changing Dimensions: Find the Active Record

SCD Type 2 questions come up at companies with a data warehouse background. The table tracks history with an is_current flag or a NULL end_date.

-- Read the current record
SELECT * FROM dim_customer WHERE is_current = TRUE;

-- Expire the old row and insert the new one (idempotent upsert)
UPDATE dim_customer
  SET end_date = CURRENT_DATE, is_current = FALSE
  WHERE customer_id = 42 AND is_current = TRUE;

INSERT INTO dim_customer
  (customer_id, name, email, start_date, end_date, is_current)
  VALUES (42, 'New Name', '[email protected]', CURRENT_DATE, NULL, TRUE);

Know what SCD Type 2 is and why you would choose it over Type 1. Type 1 overwrites. Type 2 preserves history. The interviewer cares about the tradeoff, not just the SQL.

Python Tests ETL Instincts (Problems 9-15)

The Python section is not about algorithms. Nobody is asking you to invert a binary tree on a data team. They want to know whether you reach for the right primitive, whether you know that loading a 10 GB file with readlines() is a bad idea, and whether you handle malformed input like a professional instead of hoping it does not exist.

LinkedIn profile slide: list of data tools mixed with Pokémon names. Caption: "I typically ask recruiters to point out which of these are Pokémon." The DE skills landscape. Knowing which of these to reach for in an interview is the whole game.

9. Stream Large Files Without Loading Into Memory

Process a 10 GB log file and compute a count per event type.

from collections import Counter

def count_events(filepath: str) -> Counter:
    counts: Counter = Counter()
    with open(filepath) as f:
        for line in f:
            event_type = line.split("\t")[0].strip()
            counts[event_type] += 1
    return counts

Iterating a file object is already lazy. The mistake to avoid is wrapping the file in list() or calling readlines() before processing. For JSON-lines files, parse each line inside the loop rather than passing the whole file to json.load.

10. Group-By Aggregation Without Pandas

Sum revenue per region from a list of transaction dicts.

from collections import defaultdict

def revenue_by_region(transactions: list[dict]) -> dict:
    totals: defaultdict = defaultdict(float)
    for t in transactions:
        totals[t["region"]] += t["amount"]
    return dict(totals)

Interviewers want to see you reach for defaultdict or Counter before importing Pandas. It signals that you understand the underlying structure, not just the library API. Pandas is a fine answer in a second pass.

11. Flatten Nested JSON Records

Convert records with nested fields into flat rows for loading into a warehouse.

def flatten(record: dict, prefix: str = "") -> dict:
    flat: dict = {}
    for key, value in record.items():
        full_key = f"{prefix}_{key}" if prefix else key
        if isinstance(value, dict):
            flat.update(flatten(value, full_key))
        else:
            flat[full_key] = value
    return flat

# Input:  {"user": {"id": 1, "name": "Alice"}, "amount": 99.0}
# Output: {"user_id": 1, "user_name": "Alice", "amount": 99.0}

The recursion is simple. Interviewers extend it: handle lists of nested objects, cap depth at N levels, handle None values. Know how to add a depth guard.

12. Parse a Messy CSV with Error Handling

import csv
from typing import Iterator

def parse_transactions(filepath: str) -> Iterator[dict]:
    with open(filepath, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        for row_num, row in enumerate(reader, start=2):
            try:
                yield {
                    "id":     int(row["id"]),
                    "amount": float(row["amount"]),
                    "region": row["region"].strip().lower(),
                }
            except (ValueError, KeyError) as e:
                print(f"Row {row_num} skipped: {e}")

yield makes this a generator, keeping memory constant regardless of file size. The row_num starts at 2 to account for the header row in error messages. Catching both ValueError (bad type cast) and KeyError (missing column) is a detail the interviewer notices.

13. Sliding Window Over Timestamped Events

Count events in the last 60 seconds at each point in a sorted event stream. This is the same sliding window pattern you have seen in array problems, applied to a real streaming use case.

from collections import deque

def rolling_count(
    events: list[tuple[int, str]], window_sec: int = 60
) -> list[int]:
    queue: deque = deque()
    result: list[int] = []
    for timestamp, _ in events:
        queue.append(timestamp)
        while queue and timestamp - queue[0] > window_sec:
            queue.popleft()
        result.append(len(queue))
    return result

For sorted input, this runs in O(n) total. Each event is enqueued and dequeued at most once. The wrong approach is a nested loop at O(n times window_size).

14. Reconcile Two Datasets for Discrepancies

Given records from a source system and a destination, find what is missing, extra, or mismatched.

def reconcile(
    source: list[dict], dest: list[dict], key: str
) -> dict:
    source_map = {r[key]: r for r in source}
    dest_map   = {r[key]: r for r in dest}

    missing  = [source_map[k] for k in source_map if k not in dest_map]
    extra    = [dest_map[k]   for k in dest_map   if k not in source_map]
    mismatch = [
        {"key": k, "source": source_map[k], "dest": dest_map[k]}
        for k in source_map
        if k in dest_map and source_map[k] != dest_map[k]
    ]
    return {"missing": missing, "extra": extra, "mismatch": mismatch}

Build an index on both sides first. The hash map lookup drops this from O(n times m) to O(n + m), which matters when both sides are millions of rows.

15. Retry with Exponential Backoff

Production pipelines call external APIs that occasionally fail. This tests whether you can write reliable pipeline code.

import time
import random
import requests

def fetch_with_retry(url: str, max_retries: int = 5) -> dict:
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()
            return response.json()
        except requests.RequestException as e:
            if attempt == max_retries - 1:
                raise
            wait = (2 ** attempt) + random.uniform(0, 1)
            print(f"Attempt {attempt + 1} failed: {e}. Retrying in {wait:.1f}s")
            time.sleep(wait)

The random.uniform jitter is not decoration. Without it, all retrying workers hit the same second together, which can bring down the external service. Interviewers often test whether you know this. If you skip the jitter, you will hear about it.

How Data Engineering Interview Problems Are Scored

Correctness is necessary but not sufficient. The rubric in most DE loops includes:

Edge case handling: nulls, empty input, duplicate keys, encoding errors
Memory awareness: do you read everything into RAM, or stream it?
SQL idiom choices: window functions vs correlated subqueries vs application-side logic
Error handling: do you silently discard bad rows, or surface them with context?

The problems above map directly onto those four dimensions. Working through them gives you a feel for where the interviewer is looking, not just what the correct answer is.

Thirty to Forty Problems Is Enough

Domain	Count	Focus areas
SQL window functions	10-12	ROW_NUMBER, LAG/LEAD, frames
SQL aggregation and CTEs	8-10	GROUP BY extensions, subqueries
Python data manipulation	8-10	generators, dicts, error handling
SQL schema and upserts	4-6	SCD Type 2, idempotent writes

LeetCode's SQL 50 study plan covers the SQL side well. For a broader picture of what DSA knowledge DE roles actually require, see DSA for Backend Engineers and the Data Engineer Interview Prep guide.

One thing practice problems cannot train is explaining your choices under pressure. If the loop includes a live coding component (most do), you will be scored on whether you can say "I am streaming this file because it could be multi-GB and loading it into memory would cause OOM" while typing. Running voice-based mock interviews on SpaceComplexity is a direct way to build that habit before the real thing.