DSA for Data Scientists: What Your Coding Interview Actually Tests

You have a data scientist interview on the calendar. You know pandas. You can write a SQL window function without checking the docs. You have opinions about Jupyter notebooks. Things are going well.

Then someone sends you a LeetCode link.

The patterns that actually show up in data scientist coding rounds are specific, learnable in a few weeks, and map directly to things you already do with data. This guide breaks down exactly what those patterns are, what you can safely skip, and how to build a prep plan that matches your actual role and timeline.

You're Not Interviewing for SWE. But Which DS Track Are You?

Get this wrong and you'll either over-prepare for months or walk into a harder loop than expected. Neither is fun.

Product and analytics data scientists (Meta Core Data Science, Google Analytics DS, most business-facing DS roles) sit closer to the business. Their coding rounds lean on SQL, Python data manipulation, A/B testing design, and statistical reasoning. One or two DSA problems usually show up in a phone screen, but they're LeetCode easy-to-medium: frequency counting, grouping, filtering, basic sorting. Nothing exotic.

Machine learning engineers and applied scientists (Amazon Applied Science, Google Research, any role with "ML Engineering" in the title) are much closer to software engineering. Expect a near-full SWE coding loop. Two medium problems in 35 to 70 minutes. Hash maps, trees, graphs, complexity analysis. The bar is close to SDE-1, not DS.

Full-stack data scientists at mid-size companies and startups fall in between. One to two coding screens, medium difficulty, often with a data-flavored context layered on top of the algorithm.

Read the job description before planning your prep. "Strong engineering skills" means MLE-adjacent. "Experience with SQL and Python" means analytics-adjacent. The difference in prep time is literally months.

What the Major Companies Actually Do

Formats shift by team and over time. These are the recognizable patterns as of mid-2026.

Company	Coding Difficulty	Format	Other Rounds
Google DS	Easy to Medium	2 problems, 70 min, no code execution	Stats, ML design, product sense
Meta DS	Medium to Hard	2 problems, 35 min, live interviewer	Product sense, SQL, experimentation
Amazon DS	Medium	1-2 coding + ML + behavioral	LP questions woven throughout
Microsoft DS	Easy to Medium	1-2 coding screens	SQL, case studies
Startups	Easy to Medium	Take-home or 1 live screen	SQL-heavy, stats-heavy

Google's coding round uses a shared Google Doc with syntax highlighting. No compiler. You type, you explain, you mentally trace. That changes what "debugging" means: you have to narrate your way through it.

Meta's 35-minute window for two problems is genuinely fast. Solving both optimally requires pattern fluency, not just familiarity. Amazon's loop is the broadest: ML implementation, statistics simulations, SQL queries, and LeetCode-style coding can all appear in the same loop. Breadth matters more than depth.

The Four Patterns That Actually Show Up

You don't need to master 15 algorithm families. Most DS coding questions cluster around four patterns, and you probably know more of the underlying logic than you think.

Arrays and Hash Maps Are Your Most-Used Tools

The majority of data scientist coding questions reduce to: count something, find duplicates, group things, or look something up fast.

Frequency counting is the canonical pattern. Given a list of user events, return the most common action. Given product purchases, find all items bought more than N times. In Python you'd reach for collections.Counter. The interview removes the library and asks you to show the underlying logic.

A hash map makes this O(n) instead of O(n²). That distinction is the whole test.

Duplicate detection, two-sum variants, anagram grouping, first non-repeating element: all hash map problems. If you've written a pandas groupby, you're already thinking in hash maps. The interview just asks you to implement what the library does.

from collections import defaultdict

def group_by_category(records):
    groups = defaultdict(list)
    for item, category in records:
        groups[category].append(item)
    return dict(groups)

That's essentially what df.groupby("category")["item"].apply(list) compiles down to. The interview removes the abstraction layer. That's it.

Sorting: Not Just `list.sort()`

Sorting comes up in DS interviews because ordering data is everywhere: ranking recommendations, binning features, merge operations. The interview tests whether you can sort on custom keys and whether you understand why O(n log n) is the floor for a general comparison sort.

Common patterns: sort by multiple keys, custom comparator, sort then binary search as a combined approach, and median of a stream.

# Sort users by tier descending, then signup date ascending
users.sort(key=lambda u: (-u.tier, u.signup_date))

Python uses Timsort, which is O(n log n) worst case and O(n) for nearly-sorted data. Worth knowing because interviewers ask about it and because real data is often nearly-sorted.

Matrix Traversal: The NumPy Interview Round

If you're targeting an MLE or applied scientist role, 2D array problems show up more than you'd expect. Spiral traversal, rotating a matrix in-place, island counting in a grid, image transformations.

These map directly to how you think about tensors, embeddings, and feature matrices in production. Rotating a 2D array in-place is what np.rot90 does. The interview asks for the loop logic. The insight: transpose, then reverse each row. That same spatial reasoning matters when you're reshaping a (batch, features) tensor or flattening a convolutional layer's output.

def rotate_90_clockwise(matrix):
    n = len(matrix)
    for i in range(n):
        for j in range(i + 1, n):
            matrix[i][j], matrix[j][i] = matrix[j][i], matrix[i][j]
    for row in matrix:
        row.reverse()

For analytics DS roles, matrix problems are less common. But if the role has any ML engineering flavor, practice a few grid problems.

Complexity Reasoning: This Is Where You Differentiate

Interviewers testing a data scientist specifically want to know that you think about scale: what happens when the dataset is 10x larger?

This isn't abstract. You've felt it.

A data scientist realizes their Jupyter notebook takes 20 minutes for 11K rows. The actual data has 400K rows. When you test on a sample and everything looks fine, then you see the full dataset size.

A naive O(n²) join on two dataframes with millions of rows will OOM or time out. A hash-based join is O(n). The interview just asks you to articulate what you already do intuitively.

Two moves that consistently land well. State your complexity before you code: "I'll use a hash map for O(1) lookup, so the overall pass is O(n)." Then check with the interviewer whether they want optimization after the brute force works. DS interviewers often care as much about the reasoning as the optimal code, because they're evaluating whether you'll write scalable data pipelines, not just clever one-liners.

What You Can Skip

Here's the part nobody tells you.

Technical interview: Godzilla fighting Kong. Actual job: tiny cute stuffed dinosaurs. The interview looks terrifying. The job is mostly groupbys and window functions.

Unless you're targeting an MLE or applied scientist role, these rarely appear in DS coding rounds:

Complex dynamic programming (coin change, edit distance, longest common subsequence)
Advanced graph algorithms (Dijkstra, Bellman-Ford, topological sort)
Tries, segment trees, Fenwick trees, union-find
Bit manipulation
Recursion-heavy divide and conquer

A software engineer interviewing at Google needs all of these. A data scientist at Google Analytics probably doesn't. For machine learning engineer roles, the calculus flips. Treat the coding prep like an SWE loop and stack the stats and ML design on top.

The biggest mistake DS candidates make is preparing for a different job. Know which track you're on before week one.

These Patterns Are Already in Your Data Work

This is the part that should genuinely make you feel better.

Feature engineering is frequency counting. Counting event types per user, computing frequency ratios, finding co-occurring items across sessions: all hash map problems in code form. When you're building features from raw event logs, you're writing frequency maps.

Merging datasets is two pointers. When you join two sorted streams without loading both into memory, that's the two-pointer technique. Streaming ETL pipelines use this exact structure.

Recommendation ranking is custom-key sorting. Sort by predicted score, then by recency, then break ties by item ID: that's multi-key sort, exactly what the interview tests.

Rolling accuracy and drift detection are sliding windows. Computing a rolling average over the last N predictions, or detecting distribution shift in a streaming inference pipeline, uses the sliding window pattern directly.

The interview removes the library and asks whether you understand what the library is doing. You already do. You just haven't been asked to narrate it in 35 minutes with someone watching.

A Focused Prep Plan

Product and Analytics DS Roles (4 to 5 weeks)

Week 1-2: Arrays, strings, hash maps. 20 to 25 LeetCode easy-medium problems. Focus on frequency counting, duplicate detection, grouping, and two-sum variants.

Week 3: Sorting with custom keys, two pointers, sliding window. 15 to 20 problems.

Week 4: For every solution you write, state the time and space complexity before you code and after. Make it automatic. If you can't explain it out loud, practice until you can.

Week 5: SQL, A/B testing design, and product sense prep. For analytics DS roles, these rounds are at least as weighted as the coding round.

MLE and Applied Scientist Roles (8 to 10 weeks)

Start with the plan above, then add:

Week 5-6: Trees, BFS/DFS, basic graphs. 20 to 25 problems. The most common topics across DS and SWE loops overlap here.

Week 7-8: Dynamic programming, classic patterns only. 1D DP, 2D DP, knapsack family. 15 problems.

Week 9-10: Full mock interview loops, timed and spoken aloud. Coding interviewers at every company score communication as an explicit dimension, and DS interviewers especially care about how you explain your reasoning around scale and complexity.

In both tracks, prioritize mediums over hards. The DS coding interview rarely features a hard problem. The medium is where the signal lives: can you identify the right pattern, implement it cleanly, and explain the tradeoffs out loud?

For spoken practice specifically, SpaceComplexity runs voice-based mock interviews with rubric feedback across all four dimensions (algorithms, coding, communication, problem-solving) that interviewers actually score. Running 10 sessions before your loop will expose gaps that grinding LeetCode silently won't.