DSA for Data Scientists: What Your Coding Interview Actually Tests

- Data scientist coding interviews split into two bars: analytics DS (easy-medium, four patterns) vs MLE/applied scientist (near full SWE loop)
- Hash maps are your workhorse: frequency counting, duplicate detection, grouping, and two-sum variants all reduce to the same O(n) structure
- Sorting, matrix traversal, and complexity reasoning round out the four patterns worth drilling before any DS coding round
- Skip advanced DP, graph algorithms, and tries unless you're targeting an MLE or applied scientist role
- The patterns map directly to your data work: groupby is a hash map, rolling averages are sliding windows, merge joins are two pointers
- Stating complexity before you code is the move that consistently differentiates data scientists in coding interviews
You have a data scientist interview on the calendar. You know pandas. You can write a SQL window function without checking the docs. You have opinions about Jupyter notebooks. Things are going well.
Then someone sends you a LeetCode link.
The patterns that actually show up in data scientist coding rounds are specific, learnable in a few weeks, and map directly to things you already do with data. This guide breaks down exactly what those patterns are, what you can safely skip, and how to build a prep plan that matches your actual role and timeline.
You're Not Interviewing for SWE. But Which DS Track Are You?
Get this wrong and you'll either over-prepare for months or walk into a harder loop than expected. Neither is fun.
Product and analytics data scientists (Meta Core Data Science, Google Analytics DS, most business-facing DS roles) sit closer to the business. Their coding rounds lean on SQL, Python data manipulation, A/B testing design, and statistical reasoning. One or two DSA problems usually show up in a phone screen, but they're LeetCode easy-to-medium: frequency counting, grouping, filtering, basic sorting. Nothing exotic.
Machine learning engineers and applied scientists (Amazon Applied Science, Google Research, any role with "ML Engineering" in the title) are much closer to software engineering. Expect a near-full SWE coding loop. Two medium problems in 35 to 70 minutes. Hash maps, trees, graphs, complexity analysis. The bar is close to SDE-1, not DS.
Full-stack data scientists at mid-size companies and startups fall in between. One to two coding screens, medium difficulty, often with a data-flavored context layered on top of the algorithm.
Read the job description before planning your prep. "Strong engineering skills" means MLE-adjacent. "Experience with SQL and Python" means analytics-adjacent. The difference in prep time is literally months.
What the Major Companies Actually Do
Formats shift by team and over time. These are the recognizable patterns as of mid-2026.
| Company | Coding Difficulty | Format | Other Rounds |
|---|---|---|---|
| Google DS | Easy to Medium | 2 problems, 70 min, no code execution | Stats, ML design, product sense |
| Meta DS | Medium to Hard | 2 problems, 35 min, live interviewer | Product sense, SQL, experimentation |
| Amazon DS | Medium | 1-2 coding + ML + behavioral | LP questions woven throughout |
| Microsoft DS | Easy to Medium | 1-2 coding screens | SQL, case studies |
| Startups | Easy to Medium | Take-home or 1 live screen | SQL-heavy, stats-heavy |
Google's coding round uses a shared Google Doc with syntax highlighting. No compiler. You type, you explain, you mentally trace. That changes what "debugging" means: you have to narrate your way through it.
Meta's 35-minute window for two problems is genuinely fast. Solving both optimally requires pattern fluency, not just familiarity. Amazon's loop is the broadest: ML implementation, statistics simulations, SQL queries, and LeetCode-style coding can all appear in the same loop. Breadth matters more than depth.
The Four Patterns That Actually Show Up
You don't need to master 15 algorithm families. Most DS coding questions cluster around four patterns, and you probably know more of the underlying logic than you think.
Arrays and Hash Maps Are Your Most-Used Tools
The majority of data scientist coding questions reduce to: count something, find duplicates, group things, or look something up fast.
Frequency counting is the canonical pattern. Given a list of user events, return the most common action. Given product purchases, find all items bought more than N times. In Python you'd reach for collections.Counter. The interview removes the library and asks you to show the underlying logic.
A hash map makes this O(n) instead of O(n²). That distinction is the whole test.
Duplicate detection, two-sum variants, anagram grouping, first non-repeating element: all hash map problems. If you've written a pandas groupby, you're already thinking in hash maps. The interview just asks you to implement what the library does.
from collections import defaultdict def group_by_category(records): groups = defaultdict(list) for item, category in records: groups[category].append(item) return dict(groups)
That's essentially what df.groupby("category")["item"].apply(list) compiles down to. The interview removes the abstraction layer. That's it.
Sorting: Not Just list.sort()
Sorting comes up in DS interviews because ordering data is everywhere: ranking recommendations, binning features, merge operations. The interview tests whether you can sort on custom keys and whether you understand why O(n log n) is the floor for a general comparison sort.
Common patterns: sort by multiple keys, custom comparator, sort then binary search as a combined approach, and median of a stream.
# Sort users by tier descending, then signup date ascending users.sort(key=lambda u: (-u.tier, u.signup_date))
Python uses Timsort, which is O(n log n) worst case and O(n) for nearly-sorted data. Worth knowing because interviewers ask about it and because real data is often nearly-sorted.
Matrix Traversal: The NumPy Interview Round
If you're targeting an MLE or applied scientist role, 2D array problems show up more than you'd expect. Spiral traversal, rotating a matrix in-place, island counting in a grid, image transformations.
These map directly to how you think about tensors, embeddings, and feature matrices in production. Rotating a 2D array in-place is what np.rot90 does. The interview asks for the loop logic. The insight: transpose, then reverse each row. That same spatial reasoning matters when you're reshaping a (batch, features) tensor or flattening a convolutional layer's output.
def rotate_90_clockwise(matrix): n = len(matrix) for i in range(n): for j in range(i + 1, n): matrix[i][j], matrix[j][i] = matrix[j][i], matrix[i][j] for row in matrix: row.reverse()
For analytics DS roles, matrix problems are less common. But if the role has any ML engineering flavor, practice a few grid problems.
Complexity Reasoning: This Is Where You Differentiate
Interviewers testing a data scientist specifically want to know that you think about scale: what happens when the dataset is 10x larger?
This isn't abstract. You've felt it.
When you test on a sample and everything looks fine, then you see the full dataset size.
A naive O(n²) join on two dataframes with millions of rows will OOM or time out. A hash-based join is O(n). The interview just asks you to articulate what you already do intuitively.
Two moves that consistently land well. State your complexity before you code: "I'll use a hash map for O(1) lookup, so the overall pass is O(n)." Then check with the interviewer whether they want optimization after the brute force works. DS interviewers often care as much about the reasoning as the optimal code, because they're evaluating whether you'll write scalable data pipelines, not just clever one-liners.
What You Can Skip
Here's the part nobody tells you.
The interview looks terrifying. The job is mostly groupbys and window functions.
Unless you're targeting an MLE or applied scientist role, these rarely appear in DS coding rounds:
- Complex dynamic programming (coin change, edit distance, longest common subsequence)
- Advanced graph algorithms (Dijkstra, Bellman-Ford, topological sort)
- Tries, segment trees, Fenwick trees, union-find
- Bit manipulation
- Recursion-heavy divide and conquer
A software engineer interviewing at Google needs all of these. A data scientist at Google Analytics probably doesn't. For machine learning engineer roles, the calculus flips. Treat the coding prep like an SWE loop and stack the stats and ML design on top.
The biggest mistake DS candidates make is preparing for a different job. Know which track you're on before week one.
These Patterns Are Already in Your Data Work
This is the part that should genuinely make you feel better.
Feature engineering is frequency counting. Counting event types per user, computing frequency ratios, finding co-occurring items across sessions: all hash map problems in code form. When you're building features from raw event logs, you're writing frequency maps.
Merging datasets is two pointers. When you join two sorted streams without loading both into memory, that's the two-pointer technique. Streaming ETL pipelines use this exact structure.
Recommendation ranking is custom-key sorting. Sort by predicted score, then by recency, then break ties by item ID: that's multi-key sort, exactly what the interview tests.
Rolling accuracy and drift detection are sliding windows. Computing a rolling average over the last N predictions, or detecting distribution shift in a streaming inference pipeline, uses the sliding window pattern directly.
The interview removes the library and asks whether you understand what the library is doing. You already do. You just haven't been asked to narrate it in 35 minutes with someone watching.
A Focused Prep Plan
Product and Analytics DS Roles (4 to 5 weeks)
Week 1-2: Arrays, strings, hash maps. 20 to 25 LeetCode easy-medium problems. Focus on frequency counting, duplicate detection, grouping, and two-sum variants.
Week 3: Sorting with custom keys, two pointers, sliding window. 15 to 20 problems.
Week 4: For every solution you write, state the time and space complexity before you code and after. Make it automatic. If you can't explain it out loud, practice until you can.
Week 5: SQL, A/B testing design, and product sense prep. For analytics DS roles, these rounds are at least as weighted as the coding round.
MLE and Applied Scientist Roles (8 to 10 weeks)
Start with the plan above, then add:
Week 5-6: Trees, BFS/DFS, basic graphs. 20 to 25 problems. The most common topics across DS and SWE loops overlap here.
Week 7-8: Dynamic programming, classic patterns only. 1D DP, 2D DP, knapsack family. 15 problems.
Week 9-10: Full mock interview loops, timed and spoken aloud. Coding interviewers at every company score communication as an explicit dimension, and DS interviewers especially care about how you explain your reasoning around scale and complexity.
In both tracks, prioritize mediums over hards. The DS coding interview rarely features a hard problem. The medium is where the signal lives: can you identify the right pattern, implement it cleanly, and explain the tradeoffs out loud?
For spoken practice specifically, SpaceComplexity runs voice-based mock interviews with rubric feedback across all four dimensions (algorithms, coding, communication, problem-solving) that interviewers actually score. Running 10 sessions before your loop will expose gaps that grinding LeetCode silently won't.
Further Reading
- Python
collectionsmodule documentation, Counter, defaultdict, and the other tools that abstract over the patterns you'll be asked to implement - Timsort on Wikipedia, the sorting algorithm Python and Java use, and why it's O(n) on nearly-sorted data
- Hash table on Wikipedia, the formal treatment of why average O(1) lookup holds and what breaks it
- NumPy documentation, the reference for matrix operations that MLE coding questions often strip back to raw loops
- LeetCode problem explorer, filter by tag (Array, Hash Table, Sorting, Sliding Window) and difficulty (Easy, Medium) to build your problem list