Comments and Threads System Design Interview: The Data Model Decides Everything

- Adjacency list beats closure table for shallow-tree, write-heavy comment systems; upgrade to materialized path for deeply nested forums
- Cursor-based pagination always beats offset — encode position as
base64(comment_id + timestamp)so the query hits the index directly - Redis
INCRwith async flush outperforms database column updates for counters under high concurrent write load - SSE is the right choice for read-only live comment counts; WebSockets add bidirectional overhead you don't need for viewers
- Async moderation pipeline via Kafka; use a
visibilityenum withshadow_banso spammers never learn they're filtered - Hot partition fix: read replicas + Redis sorted-set cache (60s TTL) + CDN snapshots for viral posts reduces DB load 95%+
You have a post. Users want to comment on it. Some of those users want to reply to those comments. Some of those users want to reply to the replies. And somewhere, eight levels deep into a thread about JavaScript, two people are still arguing at 2am about whether TypeScript is "actually just JavaScript" using your infrastructure. This is your problem now.
Except comments form a tree, and trees are annoying to store in relational databases efficiently. Add concurrent writes on a viral post, live counters ticking up in real time, and a moderation pipeline that needs to catch spam before it spreads, and what looked like a weekend project turns into one of the better comment system design interviews out there. Here is how to walk through it in 45 minutes.
Start With Clarifying Questions
Before touching the whiteboard, nail down scope. The answer to each of these changes the design substantially.
- How deep does nesting go? YouTube caps at two levels (comments and replies, no sub-replies). Reddit allows arbitrarily deep nesting, which is where humanity's most consequential arguments about topics that genuinely do not matter take place. Facebook stops at two. The answer determines your data model.
- What scale? A community forum is different from a platform with 500 million daily active users. Ask for DAU and write frequency.
- Do you need real-time updates? A blog comment section can refresh every 30 seconds. A live stream chat cannot.
- What sort orders? Newest, oldest, highest-score, "best" (Wilson score). Each changes indexing and pagination strategy.
- Soft or hard delete? Almost always soft delete. A deleted comment with replies has to leave a tombstone, or you break thread continuity. Every
[deleted]placeholder you have ever seen on Reddit is that tombstone doing exactly one job.
For this walkthrough, assume: 500 million DAU, two or three levels of nesting max, a sort-by-new default with an optional sort-by-top, real-time comment counts but not real-time streaming of every comment, and soft delete.
Capacity Math (Two Minutes, Not Ten)
With 500 million DAU posting roughly 0.1 comments per user per day:
- Write QPS: 500M x 0.1 / 86,400 = ~580 writes/second
- Read QPS: 100:1 read-to-write ratio = ~58,000 reads/second
- Storage: average comment is 250 bytes of content plus metadata. At 50 million comments/day, with 3x replication, you are writing about 37 GB per day. A year of data fits comfortably in a few hundred terabytes.
Write QPS of ~600 is very manageable for a single Postgres primary. The read throughput is what requires a caching layer.
High-Level Architecture
Client
|
v
API Gateway / Load Balancer
|
v
Comment Service (stateless, horizontally scaled)
|
+---> PostgreSQL (primary write, sharded by post_id)
| |---> Read Replicas (read traffic)
|
+---> Redis (hot comment cache, counters)
|
+---> Message Queue (Kafka)
| |---> Notification Service
| |---> Moderation Service (async toxicity scoring)
| |---> Analytics/Search Indexer
|
+---> CDN (static assets, snapshot caches for viral posts)
The comment service is stateless. Any instance can handle any request. State lives in Postgres and Redis. Kafka fans out asynchronously so slow consumers (moderation, analytics) never block a comment from being saved.
The comment service is stateless. Any instance can handle any request. State lives in Postgres and Redis. This makes horizontal scaling trivial.
The Data Model Is Where a Comment System Design Interview Gets Interesting
You have four options for storing tree-structured data in a relational database. Three of them are interesting tradeoffs. One of them is a trap.
Adjacency list is the right call for most comment systems. Materialized path is the upgrade for deep threads. Nested sets: look, don't touch.
Option 1: Adjacency list. Each comment stores its own parent_id.
CREATE TABLE comments ( id BIGSERIAL PRIMARY KEY, post_id BIGINT NOT NULL, parent_id BIGINT REFERENCES comments(id), -- NULL = top-level user_id BIGINT NOT NULL, content TEXT NOT NULL, score INT DEFAULT 0, reply_count INT DEFAULT 0, -- cached, updated on write deleted BOOLEAN DEFAULT false, created_at TIMESTAMPTZ DEFAULT now() ); CREATE INDEX idx_comments_post_id ON comments(post_id, created_at); CREATE INDEX idx_comments_parent_id ON comments(parent_id);
Fetching an entire thread requires a recursive CTE:
WITH RECURSIVE thread AS ( SELECT *, 0 AS depth FROM comments WHERE id = $root_id UNION ALL SELECT c.*, t.depth + 1 FROM comments c JOIN thread t ON c.parent_id = t.id ) SELECT * FROM thread ORDER BY depth, score DESC;
Option 2: Closure table. A second table stores every ancestor-descendant pair, including the node pointing to itself at depth 0. Reads become a single fast join with no recursion. But every comment insert requires writing O(depth) rows into the closure table, and moving a subtree is painful.
Option 3: Materialized path. Each comment stores its full path from root: /post:42/comment:100/comment:101. Descendants of any node are a single prefix query (WHERE path LIKE '/post:42/comment:100/%') with no recursion. Reddit uses this with base36-encoded IDs so that lexicographic sort order produces correct tree traversal.
Option 4: Nested sets. Each node gets a left/right integer bounding its subtree. Beautiful for reads. Adding one comment anywhere in a large tree requires recomputing every subsequent left/right value, which means a single insert can touch the entire table. This is not an exaggeration. Never use nested sets for user-generated content. Never.
For most interview answers, choose the adjacency list. Writes are O(1), which matters at 600 writes/second. Comment threads in practice are shallow, typically two to five levels deep, so recursive CTE cost is bounded. The schema is easy to shard by post_id. You can also denormalize reply_count as a cached counter to avoid expensive COUNT(*) queries.
If the interviewer specifically asks about deeply nested forums (think phpBB-style discussion boards with 20-level threads), mention the materialized path as the upgrade, and explain Reddit's base36 encoding trick.
API Design
Keep it RESTful and pagination-first.
POST /posts/:post_id/comments -- create top-level comment
POST /comments/:comment_id/replies -- create a reply
GET /posts/:post_id/comments?sort=new&limit=20&cursor=<cursor>
GET /comments/:comment_id/replies?limit=10&cursor=<cursor>
PUT /comments/:comment_id -- edit
DELETE /comments/:comment_id -- soft delete
POST /comments/:comment_id/vote -- { direction: "up" | "down" }
Always use cursor-based pagination, never offset. LIMIT 20 OFFSET 10000 forces the database to scan 10,020 rows and discard 10,000 of them. This is morally equivalent to driving to the grocery store, filling your cart, paying, and then leaving everything on the conveyor belt. A cursor encodes the position of the last item as base64(comment_id + created_at). The next page query becomes WHERE created_at < cursor_time ORDER BY created_at DESC LIMIT 20, which uses the index directly.
The one wrinkle: score-sorted comments. Scores change constantly, so a score-based cursor can skip or repeat items as scores shift. Acceptable for most products. If you need exact consistency, paginate by ID with a join to the score, or accept that "top" sort has slightly fuzzy pagination.
Soft delete means the row stays in the database. Set deleted = true, wipe the content to [deleted], null out user_id. The children remain orphan-free. The API returns the tombstone record so the frontend can render [deleted] in the correct position in the thread.
Three Scaling Bottlenecks to Mention
1. Hot partitions from viral posts. If you shard by post_id, a viral post sends all reads and writes to one shard. The first fix is read replicas; Reddit ran 12 read replicas per comment database primary. The second fix is caching: store the top 50 comments for any post in Redis with a 60-second TTL. Serving hot threads from Redis reduces database load by 95% or more. For extreme spikes (live sports, breaking news), pre-render a static snapshot of the comment page and serve it from CDN. Accept 1-2 seconds of staleness.
Also mention Redis key replication for the hot partition problem: store the same cached comment list across multiple Redis nodes and load balance reads across them.
2. Counter updates under high concurrency. Naive UPDATE comments SET reply_count = reply_count + 1 WHERE id = $parent_id serializes all concurrent writes to the same row. For most systems, that is fine. For the top comment on a viral post, that is a queue of transactions lined up around the block waiting to increment the same integer. For systems with millions of concurrent replies to a single comment, use Redis INCR as the primary counter store with periodic async flush to the database. If you need the database-only path, use sharded counters: maintain N counter rows per comment in a counter_shards table, pick a random shard on each increment, and sum across shards on read.
3. Soft delete and thread integrity. You cannot hard-delete a comment that has children without either orphaning the children or deleting the entire subtree. Soft delete avoids this. But over time your database accumulates many [deleted] tombstone rows. Periodically audit whether a deleted comment's entire subtree is also deleted. If so, those rows are safe to archive and remove.
Real-Time: When to Use SSE vs WebSocket
For live comment counts (a number ticking up), SSE is the right choice. The client only reads. The server pushes delta events. SSE reconnects automatically, works through most corporate proxies, and multiplexes cleanly over HTTP/2. WebSockets are bidirectional, meaning you pay connection-management overhead you do not need if users are just watching. Route viewers of the same post to the same server cluster using consistent hashing on post_id, so pub/sub fan-out stays local to one machine.
For actual live comment streaming, the flow is:
SSE not WebSocket: clients only read. Consistent hashing on post_id keeps pub/sub fan-out local to one server cluster.
- New comment written to database
- Kafka event published with
{ post_id, comment_id, ... } - Notification service pushes via Redis pub/sub to the server cluster handling that
post_id - Those servers push the comment to connected SSE clients
See the deeper breakdown in WebSockets vs Long Polling vs SSE for connection management details.
Moderation Pipeline
Run moderation asynchronously. Accept the comment, persist it, return 201. Then score it for toxicity in the background.
Comment written → published to Kafka
→ Moderation consumer: score toxicity
→ score < 0.3: mark approved, publish to feed
→ score 0.3-0.7: queue for human review
→ score ≥ 0.7: auto-reject or shadow ban
Shadow banning is subtle but effective. The spammer's comment is accepted, saved, and looks completely normal to them. They might even check back to defend their position to people who cannot see it. Nobody else sees it. It is Schrödinger's spam: simultaneously posted and not posted, observed only by the person who wrote it. A visibility enum (public, shadow_banned, deleted) on the comments table handles this without the user knowing they are filtered.
Key Tradeoffs to Articulate Out Loud
| Decision | Option A | Option B | When to pick A |
|---|---|---|---|
| Data model | Adjacency list | Materialized path | Shallow trees, high write rate |
| Counter storage | DB column | Redis + async flush | DB column fine unless >10K concurrent writes to same comment |
| Real-time | SSE | WebSocket | Viewers only, no client writes |
| Moderation | Sync (block before write) | Async (write then score) | Async almost always. Sync adds latency to every comment post. |
| Pagination | Cursor | Offset | Cursor always for user-facing feeds |
The 45-Minute Clock
Spend roughly this much time on each piece:
- 0-5 min: clarifying questions, scope agreement
- 5-10 min: capacity estimates, state the read/write ratio
- 10-15 min: draw the high-level architecture diagram
- 15-25 min: data model in depth (this is where depth wins you points)
- 25-35 min: API endpoints + pagination
- 35-42 min: scaling bottlenecks (hot partitions, counters, real-time)
- 42-45 min: tradeoffs, what you would do differently at 10x scale
If you run short on time, compress the API section. The data model and the scaling discussion are what interviewers remember. An interviewer who hears you explain adjacency list vs closure table, then correctly pivot to Redis for counters and sharded reads for hot posts, has everything they need to write a strong hire. Nobody gets a strong hire for correctly remembering the cursor encoding format.
If you want to practice this kind of system design question with real-time feedback under interview conditions, SpaceComplexity runs voice-based mock interviews with rubric scoring on exactly these dimensions: requirements gathering, architecture, tradeoffs, and communication.
Recap
- Data model: adjacency list with
parent_idwins for write-heavy, shallow-tree comment systems. Upgrade to materialized path for deeper threads. - Denormalize counters: store
reply_countandlike_counton the row rather than computing at read time. Maintain with Redis INCR + async flush. - Pagination: cursor-based always. Never use offset for feeds.
- Caching: Redis sorted set for top comments per post, short TTL. CDN snapshots for viral posts.
- Real-time: SSE for read-only count updates, Kafka + Redis pub/sub for live comment streams.
- Moderation: async pipeline; soft delete to preserve thread integrity; shadow ban over hard ban for spam.
- Hot partitions: read replicas + Redis cache + consistent hashing for live-stream routing.