Content Moderation System Design: The 45-Minute Walkthrough

Every minute, users upload 500 hours of video to YouTube. Facebook processes billions of posts per day. Somewhere in that stream: child sexual abuse material, terrorist recruitment, spam, deepfakes, and content whose legality depends entirely on which country the server sits in. Your job in the interview is to design the system that catches it. All of it. Without throttling the product.

No pressure.

Content moderation system design is one of the best interview questions because it hits every dimension at once: async pipelines, ML inference, data modeling, threshold tuning, and human review under time pressure. This is how to walk through it clearly in 45 minutes.

Start With Requirements. This One Really Rewards Clarification.

Before drawing a single box, nail down scope. Candidates who skip this step design a system that answers a question nobody asked. (Specifically, a question the interviewer has never asked and will not give credit for answering.)

Ask what content types are in scope. Text only simplifies things dramatically. Adding images forces you to handle binary payloads, perceptual hashing, and a different model stack. Adding video adds frame sampling and pipeline latency that text never faces. Adding live streams creates sub-100ms latency constraints that change the architecture entirely.

Ask which violation categories matter. Spam tolerates false positives because nobody gets banned for a wrongly-filtered comment. CSAM (child sexual abuse material) demands near-zero false negatives and triggers legal reporting obligations in most jurisdictions. These categories need different thresholds and cannot share a single knob. Treating spam and CSAM as the same moderation problem is a category error, and interviewers notice.

Ask about the visibility model: does content publish immediately and get removed if bad (optimistic), or does it wait for clearance (pessimistic)? Most platforms use optimistic for most content, pessimistic for new accounts and flagged categories.

Assume, unless told otherwise: 10 million submissions per day (~120/second average, 3x peak), across text, images, and short video. That is the scale that forces interesting architectural choices.

Requirements clarification funnel, showing raw submissions scoped down to content types and violation categories From stream to scope. Every box you draw after this inherits these decisions.

The Architecture Is a Funnel, Not a Single Model

Here is the insight that separates good answers from great ones.

Every content moderation system is a cost-optimizing funnel: apply cheap tests first, route only the uncertain fraction to expensive ones. The interviewer who hears "so we'll have an ML model classify everything" is already mentally moving on. The one who hears "we apply cheap tests first and push the uncertain 5% to the expensive model" sits forward.

Stage 1 handles obvious violations in microseconds. Stage 2 handles ambiguous content in milliseconds. Stage 3 (human review) handles the rest. Most content never makes it past Stage 1.

Hash Matching: The Free Lunch

Before running any ML, check whether the content is a known-bad copy.

For exact duplicates, compute a SHA-256 hash and look it up in a Redis hash set. Under 1 millisecond for hundreds of millions of entries. Redis is doing more work before your coffee brews than most databases do in a day.

For near-duplicates, use perceptual hashing. PhotoDNA generates a hash that stays similar under resizing, compression, and cropping. You compare Hamming distances: a distance below a threshold indicates a near-match. PhotoDNA achieves a false positive rate around 1 in 10 billion, which makes it safe to auto-reject on a match alone. That number is so good it almost sounds made up. It isn't. For video, sample keyframes and hash each one independently.

This stage handles re-uploaded CSAM and known-bad content at almost zero compute cost. It should run synchronously before anything else touches the content.

Hash matching stage showing SHA-256 exact match via Redis and perceptual hash near-match via Hamming distance, both running under 1ms Both tracks run before any classifier sees the content. The most powerful safety check is also the cheapest one.

Stage 2: The Cascade

Content that passes hash matching enters a cascade of ML classifiers. The cascade's job is to push as much content as possible to a cheap decision and reserve expensive compute for genuinely uncertain cases.

The lightweight model (a distilled text classifier or a small CNN for images) runs in under 5ms and produces a confidence score per category. Anything above a high-confidence threshold (say, 0.95 for most categories) is auto-rejected. Anything below a low-confidence threshold (say, 0.10) is auto-approved. The middle band, typically 5-10% of all content, escalates to the heavy model.

Production cascade systems that send only the top 2.5% of submissions to a large model reduce inference cost to roughly 1.5% of naive full-model deployment, while actually improving F1 on hard cases. The routing decision is where most of the engineering lives. Not the model itself. The routing.

The heavy model can be a vision transformer, a multimodal LLM, or an ensemble. For images, you also run OCR to catch offensive text overlaid on the image. A meme conveys meaning that neither the image nor the caption carries alone.

Three-stage cascade pipeline showing hash check (microseconds, 5% rejection), lightweight classifier (5ms, 92% clearance), and heavy GPU model (100-500ms, 2.5% of total volume), with auto-approve and auto-reject branches Most content dies at Stage 1. What makes it to the GPU pool is genuinely ambiguous.

Stage 3: Human Review

The uncertain middle band from Stage 2 goes into a priority queue.

Priority score drives queue ordering, not submission time. Severity matters most: CSAM and terrorism content jump the queue regardless of confidence. After that, confidence gap (the delta between the two most likely categories, which signals genuine ambiguity), report count (how many users flagged this post), and account age (new accounts get higher scrutiny).

Moderators see the queue sorted by priority, with the AI's confidence scores, a brief explanation of why the content was flagged, and any prior violations on the account. They make a binary decision. Every decision feeds back into model retraining, so the human review queue is also your data labeling pipeline. It is doing two jobs at once, and you should say so in the interview.

Human review queue interface showing priority-sorted list, content preview panel, AI context panel with per-category confidence bars and account history, and approve/reject buttons The context panel on the right is what separates a moderator who can make a good decision from one guessing blind.

Data Model

Four tables cover the core lifecycle:

content_items (
  content_id       UUID PRIMARY KEY,
  user_id          UUID NOT NULL,
  content_type     TEXT,           -- 'text', 'image', 'video'
  storage_url      TEXT,
  sha256_hash      BYTEA,
  perceptual_hash  BYTEA,
  status           TEXT,           -- 'pending', 'approved', 'rejected'
  created_at       TIMESTAMPTZ,
  region           TEXT
)

moderation_jobs (
  job_id           UUID PRIMARY KEY,
  content_id       UUID REFERENCES content_items,
  pipeline_version TEXT,
  outcome          TEXT,           -- 'approved', 'rejected', 'escalated'
  confidence       FLOAT,
  category_scores  JSONB,          -- { "spam": 0.12, "hate_speech": 0.78, ... }
  decided_at       TIMESTAMPTZ
)

review_tasks (
  task_id          UUID PRIMARY KEY,
  job_id           UUID REFERENCES moderation_jobs,
  reviewer_id      UUID,
  priority         INT,
  assigned_at      TIMESTAMPTZ,
  completed_at     TIMESTAMPTZ,
  decision         TEXT,
  notes            TEXT
)

appeals (
  appeal_id        UUID PRIMARY KEY,
  job_id           UUID REFERENCES moderation_jobs,
  user_id          UUID,
  submitted_at     TIMESTAMPTZ,
  reason           TEXT,
  resolved_at      TIMESTAMPTZ,
  outcome          TEXT
)

Store category_scores as JSONB, not as typed columns. Violation categories change when policy changes. New categories get added by legal, old ones pruned after model evaluation. A column-per-category turns every policy update into a schema migration. If you have ever done that migration at 2am because legal added a new content category effective Monday, you know exactly why JSONB is the right call here.

Index content_items on sha256_hash for the hash lookup path. Composite index on (status, created_at) for queue workers polling pending items. Partition moderation_jobs by month: at 10 million decisions per day at ~500 bytes each, you accumulate 1.5 TB per year, and you only need the hot month for operational queries.

API Design

Two surfaces: the internal submission pipeline and the user-facing appeals API.

# Submit content for moderation (async)
POST /v1/moderation/submit
{
  "content_type": "image",
  "storage_url": "s3://bucket/key",
  "user_id": "abc123",
  "region": "EU"
}
→ { "job_id": "xyz789", "status": "queued" }

# Poll for result (or receive via webhook)
GET /v1/moderation/jobs/{job_id}
→ { "status": "rejected", "reason": "hate_speech", "confidence": 0.91 }

# User appeal
POST /v1/appeals
{
  "job_id": "xyz789",
  "reason": "This post was incorrectly removed."
}
→ { "appeal_id": "...", "expected_resolution": "24-72 hours" }

The submit endpoint is async: it returns a job ID immediately and processes in the background. For optimistic publishing, the content goes live and gets pulled if rejected. For pessimistic publishing, a separate gating check polls the job status before allowing the content to appear.

Note the region field on submission. You will need it. More on that shortly.

Where This System Breaks Under Load

Three bottlenecks show up at scale. Interviewers at companies with real content problems will probe exactly these three.

GPU Inference Throughput

The lightweight model is CPU-friendly, but the heavy model needs GPU workers. Batch requests before sending: waiting 20ms to accumulate a batch of 64 images can increase GPU throughput by 10x. Auto-scale GPU pools on queue depth, not CPU utilization. (CPU utilization will look fine right up until the queue is 30 minutes deep.)

Human Review Queue Depth

A news event or a coordinated upload campaign can spike submission volume 10x in minutes. You cannot hire moderators fast enough to absorb that spike. The operational safety valve is threshold tuning: temporarily lower the auto-reject confidence threshold, which pushes more content to auto-action and reduces queue depth. This trades precision for queue stability. It should be a deliberate, logged, reversible operator decision, not an automatic one. If it happens automatically, nobody is accountable when it goes wrong.

Hash Database Size

The known-bad hash database can grow to hundreds of millions of entries. A full PostgreSQL lookup per submission will not survive peak load. Keep the hot set in Redis and place a Bloom filter in front of it: 100 million hashes need under 200MB of memory, and a lookup takes microseconds, with a false positive rate you can tune to whatever collision cost you can tolerate. Two hundred megabytes doing the work of a database shard. It is a beautiful thing.

Scaling architecture showing load balancer feeding preprocessing fleet (Bloom filter plus Redis), inference queue, lightweight and heavy GPU worker pools, human review priority queue, depth alarm, and threshold controller feedback loop The threshold controller is the safety valve nobody draws until they've had a queue back up on them.

The Tradeoffs That Actually Matter

Precision vs. recall is per-category, not global. Removing a legitimate creator's video is a false positive. Leaving up a CSAM post is a false negative. These are not symmetric, and they are not symmetric in the same way across all categories. CSAM demands aggressive recall (accept many false positives to catch every true positive). Spam moderation demands aggressive precision (accepting missed spam is cheaper than wrongly silencing users). You need per-category threshold controls. One global slider is the wrong answer, and if an interviewer asks how you'd tune the system, the word "global" should not appear in your response.

Optimistic vs. pessimistic publishing is a risk segmentation problem. Holding all content until cleared gives better safety but terrible UX for the 99.9% of posts that are fine. Publishing everything immediately lets viral hate content spread before removal. The real answer is segmentation: new accounts, previously banned users, and certain content types use pessimistic gating. Everyone else gets optimistic publishing with a short SLA on the review. This is not a single architectural decision; it is a business policy that gets encoded into the submission endpoint.

A single global model cannot handle regional law. Germany's NetzDG, the EU's Digital Services Act, and India's IT Rules impose different removal obligations. Some content is legal in the US and illegal in Germany. You can handle this two ways: regional policy rules that post-process the ML decision, or fine-tuned per-region models with different training data. Policy rules are cheaper and faster to update when law changes. Per-region models are more accurate but require separate training pipelines and labeling budgets for each jurisdiction. The region field on every submission is what makes this tractable.

How to Walk Through This in 45 Minutes

Spend your first 5 minutes on requirements. You will immediately sound more senior than every candidate who opens with "so we'll need Kafka."

Sketch the funnel (hash check, cascade, human review) in the first 10 minutes of architecture. It is the thesis of the entire system. Once the interviewer agrees on the funnel shape, fill in the components.

Bring up the data model explicitly. Most candidates skip it. The JSONB for category_scores is the kind of specific call that interviewers remember, because it shows you have actually operated systems where policy changes outpace schema migrations.

For your deep dive, volunteer the most interesting tradeoff: "The hardest tension here is precision vs. recall per category, and I think the interesting case is what to do when the human review queue is backing up. Can I walk through the threshold-tuning safety valve?" That signals operational maturity, not just paper architecture.

Do not forget appeals. Almost nobody mentions them. Seriously, almost nobody. They are a legal requirement under the EU's Digital Services Act and a direct feedback loop into model quality. Mentioning them signals that you think about the full product lifecycle, not just the happy path where content goes in, gets a score, and disappears.

45-minute interview timeline divided into six phases: requirements (0-5), back-of-envelope (5-15), high-level architecture (15-25), data model and API (25-35), deep dive (35-42), tradeoffs and appeals (42-45) The clock is not your enemy. The candidate who opens with Kafka is.

Recap

The system is a cost-optimizing funnel: hash matching, cascade ML, human review.
Hash matching (exact SHA-256 and perceptual hashing) catches known-bad content in under 1ms.
The cascade sends ~2.5% of content to the expensive model, cutting inference cost to roughly 1.5% of naive full-model deployment.
Store category scores as JSONB so policy changes don't require schema migrations.
The priority queue orders by severity and confidence gap, not submission time.
Precision and recall tradeoffs are per-category. One global threshold is wrong.
Optimistic vs. pessimistic publishing is a risk segmentation policy, not a binary choice.
Regional policy layers handle legal variation without retraining the core model.
Always mention appeals. Legal requirement, feedback loop, and proof you think beyond the happy path.

If you want to practice saying all of this out loud under actual time pressure, SpaceComplexity runs voice-based mock system design interviews with rubric feedback on both your architecture and how you communicate under constraints. Reading this post gets you the framework. Practicing it out loud is what passes the interview.