Design a Real-Time Analytics Dashboard: The 45-Minute Walkthrough

June 11, 202611 min read
interview-prepcareersystem-designalgorithms
Design a Real-Time Analytics Dashboard: The 45-Minute Walkthrough
TL;DR
  • Separate write and read paths: Kafka + Flink handle ingestion continuously; Redis + ClickHouse serve queries; the two paths share only storage
  • Pre-aggregate at four granularities (1m, 5m, 1h, 1d) in the stream processor to cut query cost by 99%+ vs scanning raw events
  • ClickHouse columnar storage scans 100-200M rows/second per node with 60-98% compression, making sub-100ms analytics queries realistic at scale
  • HyperLogLog estimates unique user counts with 2% error and only 1.5 KB regardless of cardinality; exact counting is only needed for billing or compliance
  • Redis TTL jitter (±20%) prevents thundering herd when hundreds of dashboards expire the same cache key simultaneously
  • SSE over WebSockets for server-to-client dashboard updates; WebSockets are bidirectional overhead you don't need when data flows one way
  • Kappa vs Lambda: streaming-only Kappa is sufficient for product analytics; Lambda adds a batch layer only when financial reconciliation requires exact correctness

Most engineers walk into a real-time analytics dashboard system design interview expecting to draw a database and a chart. A happy little database. Some arrows. Done. Then the interviewer asks: "How do you handle 200,000 events per second while keeping dashboard queries under 100ms?" The marker goes back on the tray.

The core tension is that writes never stop, but reads need to feel instant. Solving that is what the question actually tests. This walkthrough covers every layer, the numbers that matter, and how to pace the 45 minutes before the clock runs out.


Clarify Before You Draw

Before touching the whiteboard, ask three questions. Yes, before the whiteboard. Step away from the marker.

What does "real-time" mean here? "Real-time" in engineering means anything from "within a millisecond" to "by end of day, roughly." Data visible within 1 second is genuinely hard. Within 5 seconds is manageable. Within 1 minute is honestly just fast batch. The answer determines your entire stream processing layer.

What are the query patterns? Are users filtering by arbitrary dimensions (product, region, device) or viewing fixed pre-defined charts? Arbitrary filters require a general-purpose OLAP engine. Fixed charts let you pre-compute everything.

What's the scale? A product analytics tool serving 50 dashboards and a financial platform serving 50,000 concurrent analysts are completely different problems. Get a number: events per second ingested, concurrent viewers, data retention period.

For this walkthrough, assume: 200,000 events per second at peak, 10-second refresh intervals, data visible within 5 seconds of ingestion, and 90-day retention.


The Two Pipelines You're Actually Designing

Every analytics system has a write path and a read path. Your job is to make sure they never block each other.

Write path: Events → Queue → Stream Processor → Storage
Read path:  Client → API → Cache → Storage

These are logically separate. The write path ingests continuously and must never stall. The read path serves queries and must not wait on ingestion. A queue in the middle is the foundational design decision. Two pipelines that share nothing except the storage layer is not a design flaw. It's the entire point.

Write path flowing Events through Kafka and Flink into ClickHouse; read path flowing Dashboard Client through API Server and Redis into the same ClickHouse instance Write path and read path share only the storage layer. They never block each other.


The Ingestion Layer: Kafka Keeps the Writes Honest

Events arrive from clients, servers, and SDKs. They need a buffer that absorbs traffic spikes and decouples producers from consumers.

Kafka is the standard answer. A single cluster handles millions of events per second and persists events for replay. Partition by tenant_id or user_id so consumers can process in parallel without coordination overhead. At 200,000 events per second and 800 bytes per event, you're looking at 160 MB/sec. That's well within what a properly tuned cluster handles.

One detail worth raising in the interview: delivery semantics. For analytics, at-least-once delivery is acceptable because a few duplicate pageview counts don't matter, but missing events do. Your VP of Marketing will never notice if a pageview was counted twice. They will notice if 50,000 of them went missing. Exactly-once adds 10-20% throughput overhead and is necessary only when you're counting revenue or transactions.

Keep the Kafka retention window at 24-48 hours. Long enough to replay if a consumer falls behind, short enough that storage stays manageable.


Stream Processing: Where the Math Happens

Raw events land in Kafka. Before they hit storage, you need to aggregate them. A stream processor (Apache Flink or Kafka Streams for simpler cases) reads from Kafka, applies windowed aggregations, and writes results.

Tumbling windows are fixed, non-overlapping time buckets. Every 60 seconds, close the window, compute aggregates (count, sum, unique users via HyperLogLog), and write the result. That's your 1-minute rollup.

Timeline showing three consecutive 60-second tumbling windows with event dots; closed windows emit one aggregated row with count and unique user estimate Each bucket closes independently and emits one row. No window overlaps with its neighbor.

Speaking of HyperLogLog: when an interviewer asks "how do you count unique users across millions of events?" the wrong answer is a hash set (it grows linearly with cardinality). HyperLogLog estimates cardinality with roughly 2% error using only 1.5 KB regardless of how many distinct users you've seen. For a dashboard showing "1.24M unique visitors," nobody cares about the 2% imprecision. Your CEO will round it to 1.2M in their next all-hands anyway.

Pre-aggregate at multiple time granularities during stream processing: 1-minute, 5-minute, 1-hour, and 1-day buckets. Storing all four means a query for "last 7 days" can read 168 hourly rows instead of 604,800 per-minute rows. Query cost drops by 99.97%.


The Storage Layer: Columnar Is Not Optional

Relational databases fail here. They store data row by row. An analytics query that says "sum event_count for the last 30 days" needs to read that one column across millions of rows. A row-oriented store reads every other column on the way there and burns I/O on data you never asked for.

ClickHouse is the go-to choice for this workload. It stores data column by column, compresses each column independently (numeric deltas compress at 60-98%), and scans at 100-200 million rows per second per server. Write throughput exceeds 1 million rows per second with proper batching.

Your storage schema has two layers:

-- Raw events (queryable, 7-day hot retention) CREATE TABLE events ( tenant_id UInt32, event_name LowCardinality(String), timestamp DateTime, user_id String, properties String -- JSON blob ) ENGINE = MergeTree() PARTITION BY toYYYYMMDD(timestamp) ORDER BY (tenant_id, event_name, timestamp); -- Pre-aggregated rollups (90-day retention, fast queries) CREATE TABLE rollups_1m ( tenant_id UInt32, event_name LowCardinality(String), window_start DateTime, event_count UInt64, unique_users AggregateFunction(uniq, String) ) ENGINE = AggregatingMergeTree() ORDER BY (tenant_id, event_name, window_start);

The pre-aggregated table is what your dashboards query 95% of the time. Raw events exist for ad-hoc exploration and backfill.

API Server querying 1m, 1h, and 1d rollup tables as the primary path; a dashed arrow falls back to the raw events table for the remaining 5 percent of ad-hoc queries Rollups absorb 95% of queries. Raw events stay around for the cases when someone inevitably asks "can we break this down by browser version?"


Serving Queries: The Cache Is Not Optional Either

Pre-aggregated ClickHouse queries run in under 100ms. But 50,000 concurrent dashboards all querying ClickHouse directly would still be a problem. You need a caching layer.

Redis sits in front. When a dashboard loads, the API server computes a cache key from the query parameters (tenant, metric, time range, filters) and checks Redis first. Cache hit: respond in under 5ms. Cache miss: query ClickHouse, store result in Redis with a 60-second TTL, respond.

A 60-second TTL means dashboard data is at most 60 seconds stale, which is entirely acceptable for a tool refreshing every 10 seconds. For historical queries (last 7 days, last month), bump the TTL to 5 minutes since those numbers barely move.

Beware the thundering herd. If 500 dashboards have the same cache expiry for the same query, they all miss at once and hammer ClickHouse simultaneously. ClickHouse receives 500 identical requests at the same millisecond. It is fine. Until it isn't. Add random jitter to TTL (±20%) so expirations stagger.


Pushing Updates to the Client

Dashboards are not static pages. Users expect numbers to update. Two options:

Polling: client calls /api/metrics every 10 seconds. Simple. Works everywhere. No persistent connections. The cost is a 10-second latency floor and redundant requests even when nothing changed.

Server-Sent Events (SSE): server pushes updates as events occur. One persistent HTTP connection per client. Automatic reconnection built into the browser. Works through most proxies.

For 90% of dashboards, SSE is sufficient and simpler than WebSockets. WebSockets are bidirectional. Dashboards only need data flowing one way: server to client. Reserve WebSockets for collaborative tools where the client also sends state back to the server.

SSE connections cost a few KB each. At 50,000 concurrent viewers, that overhead is trivial.

Browser client sending one SSE subscribe request to the right; API Server sending three metric_update events back to the left at 5s, 10s, and 15s intervals over a single persistent connection One subscribe, then the server drives. The browser EventSource API handles reconnection automatically.


Scaling and the Bottlenecks You Should Name

Three places this breaks under load:

Hot partitions in Kafka. If you partition by event_name and one event (say, "page_view") is 80% of traffic, one partition gets 80% of the writes and its consumer falls behind. Partition by a more uniform key like user_id or hash tenant_id + event_name together.

Write amplification in stream processing. Flink writing 1-minute rollups to ClickHouse at high frequency creates small batches, and small-batch inserts are expensive in MergeTree. Batch writes to at least 100,000 rows per insert. Buffer in the stream processor for 10-30 seconds before flushing.

Fan-out for multi-tenant rollups. If one tenant generates 80% of events, their rollup computation dominates your stream processor. Partition Flink tasks by tenant so hot tenants don't slow down everyone else.


The Tradeoffs Worth Saying Out Loud

The interviewer wants to hear you acknowledge what you gave up.

Approximate vs exact cardinality. HyperLogLog is fast and small but introduces 2% error. If the product requires exact unique user counts (billing, compliance), you need exact counting, which means either a sorted set in Redis (expensive at scale) or pre-computing cardinality at write time with a bounded key space.

Pre-aggregation rigidity. Your rollup tables are fast because they pre-answer specific questions. Adding a new dimension (say, "browser version") requires backfilling all historical rollups. Product managers will inevitably ask for that dimension the week after launch. Keeping raw events around for 7 days gives you a reprocessing window, but it is still operational work.

Lambda vs Kappa architecture. A pure streaming Kappa architecture is simpler to maintain (one pipeline, not two). Lambda adds a separate batch pipeline that recomputes correct results after late data arrives. For a product analytics dashboard, Kappa with watermarks handling late data is usually sufficient. For financial dashboards where every cent must reconcile, Lambda's correctness guarantees are worth the complexity.


How to Run the 45-Minute Clock

Here is a time budget that consistently works:

PhaseTimeWhat you cover
Requirements5 minScale, latency SLAs, query patterns
High-level architecture8 minDraw the two-pipeline diagram, name the layers
Write path deep dive10 minKafka partitioning, Flink windowing, ClickHouse schema
Read path deep dive8 minRollup tables, Redis cache, TTL strategy
Client push4 minSSE vs WebSocket decision
Scaling and bottlenecks7 minHot partitions, batch sizing, fan-out
Tradeoffs and wrap-up3 minApproximate vs exact, Lambda vs Kappa

State the bottleneck before the interviewer asks about it. The worst version of this interview is you getting there two seconds after they prompted you. Saying "one thing I'd want to watch is hot partition lag in Kafka if a single event type dominates" is a stronger signal than answering the same concern defensively when prompted.


Recap

  • Clarify "real-time" latency target before drawing anything
  • Separate write path (Kafka, Flink, ClickHouse) from read path (API, Redis, ClickHouse) so they never block each other
  • Pre-aggregate at 1m, 5m, 1h, 1d granularities in your stream processor, not at query time
  • Use ClickHouse columnar storage for aggregations; it scans 100-200M rows/sec per server
  • Use HyperLogLog for unique user counts (2% error, 1.5 KB, unlimited cardinality)
  • Cache pre-aggregated results in Redis with TTL jitter to prevent thundering herd
  • SSE over WebSockets for server-to-client dashboard updates
  • Name the tradeoffs (approximate vs exact, Kappa vs Lambda) before the interviewer asks

To practice explaining this under real interview pressure, SpaceComplexity runs voice-based system design mock interviews with rubric-based feedback. It tells you which parts of your explanation were unclear, which numbers you forgot, and whether your tradeoff discussion landed.

The ad click aggregator walkthrough covers a nearly identical write-heavy architecture focused on the aggregation pipeline. Hot partition handling goes deeper on the Kafka scaling problems above. The caching strategies guide covers Redis TTL tradeoffs in more depth.


Further Reading