Feature Flag System Design Interview: The 45-Minute Walkthrough

Feature flags are on every engineering team's stack, but most candidates treat the design as trivial. "Just a key-value store with a boolean." That answer gets you a soft no faster than "it's basically just a linked list" gets you a wrong answer. Flag evaluation sits on the hot path of every request your system handles, and the design has to handle millions of evaluations per second without adding latency.

This feature flag system design interview walkthrough covers the full 45-minute loop: requirements, architecture, data model, API design, scaling, and the tradeoffs that separate strong hires from average ones.

Why "Just a Key-Value Store" Fails

Feature flag systems look simple from the outside. Under the hood they involve distributed systems tradeoffs (consistency vs availability), read-heavy workloads at massive scale, streaming protocols, SDK design, and the fan-out problem. You also have a real-world reference point (LaunchDarkly, Unleash, AWS AppConfig), so the interviewer can probe your depth. And they will.

Saying "just a key-value store" is the system design equivalent of saying "just deploy it." Technically not wrong. Catastrophically incomplete.

Clarify Requirements First

When you design a feature flag system, spend the first five minutes here. The scope changes dramatically depending on the answers. Skip this step and you'll spend 35 minutes confidently explaining the wrong system to someone taking notes.

Functional requirements to nail down:

What types of flags? Boolean kill switches only, or multivariate flags with string/number variants?
What targeting? Simple on/off per environment, or percentage rollouts and per-user rules?
A/B testing and experimentation, or just deployment control?
Admin UI, or API-only?
Audit log of changes?

Non-functional requirements:

Scale: how many services and users will evaluate flags? (Assume 10M DAU, ~100M evaluations/day for this walkthrough.)
Latency: flag evaluation must not meaningfully slow down application requests. Target: sub-millisecond evaluation.
Availability: the application must keep working even when the flag service is unreachable.
Propagation: how quickly must a flag change reach all services? (Target: under 30 seconds.)

With those answers, scope the system: boolean and multivariate flags, percentage rollouts, user targeting, kill-switch support, audit logging, SDK-based evaluation, no A/B statistics engine in scope.

The First Decision: Two Planes, Not One

The most important structural decision in any feature flag architecture is separating the control plane from the data plane.

Two planes, two traffic profiles. Don't let your intern's afternoon flag toggle share a connection pool with your production eval path.

Control plane handles writes: creating and updating flags, targeting rules, and segments. Traffic is low (tens of changes per day). It writes to Postgres and publishes a change event.

Data plane handles reads: serving flag definitions to SDKs and evaluating flags for users. Traffic is enormous (millions of evaluations/second). It reads from Redis first, falls back to Postgres.

Separating them lets you scale each independently. The control plane lives comfortably on a single Postgres instance and three API servers. The data plane needs a Redis cluster and a fleet of evaluation servers. These workloads differ by several orders of magnitude, which is another way of saying they should not share infrastructure any more than you'd share a database connection pool between your admin dashboard and your checkout flow.

The Three Tables That Drive It All

Three core entities:

flags
  id, key (unique slug), name, description
  enabled (boolean, master toggle)
  environments: ["production", "staging", "dev"]
  created_by, created_at, updated_at

targeting_rules
  id, flag_id, priority (int, lower = higher priority)
  conditions: [{attribute, operator, value}]
  variation (which variant to serve)
  percentage (0-100, for gradual rollout)

segments
  id, key, name
  conditions: [{attribute, operator, value}]  -- reusable user groups

Rule evaluation order matters. When an SDK evaluates a flag for a user:

Check individual user overrides (highest priority)
Walk targeting rules top-to-bottom, return the first match
Apply the default percentage rollout to remaining users
Fall back to the flag's default variation

First match wins. The evaluation stops the moment a rule fires.

Percentage rollout uses deterministic hashing. Hash user_id + flag_key with MurmurHash, then % 10000 to get a bucket from 0 to 9999. A 25% rollout serves the variant to buckets 0 to 2499. The same user always lands in the same bucket, so the experience is consistent across sessions and requests.

Two APIs, Two Traffic Patterns

Two distinct APIs, each optimized for its workload.

Management API (low throughput, admin operations):

POST   /api/v1/flags                    -- create flag
GET    /api/v1/flags/{key}              -- get flag definition
PUT    /api/v1/flags/{key}              -- update flag or rules
DELETE /api/v1/flags/{key}              -- delete flag
GET    /api/v1/flags/{key}/history      -- audit log
POST   /api/v1/flags/{key}/rollback/{version}

Every mutation writes to Postgres and publishes a change event to Redis Pub/Sub. The change propagates to SDKs in under a second via streaming.

Evaluation API (high throughput, SDK polling):

GET /api/v1/flags?environment=production
-- Returns full flag + rule set for server-side SDKs to cache locally

POST /api/v1/flags/evaluate
Body: { flag_key, context: { targetingKey, plan, country, ... } }
-- Returns { value, variation, reason } for client-side evaluation

The evaluation API response is aggressively cached. A Cache-Control: public, max-age=30 header is enough for most CDN setups. Server-side SDKs call this endpoint on a background polling loop, not per-request.

The Evaluation Path: Where Latency Lives

Most candidates lose the interview here. If your design requires a network call for every flag evaluation, you have built a latency tax. Every single API request your services handle now owes 50ms in tribute to your flag service. Congratulations: the feature flag is now the feature.

The fix is local evaluation. Server-side SDKs fetch the full flag and rule set once, cache it in memory, and evaluate entirely in-process. No network call on the hot path. Evaluation time is under one millisecond.

# What SDK evaluation looks like in-process
def is_enabled(flag_key: str, user_context: dict) -> bool:
    flag = local_cache.get(flag_key)
    if flag is None:
        return DEFAULT_VALUE  # flag service unreachable, fail safe

    for rule in sorted(flag.rules, key=lambda r: r.priority):
        if rule.matches(user_context):
            return rule.variation

    bucket = murmurhash(user_context["targetingKey"] + flag_key) % 10000
    return bucket < flag.percentage_rollout * 100

The SDK refreshes the local cache two ways:

Polling (default): Background thread fetches the full flag set every 30 to 60 seconds. Simple, reliable, universally supported. Adds about 30 seconds of propagation lag.

Streaming via SSE (faster propagation): The SDK holds a persistent HTTP connection to the streaming service. When a flag changes, the control plane publishes an event, the streaming service pushes it to all connected SDKs, and the local cache is invalidated immediately. End-to-end propagation under one second.

Use polling as the fallback, streaming as the fast path. Polling ensures eventual consistency if the SSE connection drops.

Where the Load Actually Hits

The read workload is extreme. 100M evaluations/day is roughly 1,200 evaluations/second on average and 3,000 at peak. But that's evaluations across all applications. The SDK polling endpoint is the real bottleneck: if you have 5,000 SDK instances polling every 60 seconds, that's about 83 requests/second to fetch the full flag set. Small. Manageable with a Redis cache in front of Postgres and a few evaluation API servers.

The fan-out problem. One flag change must notify thousands of connected SDK instances. Redis Pub/Sub handles this well at this scale: the streaming service subscribes to the flag_changes channel, and when an event arrives, it sends an SSE event to every connected SDK. At 5,000 connections, this is fine. At 500,000 connections, you're basically running a small notification platform and need the streaming layer behind a load balancer with sticky sessions or a purpose-built push service. At 5 million connections, you have a different job title.

Client-side SDKs (browsers, mobile) cannot do local evaluation because flag rules may contain sensitive business logic or internal email patterns you cannot expose to the client. Client-side SDKs call the evaluation API with the user's context, get back evaluated values, and cache them for the session. This is a read path: add a CDN in front of the evaluation API and use edge computing (Cloudflare Workers, Fastly VCL) to evaluate flags at the edge and cache results by user segment.

Failure mode: flag service is unreachable.

SDKs evaluate from stale local cache indefinitely. Code must provide a default value for every flag evaluation: client.get_bool("new_checkout", default=False). This is not optional. I know it feels like a detail. It is not a detail. It is what keeps your service running at 2am when your flag infrastructure has an outage. The application must degrade gracefully, not crash. Design your flags with fail-safe defaults: a kill switch for a new feature should default to false (feature off), not true.

The Tradeoffs Worth Arguing Over

Eventual consistency is the right choice here. Strict consistency would require a synchronous round-trip to a central service for every evaluation, which violates your latency requirement. Accept that different SDK instances may see a flag change at different times during the propagation window (up to 60 seconds with polling, under 1 second with SSE).

One consequence: during a rollout, a single user's request may touch multiple services, and they may evaluate the same flag differently if one has the new state and another does not. For critical flows (e.g., payment processing), evaluate the flag once at the entry point and pass the evaluated decision downstream as a header, rather than re-evaluating at each service.

Redis vs. Postgres as primary storage. Use Postgres as the source of truth and Redis as the read-through cache. Postgres gives you transactions, relational joins for complex rule queries, and audit log integrity. Redis gives you low-latency flag lookups for the polling endpoint. This is the same pattern as caching strategies for system design: write to the database, warm the cache on read.

SSE vs. WebSockets for streaming. Feature flags are server-to-client communication. SSE is the right tool: standard HTTP, natively supported by browsers, simple to scale. WebSockets are bidirectional and add complexity you do not need. See WebSockets vs Long Polling vs SSE for the full tradeoff breakdown.

The 45-Minute Plan

Rough allocation that covers all the ground. The clock moves faster than it feels in the room.

0-5 min: Clarify requirements, scope the system, state assumptions out loud
5-15 min: High-level architecture, control vs data plane, draw the boxes
15-25 min: Data model, targeting rule evaluation, deterministic bucketing
25-35 min: API design, local vs remote evaluation, SDK lifecycle
35-42 min: Scaling, fan-out, failure modes, client-side
42-45 min: Tradeoffs, what you would do differently at 10x scale

If you hit 40 minutes and have not mentioned local evaluation, streaming updates, and the failure mode, you are missing the depth the interviewer is looking for. These three topics separate the "knows feature flags exist" answer from the "could build this" answer.

Before You Stop Drawing

Separate the control plane (low-volume writes) from the data plane (high-volume reads) for independent scaling
Server-side SDKs use local evaluation: download the full rule set, evaluate in-process, sub-millisecond latency
Polling every 30 to 60 seconds for eventual consistency; SSE streaming for sub-second propagation
Deterministic hashing (MurmurHash + modulo) for consistent, reproducible percentage rollouts
Default values on every flag evaluation are not optional. They are your resilience strategy.
Accept eventual consistency: evaluate flags once at the request entry point for consistency-critical flows
Redis cache in front of Postgres for the SDK polling endpoint; CDN + edge evaluation for client-side

If you want to practice explaining this live under interview conditions, SpaceComplexity runs real-time system design mock interviews with voice and rubric-based feedback. It scores your architecture and how clearly you talked through the reasoning, because that's what the hiring committee reads.