Spotify System Design Interview: Audio at 600 GB/s and the Licensing Trap

Most engineers walk into a Spotify system design interview assuming it's just a slower Netflix. Same problem, different content type. So they design the same architecture: upload video, transcode it, serve it from a CDN, done.

The analogy falls apart quickly. Audio files are tiny compared to video. The catalog is static (songs don't change after upload). Concurrent streams are measured in the hundreds of millions. And the genuinely hard problem isn't delivery at all. The hard problems are metadata at scale, discovery, regional licensing, and keeping latency under 200ms when someone hits play from São Paulo at 11pm on a Friday. (If you want the video version of this problem, see our YouTube system design walkthrough. The storage and licensing pieces diverge significantly.)

Here is how to walk through this in 45 minutes and say something worth listening to.

A Google recruiter describes watching a candidate panic-close a Spotify tab and 14 Stack Overflow windows when they realized their screen was already being shared The most relatable Spotify interview prep story on the internet. Don't be this candidate. Actually, be this candidate if it means you recovered and aced the round.

Pin Your Scope Before You Touch Architecture (Minutes 0-5)

Start with what you're actually building. Spotify has podcasts, audiobooks, video clips, social features, and a creator platform. None of that matters in a 45-minute interview unless your interviewer asks for it.

Pin your scope to five things:

Users can stream music on demand
Users can browse catalogs, search, and create playlists
Users get personalized recommendations
Content is available or blocked by region based on licensing
Premium users get higher quality audio; free users get ads

What you are NOT designing: upload/ingestion pipelines, podcast hosting, billing, ads serving. Call these out and move on.

The Numbers That Shape Every Design Decision (Minutes 5-10)

Spotify's public numbers from Q3 2025: 713 million monthly active users, 281 million premium subscribers, 100 million+ tracks, 180+ markets.

For the interview, work with these:

Metric	Number
MAU	700M
DAU (assume ~15%)	100M
Peak concurrent streams	30M
Catalog size	100M tracks
Avg track size (160 kbps, 4 min)	~5 MB
Audio storage (3 quality tiers)	~1.5 PB
Peak streaming throughput	~600 GB/s
Metadata reads (peak)	500K-1M/sec

The dominant load is reads, not writes. Song metadata doesn't change. A track's audio file never changes after upload. Your design should be optimized for read throughput above everything else.

One back-of-envelope that interviewers love: if 30 million users stream simultaneously at 160 kbps, that's 30M × 160,000 bits/sec = 4.8 terabits per second. No single origin handles that. The CDN is doing the real work. The origin server is somewhere taking a nap.

Six Services, One Event Bus (Minutes 10-20)

Spotify system architecture showing six microservices fanning out from an API gateway, connected by a Kafka event bus to four storage layers and a Fastly CDN Six services, one event bus. The Streaming Service (amber) is the only one that touches the CDN URL path.

Start with six services. You can decompose further if asked.

User Service handles authentication, profiles, and subscription tier. Spotify ran on PostgreSQL here for years, then migrated to Cassandra in 2015 when the user table grew past what a single Postgres cluster could handle.

Catalog Service owns track metadata, albums, and artists. This is a read-heavy, write-rare service. A song's metadata changes maybe once every few months (if a label re-tags it). Cache aggressively. Redis and CDN for cover art and common metadata queries. The caching strategy here follows the same cache-aside pattern described in the distributed cache system design guide.

Search Service runs on Elasticsearch. When you type "Daft Punk" you're hitting a pre-indexed inverted index, not querying the Catalog Service live.

Playlist Service handles create, read, update, delete for playlists. High write rate relative to Catalog, but still read-dominated. Cassandra works well here given its write throughput and flexible schema.

Streaming Service is the coordinator. It doesn't actually deliver bytes. It tells the client where to fetch audio from. The CDN does the delivery.

Recommendations Service is the most computationally complex. It runs offline batch jobs (Apache Beam/Scio on GCP) to build taste profiles, and a real-time layer fed by Kafka events.

All services run on Google Cloud Platform via GKE. Spotify migrated from on-premises infrastructure and a homegrown container orchestration system called Helios to Kubernetes. Today they run 1,600+ production services on Kubernetes across the fleet.

The Data Model Hides the Licensing Trick (Minutes 20-27)

Spotify data model entity-relationship diagram showing users, artists, albums, tracks, playlists, playlist_tracks, and play_events tables with available_markets highlighted on tracks Seven tables. The amber available_markets field on tracks is doing more regulatory heavy lifting than it looks.

Keep it simple. Expand on request.

users (user_id UUID, email TEXT, country TEXT, tier ENUM, created_at TIMESTAMP)

artists (artist_id UUID, name TEXT, genres TEXT[], bio TEXT)

albums (album_id UUID, artist_id UUID, title TEXT, release_date DATE, type ENUM)

tracks (track_id UUID, album_id UUID, title TEXT, duration_ms INT,
        audio_url TEXT, explicit BOOL, available_markets TEXT[])

playlists (playlist_id UUID, owner_id UUID, name TEXT, privacy ENUM,
           created_at TIMESTAMP, updated_at TIMESTAMP)

playlist_tracks (playlist_id UUID, track_id UUID, position INT,
                 added_at TIMESTAMP, added_by UUID)

play_events (user_id UUID, track_id UUID, played_at TIMESTAMP,
             ms_played INT, context TEXT)

Three decisions worth calling out:

available_markets on the track is a denormalized array. When a user in Germany hits play, the Streaming Service checks if "DE" is in that array before returning a CDN URL. This is how regional licensing enforcement works: IP geolocation on the API gateway to get the user's country code, then a filter on the track before the URL is ever returned. Simple, cheap, and legally defensible. A join to a separate licensing table under 30 million concurrent streams would not be any of those things.

play_events goes to Kafka first, then lands in Cassandra for the recommendations pipeline. You do not write individual plays directly to PostgreSQL under 30 million concurrent streams. Kafka absorbs the write spike and fans out to analytics, recommendations, and the play count aggregator.

audio_url is a signed CDN URL the client uses directly. The origin servers never serve audio bytes to end users.

How Audio Actually Gets to Your Ears (Minutes 27-35)

Most candidates gloss over this section. Walk through the flow precisely.

Spotify audio play request sequence diagram showing seven steps from client click through API gateway and Streaming Service to CDN edge delivering 512KB audio chunks Seven steps, click to sound. Notice that after step 5, the Streaming Service is completely out of the picture.

When a user clicks play, the client sends a request to the Streaming Service. The service validates the user's subscription, checks the track's available_markets against the user's country, generates a signed CDN URL (time-limited, tied to user ID and track ID), and returns it. The client never sees the raw storage URL.

The client then fetches audio directly from the nearest CDN edge using HTTP range requests. A typical chunk is 512 KB. At 160 kbps, that's about 26 seconds of audio. The client fetches the first chunk, starts playback, and pre-fetches the next chunk in the background. If the user skips forward, the client requests a different byte range from the CDN. The origin never gets hit again for that content.

Spotify uses OGG Vorbis encoding at 96/128/160/320 kbps depending on platform and subscription tier. Since September 2025, premium users get lossless FLAC at 24-bit/44.1 kHz. The web player uses AAC. Each track is stored pre-encoded at multiple bitrates. The client selects the appropriate URL when requesting the signed CDN link.

The CDN strategy runs through Fastly. Spotify standardized their multi-CDN delivery behind Fastly's edge cloud for audio, images, and client update packages. Over 60 engineering squads use this shared CDN configuration.

Two metrics Spotify tracks obsessively: playback latency (click-to-sound, target under 200ms) and stutter rate (buffer underruns during playback). Spotify tracks stutter rate the way you track open PRs: constantly, and with quiet dread. They addressed it with BBR (Bottleneck Bandwidth and Round-trip propagation time) congestion control at the transport layer, detailed in a 2018 engineering post.

Keep Recommendations Decoupled (Minutes 35-40)

Spotify recommendations event pipeline showing Kafka fan-out to a real-time Cassandra consumer and a batch Apache Beam/Scio consumer on GCP producing ML models One Kafka topic, two consumers, three features. If the batch job falls behind, nobody goes silent.

When you listen to, skip, replay, or save a track, that event hits Kafka within milliseconds. Two parallel pipelines consume from that topic.

The real-time pipeline updates your taste profile in Cassandra immediately. This feeds features like Discover Weekly (generated once per week, batch), Daily Mix (regenerated daily), and the radio feature (real-time).

The batch pipeline runs on Apache Beam via Scio (Spotify's Scala API for Beam) on GCP. This computes collaborative filtering models, content-based embeddings from audio features and lyrics, and the genre/mood taxonomy. The recommendation engine is the most computationally expensive part of Spotify's stack, maintained by more than 2,000 engineers across multiple ML teams.

The key point to land is the separation between the event bus (Kafka), the real-time profile store (Cassandra), and the batch compute layer. These are decoupled on purpose. If the recommendation batch job falls behind, users still get music. The system degrades gracefully.

Where the Design Actually Breaks (Minutes 40-45)

Consistency trade-off spectrum for Spotify services, showing CP services like auth and licensing on the left and AP services like search and playlists on the right CP where the cost of being wrong is a legal fine. AP where the cost of being wrong is a mildly awkward playlist.

CDN hit rate is the most important operational metric. If the CDN miss rate climbs, audio requests fall back to origin. With 30 million concurrent streams, even a 0.1% miss rate is 30,000 origin requests per second. That is a wake-the-on-call moment, not a log-a-ticket moment. Warm the CDN for popular content proactively. The top 1% of tracks by play count account for a disproportionate share of all streams. Pre-populate those at CDN edge on a schedule.

Playlists are AP by design. If two clients add a track to a shared playlist simultaneously, you might get both, one, or a conflict depending on how Cassandra's last-write-wins handles it. This is acceptable: a playlist having a duplicate track is not a catastrophic failure. You do not want synchronous consensus for every playlist write at this scale. Call this out and explain why. The push vs. pull and AP vs. CP consistency framing in The Tradeoff Maze gives you the vocabulary to do that cleanly.

Audio and metadata have different cache TTLs for a reason. Audio files are immutable after upload. Cache them indefinitely at the CDN with long TTLs. Metadata (track title, artist name, album art) changes rarely, but it does change. Use shorter TTLs, and build a cache invalidation pathway through Kafka events when a Catalog Service update happens.

Search can be eventually consistent. Elasticsearch indexes are rebuilt from the Catalog Service on a schedule, not in real time. A new track might take a few minutes to appear in search after ingestion. Frame this as an intentional choice: eventual consistency on search is fine because the user isn't searching for content they just uploaded.

Licensing enforcement touches every play request, so keep it cheap. Every track query that reaches the client goes through a country filter. That filter runs on every play request, forever. Keep available_markets denormalized on the track record so this is a single array lookup, not a join to a licensing table under load.

The Spotify System Design Interview Clock

Block	Minutes	Deliverable
Scope + requirements	0-5	Pinned features, explicit exclusions
Scale estimation	5-10	Key numbers, read-dominated conclusion
High-level architecture	10-20	Six services, storage choices, Kafka
Data model	20-27	Six tables, licensing via markets array
Audio streaming deep dive	27-35	CDN flow, range requests, encoding tiers
Recommendations pipeline	35-40	Kafka, real-time vs batch, Cassandra
Bottlenecks + tradeoffs	40-45	CDN hit rate, AP/CP decisions, search lag

The instinct to rush to architecture is common. Spend the first five minutes scoping. Interviewers are evaluating whether you ask clarifying questions before designing, and whether you can identify that the licensing constraint shapes half your data model.

If you want to practice explaining this out loud under time pressure and get rubric-based feedback on your communication, SpaceComplexity runs voice-based system design mocks with instant scoring on clarity, completeness, and tradeoff reasoning.

What to Take Away

Spotify's dominant challenge is read throughput, not write throughput. Design for that.
Audio is never served from origin. Your services hand the client a signed URL and step aside. The CDN does the actual work.
Licensing enforcement lives as a country code filter on every track record, checked at play time.
Kafka decouples user events from storage. It absorbs 30M concurrent write streams without touching your databases.
Playlists are AP by design. Eventual consistency is the right call there.
The recommendations system is batch-plus-real-time, not one or the other.
Two metrics that matter operationally: playback latency and stutter rate.