Design an Image and Video Upload Service: The 45-Minute Walkthrough

Uploading a photo feels simple. You've done it a thousand times on your phone. You tap, you wait two seconds, done. The system design interview version is much harder than it looks, because uploading a video to YouTube is nothing like uploading a photo to Instagram. One is a fast write followed by a CDN serve. The other kicks off a processing job involving GPU clusters, adaptive streaming manifests, and a moderation pipeline. Treat them as the same problem and you'll design the wrong system.

This is the complete walkthrough: requirements, architecture, data model, API design, bottlenecks, and how to pace the 45 minutes.

Scope Before You Draw Anything

The question hides enormous ambiguity. Before touching the whiteboard, clarify two things.

Upload only, or full end-to-end playback? Upload gets you to object storage. Playback adds CDN, transcoding, and adaptive bitrate. They're different scopes.

What media types? Images are stateless serves after upload. Videos need transcoding before they're playable. If the interviewer wants both, you've got two separate write pipelines to design.

For this walkthrough, scope it as: users upload images and videos; other users can view them. Think Instagram for photos, YouTube for videos. 100M users, 1M uploads per day, 5% video.

Back-of-napkin numbers: 1M uploads per day. 950K images at 3MB average = 2.85TB of new image data daily. 50K videos at 300MB average = 15TB of new video. Storage grows fast. CDN egress grows faster. Those numbers will anchor every tradeoff you make.

The Architecture Has Two Separate Pipelines

Draw two boxes first: write path and read path. Every design decision flows from which path a component sits on.

For images, the write path is: client uploads to S3, metadata lands in Postgres, thumbnails generate async. The read path is CDN serving the image directly, with no app server involved.

For video, the write path never completes in a single request. That's the key difference that trips up most candidates. The client uploads a raw file, transcoding workers convert it to multiple resolutions, the CDN pre-warms segments, and only then does the video become visible. The API returns 202 Accepted when the upload finishes. The video shows "processing" for minutes or hours after that. (You know that spinner. You've watched it. Now you're going to design it.)

Two-pipeline architecture: write path on the left (Client → API → S3 → Queue → Workers → CDN), read path on the right (Viewer → CDN directly)

The write and read paths are completely separate. Your app server never touches the read path.

Client
  │
  ├── POST /initiate-upload  →  API Server  →  presigned URL
  │
  └── PUT file  ─────────────────────────────────────→  S3 (raw)
                                                            │
                          ┌─────────────────────────────────┘
                          ▼
                   Message Queue (SQS/Kafka)
                          │
               ┌──────────┴──────────┐
               │                     │
          Image workers         Video workers
          (thumbnails)         (transcoding)
               │                     │
         S3 (thumbnails)     S3 (HLS segments)
               │                     │
              CDN                   CDN

Never Let File Bytes Touch Your App Servers

This is the single most important architecture decision, and half of interview candidates miss it.

If you route file uploads through your API server, that server becomes a bottleneck. A 500MB video upload holds an HTTP connection open for minutes. Imagine your API layer is a hotel receptionist. Fine for check-ins. Not fine if you ask her to personally carry every guest's suitcase to the room. At 50,000 concurrent uploads, you've brought down your API layer with legitimate traffic.

The solution is presigned URLs. The client asks your server for a time-limited upload credential, then uploads directly to S3. Your server never sees the file bytes.

Sequence diagram: Client asks API for presigned URLs, API creates a media record and gets upload credentials from S3, returns presigned URLs to client, client PUTs chunks directly to S3 bypassing the API server entirely

Steps 1-3 involve your API server. Steps 4 onward? S3 handles it. Your app server is sipping coffee.

1. Client → POST /initiate-upload
2. Server validates auth, creates media record (status: "uploading")
3. Server calls S3 CreateMultipartUpload → gets uploadId
4. Server returns { uploadId, presignedUrls[] } to client
5. Client PUT each chunk directly to S3 via presigned URLs
6. Client → POST /complete-upload with { uploadId, ETags[] }
7. Server calls S3 CompleteMultipartUpload
8. Server publishes processing job to queue

The app server handles metadata and coordination. S3 handles bytes. Instagram, Dropbox, and every large photo service operates this way.

Chunked Upload Is Not Optional for Video

A 1GB video over a mobile connection will get interrupted. Without chunking, the user restarts from zero. This will happen. Mobile networks are like that friend who always says they're "five minutes away."

Use multipart upload: split the file into 10MB chunks, upload in parallel, resume from the last successful chunk. AWS requires parts of at least 5MB; 10MB is the practical default. A 500MB video becomes 50 parts, and the client can push 4 at once in parallel.

The client tracks { partNumber, ETag } pairs as parts complete. When the network drops at part 34 of 50, the client resumes from part 35. The server tracks which parts S3 has confirmed, and combines them only at CompleteMultipartUpload.

Mobile SDKs (AWS Amplify, tus.io) handle this automatically. You still need to explain the mechanism in an interview.

Video Transcoding Is an Async Job, Not a Request

After the raw file lands in S3, your API returns 202 Accepted immediately. Real work happens on a separate worker fleet. "Your video is processing" is not a cop-out. It's the honest status of an async pipeline.

The message queue receives a job: { videoId, s3Key, userId }. Transcoding workers pick it up. Transcode in parallel across multiple workers, not sequentially. You don't encode 480p then 720p then 1080p. That would take four times as long for no reason. Fan the job out to four workers simultaneously.

Each worker runs FFmpeg against the same source file and outputs one resolution. Each resolution gets segmented into 2-6 second chunks. A 10-minute video at 1080p becomes about 100 chunk files. Workers also generate an HLS manifest (.m3u8) listing every quality tier and every segment URL.

Transcoding fan-out: raw video in S3 triggers a message queue job, which fans out to four workers simultaneously (1080p, 720p, 480p, 360p), each outputting HLS segments to S3, all four running in parallel

Sequential encoding is a classic interview trap. Fan it out. Four workers, same clock time.

raw_video.mp4  →  Worker A  →  1080p/seg_001.ts ... seg_100.ts
               →  Worker B  →   720p/seg_001.ts ... seg_100.ts
               →  Worker C  →   480p/seg_001.ts ... seg_100.ts
               →  Worker D  →   360p/seg_001.ts ... seg_100.ts

All workers complete  →  master.m3u8 generated  →  metadata.status = "published"

The video is not visible to viewers until the manifest is uploaded and the metadata record flips to published. Before that, the UI shows a progress state.

The Data Model Is Two Tables

Keep metadata lean. Binary data lives in S3. Your database stores pointers and state.

Media table (images and videos):

media_id      UUID    primary key
user_id       UUID    shard key
type          ENUM    ('image', 'video')
status        ENUM    ('uploading', 'processing', 'published', 'failed')
s3_key        TEXT
size_bytes    BIGINT
created_at    TIMESTAMP

Renditions table (video only):

rendition_id     UUID
media_id         UUID    foreign key
resolution       TEXT    ('1080p', '720p', '480p', '360p')
s3_manifest_key  TEXT
duration_secs    INT
bitrate_kbps     INT

For images, a thumbnails table maps sizes to S3 keys. A 4000px original gets processed to 1200px, 600px, and 150px. The original goes to Glacier after processing. Users never get the raw file; they get the resized version from CDN.

Shard on user_id. User uploads cluster on one shard, which makes "show all my media" queries fast without cross-shard joins.

Storage Tiering Is Free Cost Optimization

Every upload lives in S3 forever unless you write lifecycle rules. At 15TB/day of new video, you're adding 5.4PB per year. At $0.023/GB for S3 Standard, that's $124K/month after year one. The math gets grim fast.

Use S3 lifecycle policies to move cold content to cheaper storage automatically. No code. Just config.

S3 storage tier ladder: Standard (0-90 days, $0.023/GB), Infrequent Access (90d, $0.0125/GB), Glacier Instant (1yr, $0.004/GB), Glacier Deep Archive (raw originals immediately after transcoding, $0.00099/GB)

100TB of raw originals: $2,300/month on Standard vs $99/month on Deep Archive. A lifecycle rule is six lines of config.

New content: S3 Standard ($0.023/GB/month)
Content older than 90 days, fewer than 10 views: S3 Infrequent Access ($0.0125/GB)
Content older than one year: Glacier Instant Retrieval ($0.004/GB)
Original raw uploads after transcoding: Glacier Deep Archive ($0.00099/GB)

The originals move to Deep Archive as soon as transcoding completes. You only need them again if re-transcoding for a new codec. Deep Archive costs $1/TB/month versus $23/TB for Standard. On 100TB of raw uploads, that's $2,300/month versus $99/month. No code required, just a lifecycle config.

CDN Is the Entire Read Architecture

Your app server has no role on the read path. Every image, thumbnail, video segment, and manifest goes through the CDN edge.

A user in Berlin requesting a video from us-east-1 sees 150ms latency per segment. That makes video buffer. A user in Berlin hitting a Frankfurt edge node sees 3ms. That is the difference between smooth playback and a rage-quit. Nobody rages at Netflix. This is why.

The player requests master.m3u8 from the CDN edge, which lists all quality tiers, picks the tier matching current bandwidth, and fetches 2-second segments one at a time. If bandwidth drops mid-stream, the player downgrades to 480p without rebuffering from scratch. This is adaptive bitrate streaming, and it's entirely CDN-driven once the segments are cached.

CDN adaptive bitrate: player in Berlin requests master.m3u8 from Frankfurt CDN edge (3ms), edge serves cached segments on hit, fetches from S3 us-east-1 origin on miss (150ms). Player measures bandwidth per segment and switches quality tier automatically.

Cache hit = 3ms. Cache miss = 150ms round-trip to origin. CDN offload at YouTube scale reduces origin bandwidth by 90% or more.

The CDN caches popular segments indefinitely. Less popular content evicts after a few hours. Your S3 origin serves only the long tail. At the scale of YouTube, CDN offload reduces origin bandwidth by 90% or more.

For a more detailed look at CDN architecture in system design interviews, see the CDN system design guide.

Three Things Break First at Scale

Call out bottlenecks before the interviewer asks.

Transcoding capacity. GPU workers are expensive. At 50,000 videos per day, you need a fleet that can process each video faster than real-time. Use spot instances to cut GPU costs by 60-70%. When the queue backs up, status stays processing longer. Users tolerate that as long as you show a progress indicator.

Hot content at the CDN origin. S3 can hold anything, but when 500,000 users simultaneously request a viral video before the CDN has cached it, every request goes to origin. Pre-warm the CDN for popular content by triggering a cache-fill request immediately after publishing.

Metadata write storms at peak upload time. When 50,000 users start uploading at once, 50,000 rows need to be written to your media table. If the table is on a single unsharded Postgres instance, writes queue up. At high scale, move metadata to a write-optimized store like DynamoDB or Cassandra, sharded by user_id. For the message queue backing the processing pipeline, see the message queue system design.

How to Pace the 45 Minutes

Most candidates spend 30 minutes on the upload flow because it feels concrete, then scramble through everything else. The tradeoffs at the end are what separate a strong hire from a hire. Spending 35 minutes drawing boxes and 10 minutes on tradeoffs is backwards.

Stick to this split:

0-5 min: Scope. Images vs video. Scale. Upload-only or playback?
5-15 min: Upload flow. Presigned URLs, multipart, chunked resume. Draw the diagram.
15-25 min: Video processing. Async job, message queue, parallel transcoding, HLS output.
25-35 min: Storage and metadata. Two tables, shard by user_id, lifecycle policies.
35-42 min: CDN delivery. Adaptive bitrate, edge caching, origin behavior.
42-45 min: Tradeoffs you'd revisit. Sync vs async thumbnails. Glacier retrieval latency tradeoff. Presigned URL TTL.

The question "what would you change if you had more time?" is really "what tradeoffs did you consciously accept?" Name them explicitly. That's what gets written into the feedback doc.

If you want to practice defending this under real time pressure, SpaceComplexity runs voice-based system design interviews with rubric-based feedback. Reading a walkthrough is different from articulating it when the clock is running.

What Goes on Your Whiteboard

Two separate write pipelines: images are synchronous, video is async.
Presigned URLs bypass app servers entirely. Never route bytes through your API.
Multipart upload: 10MB chunks, parallel, resumable on network failure.
Transcoding fans out to parallel workers by resolution, segments to HLS.
Two tables: media (with status) and renditions. Shard by user_id.
Lifecycle rules move originals to Glacier Deep Archive after transcoding.
CDN handles every read. Adaptive bitrate needs 2-6 second segments to switch quality.
Three bottlenecks: transcoding capacity, CDN cache warming, metadata write storms.