Skip to content

Case Study

10 min read · Updated May 12, 2026

Architecture under contract

How three integrations stay online when their upstreams don't.

Three pages, three integrations that look nothing alike, one architectural rule that keeps all three online when an upstream breaks. The rule: the page never calls the upstream at request time. Every render reads from a snapshot file on disk; the snapshot is refreshed on a separate path, on a separate cadence, with its own failure mode. This is what makes /music, /films, and /television survive a rate-limit incident, a CSV no-show, and an unofficial endpoint going dark—without the page ever knowing which upstream is broken. Sequel to building this site. Same posture, harder upstreams.


If you're skimming

After migrating refresh crons from GitHub Actions (91% miss rate) to Vercel Cron (0%), zero render-path incidents since launch—because the page never calls the upstream at request time. Three completely different upstreams agreed to that contract.

  • The render contract shouldn't know how the data got there.
  • Editorial intent belongs upstream of the writer, not the reader.
  • Unofficial integrations call for politeness, not workarounds.

01 · The Brief

Three pages. Three upstreams that look nothing alike.

Three pages—/music, /films, /television—needed data from three completely different upstreams, plus TMDB for metadata enrichment on two of them. The brief: ship them under one rendering pattern so a visitor lands on a page that doesn't know—and doesn't need to know—which upstream fed it. Spotify is the easy case structurally (real OAuth API). Letterboxd has no API. Serializd has a publicly-callable internal endpoint that no one promised to keep stable. TMDB is the easy supporting case (public API, API-key auth), but it only matters if the upstream it's enriching landed cleanly.

Building this site covers Spotify; this study works the harder two.


02 · Three Core Sources, One Enrichment Source

comparative architecture

Architecture frames product UX.

Three pages that look the same to a visitor sit on three architectures that look nothing alike. Likeness at the page level doesn't require likeness at the integration level—and forcing it would have meant compromising either the page or one of the upstreams.

Spotify

Public API
Yes
Auth
OAuth (Auth Code)
Path
Live HTTP
Cadence
Manual ritual

Letterboxd

Public API
No
Auth
None
Path
CSV export + RSS feed
Cadence
Hourly cron (RSS)

Serializd

Public API
No (internal)
Auth
None (anon)
Path
Polite paginate of internal endpoint
Cadence
Hourly cron (offset)

TMDB

Public API
Yes
Auth
API key
Path
Per-item enrichment during films + television refresh
Cadence
Inherited (no own cron)

Spotify is the easy case structurally. There's an API, the auth flow is documented, and the data shapes are generally stable. The hard part was operational—rate limits, deprecations—and is covered in the previous case study.

Letterboxd is the hard case. No API at all. The site is read-friendly to humans and hostile to scrapers. But Letterboxd does publish two parseable surfaces: a CSV export a user can request by hand from account settings (ground truth, complete history, manual, ZIPped), and an RSS feed of recent activity (last ~50 entries, public, fast, machine-readable, but truncated to the recent window). This site's integration treats a CSV as the seed and the RSS as the delta. The CSV bootstraps the catalog; RSS keeps it warm. The two fix each other's blind spots.

Serializd is the awkward middle. No documented API, but the site itself is a React app that talks to a public-by-design endpoint at serializd.onrender.com. Anyone with browser DevTools can see the URL. The integration uses it the same way the site's own frontend does, with a few extra courtesies. More on that a couple of sections below.

TMDB is the reference layer. A stateless catalog API with an API-key auth model and rate limits generous enough to never matter at this scale. It supplies the metadata the other three sources don't return themselves—poster art, genre taxonomy, episode counts, and the show-type classification downstream editorial logic depends on. TMDB has no cron of its own; it rides whichever pipeline is enriching new items. The other three sources are about behavior; TMDB is about catalog.


03 · One Rendering Contract

shared shape

Rendering convergence.

Every Vercel environment runs with SPOTIFY_OFFLINE=1 and equivalents for the other feeds. Pages render from lib/feeds/_fixtures/<service>-snapshot.json. Zero live API calls per render; deterministic latency. The rate-limit incident that almost broke /music couldn't break it again, because the page never talks to Spotify in production. The same property protects /films from Letterboxd outages and /television from a Serializd 5xx.

Each snapshot envelope carries the same four things: a capturedAt ISO timestamp (used for “data as of” UI captions and freshness diagnostics), a summary block with the headline numbers pre-tallied so the page doesn't have to total them up on every visit, the full content list already in display order, and a lookup map so any /[slug] detail page can find its item instantly. Each reader implements the same three things: a schema-shape guard that fails loud at module load instead of cryptic-undefined deep in a render, a module-scoped cache that builds derived indices once and shares a lifetime with the snapshot itself, and a /api/<service>/health probe that answers “would a refresh work right now?” in one call.

The shared-lifetime detail is worth flagging. The cache holds the snapshot, the slug map, and the chronological position map for prev/next neighbor nav as one object. The alternative—evict the snapshot but keep the slug map—is a class of bug that's easy to write, impossible to debug from the symptom, and trivially prevented by making them rise and fall together.

The cost: every render is reading data that is, on average, half the cron interval old. For /music it could be days old. The pages are editorial, not realtime, so the freshness budget is generous. The trade-off is that there's no moment-of-truth retrieval; a play-history change at T+0 doesn't appear at T+0. None of these surfaces need it.


04 · Three Refresh Models

scheduling and posture

Refresh divergence.

Spotify is human-in-the-loop. npm run music:refresh is a guarded ritual: kill any running dev, spawn dev:online, probe /api/spotify/health for rate-limit clearance on both /me and /me/playlists buckets, call /api/spotify/snapshot, diff old vs new, write the new fixture. No cron. The 21-hour penalty box from case study #1 was earned in dev, and Spotify's rate limiter is sticky and per-app—a cron firing during a dev iteration could chain into it. Manual gating means the only refreshes that happen are ones I'm ready for.

Letterboxd is hourly, RSS-driven. /api/cron/films-refresh runs at 0 * * * * UTC. It fetches the RSS feed, parses the last ~50 entries, diffs against the snapshot, enriches any new films via TMDB's /movie/{id} endpoint, merges, re-aggregates, and commits the new snapshot back to the repo via the GitHub contents API. Vercel rebuilds on push. The CSV bootstrap path is separate: when a fresh export lands, parse-letterboxd-export.mjs reads the unzipped folder, joins diary.csv with reviews.csv on (Date, Letterboxd URI), applies the prose-only scope filter (rating-only watches don't qualify a film for /films), and rebuilds the snapshot from scratch. RSS catches edits inside its ~50-entry window but goes blind on older edits and on deletions; the CSV bootstrap is the ground-truth corrective for both.

Serializd is hourly, paginated, with a thirty-minute offset. /api/cron/television-refresh runs at 30 * * * * UTC. The offset is a race-guard so the films and television crons don't push to main simultaneously and collide on the GitHub contents API SHA. The pipeline paginates /api/user/malxavi/diary?page=1..N with a 500ms gap between pages, groups reviews by show, enriches each unique show via TMDB's /tv/{id} endpoint, runs an editorial-cleaning pass (next section), aggregates the summary, and commits.

On building in public, at the commit level

Both routes commit via the GitHub contents API rather than writing to Vercel Blob. Blob would be faster (no rebuild) and quieter (no commit chatter). The integration uses commit-via-API anyway because the chatter is a feature during the building-in-public phase—every refresh produces a real commit on main, which keeps the public GitHub history visible. A different posture (mature product, no audience for the commit log) would pick Blob.


05 · The Polite Client

ethics and posture

It's easier to ask forgiveness...

The Serializd integration calls an endpoint that no public documentation describes. The endpoint is the same one Serializd's own React frontend calls; it's discoverable in two minutes with browser DevTools. The integration includes three deliberate courtesies. First, an identifying User-Agent: not a browser-spoof, but "malxavi.com /television cluster - read-only, snapshot-driven, hourly (https://malxavi.com)"—a name, a scope, a cadence, and a contact path. Second, an X-Requested-With header that matches their own frontend's; the integration looks like the client their service expects to answer. Third, a 500ms gap between paginated requests, so a full bootstrap spreads ~28 requests over ~14 seconds rather than hammering. The cron fires once an hour and almost always pulls a single incremental page anyway.

This is closer in posture to scraping a public website than to consuming an API. The site isn't doing anything Serializd's own frontend doesn't do; it's doing it less often, more slowly, and identifying itself. There's no auth bypass—the endpoint is anonymous-by-design. There's no volume that would resemble an attack vector. There's no data resale, no aggregation product, no exposure of any user other than mine. If Serializd publishes a public API tomorrow, the integration switches to it tomorrow. If they ask the integration to stop, it stops. The disclosure path is real; the User-Agent is the address.

Identifying User-Agent, low volume, documented fallback. The default for any unofficial integration, not a special case.
the polite-client rule

06 · Automated Editorial

quality as plumbing

Automating editorial intent.

The Serializd bootstrap script runs an editorial-cleaning pass with seven categories—miniseries detection, show-vs-season ambiguity, in-progress vs completed shows, posterless entries, and a few others. Each is either INFORMATIONAL or BLOCKING. If any BLOCKING category has unresolved entries, the script exits non-zero and refuses to write the snapshot. The cron run fails noisily; the existing snapshot stays in production. The page never silently regresses to a misclassified state.

The miniseries double-count rule is the touchstone example. A miniseries occupies one season but reads editorially as a complete show—so a show-level review on a miniseries also counts in season totals, and a season-level review on a miniseries also counts in show totals. The rule lives in lib/feeds/serializd-mode-counts.mjs; both the bootstrap script and the runtime page consume it. Skipping it produces counts that are technically right and editorially wrong.

Most data pipelines treat editorial quality as a downstream review concern. Pulling it left into the writer means the snapshot file is the contract—if it's on disk, it's reviewable, and pipelines that don't enforce editorial intent eventually surface that gap to the user.


07 · What's Live

What shipped and what's next.

As of writing:

  • /music—39 owned public Spotify playlists, OAuth-fetched at refresh time, snapshot-cached, manually refreshed. Snapshot captured 2026-05-12.
  • /films—745 reviewed films, 104 in 2026 so far, sourced from a Letterboxd CSV export and topped up hourly via RSS, all enriched through TMDB. Snapshot captured 2026-05-12.
  • /television—153 shows across 37 show-level, 230 season-level, and 489 episode-level reviews, sourced from Serializd's internal API under a polite-client posture and enriched through TMDB. Snapshot captured 2026-05-12.

All three pages render from disk. None of them call an upstream during a request. The next public API change in any of the three sources will move the refresh script, not the page.

What's next. Build the cleanup gate even earlier on any future integration of this shape; building it after the parser meant a round of avoidable manual classification. Smooth the Letterboxd CSV re-seed into a films:reseed script that takes a path arg. With six months of clean Spotify health data, reconsider moving /music to a daily cron with a pre-flight rate-limit probe. And document the snapshot envelope shapes in a README so each fixture file doubles as a tiny public dataset.

The render contract should not know how the data got there.
the lesson that travels

Three takeaways that survive any specific upstream. First, the render contract should not know how the data got there. /music, /films, /television all read from a snapshot file with a guarded shape. The ingestion pipeline can change underneath without touching the page. Three completely different upstreams, one rendering model, no special cases at the page level.

Second, editorial intent belongs upstream of the writer, not the reader. The miniseries double-count rule is a one-line editorial decision and a thirty-line bug if it lives in the page instead of the snapshot. Pipelines that treat editorial quality as a downstream concern eventually ship the gap.

Third, the polite client is a posture, not a workaround. Identifying User-Agent, low volume, documented fallback. That's the default for any unofficial integration, not a special case.