Skip to content

Case Study

8 min read · Updated May 18, 2026

Content and data platforms: quality over quantity

Translating data quality work into the volume metric a sales-led org trusted

From my time at Muck Rack · Technical Product Manager, Content and Data IngestionQuality was the lever; quantity was the language. Behind the “missing mentions” framing sat two simpler problems: we were capturing mentions late and labelling them wrong. A year and a half at Muck Rack as the technical PM for content and data ingestion meant translating accuracy, relevance, and connectivity into the volume metric leadership trusted. The path there was decomposing a single ETL pipeline into a fleet of microservices—first by content type, then by source—even as leadership sold the product on parity with legacy competitors.

01 · Context

Sales-led PR SaaS with a volume-parity narrative.

Muck Rack was a roughly 150-person SaaS reporting tool for PR professionals when I joined in late 2022. Product was organized into three teams: Content (search and monitoring plus the journalist and outlet database), PRM (the reporting features customers used to track their coverage), and Platform (the shared infrastructure underneath). I was hired into Content and carved out the data-ingestion scope inside it: the pipeline that turned a fragmented set of upstream sources into the content objects every other team composed against and the platform that would scale that pipeline's operations. Leadership was hands-on; outside the product team, I worked day-to-day with the CEO, chief of staff, chief partnerships officer, and head of legal.

I worked with an engineering manager and four to five engineers. There was no embedded designer (we rarely touched the frontend) and no embedded data scientist—the data-science function had been operationalized as data engineering, and I had to influence their development priorities project-by-project. We built out observability in Grafana because the organization did not prioritize data connectivity across sales, operations, and product data; our success was measured against these data individually, but not against a blended data point we could use as a leading indicator of our success or north star metric.


02 · Opportunity

Leadership measured volume. The actual opportunity was timeliness and accuracy.

The institutional rallying cry was “never miss a mention.” GTM sold on content-volume parity with legacy competitors; customers and prospects reported missed mentions as the most visible quality failure of the product. The instinct across the leadership team was that the answer was more sources—more partnerships, more integrations, more scraping. We did pursue those in parallel. But the diagnostic that emerged from the data told a different story.

We weren't missing mentions. We were capturing them late. Every source funnelled into the same ETL pipeline and stalled mid-stage during peak ingestion windows. From a user's perspective, a mention that arrived four hours late was indistinguishable from one that never arrived. A second failure mode lived alongside it: parsing errors and content-type misclassification dropped perceived volume because users experienced misclassified content as missing content. The system saw it. The system just didn't surface it where the user expected.

Two failure modes—one structural (the monolith) and one qualitative (parsing and classification accuracy)—read to users and leadership as a single symptom: low volume. But the symptom is not the disease. The primary opportunity wasn't to source more content. It was to move what we already had through the pipeline more efficiently, and to label it more accurately when it arrived.


03 · Discovery

Observability turned 'low volume' into 'stalled stages.'

Two strands of evidence converged on the diagnostic. One we built; one we listened to.

Quantitative

Stage-by-stage Grafana dashboards made the bottleneck visible.

We instrumented every stage of the pipeline—detected, core processing, enrichment, downloaded, supplemental enrichment—and counted content objects sitting in each stage at any given moment, with alerting on thresholds that historically preceded incidents. The dashboards reframed the conversation from we need more sources to we need to move what we already have through faster. The metric I translated for leadership was the simplest one available: content objects hitting downloaded per day. It was also the metric marketing and sales were already using.

Qualitative

Missed-mention complaints weren't always about content we never saw.

Feature requests aggregated through Productboard, sales escalations, and customer-success triage all pointed at the same surface symptom: users reported missing mentions. Working with data engineering, we matched a sample of these complaints against the actual ingestion record. A meaningful share of them were about content the system had seen and processed, but had labelled in a way that pushed it out of the user's expected retrieval path. That insight is what shifted the second workstream from a pure quality-improvement effort to a volume-perception effort.

Both strands pointed at the same conclusion: the monolith's shape was the constraint, and accuracy was a quieter half of the same problem. The technical investment that followed was justified to leadership on the volume math; the team itself understood it as quality work.


04 · Strategy

Decompose along content-type and source boundaries.

The bet was a two-axis decomposition. We split the monolith first by content type—article, broadcast, and so on—so each type ran its own pipeline; then by source within each pipeline, so our proprietary scraping technology, a LexisNexis integration, and a TVEyes broadcast content feed each became its own microservice. Each microservice had its own detection, extraction, and initial transformation setup. The third move was standardizing the processing stages of each pipeline post-core-processing as composable nanoservices (enrichment, download, and supplemental enrichment) so we could iterate per stage without rebuilding the surrounding pipeline.

Two figures follow: the architecture, before and after. The thread continues in prose below.

Before

One pipeline carried everything.

Monolithic ingestion pipelineThree sources—proprietary scraping, LexisNexis, and TVEyes—all feed into a single ETL pipeline whose five serial stages (detect, process, enrich, download, supplemental enrichment) run on one instance. The pipeline output fans out to three downstream consumers: search, monitoring, and reporting.Proprietary scrapingLexisNexisTVEyesSINGLE ETL PIPELINEDetectionExtraction andTransformationEnrichmentDownloadSupplementalenrichmentSearchMonitoringReporting
Three sources funnelled every content object through one shared ETL pipeline. Detection, extraction and transformation, enrichment, download, and supplemental enrichment ran serially on a single instance; a backlog at any stage held up every consumer downstream.

After

A fleet of end-to-end pipelines, one per source.

Decomposed ingestion fleetThree sources—proprietary scraping, LexisNexis, and TVEyes—each get their own end-to-end pipeline, grouped by content type. The Article swim-lane holds two pipelines (scraping, LexisNexis) and the Broadcast swim-lane holds one (TVEyes). Every pipeline runs detect, extract, and transform in its own microservice, then runs its own copy of three nanoservices—enrichment, download, supplemental enrichment—configured for its content type. The nanoservices are duplicated per source and share only their shape across pipelines. Each pipeline's output fans out independently to the same three consumers: search, monitoring, and reporting.Proprietary scrapingLexisNexisTVEyesARTICLE PIPELINENanoservices are configured for article content.ScrapingmicroserviceEnrichmentDownloadSupplementalenrichmentLexisNexismicroserviceEnrichmentDownloadSupplementalenrichmentBROADCAST PIPELINENanoservices are configured for broadcast content.TVEyesmicroserviceEnrichmentDownloadSupplementalenrichmentSearchMonitoringReporting
Each source got its own end-to-end pipeline. Inside the Article and Broadcast swim-lanes, every source ran its own microservice (detection, extraction, transformation) followed by its own copy of three nanoservices—enrichment, download, supplemental enrichment—configured for its content type. The nanoservices share a shape across pipelines, not an instance. Every pipeline's output fans out independently to the same three consumers. Enrichment and supplemental enrichment also drew on external audience and viewership data sources outside the content pipelines, not detailed here.

The architecture earned its keep on the second-order effects. With sources decomposed, we could clearly identify duplicates and use them to enrich previously ingested content objects; if an article was already downloaded via the LexisNexis microservice and was then detected by the scraping microservice, we enriched the former with new data from the latter and presented the duplicate mention to users in their reporting (a sellable feature). Cross-content enrichment was one of two inputs to the enrichment nanoservices; separate audience and viewership data feeds drove the same stages from outside the content pipelines, though their mechanics sit outside this case study. With processing stages decomposed, we caught processing errors earlier in the pipeline, before storage and re-processing piled up with under-processed content. Per-source observability let us pinpoint which integration was responsible when volume on a given content type dipped.

Sequencing came down to two forcing functions. Articles were the majority of total volume, so peeling article processing off first generated the largest legible improvement to the metric users and leadership cared about. LexisNexis was the critical partnership integration on the source side, so isolating it as its own service was both architecturally clean and politically necessary—it let us iterate stage-by-stage on the most contractually sensitive source without coupling that work to the rest of the pipeline. The sell-in to leadership was the volume math, not the architecture; every technical investment had to map back to downloaded-per-day before it earned engineering cycles.

Long view: Negotiating leverage

I'm of the opinion that the standardized content schema positioned the platform, over time, to exert more negotiating leverage on vendors: the more it proved its value to users, the more we could demand a richer content schema and appropriate maintenance terms from any single vendor (or a cheaper rate if a vendor could not meet these demands). This was an underexploited moat, when paired with advanced reporting features that relied upon the uniqueness of the platform's enriched data.

05 · Execution

Three workstreams in parallel, defended against constant injection.

The theory of roadmapping is always cleaner than the practice. Some roadmaps are more modular than others and responsive to real-time influence from the market and stakeholders. Ours was defended quarter-by-quarter against pet projects and incoming requests with sales-cycle deadlines attached—requests that didn't map to our stated team goal of driving database volume growth. We absorbed some of them anyway, and implemented a few outright. My EM and I worked closely to shield the engineering team from whiplash as much as possible—cushioning timeline estimates, partnering on stakeholder management, and relying on him to present the technical roadmap because leadership respected his title's authority on architecture. Inside that collaboration, we moved work forward on three workstreams in parallel.

Phase 01

Observability first, to earn the technical investment.

Before any decomposition shipped, we built the stage-by-stage Grafana dashboards and the alerting thresholds that made pipeline health legible. This is what bought us leadership's patience for structural work—not the architectural argument, but the single chart that tied every technical investment back to downloaded-per-day. The dashboards stayed live through the rest of the project as both a measurement instrument and a communication artifact.

Phase 02

Content-type decomposition first, then per-source microservices.

Articles came off first, since they carried the majority of total volume. LexisNexis followed as the critical source-side peel, then TVEyes for broadcast. We brought in an additional engineer roughly halfway through this phase, dedicated to the decomposition itself, while the rest of the team absorbed day-to-day fires and new integrations. Within each pipeline, we iterated stage-by-stage on the nanoservices so a fix to enrichment didn't require a rebuild of detection.

Phase 03

Quality models in parallel with the structural work.

Article parsing and content-type detection (new vs. evergreen) ran as a parallel track with the data engineering team, prioritized through internal influence rather than formal roadmapping. The accuracy gains landed in Q4 2023. Evergreen detection tightened what counted as a substantive update, which made current monitoring notifications and reporting counts more accurate.

Running alongside all of this, roughly half of my week was the partnerships track. Vendor evaluation, integration scoping, contract construction, and content-compliance work with the head of legal. A broadcast vendor contract-expiration deadline led to evaluation of multiple replacements and reprioritization of the broadcast pipeline work. A social monitoring vendor evaluation became a partnership that later became a working integration and outright acquisition after my tenure. This work was consistently framed as separate from the platform decomposition; in practice it was the same job—control over the data, expressed contractually on one side and architecturally on the other.


06 · Outcomes

Volume rose. The quality dimensions underneath it rose more.

The headline outcome was on the metric the org sold and marketed on: average daily content-object volume rose 350% year-over-year, with parsing-error complaints down 45% YoY and historical coverage of the database expanded fivefold via a backfill that landed in Q1 2024. Each number is real and each carries a measurement caveat worth naming.

Ingestion

+350%

Average daily content objects processed, year-over-year. The single metric leadership marketed and sold on, and the one every technical investment had to translate into. This includes article and broadcast content.

Parsing errors

−45%

Year-over-year reduction in user-reported parsing errors. Measured via complaint volume because unknown errors aren't countable; once we identified a parsing failure, we fixed it.

Historical coverage

+500%

Database expansion via historical backfill from 2 to 5 years of content in Q1 2024. Closed the volume-parity gap against legacy competitors and cooled missed mention pressure. The steady decrease in published content meant these prior years had more valuable data for users.

The decomposition enabled three downstream products —search and discovery, monitoring, and reporting—built across two teams: search and monitoring within Content, and reporting within PRM. The teams behind them now had faster, more accurate, and cleaner data to build against. This work improved reliability and accuracy beneath features that already existed, but it did not generally result in net new user features. This is most of what senior platform PM work actually is—invisible to a screenshot, visible in the metric the org sold and marketed on.

The “never miss a mention” pressure cooled as we progressed towards the Q1 2024 backfill but never fully resolved. It couldn't—“more volume” without a numerator or denominator isn't a target you can satisfy. The social monitoring vendor evaluation surfaced this tension explicitly: choosing among candidates required a definition of “a mention” that the org hadn't committed to. Did we mean a hashtag? Keyword? Tag? The PR industry (our users) wasn't terribly clear on the need, but we, as the platform org, needed to have a sharper one.


07 · Reflection

Quality was the lever; quantity was the language.

The sharpest thing I learned at Muck Rack is about the metric itself. More volume, without a numerator or denominator, isn't a target—it's a strategy gap. There's no defensible answer to “are we done” because the goal isn't grounded in the universe of content actually published, nor in what users consider a mention in the first place. The social vendor evaluation made this visible: choosing a partner required defining the moat, and the org didn't have one. Platform work can solve a lot of things, but it can't replace strategy clarity. PMs working under unbounded growth targets should be alert to this—you can ship excellent work and still be chasing a ghost.

The companion lesson is about how platform work moves in a sales-led organization: by translation, not education. In a high-conviction leadership room, translation is stakeholder management—the two collapse into one craft. The leadership room never flipped on the diagnosis—it accepted, gradually, that I was driving results in the metric it already marketed and sold on. Every structural improvement I shipped had to prove itself by estimated impact on the downloaded-per-day metric before it earned the cycles to develop. Even the qualitative wins—parsing accuracy, evergreen detection—were largely marketed to users as part of the volume narrative rather than as their own stories. Translation went all the way down—and so did the work of absorbing shifting priorities into structured technical investment, a discipline my EM and I built together—and one I'd carry into a network-scale role at People Inc., where translation across 40+ brands and their stakeholders was critical to success.

The third thread is the connection to the framework I published recently on data quality. In a LinkedIn article on data practice for growth and personalization (opens in new tab), I argued that data programs can be evaluated across six dimensions: completeness, accuracy, relevance, connectivity, legibility, and privacy. The Muck Rack work is where those dimensions came from. The historical backfill was a completeness move. The parsing and content-type models were accuracy work. The perceived-mention insight was a relevance reframing. The ETL decomposition itself was connectivity. The stage-by-stage dashboards were legibility—a system state any non-technical stakeholder could read. And the ongoing partnership and content-compliance work with legal was the privacy column doing its quiet job. The article codified what Muck Rack taught me; the case study is the practice underneath the theory.

Two next steps, if this is the kind of PM work you're hiring for: review my resume .

Or, if you're ready to talk, book a 30-min product chat (opens in new tab).