Problem
A community storytelling nonprofit wanted to scale long-form member-success stories without losing the founder’s voice, fabricating details, or burying editors in unstructured drafts. The bottleneck wasn’t the writing — it was grading and gating quality at volume.
Approach
A pipeline where every draft is scored before a human sees it: an 8-point pass/fail check (all story blocks present and ordered, self-reported figures labeled, no fabricated claims, verified public data points) plus a 5-point rubric (voice fidelity, narrative clarity, grounding, dignity, movement significance). Drafts below threshold get kicked back with the failing criteria. It never publishes on its own — an editor approves, then the member explicitly approves, before anything goes live.
Stack
A custom Next.js app with Supabase (Postgres + Auth) and Google login, Anthropic Claude for both generation and as the eval judge, deployed on Vercel. The pass/fail and rubric criteria live in version control next to the prompts; a CLI regression runner grades the whole submission bank end-to-end.
What shipped
A beta batch of eleven member stories ran end-to-end: voice fidelity averaged 4.7/5, opening scenes landed in ten of eleven, and there were zero fabricated metrics across the batch even where intake data was missing. That last result surfaced the real risk — the model reconstructs plausibly from gaps — so a hard intake gate and explicit member approval became mandatory. The client green-lit a pilot conditional on four gates: intake validation, automated quantitative checks, member approval, and press-tier source traceability.
What’s next
Build the four gates, then move from beta to the live pilot.