Past the Render
A model can render almost anything now. A photorealistic product shot in a kitchen at golden hour. A nine-by-sixteen video of a woman lifting a package out of a box. A voiceover in any accent, clean room tone, lip-sync that holds. Every signal the model can verify says the output is correct.
Then the ad goes live and gets killed in three days.
The render is the easy mile. The ad is the hard mile. What a foundation model can verify, that an image looks like a plausible face, has almost nothing to do with what a buyer cares about, which is whether the creative converts at a CAC the business can survive, against an audience that has seen a thousand variations of this angle already this quarter. The signal that decides the ad's fate sits outside the pretraining data and outside what the model can grade itself on.
Ad creative is a vertical, and the reason is structural. Each ecommerce brand is its own distribution. The ICP is specific: a 34-year-old new mom buying clean baby products is not a 22-year-old gym bro buying preworkout, and a prompt cannot collapse the difference. So is the brand book, the typography and palette and voice and do-not-do list a junior designer learns in a week and a model has never seen. So is the product catalog, with the actual SKU rather than a stock photo of a similar one. So is the customer language, found in reviews and DMs and support transcripts no scraper has touched. So is the performance history of what has and hasn't printed in the account, attached to creative the model has never been graded against. Web-scale pretraining starts in the wrong place to produce work for any single one of these distributions, and Kojo goes deep on this vertical instead of wide across the rest.
Where the render breaks
The first mile of generative video and image is whether the model can produce a plausible image of anything, and that mile is largely solved. The last mile is whether the model can produce a specific ad for a specific brand, with the actual product, the actual brand voice, and the actual customer-facing message, that performs against a specific audience. That mile is wide open, and a buyer sees it the moment a frame loads.
Product fidelity breaks first: the cap on the bottle is the wrong geometry, the fabric weight reads heavier than the product is, the dropper sits at an angle the brand has never shot. Identity consistency breaks next, inside a single video, where the hand holding the product at 0.4 seconds is left-handed and the hand at 1.2 seconds is right-handed and the wedding ring has migrated fingers between them. Text rendering on a sticker, a label, a price card collapses into legible-looking gibberish that survives a thumbnail and dies on a hold.
Brand fidelity drifts in a direction the brand book would catch in five seconds, a Pinterest-board taste the founder can name and the model cannot. Strategic fidelity is the worst of them: the brief asked for the founder-to-camera angle that has been printing for the category this quarter, and the model produced the lifestyle UGC angle that died for skincare in February. Every one of these slips past a horizontal grader and lands hard on a buyer.
Grading what a buyer would catch
We build the grading layer first, an internal harness that runs inside generation rather than after. A single learned judge does not work here, because categories drift, the criteria for a sleep brand are not the criteria for a cookware brand, and a single rubric goes stale inside a quarter. The harness runs on multiple axes, with independent graders sitting on each and aggregating into a verdict the generator trains against.
Reference-conditioned vision models grade visual fidelity against the brand's own assets, SKU geometry, palette, typography, not against generic aesthetic priors. Identity consistency tracks frame-to-frame across a sequence: hand position, garment, ring, lighting, the continuity variables a buyer would catch. Brief adherence checks the output against a structured spec the strategy layer emits, which makes the question mechanical: did the creative execute the angle.
Hook quality retrieves category-specific patterns from a peer set and grades against those, because what holds attention for cookware does not hold attention for supplements. Pacing checks against the cut rhythm the format rewards on platform, itself a moving target. A learned head predicts performance from historical creative-to-CAC trajectories in the brand and its category, and outputs a posterior rather than a point estimate.
Some graders we learn, some we hand-write, some we model from buyer review. The harness is our internal benchmark, because no public benchmark for ad creative exists, and the public ones for image and video do not measure anything a buyer cares about.
Moving the asymptote
The second piece is per-brand post-training, rather than prompting. We tag the brand's own creative with what printed and what got killed, so the signal weights performance instead of treating every piece equally. A peer reference set inside the category goes in as retrieval, because the patterns that print for cookware in May are denser inside the category than across the web. Product photography goes in as identity reference, so SKU geometry survives generation. We pull customer voice from reviews, DMs, and support transcripts, because the words a buyer uses are not the words a copywriter guesses at. The brand book goes in as constraint, and trajectory data, the buyer's accept-reject choices on prior generations, becomes preference.
From all of this, adapter weights specialize generation for the brand, retrieval indices condition on the peer corpus and the product catalog, learned prompts encode brief structure, and a preference model trained on the buyer's own accept-reject signal scores candidates before we surface them. Training shifts the output distribution where prompting is bounded by it. Prompting an asymptote is not the same as moving the asymptote, and for a vertical absent from pretraining, the work has to move it.
Why this only works at platform scale
Our flywheel runs across tenants, which is what a single-brand tool and a web-scale lab both miss. Trajectory data at scale, the buyer accept-reject choices across hundreds of brands, yields a preference dataset no individual brand can produce. Category-aware reward models trained on that data sharpen with every account, because the patterns that print for cookware in May are visible only when many cookware accounts ship in May. Peer reference corpora thicken by category as brands join, and retrieval sharpens for every member. The harness itself tightens, because every disagreement between the harness and the buyer is a labeled example for the next version. Each brand we train gets a sharper generator, a denser reference set, a harder grader, and a better preference model than it would have had a quarter earlier. The loop is the asset.
What we're building
Performance creative is a research discipline, with rules a buyer learns over years of shipping work, patterns that print and patterns that flop and patterns that shift by category and by quarter, and a craft that lives in the gap between what a model renders and what a brand ships. Kojo ships two things at once: a creative function that, for a given brand, runs sharper than any general-purpose tool, because we trained it on that brand's distribution and graded it against the patterns of its category, and a body of research about how high-performing ad creative actually gets made, encoded into harnesses and reward models and adapter weights instead of left in the heads of a few veteran buyers.
The render starts the work. The decisions that keep an ad alive happen after.