Research

The Inbox Is the Training Set

Kojo LabsApril 22, 2026

The most valuable training signal for ad creative does not come from the prompt the marketer types. It comes from the accept-reject decision the buyer makes on a queue of pre-generated work. The surface where that decision happens is the research project. The inbox is not a feature. It is the lab bench.

The prompt is the wrong label

A prompt grades input. It records what was asked, in what order, with what constraint. It does not say whether the output was right. In a vertical absent from pretraining, where the rules of right are local to a brand and a category and a quarter, the prompt is the cheapest possible label and the least informative. The expensive label is the buyer holding a finished candidate next to the account it would ship into and saying yes or no. Pretraining does not have it. Public benchmarks do not have it. It exists only in the moment a buyer decides whether to ship.

Most generative work in this space optimizes the cheap label. The shipped creative the buyer would have killed and the killed creative the buyer would have shipped never enter the loss. The model gets fluent at the prompt and stays naive about the account.

What the chat surface alone misses

Marketer-initiated work is bounded by what the marketer thinks to ask. It is selection-biased toward briefs the team already has language for, formats they have shipped before, angles that survived the last quarter. The long tail, the angles a buyer would have dismissed in three seconds without being able to say why, is invisible to a chat-only system because a chat-only system never proposes them. A model trained only on the work the marketer requests learns the work the marketer already knows how to do. It cannot widen the brand's distribution past the buyer's prior, because the buyer's prior is the only thing it ever sees.

What the autopilot proposes

An autopilot makes the proposals the marketer would not have made. It generates against angles the strategy planner identified as uncovered, formats the peer corpus is rewarding this quarter, hooks the brand has never been graded against. Most of these proposals are wrong. That is the point. A buyer rejects most of what the inbox surfaces. The rejected output is the densest label in the vertical, because the rejection sits exactly at the boundary between what the brand will and will not ship, which is the boundary the model needs to learn. You only see that boundary by making proposals on both sides of it.

An accepted candidate from the inbox carries more information than an accepted candidate from the chat, for the same reason a correct guess carries more information than a correct copy. The model proposed something the buyer did not ask for, and the buyer shipped it anyway. The accept becomes a positive label in a region of the distribution the chat surface would not have reached.

The order is the curriculum

The inbox cannot exist for a brand on day one. It earns its place. Week one a buyer asks for individual ads in chat. Week four they review a daily batch and the chat is reserved for the harder calls. Month three they open the inbox first. The product flow is the data flow. The chat sessions seed the preference signal. The inbox accumulates that signal at scale. Trust passes from one surface to the other the way a junior buyer earns autonomy from a senior one: on the strength of small correct calls, until the supervisor gets out of the way.

An autopilot-only product bypasses both at once. It tries to label the buyer's preferences without first having shown the buyer it understands them. The reject rate stays high forever, the accepts get noisier rather than sharper, and the model learns nothing about why it was wrong, because the buyer never had a calibration phase to teach it. The copilot is the calibration phase.

Why this is structural

The argument is not specific to ad creative, but it is sharper there than almost anywhere else. Most generative-AI verticals have an outcome signal that lives outside pretraining and outside what the model can grade itself on. Code has a compiler. Math has a checker. Ad creative has a buyer, and there is no substitute. The product surface that produces the buyer's judgment at scale is the product surface that produces the dataset.

What we're building

Kojo is built around the surface that produces the label. The chat is where a buyer asks for individual ads and, in the asking and the editing and the shipping, seeds the preference signal one decision at a time. The inbox is where the autopilot proposes work the buyer would not have asked for: angles the strategy planner flagged as uncovered, formats the peer corpus is rewarding this quarter, hooks the brand has never been graded against. A pipeline run drops a batch into the inbox each morning. A buyer accepts a few, rejects most, and every one of those decisions is a labeled example sitting on the boundary between what the brand will and will not ship.

That signal is the asset. The harness scores candidates against brand and category fidelity before they ever reach the inbox, so the work the buyer sees is already filtered for the slips a horizontal grader would miss. A preference model trained on the buyer's accept-reject feeds back into generation as reward, and the per-brand adapter weights specialize against it. The chat is calibration. The inbox is scale. Together they produce a dataset no single-brand tool collects and no web-scale lab can synthesize.

The render starts the work. The buyer's yes or no is what trains the next one.

← Back to home Read: Past the Render →