Spec Agents are our attempt to move that work into the agent layer. Assuming you built the required testing infrastructure, the goal is to have an agent that reads the spec, walks the product, and returns a clear yes or no with evidence.We’ve been running this model across a cohort of early users and watched how they behaved end to end.
How we tested whether agents can take on real spec validation work
Sure, models have gotten way better at omitting good quality frontend code, that even follows design systems. Still, design QA has quietly become one of the most expensive parts of shipping eventual frontend work. Even in well run product teams, this stil requires a human tol open the ticket, open the design file, open staging, and reconcile all three by hand. Product and design end up doing another pass over changes that already “passed review,” just to answer a simple question:
Does this actually match what we agreed to build?
Spec Agents are our attempt to move that work into the agent layer. Assuming you built the required testing infrastructure, the goal is to have an agent that reads the spec, walks the product, and returns a clear yes or no with evidence.
We’ve been running this model across a cohort of early users and watched how they behaved end to end.
In the agents own eyes - Spec Reviewer performing a code review on our own web UI
What a Spec Agent actually does
A Spec Agent treats a requirement as a contract.
It starts by reading the input the way a PM would: ticket text, acceptance criteria, attached design notes, any discussion that clarifies scope. From that, it extracts what must be true in the product when the work is done. Not "someone touched this component," but "this flow should no longer exist," or "this control should behave in a specific way," or "this label and behavior should be generalized beyond a single narrow case."
From that understanding it writes its own checklist. That checklist might include visiting a particular screen, inspecting a specific control, confirming a layout change, or checking that a flow no longer appears in navigation. It may also include a quick look at the underlying route, component or event type when the requirement implies deeper change than text and pixels.
Side note: we don’t envision a world where these recipes are manually maintained. We’ve done manual and automated QA in the past and we’re tired of endlessly defining critical flows. That’s never where regressions and bugs really surface.
Then it drives the product. The agent navigates through the UI, waits for data to load, opens menus and dialogs, and captures snapshots at the relevant points. Finally it compares what it saw to the contract it extracted and issues a verdict, with short natural language reasoning and links back to the captured evidence.
How we evaluated Spec Agents
To keep this grounded, we picked a small set of real changes instead of synthetic examples. Each represented a type of work that usually pulls human PMs and designers back into hands on QA.
One was a broken UI behavior that had supposedly been fixed. Another was a cleanup task where a retired flow was meant to be fully removed, not just hidden. A third was a terminology and behavior change where a feature that used to apply to a narrow audience now needed to apply more generally across “all users,” with names and triggers updated accordingly.
For each change we gave the agent the same context a reviewer would have. We let it read the requirement, then turned it loose with the ability to browse and inspect relevant components. Also, we recorded the sessions and then watched them like National Geographic Savana exploration videos: did it go to the right places, look at the right things, and reach a sensible conclusion.
Bottom line, we judged performance on three axes:
Whether the extracted contract actually matched the human understanding of the requirement.
Whether the navigation and inspection behavior looked like a plausible design review rather than a brittle script.
Whether the final verdict and explanation matched what an experienced PM or designer would have decided after walking the product.
Standard Spec reviewer report - we start from outputs and work back to logic
Capabilities the agent demonstrated
Across all runs, certain capabilities showed up consistently.
First, the agent was able to reconstruct intent, not just restate text. It distinguished between “this control should now work,” “this entire flow should be gone,” and “this concept should be generalized beyond a special case.” That difference matters, because it changes what needs to be checked. A simple component fix demands one path through the UI, a retired flow demands that navigation, routes, and backing logic are gone, and a generalization change demands that old, narrow labels and event types no longer surface anywhere the feature is configured.
Traced agentic workflow - each session is systematically logged, metered and traced
Second, the navigation was robust. The agent did not rely on hard coded selectors or a fixed script. It scrolled, clicked through multi step builders, opened dropdowns deep inside configuration screens, and waited for asynchronous loading states to complete. In one of the harder sequences it used an accessibility snapshot of a menu to understand all available options without depending on pixel perfect layout. That is the level of resilience you need if this is going to stand in for a human run through staging.
Third, the evidence was usable. Each verdict came with concrete artifacts: screen states that showed the new behavior working, or still broken; navigation views that clearly revealed whether a legacy flow was still discoverable; configuration views that either did or did not expose the generalized concept. The written summaries were short and specific. A human reviewer could skim the conclusion, glance at the captured states, and decide whether they trusted the call, instead of repeating the journey themselves.
Finally, the verdicts were correct. In the runs we evaluated, the agent agreed with expert human reviewers on what was truly done and what was not. Fixes that worked were marked as such, partially completed cleanups were called out as incomplete, and generalizations that only changed text but not behavior were correctly treated as “not yet meeting the spec.”
Takeaways for future evals
What matters here is less the individual tickets and more the pattern.
Feed the agent the real input your PMs and designers are already using. Let it extract its own contract for what “done” means. Allow it to navigate the same product surfaces they would touch: navigation, screens, configuration flows, and where needed the code that powers them. Evaluate it not just on pass or fail, but on whether it looked in the right places, produced convincing evidence, and reached the same conclusion a human expert would.
Once you can trust an agent to run that loop reliably, you can start moving design QA out of calendar time and into the pipeline. Instead of PMs and designers spending another afternoon re opening staging and tracing flows by hand, they review a set of agent verdicts that already come with the necessary proof.
The long term goal is simple: a change is not “ready” just because tests pass. It is ready because a spec aware agent has walked the product, compared it to the original intent, and produced a clear, reviewable answer.