Skip to content

Evals Are the New Unit Tests

You would not ship code with no tests. Shipping an agent with no evals is the same bet.

The short version

An agent is only as good as the loop it runs. Perception, plan, act, verify — the failure usually hides in the verify step, where the model that wrote the code also grades it. Independent checks are not optional; they are the architecture.

Context is the scarce resource. The teams shipping real agentic work treat the context window like a budget: what goes in, what gets summarized, what gets dropped. Most agent failures are context failures wearing a different costume.

Reliability compounds badly. An agent that is right ninety-five percent of the time across a ten-step task succeeds barely sixty percent of the time. The math is unforgiving, which is why narrow, verifiable steps beat ambitious ones.

The human moved up a level. You no longer write the fix; you specify the goal and judge the result. The scarce skill now is a crisp success criterion and an honest review of the diff.

Revision history

  • 2026-04-29 Reworked the eval-coverage section after Discord feedback.

Related

Search

Esc

Search is available on the live site.