Setpoint Evals: Giving AI Coding Agents a Long Horizon

Companion repo: inctasoft/setpoint-evals-for-agentic-engineering-example — clone it, run it, copy the pattern.
Companion theory: The Setpoint Problem — why a setpoint-shaped thing matters at all, recast in control-theory terms.

The problem nobody quite admits

AI coding agents are great at the next decision. They are mediocre, and often outright bad, at the next hundred decisions in a row.

Hand a modern agent a one-step task — “rename this function,” “add this validator” — and it nails it. Hand it a multi-hour task — “implement this feature end-to-end across these six files, make the integration test pass, don’t break the others” — and it tends to wander. Halfway through, it forgets the original goal. It optimises for plausibility over correctness. It declares victory while three obvious things are broken.

The standard answer to this is “better prompting.” More context. Sharper instructions. A grandiose CLAUDE.md. We’ve all tried it. It helps a little. It does not solve the actual problem, which is that the agent has no horizon to aim at — nothing it can run, read, and use to know whether it is closer to done.

This post is about a pattern we use to give the agent that horizon. It’s not a framework. It’s not a library. It’s a convention plus a few hundred lines of bash. We call it Setpoint Evals (SE) and we’ve open-sourced a real codebase that uses them so you can copy the pattern wholesale.

What an SE actually is

An SE is a shell script. It does four things:

Posts a request to a real running system (an HTTP API, a queue, whatever the entrypoint is).
Polls the system’s state (database rows, log lines, queue depth).
Asserts the state matches what is expected, via plain old shell conditions.
Prints one line of result: PASS:<seconds>, FAIL:<seconds>, or TIMEOUT:<seconds>.

That’s it. No test framework. No mocks. No DI containers. No fixtures DSL. The orchestration glue is a helpers.sh file with initiate_job, poll_job, verify_job_status, verify_step_status. The whole infrastructure for a serious eval suite fits in a long afternoon.

Here’s an actual SE from the public repo, setpoint-evals/01-retry-transient-failure/test.sh (trimmed):

#!/bin/bash
set -e

source "${REPO_ROOT}/workflows/order-processing/setpoint-evals/shared/helpers.sh"

# Submit a job that's CONFIGURED to fail twice, then succeed
PAYLOAD=$(cat <<EOF
{
  "variant": "quick-order",
  "payload": { "customerId": 1, "orderId": 1, "entityId": "${EXTERNAL_SYSTEM_ID}" },
  "testOptions": {
    "ValidateCustomer": { "failOnAttempts": [1, 2] }
  }
}
EOF
)

JOB_ID=$(initiate_job "$PAYLOAD")
poll_job "$JOB_ID" --timeout 180

verify_job_status "$JOB_ID" "completed"
verify_step_status "$JOB_ID" "ValidateCustomer" "completed" --min-retries 2

Read that again. It is not a unit test. It is not an integration test in the Spring/Jest sense. It is a description of a behaviour the running system is supposed to exhibit — “if a step fails twice, it should retry and the job should still complete.” If that property holds, the script exits 0. If it doesn’t, the script exits non-zero with a diagnosable log.

The repo currently ships 13 such core engine SEs and ~16 workflow-specific ones, every one of which is just a bash file with a sibling README.md.

The move that changes everything: write the SEs first

Here is the pattern I want you to actually steal.

When you are about to ask an agent to implement a non-trivial feature, write three to seven SEs first, before writing a single line of source code. Each SE has:

A short README.md describing the user-visible behaviour it pins down.
A test.sh that, when the feature works, returns PASS. When it doesn’t, returns FAIL with logs explaining what was wrong.

You can write these with the agent itself — they’re shell, not rocket science. The READMEs are the part you, the human, actually craft. They are the spec.

Then you point the agent at the directory and say:

“All evals in setpoint-evals/<feature>/ should pass. Currently they all fail. Implement whatever is needed to make them pass. Run them. If any fail, diagnose why, fix, re-run. Repeat until all PASS. Do not stop. Do not ask me for confirmation. The evals are the acceptance criteria.”

And then you walk away.

sequenceDiagram
    participant H as Human
    participant A as Agent
    participant S as System Under Test
    participant E as SE Suite

    H->>E: Write SE READMEs + test.sh stubs (the spec)
    H->>A: "Make all of setpoint-evals/feature-x/ pass"
    H-->>H: Walks away

    loop Until all PASS
        A->>S: Implement / modify code
        A->>E: Run SE suite
        E->>S: Submit jobs, poll state
        E->>A: PASS:30 / FAIL:90 / TIMEOUT:600 + logs
        A->>A: Read failure logs, plan next change
    end

    A->>H: All evals PASS — feature complete

What happens during that loop is striking. The agent has a deterministic, machine-readable, system-level definition of done. It is not asking itself “does this look right?” — it is asking the system “did it actually work?” When an SE fails, the failure log tells the agent which assertion broke and where. When it succeeds, success is unambiguous.

We have run agents for eight to twelve hours against an SE suite, occasionally peeking in to make sure they hadn’t gotten stuck, otherwise just leaving them. They get the job done. Not always elegantly. Not always with the cleanest code on first pass. But the acceptance criteria are met, and that is enormously more than you can say for an agent given a vague natural-language prompt and no horizon.

The other thing nobody quite builds: log compaction

A suite of 30 evals, each producing dozens of lines of output, is useless to an agent in raw form. It will blow the context window, drown the signal in noise, and leave the agent staring at a wall of green ANSI escape codes.

What you need — and what most teams skip — is a compactor. Something that reads the full eval output and produces a single screen describing the state of the system right now. In the public repo this is setpoint-evals/analyze-results.sh. You point it at a run directory and you get something like:

=== SE Suite Results — 2026-04-30T22:14:08 ===
Total:      29 evals
Passed:     27  (93%)
Failed:     2
Timed out:  0
Total time: 4m 37s

FAILURES:
  ❌ 04-ack-delays                 FAIL:42s
     Last assertion: expected step status WAITING_FOR_ACK, got COMPLETED
     Log: .results/parallel/2026-04-30T22:14:08/04-ack-delays.log:142

  ❌ 09-orphaned-job-recovery      FAIL:90s
     Last assertion: expected job status FAILED, got PROCESSING
     Log: .results/parallel/2026-04-30T22:14:08/09-orphaned-job-recovery.log:88

FLAKINESS WATCH:
  ⚠ 06-stuck-in-progress-detection  passed in 4/5 recent runs

This output is the agent’s perception of reality. It’s a compressed, structured, holistic state report. The agent reads it, decides what to fix, fixes it, runs the suite again, reads the new compaction. Loop closed.

Without this step, the agent drowns. With it, the agent has the equivalent of a sysadmin’s morning dashboard — the kind of single-screen view of “what’s currently broken” that every good operator builds intuitively. We just made it structured enough that an LLM can consume it directly.

`testOptions`: fault injection as a first-class feature

You may have noticed that the SE example above included this:

"testOptions": {
  "ValidateCustomer": { "failOnAttempts": [1, 2] }
}

That is the system being told, at runtime, “please inject a transient failure on attempts one and two of the ValidateCustomer step.” No mocks. No test doubles. No conditional code paths sprinkled through production logic. The real worker, talking to the real database, simply behaves badly on demand because the request asked it to.

This is fault injection as a first-class feature, baked into the workers. The supported knobs include simulated delays (simDelay), failure-after-attempt (failureAfter), failure-on-specific-attempts (failOnAttempts), failure-for-specific-items (failForItemIds), ACK delays (ackDelay), ACK skip (skipAck), pre-ACK crash (crashBeforeAck), and arbitrary ACK payload override (ackPayload).

graph LR
    A[SE submits job<br/>with testOptions] --> B{Worker env<br/>guard enabled?}
    B -->|No| C[Ignore testOptions<br/>process normally]
    B -->|Yes| D{testOptions<br/>in payload?}
    D -->|No| E[Use defaults]
    D -->|Yes| F[Apply requested<br/>delay/failure/crash]
    C --> G[Complete normally]
    E --> G
    F --> G

The two-layer guard matters. testOptions only takes effect if both the deployment-side environment variable ENABLE_REQUESTS_FOR_SIMULATED_DELAYS=true is set on the worker, and the payload contains a testOptions block. Production workers ship with the env var off; even a malicious request can’t induce delays or failures. Demo and CI environments ship with it on; SEs can drive arbitrary fault scenarios.

This is the move that lets SEs cover ugly, important behaviours like “if a worker crashes between completing its work and acknowledging, the job should recover within the visibility timeout.” You can describe that scenario as a request and assert on the recovery, end-to-end, with no mocks anywhere. Try writing that test in your favourite mocking framework and notice how much code, abstraction, and lying it requires.

Why this fits agentic engineering specifically

Conventional unit tests are useful, but they suffer from two properties that make them poor agent acceptance criteria:

They live next to the code. When the agent changes the code, it tends to also change the tests, sometimes in ways that drain the assertion of meaning. A unit test the agent itself just edited is not strong evidence of correctness.
They test units, not behaviours. “Does the function return the right value?” is a weaker claim than “Does the system, as a whole, do the right thing when this happens?” Agents are particularly good at making unit tests pass while subtly breaking the integrated behaviour the user actually cares about.

SEs sidestep both. They live in setpoint-evals/, separately from the source. Their README is the contract; the test.sh is the executable form of that contract. The agent can edit the source freely, but if it touches the SE, you notice — and the SE doesn’t talk about functions, it talks about user-observable system state.

This makes SEs the natural unit of acceptance for agent work. You write down what the system should do. You walk away. You come back to a system that does it.

A short history of writing down what done means

Setpoint Evals are not a new idea. They sit at the end of a long lineage, and being honest about that is part of what makes the recipe credible.

Year	Idea	Who	What was new
~1969	Pre/post-conditions in Hoare logic	Tony Hoare	The mathematical statement: “assert what’s true before and after an operation”
~1984	Literate programming	Donald Knuth (`WEB`)	Code + prose as one document; documentation is the source of truth
~1998	Test-Driven Development & xUnit	Kent Beck (JUnit)	“Write the test first” as a discipline, not just a tool
~1999	Property-based testing	Koen Claessen, John Hughes (QuickCheck, Haskell)	Specify properties the system should satisfy; the framework generates inputs
~2001	doctest in Python	Tim Peters	Examples in docstrings that execute as tests — living documentation
~2002	Fit / Fitnesse	Ward Cunningham	Acceptance tests as wiki tables, executable. The closest spiritual ancestor of Setpoint Evals.
~2007	Cucumber / Gherkin (BDD)	Aslak Hellesøy	Plain-English `Given/When/Then` that runs as tests
~2011	Specification by Example	Gojko Adzic	A methodology — concrete, runnable examples that double as the spec
decades	SystemVerilog Assertions / PSL	Hardware verification community	Formal behavioural specs that run continuously alongside the device under test

Reading this lineage carefully, what’s actually new about Setpoint Evals?

Every prior pattern in this list was designed for human readability. The wiki tables, the Gherkin sentences, the doctest snippets — all are documents a human can sit down with, read top-to-bottom, and reason about. Their result format is structured for human review.

Setpoint Evals are designed for agent consumption. The compactor (analyze-results.sh) is the part that makes the eval suite a state-of-the-system report rather than a wall of green ANSI for a human’s review screen. That single design shift changes what the result format has to look like, and reframes the entire feedback loop.

So we’re not inventing a new category. We’re taking a 40-year tradition — acceptance criteria written first, in a form a machine can verify — and adapting it for a consumer that didn’t exist when any of these tools were built. The pattern is older than the tools you’d recognise. The constraint is brand new.

The practical recipe

If you want to adopt this pattern in your own codebase:

Pick a feature. Something concrete enough to fit on one whiteboard.
Write three to seven SE READMEs. One paragraph each. “When X happens, the system should end up in state Y.” Don’t write the code yet.
Stub out the test.sh files. Even if they all exit 1 for now. The READMEs are the spec; the scripts are the executable form.
Build a helpers.sh with the four primitives you’ll need everywhere: submit, poll, assert-status, dump-logs-on-failure. Keep it tiny.
Build a run-all.sh that runs your SEs (parallel where safe, serial where destructive) and writes per-test logs to a timestamped directory.
Build an analyze-results.sh that compacts a run directory into a one-screen state report. This is the unsexy step everyone skips. Don’t skip it.
Hand it all to an agent with the prompt: “Make every SE in setpoint-evals/feature-x/ pass. The READMEs are the spec. The eval results are the truth. Don’t stop until all are green.”

That’s the whole pattern.

What you can take from the open-source repo

The companion repository — setpoint-evals-for-agentic-engineering-example — is a fully working codebase implementing the DTM (Distributed Task Manager) with:

A real NestJS task orchestrator (Postgres + Kafka + SQS via LocalStack)
Four pluggable example workflows (order processing, IoT pipeline, infrastructure provisioning, and a dynamic-step plan executor)
13 core engine SEs covering retry, DLQ, deduplication, concurrency, stuck-state recovery, health metrics, and orphan recovery
~16 workflow-specific SEs covering happy paths and domain-specific edge cases
The full testOptions fault-injection mechanism with the two-layer production-safety guard
A Preact monitor dashboard (so a human can watch the agent work in real time, if they want)
A development ACK simulator so you can run the whole stack locally without a real Kafka consumer

You don’t need to care about DTM. The orchestrator is honestly incidental — what we want you to leave with is the pattern: shell evals + a compactor + a fault-injection mechanism + the discipline to write the eval before the implementation.

Clone it. Run ./scripts/local-env.sh start --standalone --orchestrator && ./setpoint-evals/run-all.sh. Watch 29 SEs run. Then go look at one of the simpler ones — setpoint-evals/01-retry-transient-failure/test.sh is a good start — and notice how little machinery there actually is.

Closing: the horizon is the unlock

The agent does not need a smarter brain. It needs a longer horizon. SEs give it one — written in shell, runnable on demand, compactable into a single screen. Once the agent has that horizon, you can hand it work that you would not previously have trusted any AI to attempt without supervision, walk away, and come back to a system that meets the spec.

If you build one thing this quarter, build a tiny setpoint-evals/ directory with five evals and a compactor. Put your most ambitious agent-driven feature behind those evals. Then leave the room.

Tell us what happens.

The codebase used in this article is a one-time public snapshot from a private DTM codebase that we use internally and continue to evolve toward integration with our products. The public version is intentionally a static example — clone it, fork it, copy the pattern, but expect us to maintain it as a teaching artifact rather than as a living open-source project.

Questions, war stories, or pull requests welcome at the repo.

🤖 Built with Valko — voice-driven AI coding.

This article was drafted inside a Valko voice session — push-to-talk → Whisper → Claude → markdown — then edited by hand. If you’d like to try the same workflow, reach out at valko.ai.