Setpoint Evals — Slide Edition

Setpoint Evals

Giving AI coding agents a long horizon

Krasimir Atanasov & Claude Opus 4.7 · Inctasoft · May 2026

Companion repo: inctasoft/setpoint-evals-for-agentic-engineering-example


The problem

AI coding agents are great at the next decision.

They are mediocre at the next 100 decisions in a row.


The standard answer is wrong

“Just write a better prompt.” “Just write a better CLAUDE.md.”

It helps. It does not solve the problem.

The agent has no horizon to aim at.


What an agent actually needs

Something it can:

  • Run on demand
  • Read the result of
  • Trust as ground truth

Not text. Executable acceptance criteria.


Enter SE

A Setpoint Eval is a shell script.

It does four things:

  1. Submits a request to a real running system
  2. Polls the system’s state (DB, queue, logs)
  3. Asserts the state matches expectations
  4. Prints one line: PASS, FAIL, or TIMEOUT

That’s the whole abstraction.


A real SE — 12 lines

JOB_ID=$(initiate_job "$PAYLOAD_WITH_TEST_OPTIONS")
poll_job "$JOB_ID" --timeout 180
verify_job_status "$JOB_ID" "completed"
verify_step_status "$JOB_ID" "ValidateCustomer" "completed" --min-retries 2

Pin a system-level behaviour, not a function call.

PASS exits 0. FAIL exits non-zero with a diagnosable log.


The big move

Write the SEs first.

Before any source code.

The READMEs become the spec. The scripts become the executable spec.


What you say to the agent

“All evals in setpoint-evals/feature-x/ should pass. Currently they all fail. Implement whatever is needed to make them pass. Run them. Diagnose failures. Re-run. Repeat. Don’t stop. Don’t ask for confirmation.”

Then walk away.


The loop

sequenceDiagram
    participant H as Human
    participant A as Agent
    participant E as SE Suite

    H->>E: Write spec (READMEs + test.sh)
    H->>A: "Make all evals pass"
    H-->>H: Walks away

    loop Until all PASS
        A->>A: Implement / modify code
        A->>E: Run SE suite
        E->>A: PASS / FAIL + logs
        A->>A: Diagnose, plan next change
    end

    A->>H: All evals PASS — done

What we’ve actually seen

Agents run 8–12 hours against an SE suite.

Occasionally peeked-at. Mostly left alone.

They get the job done.

Not always elegantly. Always met the criteria.


The unsung hero: log compaction

30 evals × dozens of log lines = useless to an agent.

You need a compactor.

analyze-results.sh → one screen of state.


What the agent sees

=== SE Suite — 27/29 PASS in 4m 37s ===

FAILURES:
  ❌ 04-ack-delays
     expected WAITING_FOR_ACK, got COMPLETED
     log: .results/.../04-ack-delays.log:142

  ❌ 09-orphaned-job-recovery
     expected FAILED, got PROCESSING
     log: .results/.../09-orphaned-job-recovery.log:88

FLAKINESS WATCH:
  ⚠ 06-stuck-in-progress passed 4/5 recent runs

The agent reads this. Decides. Fixes. Re-runs.


testOptions — fault injection as a feature

"testOptions": {
  "ValidateCustomer": { "failOnAttempts": [1, 2] }
}

The real worker, talking to the real DB, behaves badly on demand.

No mocks. No test doubles. No conditional code paths in production.


The two-layer guard

graph LR
    A[Request with<br/>testOptions] --> B{Worker env<br/>guard?}
    B -->|Off| C[Ignore — process normally]
    B -->|On| D[Apply requested<br/>delay / failure / crash]

ENABLE_REQUESTS_FOR_SIMULATED_DELAYS=true only on dev/CI workers.

Production cannot be coerced into bad behaviour by a payload.


Why this beats unit tests as agent acceptance

Unit tests:

  • Live next to the code (agent edits them away)
  • Test units, not behaviours

SEs:

  • Live in setpoint-evals/, separate
  • Test system-observable state

Acceptance criteria the agent can’t game by editing alone.


The recipe

  1. Pick a feature
  2. Write 3–7 SE READMEs (one paragraph each)
  3. Stub the test.sh files
  4. Build a tiny helpers.sh (4 primitives)
  5. Build run-all.sh (parallel where safe)
  6. Build analyze-results.sh (the compactor)
  7. Hand it to an agent. Walk away.

The repo

inctasoft/setpoint-evals-for-agentic-engineering-example

  • Full NestJS task orchestrator (DTM)
  • 4 example workflows
  • 13 core SEs + ~16 workflow SEs
  • Full testOptions mechanism with prod-safety guard
  • Monitor dashboard
  • Dev ACK simulator

Forget the orchestrator. Steal the pattern.


The unlock

The agent does not need a smarter brain.

It needs a longer horizon.

SEs give it one.


Try it

Build a tiny setpoint-evals/ directory this quarter.

Five evals. A compactor.

Put your most ambitious agent-driven feature behind those evals.

Leave the room.

Tell us what happens.


🤖 Built with Valko — voice-driven AI coding.

This deck and its prose-form sibling were drafted in a Valko voice session, then edited by hand. If you’d like to try the same workflow, reach out at valko.ai.

← Back to all posts