Setpoint Evals
Giving AI coding agents a long horizon
Krasimir Atanasov & Claude Opus 4.7 · Inctasoft · May 2026
Companion repo:
inctasoft/setpoint-evals-for-agentic-engineering-example
The problem
AI coding agents are great at the next decision.
They are mediocre at the next 100 decisions in a row.
The standard answer is wrong
“Just write a better prompt.” “Just write a better
CLAUDE.md.”
It helps. It does not solve the problem.
The agent has no horizon to aim at.
What an agent actually needs
Something it can:
- Run on demand
- Read the result of
- Trust as ground truth
Not text. Executable acceptance criteria.
Enter SE
A Setpoint Eval is a shell script.
It does four things:
- Submits a request to a real running system
- Polls the system’s state (DB, queue, logs)
- Asserts the state matches expectations
- Prints one line:
PASS,FAIL, orTIMEOUT
That’s the whole abstraction.
A real SE — 12 lines
JOB_ID=$(initiate_job "$PAYLOAD_WITH_TEST_OPTIONS")
poll_job "$JOB_ID" --timeout 180
verify_job_status "$JOB_ID" "completed"
verify_step_status "$JOB_ID" "ValidateCustomer" "completed" --min-retries 2
Pin a system-level behaviour, not a function call.
PASS exits 0. FAIL exits non-zero with a diagnosable log.
The big move
Write the SEs first.
Before any source code.
The READMEs become the spec. The scripts become the executable spec.
What you say to the agent
“All evals in
setpoint-evals/feature-x/should pass. Currently they all fail. Implement whatever is needed to make them pass. Run them. Diagnose failures. Re-run. Repeat. Don’t stop. Don’t ask for confirmation.”
Then walk away.
The loop
sequenceDiagram
participant H as Human
participant A as Agent
participant E as SE Suite
H->>E: Write spec (READMEs + test.sh)
H->>A: "Make all evals pass"
H-->>H: Walks away
loop Until all PASS
A->>A: Implement / modify code
A->>E: Run SE suite
E->>A: PASS / FAIL + logs
A->>A: Diagnose, plan next change
end
A->>H: All evals PASS — doneWhat we’ve actually seen
Agents run 8–12 hours against an SE suite.
Occasionally peeked-at. Mostly left alone.
They get the job done.
Not always elegantly. Always met the criteria.
The unsung hero: log compaction
30 evals × dozens of log lines = useless to an agent.
You need a compactor.
analyze-results.sh → one screen of state.
What the agent sees
=== SE Suite — 27/29 PASS in 4m 37s ===
FAILURES:
❌ 04-ack-delays
expected WAITING_FOR_ACK, got COMPLETED
log: .results/.../04-ack-delays.log:142
❌ 09-orphaned-job-recovery
expected FAILED, got PROCESSING
log: .results/.../09-orphaned-job-recovery.log:88
FLAKINESS WATCH:
⚠ 06-stuck-in-progress passed 4/5 recent runs
The agent reads this. Decides. Fixes. Re-runs.
testOptions — fault injection as a feature
"testOptions": {
"ValidateCustomer": { "failOnAttempts": [1, 2] }
}
The real worker, talking to the real DB, behaves badly on demand.
No mocks. No test doubles. No conditional code paths in production.
The two-layer guard
graph LR
A[Request with<br/>testOptions] --> B{Worker env<br/>guard?}
B -->|Off| C[Ignore — process normally]
B -->|On| D[Apply requested<br/>delay / failure / crash]ENABLE_REQUESTS_FOR_SIMULATED_DELAYS=true only on dev/CI workers.
Production cannot be coerced into bad behaviour by a payload.
Why this beats unit tests as agent acceptance
Unit tests:
- Live next to the code (agent edits them away)
- Test units, not behaviours
SEs:
- Live in
setpoint-evals/, separate - Test system-observable state
Acceptance criteria the agent can’t game by editing alone.
The recipe
- Pick a feature
- Write 3–7 SE READMEs (one paragraph each)
- Stub the
test.shfiles - Build a tiny
helpers.sh(4 primitives) - Build
run-all.sh(parallel where safe) - Build
analyze-results.sh(the compactor) - Hand it to an agent. Walk away.
The repo
inctasoft/setpoint-evals-for-agentic-engineering-example
- Full NestJS task orchestrator (DTM)
- 4 example workflows
- 13 core SEs + ~16 workflow SEs
- Full
testOptionsmechanism with prod-safety guard - Monitor dashboard
- Dev ACK simulator
Forget the orchestrator. Steal the pattern.
The unlock
The agent does not need a smarter brain.
It needs a longer horizon.
SEs give it one.
Try it
Build a tiny setpoint-evals/ directory this quarter.
Five evals. A compactor.
Put your most ambitious agent-driven feature behind those evals.
Leave the room.
Tell us what happens.
🤖 Built with Valko — voice-driven AI coding.
This deck and its prose-form sibling were drafted in a Valko voice session, then edited by hand. If you’d like to try the same workflow, reach out at valko.ai.