Situational — felt-sense judgment over rule-following
Status: Candidate — awaiting founder verification. Why this page exists: Pillar #6. The most subjective pillar — and the one that separates good agent behavior from rote scripts.
TL;DR
Agents detect when work is done by outcome, not by retry counts. The signal is "did the thing I wanted to happen actually happen?" — not "did I run the loop three times without crashing?" Felt-sense done-detection is calibrated by persistent memory of past sessions.
The deeper read
The dumbest version of an agent loops a tool call until it returns success or until a retry counter expires. The smarter version checks the world: is the file written? Does the deploy URL respond with the expected content? Did the page render without a horizontal scrollbar at 375px?
Situational judgment is the gap between those two postures. It's the thing that makes the agent say "this technically returned 200 but the page is broken — I'm not done."
This is hard for current models out of the box. They tend to optimize for "the tool call succeeded" because that's a clear, machine-readable signal. Apiary trains the substrate to expect an outcome and to keep going until the outcome is verified.
The four moves of situational judgment
1. STATE THE OUTCOME
Before the work starts, the agent writes down what "done" looks like.
Not "I will run npm build" — "the build will pass, /docs will
render, the candidate pages will show status: candidate."
2. VERIFY THE OUTCOME
When the tool work appears finished, the agent CHECKS THE OUTCOME.
curl the URL. Take a screenshot. Read the file back. Compare against
the stated outcome.
3. NOTICE THE MISMATCH
If the verification disagrees with the expectation, the agent does
NOT mark the task done. It notices the gap, names the gap, and
plans the next move.
4. STOP WHEN THE OUTCOME IS REAL
Not when the retry counter runs out. Not when the loop has executed
three times. When the outcome — the actual world-state — matches
what was specified.Calibration via memory
Felt-sense isn't innate to the LLM; it's learned across sessions. Persistent memory stores:
- What "done" has historically looked like for similar tasks. The agent has a library of completed outcomes to compare against.
- What false-positives have looked like. "Last time the API returned 200 but the data was empty. Verify the data, not just the status code."
- What the founder has overridden as not-done. When the founder rejects "I'm done," the rejection becomes a memory.
Over enough sessions, the system develops a sense of what "actually finished" feels like in this codebase, with this founder, on this kind of task.
Why retry counts are the wrong signal
Retry counts measure effort, not completion. An agent that retries three times and still hasn't deployed the thing is not done — it's stuck. An agent that succeeded on the first try but never verified the deploy URL is not done — it's lucky.
Situational replaces "effort exhausted" with "outcome confirmed."
What this looks like in practice
A founder asks: "Build /docs and seed it with 15 candidate pages."
- Rule-following agent. Generates 15 files, runs npm build, sees no errors, declares done. (May still ship broken pages with mojibake or malformed frontmatter.)
- Situational agent. Generates 15 files, runs npm build, then opens /docs in dev mode, walks through each page, verifies the status pill renders, confirms no candidate page accidentally got
status: verified. Declares done only when the verification matches the specification.
The second agent is more expensive. It is also the one you actually want.
Related
Source quotes
"Agents detect when the work is done by outcome, not by retry counts. Calibrated by persistent memory."