15. Evaluation, Debugging, and the Capstone Project

The biggest illusion in agent development is "that last run looked fine, so it must be correct." Model output is unstable, real repositories are complex, and tools and context carry a lot of state. Without an evaluation and debugging system, you'll re-test by hand after every prompt or tool-description change — and you won't be able to tell where a regression came from.

Three layers of testing

A teaching project needs at least three layers of tests:

Unit tests: tool parameter validation, path resolution, output truncation, the write queue.
Loop tests: the faux provider drives tool use, error feedback, steering, and compaction.
Replay tests: read a session log, rebuild the context, and verify the projection result.

End-to-end tests against a real model are smoke tests only. They can tell you the system is wired up, but they are unsuitable as your primary regression suite. Primary regressions must be deterministic, cheap, and repeatable.

Session replay

The session log is natural debugging material. A failed task can be exported as JSONL; the test harness reads it, rebuilds the active branch, and asserts:

Whether the context contains the system rules it should.
Whether the compaction summary preserves the key facts.
Whether the last tool result correctly enters the next turn.
Whether a model switch affects subsequent requests.
Whether the branch leaf points at the expected entry.

This is far more reliable than screenshots or a hand-written bug description.

Cost and performance

The agent should record usage, latency, tool time, and retry counts for every model request. Without this data, you can't answer "why did this task take so long" or "which model is the most expensive." Cost information can be recorded as events and session entries; it doesn't have to enter the model context every time.

Common metrics:

input tokens.
output tokens.
cache read/write tokens.
provider latency.
tool latency.
retry count.
compaction count.
estimated cost.

The capstone project: tiny-agent

The capstone is not a clone of some existing product; it is proof that you understand the engineering boundaries of an agent. Pick a small TypeScript repository and have tiny-agent complete one real change:

Read the user's goal.
Search for relevant files.
Read the target files.
Implement the change with precise edits.
Run the specified check command.
Keep fixing based on failures.
Produce a final summary.
Write a complete session log.
After exit, resume the session and explain what was just done.

When grading, don't just check whether the final code is correct — check whether the process is auditable.

Capstone acceptance checklist

Your tiny-agent must satisfy:

Providers are swappable, with at least a real provider and a faux provider.
Tool calls are driven by the stop reason, not by hard-coded steps.
Failed tool parameter validation is fed back to the model.
read/edit/write/bash all have path boundaries and truncation policies.
Write operations separate diff details from the model-facing summary.
The session log is append-only JSONL.
Resuming does not re-execute historical tools.
Compaction never deletes history; it only changes the context projection.
Steering and follow-up tasks have independent queues.
In JSON mode, stdout emits machine-readable events only.
The permission gate applies consistently across all shells.
The faux provider tests cover at least one tool-error self-correction scenario.

Debugging handbook

When the agent misbehaves, investigate in this order:

Check the session log to confirm the facts were written.
Check the context projection to confirm what was sent to the model is correct.
Check the provider adapter to confirm the stop reason and tool calls were converted correctly.
Check the tool result to confirm the error is fixable.
Check the system prompt to confirm rule ordering and conflicts.
Only then suspect the model itself.

This order matters. Many "the model won't listen" problems are actually a constraint missing from the context projection, or a tool error written in a way the model can't act on.

Closing words

The heart of building a coding agent is not finding a magic prompt; it is placing a nondeterministic model inside deterministic engineering boundaries: clear protocols, controlled tools, recoverable state, observable events, auditable permissions, and bounded extensions. Get these right, and gains in model capability naturally become gains in system capability; get them wrong, and the stronger the model, the harder the system is to control.