Build a Coding Agent from Scratch
中文

08. The Session Log and Recovery

The source of truth for an agent runtime should be the session log, not the in-memory messages array. Memory gets lost, UIs get refreshed, processes crash, and users come back tomorrow to pick up where they left off. As long as the log is complete enough, you can rebuild the context, display history, audit tool calls, resume unfinished tasks, and even fork from any intermediate node.

Why saving only the prompt is not enough

Many minimal implementations only save the current messages on exit. This has several problems:

  • Tool execution progress and errors are lost.
  • Non-message events like model switches, compaction, and user steering are lost.
  • Branching cannot be expressed; you can only overwrite linearly.
  • On a crash, the last chunk of state may never have been written at all.

An agent session is not a chat transcript; it is an event log. The chat UI is just one projection of the log.

A JSONL append-only log

For a teaching project, JSONL works well: one entry per line, appended as you go. On a crash, at most the last line is corrupted, and everything before it can still be parsed. Every entry needs at least an id, parentId, timestamp, and type.

type SessionEntry =
  | { type: "session"; id: string; timestamp: string; cwd: string; version: number }
  | { type: "message"; id: string; parentId: string; timestamp: string; message: Message }
  | { type: "model_change"; id: string; parentId: string; timestamp: string; model: string }
  | { type: "compaction"; id: string; parentId: string; timestamp: string; summary: string; firstKeptEntryId: string }
  | { type: "custom"; id: string; parentId: string; timestamp: string; customType: string; data: unknown };

parentId turns the session into a tree rather than just an array. When the user resumes from some point in history, you can create a new branch; the original follow-up records are still preserved. The currently active branch can be identified by a leaf id.

The log is the source of truth; the context is a projection

When building the model context, you do not read the whole log and send it to the model. Instead, you walk from the current leaf back to the root to get the active branch, then project the entries into LLM messages. The projection rules can be:

  • message entries go into the context.
  • model_change does not go into the context, but it determines which model subsequent requests use.
  • custom stays out of the context by default, unless an extension declares it to be a message.
  • compaction enters the context as a single summary message and determines which older entries are kept.

This way the log records the complete facts, while the context contains only what the model needs to keep working.

The recovery flow

When recovering a session, the runtime should:

  1. Parse the JSONL, skipping or reporting corrupted lines.
  2. Validate the session version and migrate old entries if necessary.
  3. Find the current leaf and build the active branch.
  4. Rebuild the messages, model, compaction state, and queue state from the entries.
  5. Re-create the tools and the provider.
  6. Let the UI project the history from the log, instead of asking the model to recite it.

Recovery should not automatically re-execute historical tools. Tool results are already facts. Unless the user explicitly asks to rerun the tests, recovery only rebuilds state.

Branching and fork

Branching is not an advanced feature; it is a natural need for an agent. A user might have the agent try approach A, decide they are not happy with it, and switch to approach B from a midpoint. If the log is a tree, all you have to do is point the leaf at some historical entry and append new messages. The old branch remains viewable.

Note the relationship between branching and compaction. Compaction should not destroy the old log; it merely tells the context builder "from this point on, represent everything earlier with a summary." If compaction deleted history outright, you could never return to a pre-compaction branch, nor audit early tool calls.

Observing it in action

A stretch of session log might look like this:

{"type":"session","id":"s1","timestamp":"2026-01-01T10:00:00.000Z","cwd":"/repo","version":1}
{"type":"message","id":"m1","parentId":"s1","timestamp":"2026-01-01T10:00:01.000Z","message":{"role":"user","content":[{"type":"text","text":"Fix the tests"}]}}
{"type":"message","id":"m2","parentId":"m1","timestamp":"2026-01-01T10:00:03.000Z","message":{"role":"assistant","stopReason":"toolUse","content":[{"type":"toolCall","id":"t1","name":"bash","input":{"command":"npm test"}}],"model":"example","usage":{"inputTokens":500,"outputTokens":40}}}
{"type":"message","id":"m3","parentId":"m2","timestamp":"2026-01-01T10:00:06.000Z","message":{"role":"toolResult","toolCallId":"t1","toolName":"bash","isError":true,"content":[{"type":"text","text":"1 test failed"}]}}

These few lines are already enough to recover the user request, the model's action, the tool result, and the context for the next step.

Exercises

Implement JSONL session storage.

Acceptance criteria:

  • Each message is appended, never overwriting the existing file.
  • After a process restart, the current active branch can be recovered.
  • model_change affects the default model after recovery, but does not enter the LLM messages.
  • After forking from a historical entry, the old branch remains readable.
  • A compaction entry does not delete old entries; it only changes the context projection.