Replayable Trace Runs

Context

Traces are most valuable when they are not just “what happened,” but also “something you can replay.” In autonomous-kernel systems, runs often fail in ways that are hard to reproduce: nondeterministic tool outputs, shifting dependencies, or missing context about what the model saw.

Replayable trace runs extend the evaluation-and-traces discipline: a trace is treated as an executable record. Given a trace and a pinned environment, you can rerun the same sequence of steps (or a safe approximation) and verify that outcomes match.

Problem

How do you make debugging and drift analysis reliable when a run’s behavior depends on many moving parts (repo state, tools, prompts, budgets, and external systems)?

Without replay:

Incident review is guesswork.
Regression testing for the harness is weak because you cannot deterministically re-execute past failures.
“It worked yesterday” is not actionable.

Forces

Determinism vs. reality: real runs include nondeterministic elements; replay wants determinism.
Side effects: replaying a run must not re-trigger unsafe mutations.
Completeness vs. size: capturing everything makes traces huge; capturing too little makes replay impossible.
Version drift: tools and policies change; replay must record versions or provide compatibility shims.
Privacy: traces can contain sensitive outputs; replay storage must support redaction.

Solution

Treat a trace as a sequence of typed events with enough information to support a “replay mode.” In replay mode:

Tool calls are stubbed from recorded results (safe and deterministic).
Verification commands can be rerun in a pinned environment (where possible).
Outcomes are compared against recorded evidence (diff hashes, exit codes, normalized errors).

A diagram helps because it distinguishes a live run (real tool router) from replay (stubbed tool router) while sharing the same kernel logic.

flowchart LR K["Kernel loop"] --> TR["Trace writer"] K --> R1["Live tool router"] R1 --> SE["Side effects"] TR --> TS["Trace store"] TS --> RR["Replay runner"] RR --> K2["Kernel loop"] K2 --> R2["Stubbed tool router"] R2 --> CMP["Compare outputs<br/>hashes, exit codes, signatures"]

Implementation sketch

Trace event requirements:

event_id, step, timestamp
action: typed tool call or stop, including tool name and args
result: tool exit status, normalized error, and bounded outputs
side_effect_summary: changed files, created ids/urls, or “none”
environment: repo SHA, harness version, tool versions

Example event (conceptual):

{
  "step": 4,
  "action": {"type": "tool", "tool": "run_tests", "args": {"command": "pytest -q"}},
  "result": {"exit_code": 1, "stderr_excerpt": "E AssertionError: ..."},
  "normalized_error": {"kind": "validation", "signature_id": "sha256:..."},
  "environment": {"repo_sha": "...", "python": "3.12.2"}
}

Replay modes (practical):

Strict replay (stub tools): tools return recorded outputs; the kernel must produce the same next actions and stop reason.
Hybrid replay (rerun verifiers): rerun verification commands in a pinned environment, but keep side-effectful tools stubbed.
Dry-run replay: do not apply patches; verify that the trace is internally consistent and the evidence bundle can be reconstructed.

Core comparisons that make replay useful:

Patch identity: compare diff hash (or file checksums).
Verification: compare exit codes and normalized failure signatures.
Control flow: compare step count and stop reason distribution.

Concrete examples

Example 1: Incident reproduction for a flaky check

A nightly job started failing intermittently. The trace shows a timeout in mkdocs build.

Replay approach:

Strict replay stubs tool outputs to confirm the kernel behavior (it correctly classifies the error as timeout and stops “blocked” with repro steps).
Hybrid replay reruns uv run mkdocs build in a pinned environment to see whether the timeout reproduces.

Outcome: you can distinguish “harness misclassified” from “real tool flakiness” without re-triggering unrelated mutations.

Example 2: Regression test for a past bugfix workflow

A bugfix run from last month succeeded after three steps. A new harness release claims to reduce steps.

Replay approach:

Run strict replay on the historical trace to ensure the harness still reaches verified with the same diff hash.
Compare metrics (steps, gate failures) across harness versions.

Outcome: the trace becomes a reusable unit test for harness changes.

Failure modes

Insufficient capture: the trace omits tool args or outputs; replay cannot stub deterministically.
- Mitigation: define a minimum replay contract per tool.
Unsafe replay: replay triggers real side effects (publishes, ticket creation).
- Mitigation: default replay to stub side-effect tools; require explicit opt-in to “live” mode.
Environment mismatch: repo/tool versions differ; hybrid replay diverges.
- Mitigation: record versions and provide a container/lockfile-based pinned environment.
Replay brittleness: minor log differences cause false mismatches.
- Mitigation: compare normalized outputs (hashes, signatures) rather than raw logs.
Privacy leakage: traces contain sensitive stdout/stderr.
- Mitigation: redaction and bounded capture, same as evidence bundles.

When not to use

Tasks dominated by external, non-stubbable systems (payments, production deploys) where safe replay is not feasible.
Extremely lightweight workflows where trace capture cost exceeds the value.
Teams without artifact retention discipline; replay depends on keeping traces and bundles available.