Evidence Bundle Standard (One Artifact, Always)

Context

In an autonomous-kernel workflow, “done” is not a narrative claim. It is a claim backed by evidence: patches applied, checks run, and outputs captured. In practice, evidence tends to fragment across terminal scrollback, CI logs, local artifacts, and ad-hoc notes.

This pattern standardizes a single, always-produced artifact that captures the minimum evidence needed to audit, reproduce, and compare runs. It is designed to align with evaluation/traces: every run emits a bundle, even when it fails.

Problem

How do you make every run auditable and replay-friendly without teaching every engineer (and every agent) a different “where did you put the logs?” ritual?

When evidence is scattered:

You cannot reliably answer: what changed, what checks ran, and what passed.
You cannot compare runs over time (drift) because the record format varies.
You cannot debug failures quickly because the first step is reconstructing what happened.

Forces

One artifact vs. many artifacts: a single bundle improves discoverability, but risks becoming large and noisy.
Completeness vs. cost: capturing more data improves debugging, but increases runtime, storage, and redaction effort.
Local runs vs. CI runs: local evidence is easiest to collect, but CI evidence is often the source of truth for merge gates.
Privacy and secrets: logs frequently contain tokens, paths, or customer data; bundles must be redactable.
Standardization vs. flexibility: a rigid schema improves tooling, but must allow per-repo variations in checks.

Solution

Define an “evidence bundle” as a single directory (or archive) that always contains:

A small manifest (machine-readable) describing the run.
A patch/diff representation of changes (or a pointer to the diff).
The verification record: commands, exit codes, and captured outputs.
A minimal environment snapshot: versions and identifiers needed for replay.

Prefer a rule that is easy to enforce: a run cannot stop “success” unless it has produced an evidence bundle id. Also produce a bundle on failure/blocked outcomes, so incident review doesn’t depend on intent.

A diagram helps because the core idea is consolidation: multiple evidence streams become one bundle with a manifest.

flowchart LR D["Diff / patch"] --> B["Evidence bundle"] T["Tool calls"] --> B V["Verification logs"] --> B E["Env snapshot"] --> B B --> M["Manifest (bundle.json)"]

Implementation sketch

Minimum required files (conceptual):

bundle.json (manifest)
diff.patch (or changed_files.json + checksums)
checks.jsonl (one record per check)
stdout/ and stderr/ captures (bounded)

Recommended storage layout:

artifacts/runs/<run_id>/bundle/...

Minimal manifest schema (example):

{
  "run_id": "2026-02-23T19-55-12Z_8f3c",
  "repo": {
    "root": "/path/to/repo",
    "commit": "<git sha>",
    "dirty": true
  },
  "kernel": {
    "version": "0.1",
    "budgets": {"max_steps": 20, "max_minutes": 5}
  },
  "outcome": "verified",
  "changed_files": [
    {"path": "book/patterns/x.md", "sha256_after": "..."}
  ],
  "verification": {
    "required_checks": ["markdownlint", "mkdocs_build"],
    "status": "pass"
  },
  "redaction": {
    "policy": "default",
    "notes": "stdout/stderr truncated to 64KB per file"
  }
}

Verification record format (append-only JSONL):

{"check":"markdownlint","command":"npx markdownlint-cli2 --fix --config .markdownlint-cli2.jsonc book/","exit_code":0,"duration_ms":1823}
{"check":"mkdocs_build","command":"uv run mkdocs build","exit_code":0,"duration_ms":9145}

Operational rules that make the standard enforceable:

Always emit: bundle creation is not conditional on success.
Bounded capture: cap stdout/stderr per command; store full logs only when explicitly enabled.
Stable naming: run_id must be unique and sortable; include timestamp + short random.
Redaction pass: run a configurable redact step (regex-based at minimum) before persisting bundles.
Pointer-friendly: if a diff is huge, store a pointer to a git range or artifact URL, but keep checksums in the manifest.

Concrete examples

Example 1: Updating a pattern page with formatting gates

A documentation agent adds a new section to a pattern page and must prove formatting and build integrity.

Evidence bundle contents:

diff.patch: shows edits to one markdown file.
checks.jsonl:
- npx markdownlint-cli2 --fix --config .markdownlint-cli2.jsonc book/ exit 0
- uv run mkdocs build exit 0
bundle.json: records outcome: verified and file checksum(s).

Result: review is trivial because the bundle contains the patch and the exact two acceptance commands with exit codes.

Example 2: Bugfix with a failing-then-passing test

A code change fixes a crash. The first test run fails; after a patch, tests pass.

Evidence bundle contents:

checks.jsonl includes two entries for the same check:
- pytest -q exit 1 with truncated stderr excerpt showing the failing assertion.
- pytest -q exit 0.
stdout/pytest_1.txt and stderr/pytest_1.txt capture the failure signature.
diff.patch contains both the code fix and the updated test.

Result: later investigators can see the precise failure signature, the repair, and the passing verification in one place.

Failure modes

Bundle inflation: bundles grow until they are unusable.
- Mitigation: strict size budgets; store only summaries by default and keep pointers to full logs.
“Manifest only” bundles: teams create bundles that omit diffs or check outputs, defeating the purpose.
- Mitigation: enforce a minimum bundle contract (diff/pointer + check records + exit codes).
Secret leakage: stdout/stderr captures contain tokens.
- Mitigation: redaction pass + allowlist what is captured + configurable “no-log” checks.
Non-comparable runs: fields differ across runs, preventing drift analysis.
- Mitigation: keep a stable schema and version it (schema_version) so tooling can evolve safely.
Evidence without provenance: bundle lacks repo SHA, tool versions, or command lines.
- Mitigation: require a minimal environment snapshot and record exact commands.

When not to use

Extremely low-risk, short-lived drafts where no verification is expected and storage overhead dominates.
High-sensitivity domains where storing logs is prohibited and you cannot reliably redact.
Environments without a stable artifact store; if bundles are routinely lost, a “standard” becomes a false promise.