Failure-Signature Index

Context

Autonomous kernels generate traces: tool calls, errors, and verification outcomes. Over time, most operational pain comes from a small number of recurring failures (missing dependencies, policy denials, flaky checks, environment drift), but these failures are usually recorded as raw logs.

A failure-signature index converts raw failures into a stable, searchable layer: a normalized signature paired with triage guidance, known causes, and the smallest next action. This pattern complements the evaluation-and-traces chapter by turning “logs” into a durable debugging interface.

Problem

How do you make failures diagnosable and attributable across many runs without relying on tribal knowledge and ad-hoc grep?

Without an index:

The same failure is re-triaged repeatedly.
Root causes are misattributed (model error vs tool error vs harness policy vs flaky test).
Teams cannot measure whether the system is improving because failure categories are unstable.

Forces

Stability vs. specificity: signatures must be stable across minor log changes, but specific enough to be actionable.
Privacy vs. usefulness: raw stderr can contain secrets or customer data; normalization must avoid persisting sensitive text.
Coverage vs. effort: indexing every failure is expensive; focusing on “top N” recurring failures yields better ROI.
Collisions: over-normalization can cause unrelated failures to map to one signature.
Evolution: signatures and fixes change as code and tooling change; the index must be versioned.

Solution

Normalize each failure into a compact signature and maintain an index entry that answers:

What category is this? (validation, timeout, policy, runtime)
What layer likely caused it? (model, tool, harness, eval)
How do you reproduce it?
What is the smallest next action?
What evidence should the run include next time?

A diagram helps because the pattern has two distinct flows: recording from traces and lookup during triage.

flowchart TB R["Run fails"] --> N["Normalize error"] N --> S["Signature"] S --> I["Index lookup"] I -->|hit| G["Guidance + next action"] I -->|miss| C["Create new entry"] C --> U["Update index"] U --> I

Implementation sketch

Signature computation should be deterministic and bounded. A practical approach is to store a short normalized excerpt plus a hash.

Normalization heuristics (example):

Strip timestamps, absolute paths, ANSI color codes.
Keep only the first ~20 lines of stderr.
Extract a “headline” (exception type + key message).
Optionally include top stack frame file + line (if stable).

Signature payload (conceptual):

kind=runtime
layer=tool
headline=ModuleNotFoundError: No module named 'mkdocs'
frame=python:-m mkdocs

Then compute signature_id = sha256(payload).

Index storage format (YAML example):

schema_version: 1
signatures:
  - signature_id: "sha256:..."
    headline: "ModuleNotFoundError: No module named 'mkdocs'"
    kind: "runtime"
    layer: "tool"
    severity: "blocking"
    repro:
      command: "uv run mkdocs build"
      notes: "Ensure .venv exists and dependencies installed"
    next_action: "Run `uv sync` then retry"
    evidence_to_capture:
      - "python version"
      - "uv lock state"
    owners: ["tooling"]
    last_seen_after: "2026-02-23"

Integration points:

Trace writer: on failure, emit normalized_error fields (kind/layer/headline/signature_id).
Verifier: attach signature_id to failing checks.
Triage tooling: provide lookup(signature_id) that prints the index entry.
Index maintenance: update entries when root causes change; keep supersedes links when signatures evolve.

A minimal governance rule keeps the index reliable:

New entries must include repro.command and next_action.
Entries must be reviewable (PR) and versioned.
Retire stale entries by date (last_seen_after) or when superseded.

Concrete examples

Example 1: Policy denial on protected paths

A run attempts to edit a protected configuration file and the tool router rejects it.

Normalized payload:

kind: policy
layer: harness
headline: PolicyDenied: protected path requires approval

Index entry guidance:

Repro: rerun the same patch attempt.
Next action: request an approval artifact (or move edit to an allowed file).
Evidence to capture: the exact file path(s) blocked and the policy name.

Outcome: the agent stops “blocked” with the smallest next action instead of attempting risky workarounds.

Example 2: Repeated markdownlint failure signature

A documentation run fails consistently with a formatting rule.

Normalized payload:

kind: validation
layer: eval
headline: MD013/line-length: Line length exceeds 120

Index entry guidance:

Repro: npx markdownlint-cli2 --config .markdownlint-cli2.jsonc book/.
Next action: apply --fix or reflow the specific paragraph; avoid adding long unbroken URLs.
Evidence to capture: the file path and rule id.

Outcome: repeated formatting failures become easy to fix and easy to measure.

Failure modes

Over-normalization collisions: unrelated errors map to the same signature.
- Mitigation: include one stable frame or rule id; keep an escape hatch for “split signature.”
Under-normalization churn: small log changes create new signatures.
- Mitigation: strip paths/timestamps and hash only normalized payload.
Sensitive data leakage: headlines include secrets.
- Mitigation: redaction before indexing; store hashes and controlled excerpts.
Stale guidance: the “next action” no longer works after tooling changes.
- Mitigation: require last_verified_after or periodic re-verification via canary tasks.
Index becomes a dumping ground: too many low-value entries.
- Mitigation: focus on high-frequency failures and “blocking” severities; prune aggressively.

When not to use

Very small projects with low run volume where ad-hoc triage is cheaper.
Systems where logs cannot be stored even in normalized form.
Environments where failures are dominated by unique, one-off issues (the index won’t amortize).