Skip to content

Chapter 01 — Paradigm Shift

Thesis

AI-first software engineering is an architectural inversion. Machine reasoning becomes a primary execution substrate. The harness—tools, constraints, evaluation, and traceability—becomes the primary design surface.

This inversion is practical: reliability comes from constraints, evaluations, and traces that turn generated changes into a repeatable loop.

Model capability is what the model can do given fixed tools and gates. Harness capability is what the system can reliably produce given a fixed model.

A concrete, testable implication (holding the model constant):

  1. A stronger harness should reduce iterations-to-pass.
  2. A stronger harness should reduce time-to-green.
  3. A stronger harness should increase attribution rate (more failures have a primary cause you can act on).

Operational definition:

  • Model capability changes when you swap models while holding tools, constraints, and evaluation constant.
  • Harness capability changes when you keep the model constant but alter tools, policies, evaluation gates, or trace capture.

Rule of thumb: if you cannot hold one layer constant while varying the other, you are not measuring “capability”—you are measuring an entangled system.

In this framing, attribution rate is a harness outcome. It depends on what evidence you capture and which evaluation gates you run. It is not just a function of model fluency.

This chapter’s claim is a hypothesis: some observed “capability” gains in practice are attributable to harness engineering rather than model changes.

Why This Matters

  • Without a clear boundary between model capability and harness capability, teams misattribute failures and waste effort.
  • Reliability depends on reproducible loops (plan → act → verify) rather than isolated prompts.
  • Production constraints (auditability, security, cost, regression control) require system design, not “prompting.”

System Breakdown

  • Actors: human governor, agent loop, tools/runtime, evaluation/CI.
  • Artifacts: specs, plans, diffs, traces, eval results, decision records.
  • Invariants (hypotheses to test):
    • Every non-trivial change is traceable to a plan and verified by checks.
    • The system can attribute regressions to a layer (prompt, tool, code, eval).
    • Autonomy is gated by evaluations and budgets.

A diagram makes the evidence path explicit; focus on where the trace is recorded, and where it is later used to choose the next plan.

flowchart TB P["Plan<br/>(spec + intent)"] A["Act<br/>(propose patch)"] T["Tools<br/>(apply + run)"] V["Verify<br/>(evals/CI)"] D{"Checks pass?"} S["Stop<br/>(ship/merge)"] X["Attribute<br/>(root cause)"] TR[("Trace (record)")] P --> A A --> T T --> V V --> D D -- yes --> S D -- no --> X X --> P A -. record .-> TR T -. record .-> TR V -. record .-> TR X -. use .-> TR

Legend:

  • Solid arrows are the operational loop (plan → act → verify).
  • Dashed arrows are trace capture and trace usage.

Takeaway: trace is recorded during act/tools/verify, then used during attribution to decide the next plan. Without a trace, attribution is guesswork. You will not know whether to clarify the spec, fix a tool issue, change product code, or correct an evaluation.

Pseudo-code goal: make the ordering and evidence capture explicit, especially where decisions depend on recorded outputs.

Pseudo-code example (loop + trace + attribution):

    // Goal: run plan→act→verify with trace, then pick the next plan based on attribution
    Spec -> "docs/specs/parse-date.md"
    Gates -> ["tests", "lint", "typecheck"]
    Trace -> []

    Loop MaxIterations Times
      Plan(Spec)
      Patch -> Act(propose_change)
      Results -> Run(Gates)
      Trace -> Record(Trace, {Patch, Results})

      If AllPass(Results) Then Stop(ship) Else Cause -> Attribute(Results, Trace)
      If Not AllPass(Results) Then Plan(update_plan_using=Cause) Else Plan(noop)

How the pseudo-code maps to the system:

  • Plan(Spec) matches the spec/intent artifact that anchors what “correct” means.
  • Run(Gates) is the evaluation surface (tests/CI) that turns claims into evidence.
  • Record(...) is the harness behavior that makes failures attributable, not just observable.
  • Attribute(...) uses the checklist buckets (spec/prompt, tool/runtime, code, eval/CI) to choose the next plan instead of making an unstructured guess.

When to prefer a code-like example here (use cases):

  • Complex workflows with branching retries (e.g., flaky CI, permissioned tools) where the next action depends on which gate failed.
  • Multi-stage delivery loops (plan → implement → validate → approve) where you must prove ordering and traceability.
  • Debugging playbooks where you need a repeatable decision procedure, not just narrative guidance.
  • Measurable signals (to separate model vs harness effects):
    • Iterations-to-pass is the number of propose→verify cycles until all required checks pass.
    • Time-to-green is the wall-clock time from the first attempt to passing all evaluation gates.
    • Attribution rate is the fraction of failures with a clear primary cause you can act on.
      • Bucket: spec/prompt vs tool/runtime vs code vs eval/CI.
  • Attribution checklist (what evidence makes a failure “belong” to a layer):
    • Spec/prompt:
      • The requirement is ambiguous, contradictory, or incomplete.
      • Two reasonable interpretations produce different expected outputs.
      • Clarifying text changes the expected outcome without any code changes.
    • Tool/runtime:
      • Tool errors, timeouts, missing permissions, or a flaky environment.
      • Reruns on the same commit produce different outcomes.
      • The failure depends on machine state (filesystem, network, credentials, resources).
    • Code:
      • Deterministic failing tests or typechecks tied to a specific diff.
      • Reverting the diff restores the previous behavior.
      • The failure reproduces across environments given the same inputs.
    • Eval/CI:
      • The asserted behavior does not match the intended behavior.
      • The check is incorrect, overly strict, or missing a required case.
      • Fixing the test changes outcomes without changing product behavior.

Concrete Example 1

Refactor a small library function using an agent loop.

  • Inputs: failing unit test + desired behavior specification (e.g., a short “Given/When/Then” note checked into the repo).
  • Loop: propose patch → run tests → inspect diff → record trace (commands + outputs) → stop on pass.

Minimal trace record (copyable). Keep it short so you can paste it into an issue or PR. Record only what you need to reproduce the failure and explain the next step.

Inputs:

Field Value
Spec note path docs/specs/parse-date.md (example)
Failing test tests/test_parse_date.py::test_rejects_empty

Run evidence:

Field Value
Commands (in order) pytest -q
ruff check . (example)
Diff identifier commit SHA or patch ID (e.g., abc1234)
Evaluation outputs failing test names
exit codes
first failing assertion (or minimal log excerpt)

Attribution:

Field Value
Attribution decision one of {spec/prompt, tool/runtime, code, eval/CI}
Evidence 1–2 sentences tied to the outputs above

Example attribution:

  • code — the same test fails deterministically after the diff; reverting the diff restores pass.
  • Measured outputs:
    • Iterations-to-pass.
    • Time-to-green.
    • Diff size (files touched, lines changed).
    • Locality: changes stay within the intended function/surface area.
    • Attribution per iteration using the checklist above (recorded in the trace).
  • Stop rule:
    • Stop when the original failing test passes and the full unit test suite passes.
    • Also require the diff to stay constrained to the intended surface area.
    • If you hit a fixed budget (N iterations or T minutes), stop and hand off:
      • the trace record
      • the smallest reproducible failing case

Concrete Example 2

Ship a minor API change in a production service.

  • Inputs: API contract + backward-compat constraints + staging environment + a defined rollout/rollback policy.
  • Loop: generate migration plan → implement → run contract tests → produce trace report (diff + commands + results) → human approve.

Minimal trace report (copyable). Keep the structure stable so diffs across iterations are easy to scan. Prefer short bullet points for outputs; link to longer logs when needed.

Contract + constraints:

Field Value
Contract/version openapi.yaml (example)
Compatibility window “compatible within v1.x”
Backward-compat constraints “no required fields added”
“no behavior change on existing endpoints”
Staging target staging-us-east-1 (example)

Evidence + outputs:

  • Commands (in order): make contract-test, npm run lint, npm run typecheck, ./scripts/staging-smoke.sh
  • Diff identifier: PR number + commit SHA (e.g., PR #482, def5678)
  • Evaluation outputs: failing checks; log paths/links; timestamps (supports time-to-green)
  • Attribution decisions: per failure, {spec/prompt, tool/runtime, code, eval/CI} + 1–2 sentences of evidence

Example attribution:

  • eval/CI — contract test rejects an allowed optional field; fixing the assertion changes outcomes without changing API behavior.
  • Measured outputs:
    • Iterations-to-pass and time-to-green (from first migration-plan draft to all required checks passing in staging).
    • Attribution rate per iteration using the checklist above (spec/prompt vs tool/runtime vs code vs eval).
    • Backward-compat outcomes:
      • contract-test failures introduced (required gate: 0 new failures in required checks)
      • rollback verification in staging (required gate: exercise rollback successfully at least once)
      • time-to-green (default target: ≤ 30 minutes from first implementation attempt to all required checks passing in staging; set per service)
  • Guardrails:
    • Protected paths or modules that require explicit human review before edits (e.g., auth, billing, infra).
    • Required checks (contract tests, integration tests, lint/typecheck, and a staging smoke test).
    • Rollback plan defined up front (feature flag, config switch, or revert procedure) and verified in staging.
    • Approval gate: no deploy until a human reviews the migration plan, the diff, and the evaluation results.
    • Mapping: Guardrails define what must be protected, required checks define what must be proven, and the approval gate defines who must accept the evidence before deploy.
  • Stop rule:
    • Stop when all required checks pass in staging and the migration plan matches backward-compat constraints.
    • Require that the trace report can explain every material change.
    • If you hit a fixed budget (N iterations or T minutes), pause the rollout and escalate with:
      • the trace report
      • the smallest reproducible failing case
    • If any guardrail is violated (protected file touched, required test skipped, rollback unclear), stop immediately and require human intervention.

Trade-offs

  • Strong harness constraints reduce freedom (and sometimes speed) but increase reproducibility.
  • More evaluation gates reduce regressions but add compute and latency.
  • Trace-heavy workflows improve debugging but increase storage and privacy considerations.

Failure Modes

  • Illusion of capability: improvements credited to the model when they come from better tooling/evals.
  • Unbounded autonomy: loops run without budgets, causing tool thrash and unclear outcomes.
  • Non-attributable failures: missing traces make regressions un-debuggable.

Use the attribution checklist above to bucket each failure before you change the system. If you skip bucketing, you will often “fix” the wrong layer.

Quick next-step mapping (use the checklist buckets):

  • If failures change on rerun or depend on machine state, treat as tool/runtime and capture the missing trace before changing code.
  • If failures are deterministic and tied to a diff, treat as code and constrain the patch surface before iterating.
  • If a check rejects intended behavior, treat as eval/CI and fix the assertion (with evidence) before touching product behavior.

Synthesis: treat machine reasoning as an execution substrate, and treat the harness as the primary lever for reliability. Track the loop metrics above to separate harness effects from model effects and to make failures actionable.

Research Directions

  • Metrics that separate model improvements from harness improvements (including confidence bounds and cost).
  • Minimal trace schema that supports attribution and replay without over-collecting sensitive data.
  • Formal definitions of autonomy envelopes and stop conditions that can be enforced as gates.