Chapter 07 — Production AI Infrastructure

Thesis

Production AI-first systems are distributed systems: they require orchestration, isolation, observability, caching, cost control, and reproducible environments.

Hypothesis: operational reliability depends more on the tool/runtime plane than on the model prompt. If you can reliably run tools and record what happened, you can reproduce runs. That lets you improve outcomes even when model behavior varies.

Why This Matters

Without isolation, tool execution becomes a security and reliability risk.
Without observability, failures cannot be attributed or fixed systematically.
Without cost controls, autonomy can become economically unstable.
Operational signals include:
- tool-failure rate
- replay success rate
- mean tool latency
- retry rate
- spend per successful task

Example targets and alerts (illustrative, not mandates). Use them to seed dashboards and error budgets.

Metric	Signal to watch	Illustrative alert
Tool-failure rate	Tool exits non-zero or returns invalid output	Alert if >2% over 1 hour for a repo; or if one tool burns its daily error budget
Replay success rate	“Green” runs fail to replay from recorded inputs	Alert if <95% on a weekly replay audit sample
Mean tool latency	Step duration inflation for stable workloads	Alert if p95 step duration doubles week-over-week
Retry rate	Rising retries indicate flakiness or degraded infra	Alert if retries exceed 1.2× baseline for two consecutive days
Spend per successful task	Cost-to-merge and wasted tokens trend upward	Alert if median cost-to-merge exceeds a cap; or wasted spend exceeds a daily share

System Breakdown

A diagram helps here because the tool/runtime plane has coupled components. It is not a single service. Focus on the contracts between boxes. Each box should emit stable, versioned signals for replay and debugging.

The point of the diagram is the boundaries. In practice, reliability tends to hinge on three contracts: - run id propagation across every step - versioned, structured tool outputs - a replay bundle that is complete enough to reproduce failures

Legend: solid arrows show work/data flow between planes; dotted arrows show governance or metadata (policy, tagging, and run id storage).

flowchart LR M[Model prompt] --> O[Orchestration Queues • Concurrency • Retries • Idempotency] O --> E[Execution Sandbox • Pinned deps • Timeouts] E --> T[Tool services Tests • Build • Browser • Repo API] T --> OB[Observability Traces • Metrics • Logs • Correlation IDs] T --> A[Artifacts Diffs • Reports • Replay bundle] E --> S[Security Allowlists • Least privilege • Secret injection] OB --> A S -. governs .-> E S -. governs .-> T O -. tags .-> OB O -. stores run id .-> A

Takeaway: reliability comes from strict contracts at each boundary. Record the environment and tool versions, constrain execution, and connect every step to a run id. Then you can replay, attribute failures, and control spend.

Minimum replay bundle (per run id): - run id (and attempt number) - environment identity (image hash + lockfile) - tool name + tool version + output schema version - commands and working directory per tool call - structured outcomes + pointers to logs and artifacts

Execution: sandboxes/containers, dependency pinning, deterministic runners. Contract: identical inputs produce the same tool environment (image hash + lockfile), with a hard wall-clock timeout per step.
- Checklist: image hash, lockfile hash, timeout, runner version.
Tool services: test runners, build systems, browsers, repo APIs. Contract: every tool call is versioned and returns structured output (exit code, stdout/stderr, and a machine-readable summary).
- Checklist: tool version, schema version, exit code, error class.
Orchestration: queues, concurrency limits, backpressure. Contract: max concurrency is enforced (per repo/org), retries are bounded (count + backoff), and each task carries an idempotency key.
- Checklist: run id propagation, idempotency key, retry policy, concurrency cap.
Observability: traces, metrics, logs; correlation ids. Contract: every task/run has a run id and spans for each tool call, with outcome and duration recorded.
- Checklist: trace id, span per tool call, duration, outcome.
Artifacts: build outputs, diffs, evaluation reports, replay bundles. Contract: store a bundle per run (inputs, tool versions, command lines, logs, diffs, and evaluation result) with a retention policy.
- Checklist: replay manifest, checksums, retention policy, access controls.
Security: secrets handling, network egress controls, least privilege. Contract: allowlists cover tools, filesystem paths, and network egress; secrets are injected only at execution time and never written to artifacts.
- Checklist: tool/path/egress allowlists, credential scope, redaction scan.

Concrete Example 1

Sandboxed tool execution for code changes. - Trigger: a proposed patch (diff) plus a task spec (e.g., “fix failing test X”). Include the target branch SHA and a pinned environment (container image + lockfile). - Sandbox: start an isolated runner with no ambient credentials. Mount the repo read-write and restrict filesystem + network egress to an allowlist. - Tool calls: - Run a fixed sequence (format/lint → unit tests → build). - Enforce step timeouts (e.g., 5m/unit, 15m/build) and bounded retries for flaky steps (e.g., 2 retries with exponential backoff). - Artifact bundle (stored per run id): - Persist the patch and tool call transcripts (commands, versions, exit codes). - Persist logs, test reports (JUnit/JSON), build outputs, and a replay manifest for the same inputs. - Evaluation gate: - Promote only if required checks pass (e.g., all tests green, no new lints, diff applies cleanly). - Require reproducibility. - A replay succeeds at least once, or the environment hash matches a known-good cache entry. - On failure, generate a human-facing summary with: run id link, failed step, and top error class. - Include a short “what to try next” hint (e.g., rerun without cache, or inspect a specific log).

Concrete Example 2

Cost-aware autonomy for a batch of maintenance tasks. - Budget: per-task token/cost ceilings (e.g., $0.50 and 20k tokens) plus a batch budget (e.g., $50/day), enforced by the orchestrator. - Strategy: fail fast on low-signal tasks and escalate to human review when confidence is low or blast radius is high. - Decision policy: 1. Treat a task as “low-signal” when there is no progress after N tool steps (e.g., 6). 2. Treat a task as “low-signal” when the same error repeats (e.g., the same stack trace twice). 3. Treat a task as “low-signal” when predicted cost-to-complete exceeds remaining budget. 4. Escalate when the change touches production config, security-sensitive files, or the diff exceeds a size threshold (e.g., >200 lines changed).

Pseudo-code makes the budget controller explicit (ordering, stop conditions, and escalation points):

// Budget controller for one task inside a batch
State -> {task_budget, batch_budget, step_count, last_error_class, repeats}

Loop N Times
  If task_budget <= 0 Then Escalate("Task budget exceeded") Else Continue()
  If batch_budget <= 0 Then Escalate("Batch budget exceeded") Else Continue()

  Step -> RunTool()
  step_count -> step_count + 1

  cost -> EstimateCost(Step)
  task_budget -> task_budget - cost
  batch_budget -> batch_budget - cost

  error -> ClassifyError(Step)
  If error == last_error_class Then repeats -> repeats + 1 Else repeats -> 0
  last_error_class -> error

  If repeats >= 1 Then Escalate("Repeated error") Else Continue()
  If DetectProgress(Step) == false And step_count >= 6 Then Escalate("No progress") Else Continue()
  If PredictCostToComplete() > task_budget Then Escalate("Over budget") Else Continue()
End

If the per-task or batch budget is exceeded:
- Stop further tool calls.
- Write a short spend-and-status summary (last step, last error, run id).
- Escalate for human review.
Measure (throughput): cost per successful task and time-to-merge.
Measure (quality): regression rate (e.g., rollback or test failures within 24h) and “wasted spend” (tokens spent on tasks that are abandoned or escalated).

Trade-offs

Isolation increases safety but adds operational complexity. Default: start with containerized execution + allowlists; revisit if tool latency dominates (e.g., repeated cold starts) and you can prove tighter scoping by repo/path.
Strong observability increases insight but raises data retention requirements. Default: use structured logs + traces with short retention for raw logs and longer retention for summaries; revisit if incident analysis regularly needs deeper raw context.
Caching and replay improve speed but can mask nondeterminism if misused. Default: cache only deterministic steps (dependency installs keyed by lockfile, build outputs keyed by inputs). Periodically force no-cache replays; revisit if you observe drift or flaky tests that caching hides.

Failure Modes

Non-reproducible runs: environment drift makes traces hard to replay.
- Detection: replay fails with different dependency resolutions; tool versions differ from the recorded manifest; repeated “works on runner A but not runner B” incidents.
- Mitigation: pin images and dependencies; record tool versions and hashes in the replay manifest; run periodic “replay audits” that re-execute a sample of recent runs.
Leaky permissions: tool plane has broader access than intended.
- Detection: outbound network calls to unexpected domains; tools reading/writing outside approved paths; secrets appearing in logs or artifacts.
- Mitigation: enforce network egress allowlists; run tools with least-privilege credentials scoped to a repo/task; add secret redaction and artifact scanning before persistence.
Noisy observability: too much unstructured logging reduces signal.
- Detection: high log volume with low queryability; incident timelines require manual grepping; key metrics (duration, retries, error class) missing from dashboards.
- Mitigation: emit structured events for each tool call; standardize error classes and outcome codes; sample verbose logs while keeping full traces for failed runs only.

Research Directions

Deterministic replay scoring: define a replay score per run (e.g., environment match, tool-version match, output match) and track it over time.
Replay audit sampling: pick a weekly sample of “green” runs, replay them with no cache, and record the delta (time, outputs, flake rate).
Artifact bundle correctness: treat missing inputs (patch, lockfiles, tool versions, output schema versions, command lines) as a build-breaking error for the infrastructure.
Redaction guarantees: measure secret leakage rates by scanning logs and artifacts before persistence, then track false positives/negatives of redaction.
Cost-to-quality controllers: test policies that trade off retries, model choice, and tool parallelism against regression rate and cost-to-merge.