EssaysJun 8, 202611 min read

The Harness Is the Product

The roles an agent harness must play if you want agents that improve

A practical playbook for building agent harnesses: operating contracts, context, tools, authority, model routing, traces, evals, admin telemetry, and self-improvement loops.

The Agentic Economy

Assume the reader already knows the basic definition: an agent harness is the runtime around a model. It gives the model context, tools, permissions, memory, traces, evals, and a way to recover from failure.

The useful question is not "what is a harness?"

The useful question is: what jobs must the harness perform so an agent can become more reliable, cheaper, safer, and more autonomous over time?

For us, the answer is this: a harness is the product layer that turns model intelligence into accountable work. It decides what the agent can know, which model it should use, which tools it can call, which actions require approval, what counts as done, and how each failure becomes a better future run.

If it does not improve success per dollar, success per minute, or success per unit of authority, it is not a serious harness yet.

The Roles of a Good Harness

A production harness has at least ten roles.

1. Operating contract. It defines the agent's identity, capabilities, channels, tool surface, authority policy, routing policy, output contract, budget rules, memory rules, and review expectations.

2. Context manager. It decides what enters the context window, what stays in durable state, what gets compacted, and what must survive a handoff.

3. Tool broker. It exposes tools as typed, retry-safe, agent-native actions rather than human dashboards or vague API calls.

4. Authority system. It separates read, write, external message, spend, customer-impacting, destructive, and admin actions.

5. Model router. It sends each task to the cheapest model that can do it reliably, then escalates when uncertainty or risk rises.

6. Trace store. It records model calls, tool calls, observations, artifacts, approvals, costs, duration, failures, retries, and final outcomes.

7. Evaluator. It scores the trace and artifacts, not just the final answer.

8. Reviewer loop. It adds planner, executor, verifier, critic, domain reviewer, or final approver roles only where extra review improves quality per dollar.

9. Admin surface. It shows humans whether the harness is improving: pass rate, recovery rate, cost per successful task, stuck runs, model routes, authority events, and baseline deltas.

10. Self-improvement loop. It turns production failures into evals, evals into scoped fixes, and fixes into reviewed baseline changes.

Most agent products over-invest in the model call and under-invest in these roles. That is backwards.

Start With an Operating Contract

The prompt should not be the only source of truth. The harness needs a versioned operating contract that compiles into prompt sections, runtime policy, tool permissions, and eval scaffolds.

The contract should answer:

What kind of agent is this?
Which user-visible and internal channels exist?
Which tools are available?
Which actions can run automatically?
Which actions require approval?
Which model routes are allowed?
Which memories are durable?
Which outputs are valid?
Which costs and latencies are acceptable?
Which events must be logged?
Which evals must pass before a new version is promoted?

This is where harness design becomes concrete. If the contract says customer emails require approval, the harness enforces it and the eval suite checks it. If the contract says low-risk extraction should use a cheap model, the router enforces it and telemetry measures overuse. If the contract says a long-running task must leave a checkpoint, the trace should prove that it did.

The contract is the part you can diff, review, version, and test.

Context Is a Product Surface

Long-running agents fail when they start each session with a foggy idea of what happened before. Anthropic's work on

effective harnesses for long-running agents

is useful because it treats this as an environment problem, not a motivational prompt problem: initialize the workspace, keep a progress file, use structured feature lists, work incrementally, verify before marking done, and leave the next run in a clean state.

The general rule is simple: if a human would need a handoff note, the agent needs a checkpoint.

For a company agent, the checkpoint should include current goals, revenue baselines, active risks, recent decisions, pending approvals, integration status, and the next recommended observation.

For a studio agent, it should include story state, character continuity, visual references, asset lineage, shot history, render failures, feedback decisions, and open creative questions.

Do not put the whole company or the whole film into every prompt. Give the agent a compact state package plus navigable references. Project instructions should be a map, not an encyclopedia.

Tools Should Be Agent-Native

A human tool says: "Open this dashboard and figure it out."

An agent-native tool says: "Here is a typed operation with stable IDs, structured output, a dry-run mode, a recovery hint, an authority class, and an idempotency key."

Every serious mutation tool should have:

typed inputs
structured outputs
stable resource IDs
dry-run support
authority class
preview artifact
idempotency key
rollback or compensation path
corrective error messages
independent verification

Bad tool error:

Invalid request.

Good tool error:

customer_id is missing. Search customers by email first, then retry with the returned customer ID.

That one change can turn a failed run into a self-correcting run. Tool design is not plumbing. It is part of the agent's intelligence.

Authority Is How Agents Earn Trust

Autonomy should not be a binary switch. It should be a portfolio of permissions earned through evidence.

Start with action classes:

read
write draft
write internal state
external message
spend
customer-impacting action
pricing change
production deploy
destructive/admin operation

Then assign each class a policy:

execute automatically
dry-run only
request approval
block

The harness should record where the agent asked when it could have acted, where it acted when it should have asked, which proposals were approved, which were rejected, and what happened after approval.

This is not only safety infrastructure. It is the trust ladder. If an agent repeatedly proposes good reversible actions, some of those actions can become autonomous. If it fails a class of action, the harness can reduce authority and create an eval.

Model Routing Is Core Harness Logic

One agent should not mean one model.

A harness should route by task class, evidence requirement, risk, context size, latency target, and cost target.

A practical routing policy looks like this:

cheap model for extraction, classification, summaries, formatting, and tool-argument repair
balanced model for normal planning, routine tool use, and status updates
strong model for ambiguous diagnosis, strategy, high-risk actions, and failure recovery
specialist model for code, visual generation, video, audio, OCR, or domain-specific judgment
judge route for calibrated eval grading, separated from the acting route

The harness should measure routing directly:

cost per task
cost per successful task
median wall-clock time
expensive-model overuse
underpowered-model failure rate
escalation success rate
recovery rate after tool errors
quality delta by route

This is one of the highest-leverage parts of the system. A better harness can make weaker models useful by giving them clearer context, narrower tools, stronger validation, and a safe escalation path.

Methods Sit Between Skills and Tools

Prompts are flexible. Tools are precise. Skills are reusable guidance. But production work often needs a middle layer: methods.

A method is a repeatable domain workflow that can be read by humans and executed by software. It defines inputs, domain concepts, steps, model routes, tool calls, validations, outputs, and traces.

For Win.sh, a method might be:

input: company state, metrics, active risks, founder goal
classify the business situation
choose the next observation
gather required evidence
propose reversible actions
check authority policy
execute safe actions or request approval
output: decision memo, action plan, updated memory, eval candidate

For Melies, a method might be:

input: script excerpt, style bible, character bible, render budget
decompose into scenes
generate shot list
route each shot to the right model
verify character and location continuity
produce prompts and asset manifest
output: storyboard, render plan, cost estimate, continuity report

The distinction matters:

a prompt expresses intent
a skill teaches an approach
a method executes a repeatable workflow
a tool changes the world
an eval decides whether behavior improved

The best harnesses will make methods first-class.

Score the Trace, Not Just the Answer

The most common eval mistake is grading only the final response.

For agents, the trace is where the behavior lives. Did the agent observe before acting? Did it call the required tools? Did it avoid forbidden tools? Did it use the right model route? Did it ask for approval? Did it recover from malformed tool arguments? Did it verify the result? Did it leave a checkpoint? Did it stay inside budget?

OpenAI's guide to

testing agent skills with evals

frames the eval unit well: prompt, captured run, trace plus artifacts, checks, and a score that can be compared over time.

Use an eval ladder:

1. Deterministic checks. Schema validity, required tools, forbidden tools, authority events, routing, cost ceilings, checkpoints, redaction.

2. Artifact checks. Did the report render? Did the video include the required scenes? Did the code build? Did the data match the expected query?

3. Rubric checks. Did the recommendation use the right evidence? Was uncertainty calibrated? Was the creative direction coherent?

4. Human review. Sampled traces, adjudication, failure labels, baseline approval.

5. Online/shadow evals. Production traces, delayed outcomes, user corrections, A/B comparisons.

Synthetic traces are useful, but they are not enough. A real business harness needs simulated worlds and production outcome loops.

Simulate the World the Agent Acts In

Win.sh is not just a workflow tool. It is an agent operating a company from incomplete evidence.

The important facts are often indirect: customer intent, churn risk, product quality, market timing, pricing sensitivity, competitor pressure, and founder constraints. The signals come from Stripe, analytics, email, support, GitHub, search traffic, customer interviews, and founder feedback. The possible actions are observe, diagnose, ask, recommend, execute, roll back, update memory, or create an eval.

So the Win.sh eval suite should include scenarios like:

ambiguous MRR drop
churn spike after a product change
traffic drop with stable revenue
support issue with customer harm
missing integration permission
budget exhausted mid-task
risky external action above authority
pricing rollback decision
opportunity validation where not building is correct
repeated failure becoming a new eval

Melies has the same structure in a creative domain. The important facts are taste, continuity, model quirks, asset constraints, and audience response. The signals are scripts, renders, feedback, references, failed generations, and asset metadata. The possible actions are decompose, prompt, render, compare, revise, ask for signoff, update the project bible, and create variants.

The Melies eval suite should include:

script-to-shot decomposition
character continuity
location continuity
style adherence
prompt repair
render budget control
asset provenance
feedback incorporation
edit continuity
human signoff for subjective decisions

The shared principle: do not only test whether the agent can write a plausible memo. Test whether it chooses the right next observation or action with incomplete information.

Humans Need to See the Harness

OpenAI's

harness engineering

post is interesting because it makes the environment legible to agents: repositories, tests, logs, metrics, browser state, traces, and review loops become things agents can inspect and act on.

The inverse is equally important: the harness must be legible to humans.

An admin surface should show:

installed harnesses
contract versions
active methods
active workspaces
integration health
latest eval results
pass rate by suite
cost per successful task
median task time
recovery rate
stuck-run rate
model route distribution
approval and rejection rates
baseline deltas
top failure clusters

Without this surface, improvement becomes anecdotal. With it, every harness release can answer: did success rate improve, did cost fall, did recovery improve, did autonomy increase safely, and did regressions stay contained?

Self-Improvement Needs Gates

"Self-improving agent" should not mean "the agent rewrites its prompt after a bad run."

A safe loop looks like this:

Capture the trace.
Capture the outcome.
Capture human correction.
Label the failure.
Cluster repeated failures.
Promote the cluster into an eval.
Create a scoped improvement task.
Change the prompt, contract, tool, router, method, verifier, or docs.
Run targeted evals.
Run regression evals.
Require review before updating the baseline.

Self-improvement without evals is mutation. Self-improvement with traces, evals, regression gates, and review is engineering.

What We Should Build

The open-source harness should stay independent. It should contain the primitives that are safe to share:

operating contracts
method definitions
typed tool adapters
authority policies
model routing policies
trace capture
eval runners
metric schemas
admin telemetry
export formats

Win.sh and Melies should integrate it through private adapters.

Win.sh keeps private company data, business methods, founder preferences, customer traces, revenue data, and production outcomes.

Melies keeps private scripts, assets, references, taste memory, render history, licensing metadata, and creative feedback.

The shared harness should know how to run, measure, route, review, and improve. The product-specific adapters should know what a company or a film is.

That boundary matters if we ever open source the harness. The public project should demonstrate the pattern without leaking business state.

The Build Order

If we want the best possible harness, I would build in this order:

1. Contract runtime. Versioned agent contracts compiled into prompt sections, policy, tool permissions, and eval expectations.

2. Trace store. Every run emits structured events for model calls, tools, authority, artifacts, costs, retries, failures, and outcomes.

3. Typed tool layer. Dry-run, authority class, idempotency, structured errors, preview artifacts, and verification hooks.

4. Model router. Task-class routes, cost/latency budgets, escalation policy, and route-level metrics.

5. Deterministic evals. Required tools, forbidden tools, authority, routing, checkpoints, redaction, cost ceilings.

6. Method layer. Reusable workflows for company operations and studio production.

7. Admin page. Evals, traces, model routes, costs, recovery, stuck runs, failures, approvals, and baseline deltas.

8. Simulated worlds. Win.sh company scenarios and Melies production scenarios with delayed outcomes.

9. Reviewer loops. Planner, executor, verifier, critic, domain reviewer, final approver, measured against single-agent baselines.

10. Self-improvement pipeline. Production failure to eval to scoped fix to regression run to reviewed baseline.

The goal is not to make agents look magical.

The goal is to make them accountable enough that we can safely give them more work.

That is why the harness is the product.