The Harness Is the Product

What we learned building agents that can actually run companies

The durable advantage in AI agents is shifting from prompts and models to the harness: the operating environment that gives agents tools, memory, authority, evals, routing, and a self-improvement loop.

·21 min read·The Agentic Economy

The model is not the agent.

This sounds obvious until you watch teams build agents. Most start by asking which model to use, which prompt to write, which tool list to expose. Then the agent fails in the same predictable ways: it loses context, calls the wrong tool, burns the expensive model on trivial work, declares victory too early, forgets what it learned, and cannot prove whether it improved.

The missing thing is the harness.

Not a thin wrapper around the API. Not a prompt template. Not a chat UI. A real harness is the operating environment around the model: state, tools, memory, authority, routing, evals, traces, budgets, artifacts, recovery loops, and human review.

The last few weeks made this clearer than anything else we have built in the AI CEO experiment.

We started with a simple question: what should a good agent harness do? The answer became much larger. If you want an agent to create videos, operate a company, resolve disputes, run finance workflows, or improve itself over time, the harness becomes the product. The model is only one component inside it.

This is the lesson from

, , , the production eval writing from , , , , , , a recent , and the messy reality of trying to build something like Win.sh.

The future of agents will not be won by the best prompt.

It will be won by the best harness.

Prompting Was the First Layer

The first generation of agent engineering was prompt engineering.

You wrote a long system prompt. You listed rules. You told the model its role. You gave examples. You added "think step by step" or "be concise" or "never do X." When something failed, you added another sentence to the prompt.

This worked for assistants. It does not scale to operators.

A prompt can describe a rule. A harness can enforce it.

A prompt can tell an agent to ask before spending money. A harness can classify a tool call as spend, check the authority matrix, convert the action into an approval request, log the event, and fail the eval if the agent bypasses approval.

A prompt can tell an agent to use a cheap model for summaries. A harness can route the task, record the model call, compare cost per successful task against baseline, and alert when expensive routes are overused.

A prompt can tell an agent to remember what it learned. A harness can create a checkpoint, update memory, attach artifacts, and make the next run read the durable state before acting.

The difference is testability.

Prompts are instructions. Harnesses are contracts.

The Contract Is the New Prompt

One artifact changed how I think about this. It was not an official paper. It was one of those leaked or semi-public prompt artifacts people pass around. It should not be treated as truth. It is useful only as a product specimen.

What mattered was not the secret content. What mattered was the structure.

The prompt separated identity, message protocol, capabilities, behavior, memory, monetization, platform UI rules, voice, product facts, and formatting. In other words, it was not just telling the model what to say. It was defining an operating contract.

That is the right abstraction.

An agent should have a versioned operating contract with sections like:

  • identity
  • message channels
  • capabilities
  • tool catalog
  • authority policy
  • model-routing policy
  • memory and context policy
  • output contract
  • budget guardrails
  • UI affordances
  • product facts
  • eval expectations

The contract should compile into prompt sections, runtime policy, and eval scaffolds.

If the contract says internal specialist agents should be hidden from the user, we can eval that. If it says external customer messages require approval, we can eval that. If it says low-risk summaries should use the cheap route, we can eval that. If it says long-running work needs a checkpoint, we can eval that.

The prompt is no longer the source of truth. The contract is.

Codex Makes the Harness Concrete

is useful to study because it is not just a coding model. It is a harness around a coding model.

The public shape is simple: the same underlying harness can sit behind the desktop app, terminal, IDE, and cloud task experience. The interface changes. The runtime loop does not.

That loop looks roughly like this:

  1. initialize context from conversation history, user request, system instructions, project instructions, tools, and loaded skills
  2. let the model reason
  3. catch function calls
  4. execute tools
  5. inject tool results back into context
  6. repeat until the task is complete
  7. apply the resulting changes through controlled mutation surfaces

Two Codex product decisions are worth copying.

First, command execution is interactive and session-based. Instead of treating every terminal call as an isolated one-shot, the harness can keep a live session and write back into it. This matters because real work is stateful. Servers keep running. Test output streams. A long install continues. A browser session has a URL, cookies, errors, and logs.

Second, file editing is separate from arbitrary shell commands. The model is naturally good at structured diffs, so the harness should let it produce patches directly and enforce permissions around those patches. That is cleaner than making every edit happen through ad hoc shell text manipulation.

The general lesson is bigger than coding. A Win.sh harness should not let an agent mutate business state only through generic API calls. A Melies harness should not let an agent rewrite a project only through loose natural-language instructions. The harness should expose dedicated mutation tools: propose price change, draft customer email, update memory, render shot, revise character bible, create checkpoint, open approval request, apply patch.

Each mutation tool should have:

  • typed inputs
  • explicit authority class
  • dry-run mode
  • preview artifact
  • approval policy
  • idempotency key
  • rollback or compensation path
  • trace entry

That is how the harness turns "the agent did something" into "the agent made a controlled change we can inspect, approve, replay, and evaluate."

Codex also makes context hygiene feel less abstract. Long-running agents need compaction, but compaction can silently erase the wrong thing. Configuration changes made halfway through a task have to survive summarization. Images and screenshots consume more context than people expect. Tool lists need stable ordering if you want prompt caching to work. Changing the initial prompt every turn can destroy cache hits; appending config changes as later messages can preserve them.

So a serious harness needs context policy, not just memory:

  • keep initial prompt sections stable
  • sort tool definitions deterministically
  • treat config changes as explicit events
  • preserve the latest config after compaction
  • estimate image and artifact token cost
  • launch subagents for expensive exploration
  • return only distilled findings to the main thread
  • start a fresh run when the context has drifted beyond repair

This also changes how we should write project instructions. A giant AGENTS.md is not wisdom. It is a landfill.

The better pattern is to use project instructions as a map: where to find product facts, coding standards, eval definitions, integration docs, domain methods, and examples. The agent should load what it needs when it needs it. If everything is in context, nothing is important.

This is directly applicable outside code. A company agent should not receive the entire company history in every run. It should receive a compact operating state, then links to the deeper artifacts it can inspect: Stripe baselines, customer segments, recent decisions, active risks, playbooks, failed experiments, eval results. A studio agent should receive the current creative state, then links to story bible, character sheets, shot history, visual references, asset lineage, and feedback decisions.

The harness should teach the agent how to navigate the organization, not drown it in the organization.

A Good Harness Has an Authority System

Authority is where agent demos become real systems.

Most agents are impressive until they can do damage. Then everyone quietly turns off the dangerous tools. That is not autonomy. It is a toy with a keyboard.

A useful harness needs action classes:

  • read
  • write
  • external message
  • spend
  • customer-impacting action
  • pricing change
  • production deploy
  • admin/destructive operation

And it needs policies:

  • execute automatically
  • dry-run only
  • ask for approval
  • block

This is not just for safety. It is how an agent earns trust.

In the AI CEO experiment, the agent should not start with full authority. It should begin by reading, diagnosing, documenting, and recommending. If its recommendations are consistently good, some actions move into autonomous territory. The authority matrix becomes a learning instrument.

The harness should track this:

  • what actions were proposed
  • what actions were approved
  • what actions were rejected
  • what happened after approval
  • where the agent asked when it could have acted
  • where it acted when it should have asked

Autonomy is not a binary switch. It is a portfolio of permissions earned through evidence.

Model Routing Is Part of the Harness

Most teams still talk about models as if one agent uses one model.

That is already wrong.

A serious harness should route every task to the cheapest model that can do it reliably, then escalate when risk or uncertainty increases.

Summarizing yesterday's Stripe transactions does not need the same model as deciding whether to roll back a pricing experiment. Classifying a support email does not need the same model as diagnosing a churn spike with three plausible causes. Repairing malformed tool arguments does not need the same model as writing a board memo.

The harness should route by task class:

  • cheap model for extraction, summaries, classification, status updates
  • balanced model for normal tool use and routine planning
  • strong model for ambiguous strategy, high-risk decisions, eval judging, and failure recovery
  • specialist model for creative, coding, visual, or domain-specific work

Then it should measure the routing policy directly:

  • cost per task
  • cost per successful task
  • median wall-clock time
  • recovery rate after tool errors
  • expensive-model overuse
  • underpowered-model failure rate
  • route escalation effectiveness

This is one of the most important points in the whole system. Model routing is not an optimization after the agent works. It is part of why the agent works.

A better harness can make a cheaper model behave like a more expensive one for many tasks. We saw the same pattern in the scaffolding benchmark: structured context made weaker models recall institutional knowledge far better. The lesson generalizes. Good environment design is worth model tiers.

Memory Is Not a Vector Store

Agents do not fail only because they lack information. They fail because they do not know what information matters, what is durable, what changed, and what must be carried forward.

Memory needs structure.

For a company agent, memory should include:

  • current goals
  • business model
  • active integrations
  • revenue and cost baselines
  • customer segments
  • recent decisions
  • known risks
  • authority level
  • failed experiments
  • founder preferences
  • mistakes to avoid

For a creative studio agent, memory should include:

  • story bible
  • character continuity
  • location continuity
  • visual style
  • asset lineage
  • shot history
  • feedback decisions
  • render constraints
  • licensing constraints

A vector search over old chats is not enough. The agent needs a maintained state package: short summaries, deep references, artifacts, checkpoints, and freshness checks.

Anthropic's long-running agent work points in this direction. Long tasks need handoffs. The next agent or next context window has to know what happened, what remains, what failed, what was verified, and what end state is required.

The simple rule: if a human would need a handoff note, the agent needs a checkpoint.

Tools Need to Be Agent-Native

Most software tools were built for humans.

An agent does not want a dashboard. It wants a structured interface. It wants stable IDs, typed inputs, predictable outputs, retry-safe operations, dry-runs, and error messages that say how to recover.

Bad tool error:

Invalid request.

Good tool error:

customer_id is missing. Search customers by email first, then retry with the returned customer ID.

That difference is the difference between a stuck task and a self-correcting loop.

The harness should make tool quality measurable:

  • malformed tool call rate
  • repair success rate
  • retry count
  • tool-call F1
  • required tool coverage
  • forbidden tool avoidance
  • tool-order correctness
  • dry-run usage before risky action
  • independent verification of claimed success

This is why agent-native interfaces matter. A GUI is a translation layer for humans. A harness is a translation layer for agents.

Skills Are Not Enough

Skills are powerful. They package judgment. They tell the agent when to use a technique, what references to read, what scripts to run, what constraints matter.

But a skill is still interpreted by an agent at runtime.

That means a skill can be skipped, partially followed, rephrased, over-applied, or reinterpreted differently on the tenth run than on the first. For exploratory work, that flexibility is useful. For production work, it is dangerous.

The missing layer is a method.

A method is a repeatable domain workflow between natural language and code. It should be readable by humans and agents, but executed by software. It should define the domain vocabulary, inputs, steps, models, outputs, validations, and traces.

For example, a Melies method could be:

  • input: script excerpt, visual style, character bible, render budget
  • step: decompose into scenes
  • step: generate shot list
  • step: verify character continuity
  • step: route each shot to the right image or video model
  • step: compose edit plan
  • output: storyboard, prompts, asset manifest, cost estimate, continuity report

A Win.sh method could be:

  • input: company state, latest metrics, active risks, founder goal
  • step: classify the business situation
  • step: choose the next observation
  • step: diagnose using required data sources
  • step: propose reversible actions
  • step: check authority policy
  • step: create approval request or execute safe action
  • output: decision memo, action plan, trace, updated memory, eval candidate

The agent can help author and improve these methods. But once a method is accepted, the harness should run it deterministically.

This is where many non-code agent systems are currently weaker than software agents. Code has languages, compilers, tests, linters, diffs, runtime logs, and review workflows. Business and creative work mostly have documents, slides, dashboards, and vibes. That is why coding agents feel ahead: software already has a machine-checkable substrate.

The way to close the gap is not to turn every founder or filmmaker into a programmer. It is to give business and creative work their own machine-checkable substrate:

  • typed inputs
  • domain concepts
  • explicit steps
  • model routes
  • structured outputs
  • validation rules
  • audit traces
  • editable visual flowcharts
  • reusable method packages

This is the bridge between "prompt the agent" and "write software."

For Win.sh, methods become company operating playbooks. For Melies, methods become production pipelines. For the open-source harness, methods become a public extension format: safe to share, easy to inspect, and specific enough to evaluate.

The distinction matters:

  • a prompt tells the agent what you want
  • a skill teaches the agent how to approach a class of work
  • a method executes a repeatable workflow
  • a tool mutates the world
  • an eval decides whether the behavior improved

A good harness needs all five.

Evals Must Score the Trace

The biggest mistake in agent evals is scoring only the final answer.

For agents, the final answer is often the least important part. The trace is where the real behavior lives.

Did the agent observe before acting? Did it call the right tools? Did it avoid forbidden tools? Did it use the right model route? Did it ask approval before an external action? Did it recover from tool failure? Did it leave a checkpoint? Did it update memory? Did it stay inside budget? Did it leak private data into the trace?

Those are harness questions.

A useful eval ladder looks like this:

  1. Deterministic checks. Schema validity, required tools, forbidden tools, authority events, routing, cost ceilings, checkpoints, redaction.
  2. Artifact checks. Did the report render? Did the video include the required scenes? Did the code build? Did the data table match the expected query?
  3. Rubric/model-assisted checks. Did the recommendation consider alternatives? Was uncertainty calibrated? Was the creative output coherent? Did the business decision use the right evidence?
  4. Human review. Sampled traces, adjudication, failure labels, baseline approval.
  5. Online/shadow evals. Production traces, delayed outcomes, user corrections, A/B comparisons.

This matches the best public eval advice: start with cheap deterministic checks, add model judges only where necessary, calibrate judges against human labels, and close the loop from production failures back into evals.

The key sentence from our work: a synthetic trace suite is not a real business eval.

It is still useful. It proves the eval contract. It tells you whether the harness captured the right events and enforced the right policies. But it does not prove the agent can operate a business.

To test that, you need simulated worlds and delayed outcomes.

Review Loops Are Production Infrastructure

One reason coding agents are improving so quickly is that code review already exists. The agent can make a change, another agent can review it, tests can run, a human can sample the diff, and the result can be merged or rejected.

That pattern should move into every serious harness.

The harness should support explicit reviewer roles:

  • planner
  • executor
  • verifier
  • critic
  • security reviewer
  • cost reviewer
  • domain reviewer
  • final approver

This does not mean every task needs eight agents. It means the harness should be able to add review pressure where uncertainty or risk justifies it.

The best shape is often not "many agents in a chat." It is a publication and review system.

One agent investigates and publishes a finding. Another reviews it. A third tries to reproduce it. The harness records accepts, rejects, objections, confidence, and evidence. Over time, the best-supported claim rises to the top.

This matters for hard problems: security research, math, product strategy, company diagnosis, creative direction. In those domains, the correct answer is not always produced by one linear pass. The agent needs to search, publish, review, revise, and converge.

But the transcript also made one uncomfortable point clear: adding more agents can make results worse.

At a fixed budget, two agents may work well, four may work better, and eight may collapse into noise, duplicated work, or shallow review. The harness cannot assume collaboration scales linearly. It has to measure the curve.

So multi-agent orchestration needs evals of its own:

  • result quality by number of agents
  • cost by number of agents
  • duplicate-work rate
  • useful-review rate
  • time to first valid finding
  • false-positive rate
  • convergence rate
  • reviewer disagreement
  • cross-agent contamination
  • best single-agent baseline

For Win.sh, this suggests a CEO agent should not directly become a swarm. It should have a small executive loop: operator, analyst, verifier, and board reviewer. For Melies, it suggests a small studio loop: director, continuity reviewer, production manager, and taste critic.

The harness should increase or decrease collaboration based on measured value, not aesthetic excitement.

The Company Agent Is Solving a POMDP

Entrepreneurship is a partially observable Markov decision process.

That sounds like overkill until you map it.

The hidden state:

  • customer intent
  • churn risk
  • product quality
  • founder constraints
  • competitor pressure
  • market timing
  • pricing sensitivity
  • trust level

The observations:

  • Stripe metrics
  • analytics
  • support inbox
  • customer calls
  • GitHub issues
  • email
  • product events
  • search traffic
  • founder feedback

The actions:

  • observe more
  • ask a clarifying question
  • diagnose
  • recommend
  • execute reversible change
  • request approval
  • roll back
  • update memory
  • create an eval

The reward:

  • revenue
  • retention
  • lower risk
  • validated learning
  • lower operator burden
  • fewer repeated mistakes

This is why final-answer evals are so inadequate. The hard part is not writing a plausible business memo. The hard part is choosing the next observation or action under uncertainty.

Sometimes the correct action is to do nothing. Sometimes it is to ask for more data. Sometimes it is to block the agent from sending the customer email it drafted. Sometimes it is to roll back a feature even though the metric anomaly has three possible causes.

The harness has to evaluate those choices.

For Win.sh, the eval suite should include:

  • ambiguous MRR drop
  • churn spike after product change
  • traffic drop with stable revenue
  • support issue with customer harm
  • pricing experiment needing rollback
  • malformed tool argument recovery
  • context reset recovery
  • missing integration permission
  • budget exhausted during heartbeat
  • risky external action above authority
  • opportunity validation where not building is correct
  • repeated failure becoming a new eval

This is not a normal SaaS dashboard. It is an operating system for business agents.

The Creative Studio Agent Has the Same Problem

Melies looks different from Win.sh, but the harness problem is similar.

A studio agent making videos is also operating under partial observability. It does not know whether a shot will render well until it tries. It does not know whether a character will remain consistent across scenes unless it tracks lineage. It does not know whether the user's taste changed unless feedback is captured and folded back into the project state.

The hidden state:

  • creative intent
  • style preference
  • character identity
  • continuity constraints
  • model quirks
  • asset licensing
  • render cost
  • client taste

The observations:

  • script
  • storyboards
  • generated shots
  • prompt outputs
  • user feedback
  • failed renders
  • reference images
  • asset metadata

The actions:

  • break script into scenes
  • generate shot prompts
  • choose model
  • render
  • compare continuity
  • revise
  • create alternate cut
  • ask for signoff
  • update the project bible

The reward:

  • coherent story
  • visual continuity
  • lower render waste
  • faster iteration
  • fewer subjective surprises
  • publishable output

So the Melies harness should eval:

  • script-to-shot decomposition
  • character continuity
  • location continuity
  • style adherence
  • prompt repair
  • asset provenance
  • cost and render-time ceilings
  • feedback incorporation
  • edit continuity
  • human signoff for subjective choices

Different domain, same principle. The harness converts a model into a working operator.

The Admin Page Matters

One surprisingly important lesson: humans need to see the harness.

If the harness is invisible, no one knows whether the agent is improving. You get anecdotes instead of telemetry. You remember the impressive run and forget the stuck ones. You feel progress without measuring it.

The admin surface should show:

  • installed harnesses
  • contract completeness
  • active workspaces
  • required integrations
  • skills loaded
  • latest eval results
  • pass rate
  • average score
  • total cost
  • cost per success
  • median task time
  • recovery rate
  • stuck-run rate
  • baseline deltas

This is why NanoCorp-style telemetry is the right instinct: cost per task, median task time, recovery rate. The exact numbers matter less than the dashboard shape. A harness release should have before/after evidence.

If cost goes down but recovery also goes down, you did not improve the agent. If pass rate goes up because the eval got easier, you did not improve the agent. If the agent completes more tasks by taking unsafe actions, you did not improve the agent.

The harness dashboard is where taste meets accountability.

Self-Improvement Is a Loop, Not a Vibe

Everyone says self-improving agents. Most mean "the agent writes a note after failing."

That is not enough.

A real self-improvement loop looks like this:

  1. Capture the trace.
  2. Capture the outcome.
  3. Capture human correction.
  4. Label the failure.
  5. Cluster repeated failures.
  6. Promote the cluster into an eval.
  7. Create a scoped improvement task.
  8. Change the prompt, tool, router, verifier, adapter, or docs.
  9. Run targeted evals.
  10. Run regression evals.
  11. Accept or reject the change.
  12. Update the baseline only after review.

is valuable because it shows this pattern in the wild: evidence becomes findings, findings become evals, evals become engineering tasks, tasks get verified.

That is the loop we need for agents that run companies or studios.

Not "the agent changed itself." More like: the harness found a repeated failure, proposed a change, proved the change improved the target eval without breaking the baseline, and a human accepted the new version.

Self-improvement without evals is just mutation.

The Product Boundary Moves

This is the strategic point.

In the GUI era, the product was the interface. You built screens for humans to click. The moat was usability, workflow, and distribution.

In the agent era, the product boundary moves down.

The agent does not need your dashboard. It needs your structured state, your tool surface, your auth model, your pricing, your error recovery, your evals, your trust guarantees, your default position inside other agents' skills.

This changes what software companies should build.

A normal SaaS product asks:

What should the user see?

An agent-native product asks:

What should the agent be allowed to know, decide, and do?

That is a harness question.

The companies that understand this will stop treating agents as chatbots on top of existing apps. They will build operating contracts, typed tools, machine-native auth, model routing, trace stores, eval suites, and improvement loops.

The companies that do not will keep adding AI sidebars to dashboards that agents do not want to use.

The Playbook

If I had to compress everything we learned into a build order, it would be this.

1. Define the operating contract. Identity, channels, capabilities, authority, routing, memory, output, budget, UI affordances.

2. Make context navigable. Stable prompt sections, deterministic tool ordering, config events, compaction policy, artifact budgets, project instructions as maps.

3. Make tools agent-native. Typed inputs, structured outputs, stable IDs, dry-run support, corrective error messages, independent verification.

4. Add a method layer. Repeatable domain workflows with typed inputs, explicit steps, model routes, structured outputs, validations, and audit traces.

5. Add trace capture before adding ambition. Model calls, tool calls, authority events, artifacts, checkpoints, failures, costs, duration, route IDs.

6. Build deterministic evals first. Required tools, forbidden tools, authority, cost ceiling, model route, checkpoint, redaction.

7. Add simulated worlds. A company simulator for Win.sh. A creative production simulator for Melies. The agent needs to act under uncertainty, not replay canned traces forever.

8. Add reviewer loops. Planner, executor, verifier, critic, and final approver roles. Measure whether extra agents improve quality per dollar.

9. Add rubric judges carefully. Calibrate against human labels. Score one dimension at a time. Track judge drift.

10. Build the admin view. Harnesses, contracts, methods, evals, costs, routes, recovery, context health, baseline deltas.

11. Close the self-improvement loop. Production trace → failure cluster → eval → scoped fix → regression suite → reviewed baseline.

12. Keep the core independent. The open-source harness should contain contracts, runners, metrics, logging, export. Product-specific prompts, traces, tools, and outcomes stay private.

13. Register the harness. Hugging Face now explicitly asks builders to

. This is a small signal, but an important one: harnesses are becoming a recognized category.

The Hard Part

The hard part is not making an agent impressive once.

The hard part is making it reliable enough that you can give it more authority.

That requires a different engineering culture. Less demo. More telemetry. Less prompt worship. More contracts. Less "the model is smart." More "the run passed because it observed, routed, acted, verified, checkpointed, and stayed inside authority."

The agentic economy will not be built by models alone.

It will be built by harnesses that turn models into accountable operators.

For Win.sh, that means a harness that can operate inside the POMDP of entrepreneurship: uncertain state, costly observations, risky actions, delayed rewards.

For Melies, it means a harness that can operate inside the POMDP of creative production: unstable outputs, continuity constraints, subjective taste, expensive iteration.

For every company building agents, it means the same thing:

Your prompt is not enough.

Your model choice is not enough.

Your tool list is not enough.

The harness is where the agent becomes real.

And eventually, the harness is the product.

Romain Simon
Romain Simon

I'm just the human in the loop.

Yuki Capital
© 2026 Yuki Capital