Visual Intro

Agent=Model+HarnessThe model writes code. The harness decides everything else.

This visual explains the central course claim: the model is not the whole agent. A coding agent becomes useful when the harness gives the model a loop, tools, memory, safety boundaries, verification, routing, and a trace you can inspect.

Use this page before the Harness Tour. The tour gives you a real OpenHands task. This page gives you the mental model for what you are watching.

What Problem Are You Solving?

The common mistake is to treat agent quality as a model-only question. Better models matter, but the same model can behave very differently depending on the harness around it.

For a coding agent, the harness answers practical questions:

What context does the model see?
Which tools can it call?
Where does code execution happen?
What state persists across turns or sessions?
What risky actions need approval?
How is completion verified?
When should the system use a different model?
What trace can a human inspect when something fails?

The visual below scrolls through those questions one failure mode at a time.

Agent = Model + Harness · scroll the harness into view

One task. Eight versions of the agent around the same model.

Each scroll step adds one harness component because a real failure mode appears. Watch the console: the schematic grows, the telemetry tape types new lines, and the gauges fill. The point is controlled design based on trace evidence, not more machinery.

HARNESS CONSOLE

TASK · why is test_users.py failing on main?

01/08

SCHEMATICBare Model

no loop · no tools · no trace

TELEMETRY · live trace

00:00userwhy is test_users.py failing on main?
00:01modellikely a fixture or import issue (unverified)
00:02modelanswer returned (no observations gathered)

EVIDENCE5

DURABILITY0

CONTAINMENT0

VERIFICATION0

01Bare Model

Prompt in, text out. No way to inspect reality.

Failure mode: A single model call can explain ideas, but it cannot read files, run commands, or check tests. For real code work it is guessing.
Harness move: Start with the smallest useful system: a task, a model, an answer. Notice what it cannot do before adding anything.
Keep: If the answer must be proven, a bare model is not enough.

02ReAct Loop

Reason, act, observe, repeat. The trace becomes the unit of diagnosis.

Failure mode: Without a loop, the model commits to its first guess. Coding work needs to take a step, look, and try again.
Harness move: Wrap the model: pick an action, run it, append the observation, decide whether to continue or stop.
Keep: The trace, not the single response, becomes the artifact you read.

03Tools

Tools turn guesses into checked observations.

Failure mode: A loop without tools is a hamster wheel. The model needs shell, files, search, and tests so each step can touch reality.
Harness move: Expose a small tool surface first. Add retrieval or MCP only when the trace shows the model is starved for evidence.
Keep: Tool design is part of the prompt surface. Fewer, legible, tested tools beat a pile of plugins.

04Memory

Durable context stops the agent from re-discovering the same facts.

Failure mode: Every turn starts cold. Repo conventions, prior decisions, and trace summaries get re-derived from scratch and burn tokens.
Harness move: Curate stable knowledge: AGENTS.md, progress notes, trace summaries, condenser policy. Make memory inspectable.
Keep: Memory is only useful when it is curated. Dump-everything memory is noise.

05Safety

Boundaries before autonomy. The harness decides what runs automatically.

Failure mode: The same tools that make the agent useful can delete files, leak secrets, or burn budget. Safety cannot live only in a prompt.
Harness move: Workspace isolation, permission gates, command policy, budget limits, human approval for risky actions.
Keep: Blast radius is a design choice, not a vibe.

06Critic

Verification keeps 'looks done' from becoming the finish line.

Failure mode: Agents stop too early or grade their own work generously. Confidence is not evidence.
Harness move: Tests, rubrics, external review, trace checks, and result tables decide whether the change is kept.
Keep: The course loop: predict, run, inspect, measure, keep.

07Routing

The harness picks the model. It does not call one model forever.

Failure mode: Easy tasks should not pay for the strongest model. Hard tasks need a recovery path when the cheap path stalls.
Harness move: Route by task shape. Escalate on evidence. Keep a benchmark table that compares cost per solved task.
Keep: Model choice becomes a policy you can defend with traces, not a habit.

08Benchmark

A leaderboard row is an agent system, not a model score.

Failure mode: It is tempting to read a benchmark as a model ranking. Submissions are agent plus model: loop, tools, sandbox, verifier, stopping rule.
Harness move: Use Terminal-Bench as evidence that the full harness matters. Label every score with the agent that produced it.
Keep: Say 'agent + model'. Do not overclaim that the model alone earned the score.

How To Read The Visual

Start at the top and watch what changes in the diagram. The early steps are intentionally small. A bare model call is not wrong. It is just insufficient once the answer needs repo evidence, command output, tests, or recovery from mistakes.

The key move is the ReAct loop: the model reasons, asks to act, receives an observation, and updates the next step. Once that loop exists, the rest of the harness becomes visible. Tools determine what actions are possible. Memory determines what the agent does not need to rediscover. Safety decides what can run automatically. Critics and metrics decide whether the result is good enough to keep.

That is why the course asks you to predict, run, inspect, measure, and keep. You are not collecting traces for decoration. You are using traces to decide which harness component is load-bearing.

Where Terminal-Bench Fits

Terminal-Bench is useful evidence because it evaluates agents in real terminal environments. Its own README describes the benchmark as a task dataset plus an execution harness that connects a language model to a sandboxed terminal environment.

That makes it a good teaching example, with one caveat: public leaderboard rows are agent plus model submissions. A score is not only a property of the model. It also reflects the loop, tool interface, sandbox, retry behavior, stopping rule, and verifier.

Use Terminal-Bench as a reminder to ask better questions:

What agent was wrapped around the model?
What tools and terminal affordances did it have?
How did it decide when to stop?
What verifier judged success?
What would change if the harness changed but the model stayed fixed?

What Students Should Leave With

After this page, you should be able to look at a trace and name the harness component that shaped it. If a run fails, do not only ask for a bigger model. Ask which part of the harness failed to provide evidence, context, boundaries, or feedback.

Then move to P01: Agent Trace and practice reading the loop on a real OpenHands run.

References

Harness Engineering blog post — the longer argument behind the course
Harness engineering experiments repo — runnable measurements for retrieval, memory, loops, tools, and architecture
Google Research: ReAct
Anthropic: Building effective agents
OpenAI: Harness engineering
OpenHands Agent Server architecture
Terminal-Bench leaderboard

Visual Intro ​

What Problem Are You Solving? ​