
I watched Tejas Kumar's talk "Harnesses in AI: A Deep Dive" expecting another agents-are-the-future pitch. What I got was the cleanest articulation I've heard of something I'd been circling for months: the model is the easy part. The hard part — where the real engineering lives — is the AI agent harness around it: the scaffolding that turns a model call into something that can actually act. His line that stuck with me — 2026 is going to be the Year of Harnesses — and I think he's right.
So instead of just nodding along, I opened an editor and followed his build — porting the poor man's harness he walks through into the stack I actually reach for, TypeScript and the Claude API. The design here is Tejas's, not mine; the point was to type it out by hand and feel where each piece earns its keep. The harness he builds drives a browser, so that's what I followed along with: point a real Chromium at Hacker News and pull the top few AI stories. Mundane is the point. The moment an agent touches the real world — a browser, a login, a page that shifts under it — the model stops being the hard part and the harness becomes everything.
Strip the buzzword and an AI agent harness is the scaffolding that turns a model call into an agent that can act. Tejas breaks it into pieces that map almost one-to-one onto what bites you in production: the agent loop (call, act, repeat), a tool registry (the catalog of actions the agent may take, plus the code behind each one), context management (what you feed back in on each turn), and guardrails (the limits that stop it running forever, lying, or doing something irreversible). The loop is the skeleton everyone shows off. The other three are why your agent works in a demo and falls over on a Tuesday.
Start with the registry, because Tejas is right that it's the real core — not the loop. A tool registry is two things kept in sync: the code that performs each action, and the description the model reads to decide when to call it. For a browser agent the actions are things like navigate, read the page, and click:
import Anthropic from "@anthropic-ai/sdk";
import { chromium, Page } from "playwright";
const client = new Anthropic();
// The registry: every action the agent is allowed to take, as real code.
const registry: Record<string, (page: Page, input: any) => Promise<string>> = {
navigate: async (page, { url }) => {
await page.goto(url, { waitUntil: "domcontentloaded" });
return `Navigated to ${url}`;
},
read_page: async (page) => {
// Don't dump the raw DOM — compress to visible text first (see guardrails).
return compress(await page.innerText("body"));
},
click: async (page, { text }) => {
await page.getByText(text, { exact: false }).first().click();
return `Clicked "${text}"`;
},
};
// The same registry, described for the model.
const tools: Anthropic.Tool[] = [
{ name: "navigate", description: "Open a URL in the browser.",
input_schema: { type: "object", properties: { url: { type: "string" } }, required: ["url"] } },
{ name: "read_page", description: "Read the visible text of the current page.",
input_schema: { type: "object", properties: {} } },
{ name: "click", description: "Click the first element matching the given text.",
input_schema: { type: "object", properties: { text: { type: "string" } }, required: ["text"] } },
];More writing
Notice read_page doesn't return the raw DOM — it returns compressed text, which I'll come back to. With the registry defined, the loop is almost boring: ask the model, run whatever it asked for through the registry, feed the results back, repeat. The only thing it adds over a textbook loop is the line that matters most:
async function run(goal: string) {
const browser = await chromium.launch();
const page = await browser.newPage();
const messages: Anthropic.MessageParam[] = [{ role: "user", content: goal }];
const MAX_STEPS = 12; // the single most important guardrail
for (let step = 0; step < MAX_STEPS; step++) {
const res = await client.messages.create({
model: "claude-opus-4-8",
max_tokens: 16000,
thinking: { type: "adaptive" },
tools,
messages,
});
if (res.stop_reason === "end_turn") {
return res.content.find((b) => b.type === "text")?.text ?? "";
}
messages.push({ role: "assistant", content: res.content });
const results: Anthropic.ToolResultBlockParam[] = [];
for (const block of res.content) {
if (block.type === "tool_use") {
const fn = registry[block.name]; // dispatch through the registry
const out = fn ? await fn(page, block.input) : `Unknown tool: ${block.name}`;
results.push({ type: "tool_result", tool_use_id: block.id, content: out });
}
}
messages.push({ role: "user", content: results });
}
throw new Error(`Hit MAX_STEPS (${MAX_STEPS}) without finishing — probably stuck in a loop.`);
}That MAX_STEPS cap is the highest-leverage line in the whole harness. Without it, a confused agent doesn't fail — it loops, clicking the same broken link forever and quietly draining your token budget. A hard step ceiling turns an infinite, expensive failure into a finite, cheap one you can see in the logs. It's the first guardrail I add to anything, before I even make it work.
The loop runs now. That's where most tutorials stop, and it's where the real work starts. Two more guardrails turn this from a toy into something I'd leave running unattended.
Hacker News is tiny; a real app page is a hundred thousand tokens of navigation, scripts, and tracking junk. Feed that raw to the model on every turn and you'll blow the context window and the budget in a handful of steps. So the harness compresses before it feeds — strip to visible text, collapse whitespace, cap the length. The model sees signal, not the DOM:
// A page's raw HTML can be 100k+ tokens of noise. Feed the model the
// visible text, capped — not the markup.
function compress(text: string, max = 4000): string {
const clean = text.replace(/\s+/g, " ").trim();
return clean.length > max ? clean.slice(0, max) + "\n…[truncated]" : clean;
}This one surprised me the first time and then I saw it constantly. Ask an agent to fetch the top three stories and it'll cheerfully reply "Done — here they are" — sometimes with stories it never actually read, sometimes with ones it invented. It isn't malicious; it pattern-matches "task complete" and writes the closing line. Tejas's fix is the one that stuck with me most: don't trust the agent's own claim of success. Add a verify step that checks the claim against ground truth — the actual page — with a separate model call:
// The model will happily say "Done!" without actually reading the page.
// So don't trust the final text — verify it against ground truth.
async function verify(page: Page, claim: string): Promise<boolean> {
const res = await client.messages.create({
model: "claude-opus-4-8",
max_tokens: 1024,
messages: [{
role: "user",
content:
`An agent claims it completed this task:\n${claim}\n\n` +
`Here is the actual page text:\n${compress(await page.innerText("body"))}\n\n` +
`Reply ONLY "PASS" if the claim is fully supported by the page, ` +
`or "FAIL: <reason>".`,
}],
});
const text = res.content.find((b) => b.type === "text")?.text ?? "";
return text.startsWith("PASS");
}If verify fails, you feed the failure back into the loop — "your last answer wasn't supported by the page, try again" — instead of returning it to the user. That single step is the difference between an agent that's confidently wrong and one you can actually rely on. And it's pure harness: the model didn't get more honest, the scaffolding around it got more skeptical.
The sharpest idea in the talk — and one I'd already learned the hard way — is this: do not let the model handle authentication. Logging in is everything a language model is bad at: exact selectors, hidden 2FA, CAPTCHAs, redirects — and worst of all, when it fails it'll often report that it succeeded. So you pull auth out of the model's hands entirely and inject a deterministic, programmatic login handler into the loop. When the agent hits a login wall, the harness logs in with plain code, then hands a ready, authenticated page back:
// Never let the model "log in." It hallucinates success, fumbles 2FA, and
// guesses at selectors. The harness logs in deterministically, then hands
// the loop a page that's already authenticated.
async function ensureLoggedIn(page: Page) {
const needsLogin = page.url().includes("/login")
|| (await page.getByRole("button", { name: /sign in/i }).count()) > 0;
if (!needsLogin) return;
await page.fill("#username", process.env.HN_USER!);
await page.fill("#password", process.env.HN_PASS!);
await page.click("button[type=submit]");
await page.waitForLoadState("networkidle");
}The model never sees a credential and never makes a decision about auth. That's the whole philosophy in miniature: anything that has to be correct and repeatable, you do in code; you reserve the model for the genuinely fuzzy judgment in between. Anchor the agent to a deterministic environment, and let it improvise only where improvisation is actually the job.
It's tempting to read all this as hobbyist plumbing, but it's exactly why serious tools lean on harnesses. The harness is the control plane — the layer that decides what the model is allowed to see and do — and it's where every property an enterprise cares about gets enforced. Data security lives there: the model only ever sees what the harness hands it, so a deterministic auth handler and a compressed, scrubbed context are also your data-exposure boundary — and the containment layer for when an agent goes wrong, which I've watched happen firsthand. Reliability lives there: the verify step and the step cap are what let you put an agent in front of real users. And cost lives there: compression plus a hard ceiling are how you do more with less instead of watching a runaway loop drain a budget. The model is interchangeable. The harness is the product.
Which brings me back to Tejas's prediction — the part of the talk I keep chewing on. His read: 2026 is the year harnesses become the actual unit of AI engineering — the thing teams build, share, and compete on — rather than an afterthought bolted around a model. I buy it, because it matches everything I've already shipped: my production MCP server is, underneath, a harness; my LangChain + Mastra PR-reviewer agent is mostly harness; and the time I tore a multi-agent setup back down to a single agent was a harness lesson, not a model one. The further-out call is the spicy one — 2027 as the year of self-aware, dynamic harnesses that rewrite themselves on the fly, generating new tools and adjusting their own guardrails mid-task. I'm not sure I believe the timeline, but the direction is obviously right. Today I hand-write the registry and the step cap; the moment a harness can safely extend its own registry, that's a genuinely different kind of system.
You don't need a framework to learn this — you need to follow a poor man's harness build once, by hand, and watch where it breaks. It won't break at the model. It'll break when the page is too big, when the agent claims a win it didn't earn, when it tries to log in and faceplants. Every one of those is a harness problem with a deterministic fix. The model under your agent is borrowed — you might even lose it on a random Friday. The harness is the part you build, the part you own, and — if Tejas is right — the part your whole job is about to be. Watch his talk for the full build and the framing — it's his design, and I'd point you there first. Any mistakes in how I ported it are mine.