Your AI Agent Fails 97.5% of Real Work. The Fix Isn’t Coding.

The agents are getting better. The people deploying them are not. AI agents can write code, generate designs, and close tickets with increasing proficiency, but they still hit a “memory wall” that makes them fundamentally incapable of doing real jobs. Recent research shows frontier agents complete only 2.5% of real freelance projects at acceptable quality, 75% of models break working software during long-term maintenance, and companies are discovering too late that the institutional context their senior employees carry is load-bearing infrastructure, not an invisible byproduct of work.

The fix is not better models or bigger context windows. It is human judgment, encoded into evaluations that run before, during, and after an agent acts. The companies that win the next few years will be the ones that treat eval design as a core competency for their most experienced people, not a chore delegated to juniors. The human role in an agentic world is what Nate Jones calls “contextual stewardship”: maintaining the mental model of your system, representing what you know in ways machines can use, and exercising judgment about when technically correct output is organizationally wrong.

The Memory Wall

AI agents are getting really good at doing their work. The capability trajectory is real and accelerating. But there is a memory wall. Agents still have short-term memories, especially measured against the arc of a real job. Software jobs in tech average somewhere between 18 months and two years. AI agent runs are measured in weeks at best, and for most runs, it’s an hour or two. The people who really hold institutional context and keep a business going often stay four, five, six, seven, eight years or longer. That gap is one of the hardest problems in tech.

The combination of AI skills with AI context, tools, resources, workloads, and prompts is still brittle. It is still difficult to predict what changing one thing will do. This matters because AI tools are getting more powerful even as they remain brittle, and they are getting more powerful very quickly. Net result: if you improperly deploy them, they are getting more destructive, not less.

A mediocre tool that fails obviously is just annoying. A power tool that fails silently is very dangerous. And that is the world we are headed toward. The best tools we have for managing that danger are human brains crafting evaluations. Not better prompts, not bigger context windows per se, but human judgment about what matters, what’s fragile, and what the AI doesn’t know it doesn’t know.

When an Agent Demolished a Production Database

Two weeks ago, an AI coding agent wiped out a production database. 1.9 million rows of student data gone in seconds. The backups disappeared too. The agent never made a technical error. Every action was logically correct. It simply had no idea it was demolishing a live system because the knowledge that distinguished real infrastructure from temporary copies existed only in the engineer’s head.

Alexey Grigorev runs the DataTalks.Club course platform, a system that manages homework submissions, projects, and leaderboard entries across multiple courses spanning two and a half years. He was migrating a separate website to the cloud and decided to reuse his existing infrastructure setup to save a few bucks a month.

His AI coding agent was running the deployment. The first warning sign was that the agent started creating a long list of cloud resources that should not have existed. Alexey had recently moved to a new computer and hadn’t transferred his infrastructure configuration. The agent looked at the cloud, saw nothing it recognized, and assumed it was building from scratch. Pretty logical.

Alexey stopped the process, but some duplicate resources had already been created. Next step: he asked the agent to identify the duplicate files and remove them. A very reasonable ask. But the agent decided on its own that instead of removing resources one at a time, it would be “cleaner and simpler” to demolish everything it had created in one shot. Also reasonable in isolation.

What Alexey didn’t realize was that the agent had quietly unpacked an archived configuration file from his old computer. Inside that archive were the definitions of his real production infrastructure. So when the agent ran a demolition command, it wasn’t clearing out temporary duplicates. It was destroying the production database, the networking layer, the application cluster, the load balancers, the host — everything.

The story does have a good ending. It took 24 hours, an emergency support upgrade to Amazon, and a significant amount of luck to recover the data. Alexey immediately stripped the agent of all execution permissions and now reviews every infrastructure change personally.

The agent was competent. The agent was confident. And Alexey made a lot of reasonable asks — this is not a case of bad engineering decisions. He made asks that a lot of engineers would have made in this situation and just got unlucky. That is the point. The agent was wrong about which world it was operating in — production or not production — and it did not have the self-awareness to ask. The only thing that could have prevented this disaster was a human who understood the organizational context or an evaluation that encoded that context into a guardrail before the agent ever got as far as running that command. Neither existed, and Alexey had a really rough 24 hours as a result.

This, by the way, is the reason 11 Labs is pushing AI insurance. Their agents are insured. We are going to see a lot more of that.

Three Studies That Reveal the Pattern

Alexey’s story is vivid, but it is not a cherry-picked anecdote. New studies confirm a persistent agent memory wall issue that undercuts the ability to manage long-running agentic tasks without an extraordinary amount of human judgment.

1. The Remote Labor Index (Scale AI and Center for AI Safety)

Scale AI and the Center for AI Safety tested frontier AI agents on 240 real freelance projects from Upwork — video production, architecture, 3D modeling, game development, data analysis. These were end-to-end projects.

Average project cost: $630
Average human completion time: 29 hours
Best agent success rate: 2.5% of projects completed at a quality a paying client would accept
Failure rate: 97.5% on real work

Here is the confusing part. A different benchmark, GDP val, built by OpenAI, shows the exact same class of models approaching expert-level quality and completing tasks a hundred times faster than humans. Both numbers are real. Both studies are real. The difference is that GDP val gives the model all the context it needs on purpose: the brief, the deliverable format, and what good looks like. The Remote Labor Index gives the model a client brief and some files and says “figure it out.”

The gap between these two benchmarks is the gap between “can AI do this task” and “can AI do a job.” Tasks come with context provided; AI is pretty good at them now. Jobs require you to bring your own. AI is not good at that yet.

If AI agents cannot figure out an Upwork task with any degree of reliability, there is no way we are rationally putting them in charge of entire jobs. There will be entire classes of human jobs invested in building infrastructure to support long-running agents — that is happening now, and tremendous speedups are possible. But that only happens when humans with really good brains spend a lot of time very thoughtfully figuring out how to do it. It does not happen magically when the CEO finds a LinkedIn post.

2. SWECI: The Software Maintenance Benchmark (Alibaba)

A team out of Alibaba built the first benchmark that measures what happens when AI maintains software over time instead of writing it fresh. This is the long software study.

100 real codebases, each spanning an average of 233 days and 71 consecutive updates of actual development history
The agent has to evolve the codebase forward: adding features, fixing bugs, adapting to new requirements — the way real software gets built over months and years
75% of models tested broke previously working features during maintenance
Three out of four frontier models asked to maintain code over time actively make things worse
The benchmark punishes agents whose early decisions compound into technical debt later, and almost all of them do

Writing code and maintaining code are fundamentally different skills. AI is really good at the former. AI is not very good at the latter. We only benchmark the first one right now, and the first one is the basis for dramatic statements by Dario Amodei and others saying jobs are over or half of jobs are gone.

But if you still need a person to maintain the code, what are we doing here? As much as AI agents show incredible use cases, what they actually do in production is depend on humans to build the infrastructure around them. When the Cursor team set up their AI agent to recreate Excel or write a browser, they had to set the intent, deliberately experiment on the agent harness, and design the context, tools, sub-agents, and reporting structure. The agent gets credit for doing great work, but the humans should get credit for setting it up. Smart humans have to deploy and maintain this stuff.

3. The Harvard Seniority Paper

Hossein, Maum, and Lickinger studied 62 million American workers across 285,000 firms from 2015 to 2025.

Junior employment dropped roughly 8% relative to non-adopters within a year and a half
Senior employment kept rising
The decline hit hardest in AI-exposed occupations
Critically, the drop was driven by slower hiring, not more firing

The conventional read is that AI replaces junior workers. The better read is that AI replaces task execution. Juniors are hired for tasks: debugging, document reviews, first drafts. AI does these tasks adequately in isolation. Seniors survive because they provide something different. They hold the mental model of the system. They know which parts are load-bearing. They know the decision history. They know the things nobody wrote down.

The Harvard data shows a labor market learning in real time that context is the scarce resource, not agentic coding execution.

The Context Gap Beyond Engineering

Software engineering is the easiest domain to measure, but this is not an engineering story. It just happens to start that way because engineering is easy to verify. The pattern Alexey experienced — a technically capable agent that is blind to the larger context of the problem — is about to repeat in every knowledge work domain where agents get deployed. And agents are getting deployed everywhere in 2026.

Legal: An agent reviewing contracts can parse clauses, flag risks, and compare against templates. What it cannot know is that a particular vendor has an informal understanding about payment terms negotiated over dinner three years ago, or that the company is in quiet acquisition talks and certain IP clauses are suddenly existential. The agent will review the contract competently and miss what matters, because what matters lives in the general counsel’s head, not in a document.
Marketing: Agents can build audiences, draft copy, and allocate budget. They cannot know the brand had a crisis in that market segment eight months ago and the tone needs to be completely different there. They cannot know the CMO made a promise to the CEO about a positioning shift that hasn’t been written down anywhere. The agents will execute a technically strong campaign that reopens a wound the organization spent months sealing.
Finance: AI can build technically perfect projections. It cannot know that certain numbers are politically dangerous internally, even if technically correct. It can’t read the room. It doesn’t know what the board cares about this quarter versus last unless you tell it.

In every case, the agent does the task well. In every case, the agent cannot know whether this is the right task done the right way at this moment in this organizational context. In every case, the human who holds that context is the difference between the agent creating value and the agent creating damage — in some cases, existential damage.

The Market Is Confused

The market is sending contradictory signals. On one hand, there is the SaaS apocalypse narrative. On the other hand:

Gartner predicts that by 2027, half the companies that cut staff for AI will rehire workers to perform similar functions, often under different job titles. Their survey of over 300 customer service leaders found only 20% had actually reduced headcount because of AI.
Forrester data is even sharper: 55% of employers say they regret AI-driven layoffs. There are very public case studies where companies have regretted laying off employees and subsequently rehired them.

CEOs are hearing that AI can do a lot of things, and that is true. It is absolutely transformative to the business. What they are not hearing soon enough is that really good humans are needed to make that happen. AI being incredibly great doesn’t mean you don’t need good people in the enterprise in those job functions.

This keeps coming back to the same agent problem. AI agents can write code but they can’t sustain code for eight months. It’s the memory wall again, extended across the enterprise to customer service, marketing, legal, and product. AI is really good at doing a task — a specific thing, like it does on GDP val — and really bad at doing the complete job, like the Upwork study shows. That is a common pattern. It is not a solved problem.

Evaluations: The Bridge Between Human Knowledge and Machine Action

Right now, human judgment is our critical safeguard. The industry is struggling with this. An eval is a way of encoding human judgment into a test that runs before, during, and after an agent acts. It is the bridge between what the human knows and what the machine does.

A good eval would have caught Alexey’s disaster. Something as simple as: “Before destroying any cloud resource, verify it is not tagged as production.” Or: “Before any bulk infrastructure change, compare the current state file against the known production manifest.” These feel unglamorous, but they are the kind of thing a senior engineer knows how to check and an AI agent will never think to check on its own.

The Problem With How Companies Write Evals Today

Most companies deploying AI agents don’t write evals at all. And if they do, they are “vibes-based evals.” A junior person sits in front of a spreadsheet and writes a list of things they think constitute a good test set. Nobody asks about methodology. Nobody asks if the evals are actually good. Nobody asks if they can prevent real-world failures until it’s too late.

Most evals don’t test whether an output is safe for a specific environment. They don’t test if it’s appropriate for the organizational context. They don’t test alignment with decisions made six months ago. The ones who do write evals are usually only testing surface-level correctness and mistakenly think that constitutes formal, complete correctness.

Evals Require Senior People

It is not an accident that evals are often treated as a chore and devolved to junior team members. Junior team members don’t have the context. You need senior people writing the evals. Don’t put an eval into production that tests agentic work and says “did the code compile?” Ask something better: “Does this change break something downstream that the test suite didn’t cover? Here are 16 examples, three counter-examples, and a repo to reference.”

The skill of writing great evaluations is the exact same skill that makes senior people valuable. You have to know what “right” looks like in your situation, not just in general. You have to understand the system well enough to anticipate where an agent will go wrong in ways the agent can’t anticipate for itself. That is contextual judgment, encoded into infrastructure, in ways that enable agents to succeed.

The Fear of Making Yourself Replaceable

Some senior people resist writing evals because they fear that if they encode their knowledge, they’ll be fired. Consider the Forrester data: 55% of employers regret their AI-driven layoffs. Any leader worth their salt needs to understand that the ability to write evals is an evolving skill based on evolving context. If a senior person writes an eval and then gets removed, the eval will not magically continue to work. It will burn up, and the result will be something like what happened to Alexey, except at corporation scale.

The companies that win the next few years will treat eval design as a core competency for seniors — not a developer task, not an afterthought, not a chore, not something to throw at juniors, not something to do before firing people, but as a primary expression of ongoing institutional knowledge.

Contextual Stewardship: The Human Role in an Agentic World

The human role in an agentic world is contextual stewardship: maintaining the mental model of your system, representing what you know in ways machines can use, and exercising judgment about when technically correct output is organizationally wrong. This is not a technical skill. It is not about learning to code or mastering a particular AI tool. It is about becoming the person in your organization who holds the context that keeps the machines from going sideways.

1. Document Decisions, Not Just Outcomes

Most organizations track what happened but never capture why very well. They don’t capture the constraints, the trade-offs, the context that made one choice better than another at that specific moment — let alone capturing that in a useful, repeatable way. Decision context is the raw material that makes agents effective. Its absence is what makes them dangerous.

When Alexey’s agent destroyed his database, the critical missing piece wasn’t a fancier model. It was a record of which infrastructure was production and why.

2. Develop System-Level Thinking

This applies not just to engineers but to everyone operating in a complex environment, which is increasingly all of us at work. Understand how the pieces of your organization connect. Know second-order consequences. Build a mental model that lets you evaluate whether an agent’s output is appropriate for this specific moment, not just correct in isolation.

The marketing lead who knows the brand’s wound history is doing system-level thinking. The general counsel who knows the unwritten relationship terms is doing system-level thinking. This is the senior skill the Harvard data says the market is paying for.

3. Invest in Writing Evaluations

This is the highest-leverage thing most people are not doing. You don’t need to be an engineer. You need to know your domain well enough to articulate: here are the things that must be true for this output to be safe and useful in our world.

If you have Claude in your browser, you have an agent. If you have ChatGPT using your computer, you have an agent. You don’t need to be an engineer to ask: what are the checks that might catch a bad disaster before it happens? How can I communicate that context to the agent in a way that matters?

The ability to write an eval is the ability to scale your judgment across every agent your organization deploys.

The Asymmetry That Matters

AI capabilities are advancing quickly. The trajectory is real. But the capabilities are advancing in ways that don’t address long-term memory. Task execution — writing code, generating content, building models — is improving at a terrifying pace. Contextual understanding, the kind that prevents an agent from obliterating your database, is improving much more slowly.

OpenAI has made a bet on solving this with their Frontier system alongside AWS, but that is a bet, not a solution. It is not clear how it will be solved, and when it is solved, it is not clear everyone will want to give this kind of long-running context to OpenAI or any private company — because what we are talking about is the story of your company inside people’s heads.

The story is not “AI is overhyped.” It might be underhyped, honestly. The story is not “AI will replace everyone,” because there are real gaps. The story is that the gap between what agents can do and what agents understand is actually getting wider, because agents are getting more intelligent without getting better at memory. The humans who stretch to close that gap through judgment, context, and evals will become the most valuable people in their organizations. The humans who don’t will find themselves competing with machines on the only dimensions where machines are improving the fastest.

Gartner’s rehiring prediction is not about AI failing. It is about organizations discovering too late what their humans were actually providing. The task execution was visible. The contextual stewardship was invisible. You don’t realize invisible infrastructure is load-bearing until you remove it and something collapses.

Having been the person who horrifically deleted something in production earlier in his career — half an Oracle instance, thanks to terrible UX — Nate Jones knows the feeling in the pit of your stomach. He’s glad Alexey got his database back.

The agents are here. They work. They’re improving all the time. And that’s what makes it scary, because every single one of them that keeps getting better is not automatically going to be aware of the context that keeps your business alive. In fact, it’s going to be automatically unaware, automatically blind to the context that keeps your organization going. Your job, whatever your title, whatever your domain, is to be the one who sees what they can’t — and then to write an eval that makes sure they never have to.

Marq AI Wiki

Explorer

Your AI Agent Fails 97.5% of Real Work. The Fix Isn't Coding.