Plausible Reality

The AI Layoff Story Is Too Convenient

Eloi Tay — Sun, 14 Jun 2026 20:01:01 GMT

The funniest thing about the “AI caused the layoffs” argument is that it makes big tech sound much tidier than it is.

It imagines a clean mechanical event. A model gets smarter. A worker becomes redundant. A dashboard updates. Somewhere, a finance executive presses a tasteful grey button labelled optimise, because apparently even dystopia has a design system.

That is not what the evidence suggests.

The wrong debate is whether big-tech layoffs are caused by AI or not caused by AI. That framing is too neat. It asks whether the robot committed the murder while ignoring the CFO standing next to the body holding a quarterly forecast.

I think the better reading is this: the first big wave of layoffs was mostly a correction for pandemic overhiring, higher interest rates, revenue pressure, and the end of free-money confidence. AI then became two things at once: a real productivity technology and a very expensive capital allocation problem. By 2024-2026, the story shifted from “we hired too many people” to “we need fewer people in some places because compute, infrastructure, and a smaller set of AI-critical workers now get first claim on the budget.”

That is not the same as “AI replaced everyone.” It is also not comforting.

The talent-hoarding hypothesis gets part of the story right. Big tech has treated talent as a strategic input before. It has not always behaved like a group of innocent employers wandering through the labour market carrying ergonomic keyboards and good intentions. But the idea that the whole sector is now deliberately depressing the labour market to retain top talent is harder to prove. The incentive is real. The direct evidence of intent is not.

The useful question is not “did AI cause the layoffs?” It is: what kind of labour becomes less valuable when capital gets tighter, AI tools get better, and compute becomes the new headcount?

That is the question the spreadsheet is already answering, usually before anyone has admitted there is a spreadsheet.

The wrong comparison

Start with the comparison being smuggled into the debate.

When people say “AI caused the layoffs,” they usually compare AI against a fantasy version of the old labour market: stable teams, sensible hiring, accurate roadmaps, managers allocating talent rationally, and job descriptions that reflected the work.

This version of the company exists mainly in onboarding decks and executive offsites. In practice, hiring during the pandemic often looked like a grocery run before a snowstorm. Teams hired because demand was up, competitors were hiring, capital was cheap, and the safest internal answer to “can we do this faster?” was “give me twelve more people.”

The official version was strategic growth. The real version was sometimes an arms race with laptop stickers.

Meta’s 2022 layoff message is unusually explicit about this. Zuckerberg said the company had increased investment after the pandemic accelerated online commerce, then acknowledged that he had expected the acceleration to be permanent and “got this wrong.” The company cut more than 11,000 people, about 13% of its workforce. A few months later, Meta announced another round: around 10,000 additional cuts and the closure of about 5,000 open roles under the “Year of Efficiency.”¹ ²

That is not an AI-substitution memo. It is an overexpansion memo, a revenue-pressure memo, a “we believed our own trend line” memo. The prose is corporate, but the plot is simple: the graph went up, everyone extrapolated, the graph stopped cooperating.

Microsoft’s January 2023 note made a similar move. It announced 10,000 job cuts while saying the company would continue hiring in key strategic areas. Amazon said its cuts would exceed 18,000 roles, with Stores and PXT heavily affected. Google announced about 12,000 cuts and framed them around a changed economic reality and sharper prioritisation.³

Again, the interesting part is composition. These firms were not uniformly collapsing. They were cutting here, hiring there, closing open roles, flattening layers, ending projects, and reallocating money. That is messier than the AI-doom story, which is why the AI-doom story travels better. It has fewer tabs.

The original sin was not that companies adopted AI. It was that they treated temporary demand, cheap capital, and competitive paranoia as if they were a permanent operating model.

Then the cost of pretending went up.

Perfection was never the benchmark

The other lazy debate is whether AI is “good enough” to replace workers.

This sounds sensible until you compare it with what actually happens at work.

Most companies do not require perfection from their current systems. They tolerate broken dashboards, stale documentation, meetings where everyone agrees and nothing changes, and roadmaps that have the structural integrity of a wet croissant. Then AI arrives and suddenly everyone becomes a Swiss watchmaker of epistemic purity.

The useful question is not whether AI is flawless. It is what flawed system it is being compared against.

There is evidence that generative AI can materially improve productivity in some workflows. In a large customer-support study, access to an AI assistant increased productivity for agents, with larger gains for less-experienced workers. In a controlled software-development experiment, developers using GitHub Copilot completed a coding task substantially faster than the control group.⁴ ⁵

These findings matter, but they do not mean “delete the team.” They mean the shape of the team changes.

Imagine a developer dropped into a legacy service called billing_sync_v2, which everyone privately knows is v4 because v3 was abandoned after the migration that nobody discusses. The developer asks an AI assistant to explain the code path, find likely duplication, draft a test, and propose a cleanup plan. The tool helps. It surfaces dependencies faster than a human spelunking through Slack archaeology. It also confidently misses the fact that a “temporary” integration from 2019 still powers one enterprise customer whose contract renewal pays for half the team’s snacks.

The AI did not replace engineering judgement. It moved the judgement to a different place.

Before, the scarce skill was often remembering where the bodies were buried. Now the scarce skill is knowing which generated explanation to distrust, which test to run, which log line matters, and which staff engineer to bother before the cleanup becomes a revenue event.

This is where the layoff story gets uncomfortable. If a company can get the same output from a smaller team by cancelling low-priority projects, using AI for first drafts, automating routine coordination, and leaning harder on senior reviewers, it may not need the same number of people. That is not science fiction. It is also not a clean one-worker-one-bot substitution.

The job does not disappear in a puff of model dust. The work gets redistributed, compressed, and hidden inside the remaining roles.

Which would be funny, except this is how burnout gets rebranded as leverage.

The spreadsheet is also hallucinating

A lot of the AI-layoff discourse assumes companies know what work people actually do.

That is generous.

In many organisations, work is not described by the org chart. It is described by the person everyone asks when the payment job fails at 11:40 p.m. The source of truth is not the architecture diagram. It is a pinned Slack message, three tribal-memory specialists, and a spreadsheet called Final_v7_REAL_final.xlsx sitting in a folder named “Q3 planning - new - approved - USE THIS.”

This matters because AI does not enter a clean system. It enters a messy one and makes the mess legible faster.

A product team, for example, might use AI to summarise hundreds of support tickets and customer calls. The output says customers are angry about onboarding. That is useful. Then the team checks usage data and discovers the people complaining are mostly prospects who never activated, while retained customers are quietly struggling with permissions. The model found the loud pain. The business needed the costly pain.

The operating lesson is boring and therefore valuable: AI is excellent at accelerating the first pass and dangerous when the first pass becomes the decision.

This is why “AI will replace managers” is usually the wrong frame. AI will replace some managerial theatre. It can summarise status updates, draft review language, cluster risks, write the first version of the reorg FAQ, and turn meeting sludge into something resembling decisions. Good. Some meeting sludge has had a long and undeserved career.

But the actual managerial work is deciding what trade-off the organisation is making, who is accountable, which signal is real, which team is overloaded, and which comfortable project needs to die. If a manager’s job was mostly forwarding ambiguity between calendars, yes, the outlook is poor. That was not management. That was packet switching with a lanyard.

AI does not remove the need for judgement. It removes excuses for not exercising it.

And that is exactly why headcount plans change.

Talent hoarding was real, just not in the tidy way

The talent-hoarding hypothesis is attractive because it explains a thing everyone in tech has seen.

A company hires people it does not quite know how to use. A team grows because another team grew. Requisitions become defensive weapons. Leaders collect headcount like medieval nobles collecting horses, except the horses have RSUs and opinions about Kubernetes.

There is a serious point underneath the comedy: in big tech, talent has often been treated as a scarce strategic input, not merely a cost line.

The strongest evidence is not vibes. It is legal history. In 2010, the US Department of Justice brought a case against Adobe, Apple, Google, Intel, Intuit, and Pixar over “no cold call” agreements that restricted recruiting competition. The DOJ argued that these agreements eliminated a significant form of competition for employees and deprived workers of information about better opportunities.⁶

That matters because it shows the instinct directly: do not just win talent; manage the market for talent.

Labour economics gives the mechanism. When workers have fewer outside options, employers have more bargaining power. Research on labour-market concentration finds that higher concentration is associated with lower posted wages. US antitrust agencies have also warned that wage-fixing and no-poach agreements can violate antitrust law and expose companies and executives to serious liability.⁷

So the broad idea is not paranoid. Labour markets are markets. Employers have incentives. Very large employers do not become monks simply because the cafeteria has kombucha.

But the clean version of the talent-hoarding theory breaks on the frontier-AI labour market.

If AI scaling had made talent-blocking broadly obsolete, you would expect the talent war to cool at the top. The opposite appears to have happened. Reuters reported in 2025 that top OpenAI researchers were receiving compensation packages above $10 million a year, that Google DeepMind had offered packages reaching around $20 million for some researchers, and that Meta had made extreme reported offers to recruit OpenAI employees. Reuters later described Meta’s AI hiring push as an intensification of Silicon Valley’s talent war.⁸

That is not de-escalation. That is an auction wearing a hoodie.

The better model is segmentation.

Frontier AI researchers, infrastructure specialists, and a small number of deployment leaders become more valuable. Generic surplus headcount, coordination-heavy roles, and work attached to low-priority projects become less protected. The labour market does not cool evenly. It develops weather systems.

One person gets a call about a nine-figure package. Another gets a calendar invite from HR with no agenda. Both are living in the same AI boom. That is the problem.

The market split in two

The most important thing AI is doing to tech labour is not simply replacing jobs. It is changing which jobs are defensible.

That distinction matters for builders because “can AI do this task?” is not the same question as “should this be a full-time role?”

Take a routine product-operations workflow. Someone exports customer feedback, cleans a spreadsheet, tags themes, creates a deck, schedules a readout, and spends forty minutes explaining that the top issue is “unclear onboarding,” which everyone already suspected because customers keep saying “the onboarding is unclear.” An AI system can now compress a lot of that work. It can read tickets, cluster themes, generate example quotes, draft the deck, and produce the first-pass recommendation.

It can also over-weight the most verbose customers, flatten important edge cases, and produce recommendations that sound strategically mature because the word “streamline” appears twelve times.

The remaining job is verification: compare the output with usage data, revenue impact, churn, support cost, and the product surface. In a well-run team, AI reduces the time spent assembling the artefact and increases the pressure to make a decision. In a badly run team, AI generates a better-looking artefact so the indecision can wear a blazer.

This is the practical labour-market shift. Work that is mostly assembling, reformatting, summarising, translating, routing, and lightly coordinating becomes more vulnerable. Work that involves accountable judgement, production ownership, customer trust, system design, regulatory risk, or hard verification becomes more valuable.

Official assessments from the IMF, ILO, and OECD have been cautious about treating AI exposure as automatic job destruction. The ILO’s work argues that most occupations contain tasks requiring human input and that transformation is the more likely effect than full automation. The OECD has similarly noted limited evidence so far of broad AI-led job loss, while still recognising high automation exposure in some occupations.⁹

That caution is important. So is the direction of travel.

Anthropic’s own usage research has found much higher automation rates in Claude Code conversations than in general Claude.ai conversations, and its Economic Index shows AI usage heavily concentrated in computer, mathematical, and API-driven workflows.¹⁰ Challenger Gray & Christmas reported that AI was the leading cited reason for US job cuts in April 2026. Indeed’s Hiring Lab, meanwhile, has described AI-mentioned job postings as growing amid broader hiring weakness.¹¹

None of that proves AI is the single cause of big-tech layoffs. It does show why the latest phase feels different from 2023.

In 2023, the cleanest story was: we overhired and money got more expensive.

By 2026, the sharper story is: we still need people, but not as many of the same people, in the same places, doing the same coordination work, while data centres and frontier teams eat the budget first.

That is a colder story than “the robots took the jobs.” It is less cinematic. It is also more useful.

Verification is the work

The practical response for teams is not to argue about whether AI is good or bad in the abstract. That is a ritual with better formatting.

The practical response is to map where AI changes the verification burden.

Picture a CTO looking at a 120-person engineering organisation after a budget reset. The lazy version of the exercise is to ask, “Which roles can AI replace?” This produces theatre. Managers defend their teams. Finance asks for percentages. Someone says “10x engineer” and the room loses three IQ points.

The better workflow starts with work, not roles.

First, list the recurring workflows: incident response, customer escalation, release notes, test generation, data cleaning, roadmap synthesis, contract review, onboarding, internal tooling, migration planning. Then ask four questions for each workflow.

What does AI make faster?

What does AI make easier to fake?

What verification proves the output is safe?

Who is accountable when the output is wrong?

That small checklist does more than most AI strategy decks.

For incident response, AI may summarise logs and suggest likely causes. It may also hallucinate a causal chain and send the team toward the wrong service. Verification means checking metrics, traces, recent deploys, and rollback safety.

For code migration, AI may draft mechanical changes across a large codebase. It may also miss the weird customer-specific path that is not covered by tests because nobody budgeted for tests when the feature was launched during a sales emergency. Verification means test coverage, staged rollout, ownership, and a human who knows what “temporary exception” means in enterprise software. Usually it means permanent.

For performance reviews, AI may help an EM draft clearer feedback. It may also launder vague resentment into polished prose. Verification means anchoring feedback in observed behaviour, specific examples, and standards the person knew about before review season arrived like a legally compliant thundercloud.

For customer research, AI may summarise interviews quickly. It may also confuse frequency with importance. Verification means comparing themes against usage, revenue, retention, support burden, and product strategy.

The lesson is not “never use AI.” The lesson is that AI shifts effort from production to judgement. It reduces some drafting and searching work. It increases the premium on people who know how to check, decide, own, and repair.

That is also the labour-market lesson.

If your role is mostly producing artefacts that no one verifies, you are exposed. If your role is owning outcomes that require verification across messy systems, you are more defensible. Preference is not performance. Busyness is not leverage. A calendar full of coordination is not the same as accountability, even if it has colour coding.

The annoying question

The AI-layoff story is convenient because it lets everyone avoid the harder conversation.

Executives can say technology changed the business. Workers can say executives used technology as cover. Commentators can pick a side and produce the required amount of concern. Everyone gets a clean villain.

The real picture is nastier.

Big tech overbuilt during a strange demand shock. Higher rates and lower patience made that overbuild expensive. AI created genuine productivity gains, genuine substitution pressure in some tasks, and a massive new appetite for compute. At the same time, the market for talent split: hotter at the frontier, colder across broad generic tech labour, most exposed in routine cognitive work.¹²

The talent-hoarding thesis is therefore partly right and partly too elegant. Yes, companies have treated labour strategically. Yes, weaker outside options help employers. Yes, a cooler market can reduce wage pressure and quits. But proving that recent layoffs were deliberately staged to depress the labour market requires evidence we do not have.

What we do have is enough.

We have companies cutting open roles while hiring in strategic areas. We have AI researchers priced like small sports franchises. We have managers being flattened while infrastructure budgets expand. We have software teams being asked to do more with tools that are powerful, unreliable, and improving. We have junior roles under pressure because the training path was always less robust than the mythology suggested.

The question is not whether AI replaces developers, PMs, analysts, recruiters, or managers.

The question is what remains worth a full-time person once the first draft is cheap, the second draft is suspicious, and the final answer still requires someone to own the consequences.

That is the annoying question for workers. It is also the annoying question for companies.

Because if the answer is “judgement,” then the next question is worse.

Who, exactly, in your organisation is being trained to exercise it?

Notes and References

1. Meta, “Mark Zuckerberg’s Message to Meta Employees,” About Meta, 2022. Used for Meta’s November 2022 layoff count, the 13% reduction, and the company’s explanation that pandemic-era acceleration had been overestimated.

2. Meta, “Update on Meta’s Year of Efficiency,” About Meta, 2023. Used for the second Meta layoff round, the planned reduction of around 10,000 employees, and the closure of about 5,000 open roles.

3. Microsoft, “Focusing on our short- and long-term opportunity,” Microsoft, 2023; Amazon, “Update from CEO Andy Jassy on role eliminations,” About Amazon, 2023; Google, “A difficult decision to set us up for the future,” Google, 2023. Used for official company explanations of 2023 layoffs and continued selective hiring or reprioritisation.

4. Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond, “Generative AI at Work,” NBER Working Paper, 2023; Quarterly Journal of Economics, 2025. Used for evidence that an AI assistant increased customer-support productivity, with larger gains for less-experienced workers.

5. Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer, “The Impact of AI on Developer Productivity: Evidence from GitHub Copilot,” arXiv, 2023. Used for the controlled experiment in which developers using Copilot completed a coding task 55.8% faster.

6. US Department of Justice, “Justice Department Requires Six High-Tech Companies to Stop Entering into Anticompetitive Employee Solicitation Agreements,” 2010. Used for the historical “no cold call” agreements involving Adobe, Apple, Google, Intel, Intuit, and Pixar.

7. José Azar, Ioana Marinescu, and Marshall Steinbaum, “Labor Market Concentration,” NBER, 2017 / Journal of Human Resources, 2022; Federal Trade Commission and US Department of Justice, “Antitrust Guidelines for Business Activities Affecting Workers,” 2025. Used for the relationship between labour-market concentration and lower posted wages, and for legal treatment of wage-fixing and no-poach agreements.

8. Reuters, “OpenAI, Google and xAI battle for superstar AI talent, shelling out millions,” 2025; Reuters, “Sam Altman says Meta offered $100 million bonuses to OpenAI employees,” 2025; Reuters, “Zuckerberg’s Meta Superintelligence Labs poaches top AI talent in Silicon Valley,” 2025. Used for evidence that the frontier-AI talent war intensified rather than cooled.

9. IMF, “Gen-AI: Artificial Intelligence and the Future of Work,” 2024; International Labour Organization, “Generative AI and Jobs: A global analysis of potential effects on job quantity and quality,” 2023; International Labour Organization, “Generative AI and Jobs: A Refined Global Index of Occupational Exposure,” 2025; OECD, “Employment Outlook 2023: Artificial Intelligence and the Labour Market,” 2023. Used for the distinction between AI exposure, augmentation, transformation, and displacement.

10. Anthropic, “AI’s Impact on Software Development,” 2025; Anthropic, “Anthropic Economic Index report: Economic primitives,” 2026. Used for Claude Code automation-rate evidence and concentration of AI usage in computer, mathematical, API-driven, and office-administrative workflows.

11. Challenger, Gray & Christmas, “April Job Cuts Rise 38% from March; YTD Cuts Down 50%,” 2026; Indeed Hiring Lab, “January 2026 US Labor Market Update: Jobs Mentioning AI Are Growing Amid Broader Hiring Weakness,” 2026. Used for evidence that AI is increasingly cited in layoffs while AI-mentioned job postings grow amid broader hiring weakness.

12. Reuters, “Meta CEO Zuckerberg blames layoffs on capital spending, won’t rule out more job cuts,” 2026. Used for Meta’s reported framing of layoffs as a trade-off between compute infrastructure and workforce costs during rising AI capital expenditure.

The Benchmark Is Also Hallucinating

Eloi Tay — Thu, 11 Jun 2026 19:15:00 GMT

The wrong debate

The funniest thing about the phrase “the only trustable coding benchmark” is that it assumes trust is the kind of object you can put in a leaderboard column.

This is how engineering measurement usually goes wrong. Someone takes a useful instrument, asks it to become a religion, then acts surprised when it starts behaving like procurement theatre with a CSV export.

DeepSWE deserves attention. It is one of the more serious public attempts to measure long-horizon coding agents on repository-level work: multi-file changes, sparse prompts, behavioural verification, regression checks, and a fixed harness. Its public repository describes 113 tasks across TypeScript, Go, Python, JavaScript, and Rust, drawn from active open-source projects. Its homepage reports model pass rates, average cost, wall-clock time, and output tokens, rather than only one heroic score pretending to explain the universe. [2][3]

That is good. It is also not the same thing as being the only trustworthy benchmark for coding.

The wrong debate is whether DeepSWE is “the” benchmark. The useful question is what kind of work it forces a coding agent to do, what kind of mistakes it can see, and what kind of mistakes still walk past it wearing a lanyard.

Because coding is not one job. It is a cupboard full of jobs that engineering teams keep labelling “implementation” so the project plan looks tidy.

Sometimes coding means writing a small pure function from a docstring. Sometimes it means fixing a real issue in a Python repository. Sometimes it means changing seven files without breaking the two ancient tests that only pass on the laptop nobody is allowed to reboot. Sometimes it means reading the onboarding document, discovering it describes a product that died three quarters ago, and then quietly asking Darren why the payment service still depends on a cron job under his personal account.

No single benchmark captures all of that. A benchmark is a flashlight. DeepSWE is a better flashlight than many. It is not daylight.

Perfection was never the benchmark

Most benchmark arguments smuggle in the wrong comparison. They compare the benchmark against perfect knowledge, then declare it flawed because it is not omniscient.

Fine. It is flawed. So is the current system.

The current system is a staff engineer trying three coding agents on the same bug, a product manager reading a model card, a CTO staring at a leaderboard, and everyone pretending that “we should evaluate this properly” does not mean “someone should spend the next two Fridays building a private test harness while the roadmap looks at them like a disappointed parent.

DeepSWE is useful because it improves the comparison we actually need. It does not ask whether a model can solve toy snippets in isolation. It asks whether an agent can enter a repository, interpret a natural request, edit multiple files, and pass behavioural tests that care about what the software does rather than whether the patch resembles the author’s preferred internal helper name. [3][4]

That matters because a large part of AI coding evaluation has been polluted by benchmarks that are public, old, narrow, overfit, saturated, or quietly measuring memory. HumanEval still matters for compact Python function synthesis, but it is not a proxy for changing a maintained product. MBPP still matters for entry-level Python programming problems, but nobody should choose an enterprise coding agent because it can write a small list manipulation function with a polite docstring. LiveCodeBench matters because it was designed around fresher contest-style coding problems and broader scenarios such as code generation, execution, test-output prediction, and self-repair. SWE-bench Verified matters historically and still offers a useful slice of GitHub issue resolution, but even OpenAI has argued that it no longer measures frontier coding progress cleanly because of contamination and test-design issues. [5][6][7][8][10]

This sounds like academic hair-splitting until you are the person approving the tool budget.

A leaderboard says Agent A is better than Agent B. In a meeting, this becomes “we have evidence.” By the time it reaches finance, it becomes “the market has validated this.” By the time it reaches engineering, it becomes “please replace the migration estimate with vibes and a screenshot.

The serious point underneath the comedy is simple: benchmarks are not guilty because they are incomplete. They become dangerous when people forget what they are incomplete about.

What DeepSWE actually tests

DeepSWE is strongest where older coding benchmarks have been visibly creaking: long-horizon, repository-level work under ambiguity.

The tasks are not copied from existing public pull requests, according to Datacurve. They are authored as original tasks against active open-source repositories. The reference solution is held out. The prompt is what the agent sees. The verifier is executable. The intended target is observable behaviour, not a secret implementation shape. [3][4]

That design choice matters more than the score.

Imagine an engineer using an AI agent to add cancellation support to a JavaScript engine. The weak version of the task is: “change this exact function, add this exact guard, return this exact value.” The stronger version is: “make cancellation stop all queued work associated with an evaluation without breaking unrelated jobs.” The first task measures patch obedience. The second starts measuring engineering.

Datacurve’s own qualitative audit gives an example in this shape: an agent mostly built cancellation support but missed one important path where Promise callbacks could still be queued after cancellation. That is the kind of failure a shallow benchmark can miss and a behaviour-first repository task is more likely to expose. [4]

This is why DeepSWE is interesting. It is not merely harder. Harder is easy. You can make any benchmark harder by writing a prompt that sounds like it was produced by a compliance department and a dungeon master trapped in the same elevator.

The useful part is that DeepSWE changes what counts as evidence. It rewards the agent for reading the repository, preserving behaviour, interpreting a shorter request, making a coherent multi-file change, and not treating the tests as decorative furniture.

That is close to how teams want coding agents to behave. Not magic. Not autonomous product ownership. Just a competent assistant that can enter a codebase without immediately deciding that all abstractions are optional and every edge case is a lifestyle choice.

The current system is also hallucinating

The phrase “AI hallucination” has become too convenient. It makes the model sound uniquely irrational while the surrounding organisation stands there holding three inconsistent Jira tickets and a Confluence page last updated during the previous monarchy.

Benchmarks hallucinate too, just more formally.

They hallucinate when a test suite checks one implementation detail and calls it correctness. They hallucinate when a public task rewards a model for having seen the answer before. They hallucinate when a prompt tells agents not to write tests, then the leaderboard acts surprised that test-writing behaviour disappears. They hallucinate when a single pass rate quietly hides cost, time, token spend, and review burden.

DeepSWE tries to reduce several of these failure modes. Its maintainers report original tasks, programmatic verifiers, regression checks, and a fixed mini-swe-agent harness. They also report a verifier audit where DeepSWE had a 0.3% false-positive rate and 1.1% false-negative rate on the reviewed sample, compared with 8.5% and 24.0% for SWE-Bench Pro. That is a striking claim. It is also a maintainer-reported claim using an LLM-based analyser as part of the audit, so the right reaction is not worship. It is “excellent, now replicate it.” [4]

This is the adult position, which makes it unpopular in rooms that prefer a winner.

DeepSWE may be a better measurement tool for one important construct. It may expose differences that older leaderboards compress. Its current public page reports a large spread between model configurations, with GPT-5.5 [xhigh] at 70% +/- 4% Pass@1 and several other frontier configurations materially lower. That is useful signal, especially compared with benchmarks where top models cluster into one polite blob. [2]

But signal is not truth. Signal is evidence under a measurement design.

A thermometer can tell you the room temperature. It cannot tell you whether the building has bad insulation, whether finance approved the heating bill, or whether Darren opened the window because the server closet smells like hot plastic. This is not a criticism of thermometers. It is a criticism of people who use one reading to explain the entire office.

Verification is the work

The operating lesson for builders is not “use DeepSWE” or “ignore DeepSWE.” It is that verification has become the real work.

Take a normal product team. They are considering a coding agent for repository maintenance. The demo is impressive. The agent changes three files, explains itself nicely, and produces a summary with the confidence of someone who has never had to be on-call. The team runs the tests. Most pass. The agent apologises for the one that fails and suggests a patch. Everyone has a small spiritual experience.

Then the staff engineer asks three dull questions, which are usually the important ones:

Did it change the right behaviour, or just satisfy the visible tests?

Did it preserve unrelated behaviour, or only avoid breaking the happy path?

Did it reduce future maintenance, or did it create a clever little shrine to today’s prompt?

This is where DeepSWE is useful as a pattern, not just as a leaderboard. Behavioural verifiers, regression checks, task-specific tests, cost reporting, time reporting, and trajectory audits are not benchmark luxuries. They are what serious AI-assisted engineering increasingly needs inside the company.

The workflow is boring and therefore likely to be correct. Pick real internal tasks. Freeze the base commit. Give the agent the same prompt a developer would actually write. Let it work in the same harness you intend to use. Run existing tests, targeted behavioural tests, static analysis, security checks, and review. Track pass rate, review time, rework, cost, latency, rollback risk, and the specific failure mode.

Then do the annoying part: read the failures.

The failures are the product. A model that fails by missing one parallel requirement is different from one that breaks integration. A model that writes tests and catches its own mistake is different from one that emits a majestic patch and then leaves before the bill arrives. A model that succeeds only when the answer is in git history is not solving your problem. It is doing archaeology with an API key.

This is why a private eval suite matters. DeepSWE can tell you something about frontier agent capability on open-source repository tasks. It cannot tell you whether the agent understands your strangler-fig migration, your feature flag conventions, your audit trail requirements, or the part of the billing service that everyone says is “temporary” in the same tone archaeologists use for pottery shards.

The product wrapper still matters

One of DeepSWE’s best choices is also one of its limitations: it fixes the harness.

Datacurve runs models through mini-swe-agent for consistency. That reduces scaffolding noise. It means the leaderboard is trying to compare model capability rather than the hidden advantages of one vendor’s product environment, prompt, editing tool, or carefully massaged workflow. [2][4]

This is clean. In practice, nobody buys “model capability” in a jar.

Teams use Claude Code, Codex CLI, Cursor, Gemini CLI, internal wrappers, security scanners, branch protection, code review bots, ticket systems, and whatever script Michael wrote during the incident because it was 2 a.m. and “temporary” had once again become architecture.

A fixed harness answers one question: how do these models behave when forced through the same narrow pipe?

A native product answers a different question: how does the whole system perform when developers actually use it?

Both questions matter. Confusing them is how teams buy a model because it wins a benchmark, then discover that the expensive part is not generation. It is integration, review, permissions, context packaging, security approval, developer trust, and the small matter of whether the thing can open the right files without turning the repository into a choose-your-own-adventure novel.

This is also why preference is not performance. RealHumanEval found that benchmark improvements can translate into programmer productivity gains, but not proportionally; it also found that programmer preferences did not necessarily track actual performance. [9]

That should bother anyone who relies on demos. Demos are preference machines. They are designed to make the tool feel obvious. Production is a performance machine. It is designed to punish every missing edge case you clapped for earlier.

What teams should actually do

Use DeepSWE as a serious external signal. Do not use it as a solitary purchasing ritual.

For researchers, it belongs near the centre of the long-horizon coding-agent stack. Pair it with function-level benchmarks, freshness-oriented benchmarks, and quality-oriented evaluations for security, maintainability, and efficiency. Report the harness, model version, effort mode, cost, time, token use, and uncertainty. Otherwise the score is not a result. It is a badge.

For practitioners, the minimum sensible evaluation looks more like a release checklist than a leaderboard screenshot.

Start with the public evidence: DeepSWE for long-horizon repository work; HumanEval or MBPP for small synthesis primitives; LiveCodeBench for fresher algorithmic coding; SWE-bench-style tasks for issue resolution; human-in-the-loop studies or internal pilots for productivity. Then add your private work: five to twenty representative repository tasks, held-out behavioural tests, review by your engineers, and explicit measurement of how much effort the agent saves or moves downstream.

Downstream is where lies go to become tickets.

Do not only ask “did the agent pass?” Ask who had to verify it, what they checked, what they missed, how long it took, and whether the resulting code is something you want to maintain after the agent has gone off to generate a haiku about root cause analysis.

The uncomfortable conclusion is that good AI coding evaluation looks less like model ranking and more like engineering management. It requires boring infrastructure, realistic tasks, clear ownership, review discipline, and enough humility to admit when the tool is useful but not safe unattended.

That is not a strategy deck sentence. That is the work.

The annoying question

DeepSWE is not the only trustable coding benchmark. That claim is too strong and, more importantly, too small.

The better claim is sharper: DeepSWE is one of the most useful current public benchmarks for long-horizon, behaviour-verified, repository-level coding agents. It measures a slice of work that teams actually care about and that older leaderboards often blur. It exposes failures that look like real engineering failures: missed requirements, integration mistakes, regressions, weak verification, and suspicious help from public history.

But it does not measure coding in general. It does not fully measure native product ergonomics, enterprise context, security, maintainability, review quality, private-codebase behaviour, or whether a developer actually gets their afternoon back. It does not turn judgment into a number. It gives judgment something better to inspect.

This is the useful frame.

The point of DeepSWE is not that we have finally found the one true benchmark. The point is that a serious benchmark makes the rest of our evaluation habits look lazy.

That is why the annoying question for teams is not “Which model won?

It is: “What would our own benchmark reveal about the way we already ship software?

Most organisations will not like the answer. Which is usually how you know the benchmark is starting to work.

Notes and References

1. Uploaded report, “DeepSWE and the Trustworthiness of Coding Benchmarks,” 2026. Used for the source framing that DeepSWE is a strong current public benchmark for long-horizon repository-level coding-agent evaluation, but not the only trustworthy benchmark for all coding work.

2. Datacurve, “DeepSWE,” Datacurve, 2026. Used for the current public leaderboard snapshot, including the 30 May 2026 update, Pass@1, cost, time, output-token fields, and the statement that all models are run on mini-swe-agent for consistency.

3. Datacurve, “datacurve-ai/deep-swe,” GitHub, 2026. Used for the public repository description: 113 tasks across TypeScript, Go, Python, JavaScript, and Rust; task format; held-out reference solution; and program-based behavioural verifiers.

4. Datacurve, “DeepSWE: Measuring frontier coding agents on original, long-horizon engineering tasks,” Datacurve Blog, 2026. Used for task-construction methodology, original-task claims, verifier design, false-positive and false-negative audit numbers, harness discussion, qualitative failure taxonomy, and examples of missed requirements and test-writing behaviour.

5. SWE-bench, “SWE-bench Verified,” SWE-bench, 2026. Used for the claim that SWE-bench Verified is a human-filtered subset of 500 SWE-bench instances for evaluating coding agents and language models.

6. OpenAI, “HumanEval: Hand-Written Evaluation Set,” GitHub, 2021. Used for HumanEval as a function-level code-generation benchmark and evaluation harness associated with “Evaluating Large Language Models Trained on Code.

7. Google Research, “Mostly Basic Python Problems Dataset,” GitHub, 2021. Used for MBPP as around 1,000 crowd-sourced entry-level Python programming problems with task descriptions, solutions, and automated tests.

8. Naman Jain et al., “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code,” arXiv, 2024. Used for LiveCodeBench as a continuously updated coding benchmark with scenarios beyond code generation, including self-repair, code execution, and test-output prediction.

9. Hussein Mozannar et al., “The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers,” OpenReview, 2024. Used for the human-in-the-loop finding that benchmark gains can improve productivity but not proportionally, and that programmer preference signals do not necessarily correlate with actual performance.

10. OpenAI, “Why SWE-bench Verified no longer measures frontier coding capabilities,” OpenAI, 2026. Used for the contamination and test-design concerns around SWE-bench Verified and the broader argument that public coding benchmarks can stop measuring genuine frontier progress cleanly.

The Moat Moved Up the Stack

Eloi Tay — Tue, 09 Jun 2026 19:11:59 GMT

Every few months, the same argument comes back wearing a new jacket.

An open model lands near the top of a public leaderboard. Someone posts a chart. The conclusion arrives five seconds later: the frontier labs are cooked.

I think that conclusion is too neat. It confuses a model with a product, and a product with a business.

The wrong debate is whether an open model can match a closed model on a public test. Of course it can, in slices, and sometimes very impressively. The useful question is what actually has to be copied before a customer can move serious work without caring who made the underlying model.

My answer is simple: the moat has moved up the stack.

Benchmark parity still matters. It pressures prices. It embarrasses marketing decks. It makes lazy product managers nervous, which is a public service. But it is no longer the cleanest map of competitive power. The high-value AI product is a runtime system wrapped around a model: tool orchestration, retrieval, policy enforcement, private evaluations, monitoring, enterprise controls, distribution, pricing, and trust.

You can scrape answers. You can fine-tune a student. You can copy vibes. You cannot cheaply scrape the machinery that makes the thing work on Monday morning inside a company with permissions, angry engineers, compliance reviews, and audit logs.

* * *

Benchmark parity is a bad trophy

The 2026 AI Index gives us the more boring, more useful version of the story. It reports that top-model performance is converging, that the leaders are packed more tightly together, and that competition is shifting toward cost, reliability, and domain-specific performance. It also reports that the open-versus-closed gap, after narrowing in 2024, widened again, with the top closed model ahead of the top open model by 3.3 percentage points as of March 2026. [1]

That is not a victory lap for closed models. It is a warning against over-reading any single chart.

Public benchmarks are useful instruments, but they are increasingly noisy instruments. Stanford’s report points to fast benchmark saturation, invalid-question rates as high as 42 percent on some widely used evaluations, and concerns that leaderboard standing can partly reflect adaptation to the platform rather than general capability. [1]

This sounds obvious, but teams miss it all the time: when the test is unstable, parity on the test is not a durable economic fact.

LiveBench exists because static benchmarks are vulnerable to contamination and obsolescence. Its designers release new questions regularly and use objective ground-truth answers to reduce dependence on LLM judges. [2] Humanity’s Last Exam exists for a similar reason: older academic tests were becoming too easy for the strongest systems, so the measurement target had to move. [23]

The deeper problem is that benchmark capability is jagged. A model can look equivalent on a math set and then behave differently on long-context retrieval, tool use, refusal calibration, latency, consistency, or enterprise permissions. Lost in the Middle showed that language models can perform worse when relevant information appears in the middle of a long context, even when the same information is easier to use at the beginning or end. [3] That kind of failure rarely shows up in the leaderboard screenshot that gets circulated on social media.

Preference is not performance. A public arena vote is not the same thing as a production system surviving a month of messy workflows.

The product is not the model

The easiest way to misunderstand the frontier labs is to keep imagining that they sell a single static model. That was closer to true in the early chatbot phase. It is much less true now.

OpenAI’s Responses API is a good example. The official docs describe a stateful interface for agent-like applications, with built-in tools such as web search, file search, computer use, Code Interpreter, remote MCP support, and reasoning summaries. [4] [5] A leaderboard score does not recreate that product surface. A copied answer style does not reproduce the server-side tool loop, preserved context, encrypted reasoning items, or enterprise controls around the API.

Google’s Gemini stack is similarly not just a model endpoint. The developer docs cover grounding with Google Search, context caching, structured outputs, built-in tool combinations, and file or function calling. [11] [12] Google’s Gemini 3.1 Pro model card describes internal safety evaluations, human red teaming, and continuous testing under its Frontier Safety Framework. [14] Again, the interesting thing is not only the model. It is the placement of the model inside search, cloud, developer tools, enterprise surfaces, and a safety-release process.

Anthropic’s moat is different again. It is not only that Claude answers well. Anthropic has built a safety- and developer-oriented deployment regime around tool use, prompt caching, web search, system-card discipline, and usage restrictions. [7] [8] It also now talks publicly about detecting industrial-scale distillation attempts through account behavior, request metadata, traffic patterns, and capability-targeted prompts. [10]

Meta is the useful counterexample. Meta deliberately weakens the ‘weights secrecy’ moat by releasing Llama models openly. But that does not mean Meta has no moat. Its strategy is ecosystem capture: developer mindshare, distribution, open weights, safety tools, partners, and the ability to keep reinvesting at huge scale. Meta’s own Llama 3.1 announcement says the license changes allow developers to use outputs from Llama models, including the 405B model, to improve other models. [15] That is not a lab forgetting to defend itself. That is a lab choosing a different layer to defend.

So when someone says, ‘the open model caught up,’ the first answer should be: caught up to what? The static score? The assistant product? The agent runtime? The compliance posture? The distribution channel? The cost curve?

The phrase ‘model moat’ is now too small. The product moat is the thing.

Distillation is real. It is just narrower than the panic

The anti-hype version of this argument is not that extraction is fake. It is very real.

The model-stealing literature goes back years. Tramèr and colleagues showed in 2016 that black-box prediction APIs could be copied with surprising fidelity in several supervised-learning settings. [17] Later work such as Knockoff Nets showed that attackers could train functional substitutes for vision models by querying a target and using the responses as labels. [18] Wallace and colleagues showed that commercial machine-translation systems could be imitated through query access, and Xu and colleagues showed that NLP API imitators could sometimes outperform the target on transferred domains. [19] [20]

Those papers matter because they kill the comforting story that an API boundary is automatically a wall. It is not. An API can be a teacher.

Recent LLM work makes the concern sharper. LoRD studies locality-reinforced extraction for aligned LLMs and argues that alignment-aware extraction can be more query-efficient than naive imitation. [21] A 2026 trace-inversion paper goes further: it reports that a student model fine-tuned on synthetic reasoning traces inferred from a commercial model’s answers and summaries improved substantially on MATH500 and JEEBench compared with fine-tuning on answers and summaries alone. [22]

That is not nothing. It is also not instant cloning.

The same research is revealing about the limits. These attacks are usually task-targeted. They depend on query budgets, public corpora, surrogate models, filtering, relabeling, and offline fine-tuning. They transfer slices of behavior. They do not copy the teacher’s full internal training mixture, hidden routing, safety monitors, enterprise permission model, tool environment, private launch gates, or customer trust surface.

Anthropic’s February 2026 post is important here, but it should be read carefully. Anthropic says it detected industrial-scale campaigns by DeepSeek, Moonshot AI, and MiniMax involving millions of exchanges through fraudulent accounts, with prompts targeting reasoning, coding, tool use, and agentic behavior. [10] That is a serious provider-side claim. It is evidence that labs treat capability siphoning as operationally real. It is not independent proof that any one actor obtained complete product equivalence.

The realistic threat is not a movie scene where someone steals the whole frontier model overnight. The realistic threat is continuous siphoning: cheaper followers copying selected behaviors, compressing commodity margins, and shortening the distance between leader and fast follower in specific domains.

That is dangerous enough. We do not need to exaggerate it.

Where the moat actually moved

First, inference-time systems. The best products now do more than answer text. They call tools, retrieve files, cache context, ground claims, execute code, remember workflow state, and route around failure. The visible output is the last inch of the system, not the system itself.

Second, private evaluation and release gates. Public benchmarks tell the world what everyone can see. Serious deployment depends on the tests customers never see: regression suites, abuse probes, internal red-team cases, latency targets, tool-use evals, refusal calibration, long-context retrieval checks, and domain-specific failure modes. A scraper can sample outputs. It cannot easily infer the hidden eval set that shaped the release decision.

Third, trust and liability. Enterprises do not only buy token prediction. They buy contractual posture, privacy defaults, account controls, admin consoles, SSO, retention settings, auditability, support, incident handling, and someone to blame when things break. These are boring moats. Boring moats are excellent because competitors underestimate them.

Fourth, distribution. Google has Search, Workspace-adjacent surfaces, Vertex AI, AI Studio, and developer channels. OpenAI has consumer distribution, API adoption, and an agent platform. Anthropic has a strong developer and enterprise wedge around Claude. Meta has the open ecosystem. These are not identical strategies, which is exactly the point. Competition has fragmented across layers.

Fifth, legal friction. Terms of service are not magic shields, but they raise the cost of extraction done openly. OpenAI, Google, and Anthropic all restrict using their services or outputs to build competing models or to reverse engineer or extract the underlying system. [6] [9] [13] A determined attacker can violate terms. A large commercial buyer cannot comfortably build its AI strategy on violating them.

This is why the phrase ‘open-source killed the moat’ is too blunt. Open models absolutely compress the commodity layer. They make generic summarization, basic coding help, formatting, translation, and many benchmark-facing demos cheaper. They also force incumbents to justify premium pricing with reliability and integration rather than mystique.

But the moat did not disappear. It moved from the static artifact to the operating system around the artifact.

A concrete workflow: the enterprise coding assistant

Take a boring example: a company wants an internal coding assistant for a large, ugly codebase. This is not a demo repo. It has legacy services, half-migrated libraries, secret configuration patterns, flaky tests, permission boundaries, and a few sacred files no one is supposed to touch unless the staff engineer has had coffee.

A distiller can query the public model with coding tasks and collect good answers. That helps. It may produce a cheaper model that writes decent patches for common problems. It may even learn some of the teacher’s style: clearer explanations, better decomposition, more cautious refactors.

But the production assistant has to do other things.

It has to retrieve the right files, not just plausible files. It has to respect permissions. It has to know when a failing test is a real signal and when the test suite is lying because some fixture was written during a previous geological era. It has to call a code interpreter or build system. It has to preserve context across a multi-step edit. It has to explain changes in the format the team actually reviews. It has to avoid leaking secrets into logs. It has to produce traces that compliance and security people can inspect. It has to be monitored for abuse, prompt injection, data exfiltration, and suspicious usage patterns.

The answer text is the visible residue of that workflow. It is not the workflow.

This is the practical distinction that keeps getting lost. If the job is ‘write a quick helper function,’ model parity matters a lot. If the job is ‘operate safely inside our production engineering system,’ the model is only one component. You still need connectors, permissions, retrieval, evals, logging, rollout controls, and trust.

That is why open models can be both extremely important and insufficient to erase frontier-lab advantage. Both things can be true. Annoying, I know.

The boring prediction

My forecast is a barbell.

At one end, open models and fast followers keep commoditizing broad-purpose inference. The price of generic text generation falls. The quality of local and self-hosted models improves. More companies decide that good enough is not only good enough, but preferable because it is cheaper, controllable, and closer to their data boundary.

At the other end, frontier labs push harder into high-trust, high-integration, high-liability surfaces: coding agents, enterprise workflows, search-grounded systems, multimodal orchestration, regulated deployments, private eval advantage, and products where failures are expensive enough that customers pay for the surrounding machinery.

Distillation accelerates the first end of the barbell. It narrows the commodity layer and makes fast-following easier. It also makes providers expose less reasoning signal, watermark more, fingerprint more traffic, verify more accounts, and separate public demo behavior from high-value production behavior. That is the natural defensive response when the attack is not theft of raw weights, but capability siphoning through outputs.

The stronger labs should not be complacent. Open models will keep cutting into margins. Customers will get better at asking why a premium model is worth the premium. Some products that look defensible today will become wrapper dust tomorrow.

But the stronger critics should also be more precise. ‘The open model matched a benchmark’ is not the same as ‘the frontier product has no moat.’ It is a claim about one layer, often the most visible layer, and visibility is not the same as importance.

What actually matters is whether the hard-to-copy parts of the system are where customers feel the pain.

So the useful question is not whether open models will catch frontier labs on more charts. They will. The useful question is whether you are buying intelligence, or buying the machinery that makes intelligence useful when the chart is gone and the work starts.

* * *

Notes and References

1. Stanford Institute for Human-Centered AI, 2026 AI Index Report, Technical Performance chapter. https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance

2. Colin White et al., LiveBench: A Challenging, Contamination-Free LLM Benchmark, ICLR 2025 / LiveBench project. https://github.com/LiveBench/LiveBench

3. Nelson F. Liu et al., Lost in the Middle: How Language Models Use Long Contexts, Transactions of the Association for Computational Linguistics, 2024. https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00638/119630/Lost-in-the-Middle-How-Language-Models-Use-Long

4. OpenAI, New tools and features in the Responses API, 2025. https://openai.com/index/new-tools-and-features-in-the-responses-api/

5. OpenAI API documentation, Responses API overview and migration guidance. https://developers.openai.com/api/docs/guides/migrate-to-responses

6. OpenAI, Terms of Use, restrictions on reverse engineering, extraction, rate-limit bypass, and using output to develop competing models. https://openai.com/policies/row-terms-of-use/

7. Anthropic documentation, Prompt caching. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

8. Anthropic API release notes, tool use, web search, and platform features. https://docs.anthropic.com/en/release-notes/api

9. Anthropic Help Center, Can I use my Outputs to train an AI model?, March 2026. https://support.claude.com/en/articles/12326764-can-i-use-my-outputs-to-train-an-ai-model

10. Anthropic, Detecting and preventing distillation attacks, February 2026. https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks

11. Google AI for Developers, Grounding with Google Search. https://ai.google.dev/gemini-api/docs/google-search

12. Google AI for Developers, Context caching. https://ai.google.dev/gemini-api/docs/caching

13. Google AI for Developers, Gemini API Additional Terms of Service, effective March 23, 2026. https://ai.google.dev/gemini-api/terms

14. Google DeepMind, Gemini 3.1 Pro Model Card, February 2026. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf

15. Meta, Introducing Llama 3.1: Our most capable models to date, July 2024. https://ai.meta.com/blog/meta-llama-3-1/

16. The Llama Team, The Llama 3 Herd of Models, 2024. https://arxiv.org/abs/2407.21783

17. Florian Tramèr et al., Stealing Machine Learning Models via Prediction APIs, USENIX Security, 2016. https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/tramer

18. Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz, Knockoff Nets: Stealing Functionality of Black-Box Models, CVPR 2019. https://openaccess.thecvf.com/content_CVPR_2019/html/Orekondy_Knockoff_Nets_Stealing_Functionality_of_Black-Box_Models_CVPR_2019_paper.html

19. Eric Wallace et al., Imitation Attacks and Defenses for Black-box Machine Translation Systems, EMNLP 2020. https://aclanthology.org/2020.emnlp-main.446/

20. Yifei Xu et al., Student Surpasses Teacher: Imitation Attacks for Black-Box NLP APIs, COLING 2022. https://aclanthology.org/2022.coling-1.252/

21. Zi Liang et al., Yes, My LoRD: Guiding Language Model Extraction with Locality Reinforced Distillation, 2024. https://arxiv.org/abs/2409.02718

22. Tingwei Zhang, John X. Morris, and Vitaly Shmatikov, How to Steal Reasoning Without Reasoning Traces, arXiv, March 2026. https://arxiv.org/abs/2603.07267

23. Long Phan et al., Humanity’s Last Exam, arXiv, 2025/2026. https://arxiv.org/abs/2501.14249

We Keep Promoting the Wrong Kind of Leader

Eloi Tay — Sun, 07 Jun 2026 19:10:10 GMT