The Benchmark Is Also Hallucinating

DeepSWE is a serious test of coding agents. The silly bit is pretending any one leaderboard can tell you whether software work is safe.

Jun 11, 2026

The wrong debate

The funniest thing about the phrase “the only trustable coding benchmark” is that it assumes trust is the kind of object you can put in a leaderboard column.

This is how engineering measurement usually goes wrong. Someone takes a useful instrument, asks it to become a religion, then acts surprised when it starts behaving like procurement theatre with a CSV export.

DeepSWE deserves attention. It is one of the more serious public attempts to measure long-horizon coding agents on repository-level work: multi-file changes, sparse prompts, behavioural verification, regression checks, and a fixed harness. Its public repository describes 113 tasks across TypeScript, Go, Python, JavaScript, and Rust, drawn from active open-source projects. Its homepage reports model pass rates, average cost, wall-clock time, and output tokens, rather than only one heroic score pretending to explain the universe. [2][3]

That is good. It is also not the same thing as being the only trustworthy benchmark for coding.

The wrong debate is whether DeepSWE is “the” benchmark. The useful question is what kind of work it forces a coding agent to do, what kind of mistakes it can see, and what kind of mistakes still walk past it wearing a lanyard.

Because coding is not one job. It is a cupboard full of jobs that engineering teams keep labelling “implementation” so the project plan looks tidy.

Sometimes coding means writing a small pure function from a docstring. Sometimes it means fixing a real issue in a Python repository. Sometimes it means changing seven files without breaking the two ancient tests that only pass on the laptop nobody is allowed to reboot. Sometimes it means reading the onboarding document, discovering it describes a product that died three quarters ago, and then quietly asking Darren why the payment service still depends on a cron job under his personal account.

No single benchmark captures all of that. A benchmark is a flashlight. DeepSWE is a better flashlight than many. It is not daylight.

Perfection was never the benchmark

Most benchmark arguments smuggle in the wrong comparison. They compare the benchmark against perfect knowledge, then declare it flawed because it is not omniscient.

Fine. It is flawed. So is the current system.

The current system is a staff engineer trying three coding agents on the same bug, a product manager reading a model card, a CTO staring at a leaderboard, and everyone pretending that “we should evaluate this properly” does not mean “someone should spend the next two Fridays building a private test harness while the roadmap looks at them like a disappointed parent.

DeepSWE is useful because it improves the comparison we actually need. It does not ask whether a model can solve toy snippets in isolation. It asks whether an agent can enter a repository, interpret a natural request, edit multiple files, and pass behavioural tests that care about what the software does rather than whether the patch resembles the author’s preferred internal helper name. [3][4]

That matters because a large part of AI coding evaluation has been polluted by benchmarks that are public, old, narrow, overfit, saturated, or quietly measuring memory. HumanEval still matters for compact Python function synthesis, but it is not a proxy for changing a maintained product. MBPP still matters for entry-level Python programming problems, but nobody should choose an enterprise coding agent because it can write a small list manipulation function with a polite docstring. LiveCodeBench matters because it was designed around fresher contest-style coding problems and broader scenarios such as code generation, execution, test-output prediction, and self-repair. SWE-bench Verified matters historically and still offers a useful slice of GitHub issue resolution, but even OpenAI has argued that it no longer measures frontier coding progress cleanly because of contamination and test-design issues. [5][6][7][8][10]

This sounds like academic hair-splitting until you are the person approving the tool budget.

A leaderboard says Agent A is better than Agent B. In a meeting, this becomes “we have evidence.” By the time it reaches finance, it becomes “the market has validated this.” By the time it reaches engineering, it becomes “please replace the migration estimate with vibes and a screenshot.

The serious point underneath the comedy is simple: benchmarks are not guilty because they are incomplete. They become dangerous when people forget what they are incomplete about.

What DeepSWE actually tests

DeepSWE is strongest where older coding benchmarks have been visibly creaking: long-horizon, repository-level work under ambiguity.

The tasks are not copied from existing public pull requests, according to Datacurve. They are authored as original tasks against active open-source repositories. The reference solution is held out. The prompt is what the agent sees. The verifier is executable. The intended target is observable behaviour, not a secret implementation shape. [3][4]

That design choice matters more than the score.

Imagine an engineer using an AI agent to add cancellation support to a JavaScript engine. The weak version of the task is: “change this exact function, add this exact guard, return this exact value.” The stronger version is: “make cancellation stop all queued work associated with an evaluation without breaking unrelated jobs.” The first task measures patch obedience. The second starts measuring engineering.

Datacurve’s own qualitative audit gives an example in this shape: an agent mostly built cancellation support but missed one important path where Promise callbacks could still be queued after cancellation. That is the kind of failure a shallow benchmark can miss and a behaviour-first repository task is more likely to expose. [4]

This is why DeepSWE is interesting. It is not merely harder. Harder is easy. You can make any benchmark harder by writing a prompt that sounds like it was produced by a compliance department and a dungeon master trapped in the same elevator.

The useful part is that DeepSWE changes what counts as evidence. It rewards the agent for reading the repository, preserving behaviour, interpreting a shorter request, making a coherent multi-file change, and not treating the tests as decorative furniture.

That is close to how teams want coding agents to behave. Not magic. Not autonomous product ownership. Just a competent assistant that can enter a codebase without immediately deciding that all abstractions are optional and every edge case is a lifestyle choice.

The current system is also hallucinating

The phrase “AI hallucination” has become too convenient. It makes the model sound uniquely irrational while the surrounding organisation stands there holding three inconsistent Jira tickets and a Confluence page last updated during the previous monarchy.

Benchmarks hallucinate too, just more formally.

They hallucinate when a test suite checks one implementation detail and calls it correctness. They hallucinate when a public task rewards a model for having seen the answer before. They hallucinate when a prompt tells agents not to write tests, then the leaderboard acts surprised that test-writing behaviour disappears. They hallucinate when a single pass rate quietly hides cost, time, token spend, and review burden.

DeepSWE tries to reduce several of these failure modes. Its maintainers report original tasks, programmatic verifiers, regression checks, and a fixed mini-swe-agent harness. They also report a verifier audit where DeepSWE had a 0.3% false-positive rate and 1.1% false-negative rate on the reviewed sample, compared with 8.5% and 24.0% for SWE-Bench Pro. That is a striking claim. It is also a maintainer-reported claim using an LLM-based analyser as part of the audit, so the right reaction is not worship. It is “excellent, now replicate it.” [4]

This is the adult position, which makes it unpopular in rooms that prefer a winner.

DeepSWE may be a better measurement tool for one important construct. It may expose differences that older leaderboards compress. Its current public page reports a large spread between model configurations, with GPT-5.5 [xhigh] at 70% +/- 4% Pass@1 and several other frontier configurations materially lower. That is useful signal, especially compared with benchmarks where top models cluster into one polite blob. [2]

But signal is not truth. Signal is evidence under a measurement design.

A thermometer can tell you the room temperature. It cannot tell you whether the building has bad insulation, whether finance approved the heating bill, or whether Darren opened the window because the server closet smells like hot plastic. This is not a criticism of thermometers. It is a criticism of people who use one reading to explain the entire office.

Verification is the work

The operating lesson for builders is not “use DeepSWE” or “ignore DeepSWE.” It is that verification has become the real work.

Take a normal product team. They are considering a coding agent for repository maintenance. The demo is impressive. The agent changes three files, explains itself nicely, and produces a summary with the confidence of someone who has never had to be on-call. The team runs the tests. Most pass. The agent apologises for the one that fails and suggests a patch. Everyone has a small spiritual experience.

Then the staff engineer asks three dull questions, which are usually the important ones:

Did it change the right behaviour, or just satisfy the visible tests?

Did it preserve unrelated behaviour, or only avoid breaking the happy path?

Did it reduce future maintenance, or did it create a clever little shrine to today’s prompt?

This is where DeepSWE is useful as a pattern, not just as a leaderboard. Behavioural verifiers, regression checks, task-specific tests, cost reporting, time reporting, and trajectory audits are not benchmark luxuries. They are what serious AI-assisted engineering increasingly needs inside the company.

The workflow is boring and therefore likely to be correct. Pick real internal tasks. Freeze the base commit. Give the agent the same prompt a developer would actually write. Let it work in the same harness you intend to use. Run existing tests, targeted behavioural tests, static analysis, security checks, and review. Track pass rate, review time, rework, cost, latency, rollback risk, and the specific failure mode.

Then do the annoying part: read the failures.

The failures are the product. A model that fails by missing one parallel requirement is different from one that breaks integration. A model that writes tests and catches its own mistake is different from one that emits a majestic patch and then leaves before the bill arrives. A model that succeeds only when the answer is in git history is not solving your problem. It is doing archaeology with an API key.

This is why a private eval suite matters. DeepSWE can tell you something about frontier agent capability on open-source repository tasks. It cannot tell you whether the agent understands your strangler-fig migration, your feature flag conventions, your audit trail requirements, or the part of the billing service that everyone says is “temporary” in the same tone archaeologists use for pottery shards.

The product wrapper still matters

One of DeepSWE’s best choices is also one of its limitations: it fixes the harness.

Datacurve runs models through mini-swe-agent for consistency. That reduces scaffolding noise. It means the leaderboard is trying to compare model capability rather than the hidden advantages of one vendor’s product environment, prompt, editing tool, or carefully massaged workflow. [2][4]

This is clean. In practice, nobody buys “model capability” in a jar.

Teams use Claude Code, Codex CLI, Cursor, Gemini CLI, internal wrappers, security scanners, branch protection, code review bots, ticket systems, and whatever script Michael wrote during the incident because it was 2 a.m. and “temporary” had once again become architecture.

A fixed harness answers one question: how do these models behave when forced through the same narrow pipe?

A native product answers a different question: how does the whole system perform when developers actually use it?

Both questions matter. Confusing them is how teams buy a model because it wins a benchmark, then discover that the expensive part is not generation. It is integration, review, permissions, context packaging, security approval, developer trust, and the small matter of whether the thing can open the right files without turning the repository into a choose-your-own-adventure novel.

This is also why preference is not performance. RealHumanEval found that benchmark improvements can translate into programmer productivity gains, but not proportionally; it also found that programmer preferences did not necessarily track actual performance. [9]

That should bother anyone who relies on demos. Demos are preference machines. They are designed to make the tool feel obvious. Production is a performance machine. It is designed to punish every missing edge case you clapped for earlier.

What teams should actually do

Use DeepSWE as a serious external signal. Do not use it as a solitary purchasing ritual.

For researchers, it belongs near the centre of the long-horizon coding-agent stack. Pair it with function-level benchmarks, freshness-oriented benchmarks, and quality-oriented evaluations for security, maintainability, and efficiency. Report the harness, model version, effort mode, cost, time, token use, and uncertainty. Otherwise the score is not a result. It is a badge.

For practitioners, the minimum sensible evaluation looks more like a release checklist than a leaderboard screenshot.

Start with the public evidence: DeepSWE for long-horizon repository work; HumanEval or MBPP for small synthesis primitives; LiveCodeBench for fresher algorithmic coding; SWE-bench-style tasks for issue resolution; human-in-the-loop studies or internal pilots for productivity. Then add your private work: five to twenty representative repository tasks, held-out behavioural tests, review by your engineers, and explicit measurement of how much effort the agent saves or moves downstream.

Downstream is where lies go to become tickets.

Do not only ask “did the agent pass?” Ask who had to verify it, what they checked, what they missed, how long it took, and whether the resulting code is something you want to maintain after the agent has gone off to generate a haiku about root cause analysis.

The uncomfortable conclusion is that good AI coding evaluation looks less like model ranking and more like engineering management. It requires boring infrastructure, realistic tasks, clear ownership, review discipline, and enough humility to admit when the tool is useful but not safe unattended.

That is not a strategy deck sentence. That is the work.

The annoying question

DeepSWE is not the only trustable coding benchmark. That claim is too strong and, more importantly, too small.

The better claim is sharper: DeepSWE is one of the most useful current public benchmarks for long-horizon, behaviour-verified, repository-level coding agents. It measures a slice of work that teams actually care about and that older leaderboards often blur. It exposes failures that look like real engineering failures: missed requirements, integration mistakes, regressions, weak verification, and suspicious help from public history.

But it does not measure coding in general. It does not fully measure native product ergonomics, enterprise context, security, maintainability, review quality, private-codebase behaviour, or whether a developer actually gets their afternoon back. It does not turn judgment into a number. It gives judgment something better to inspect.

This is the useful frame.

The point of DeepSWE is not that we have finally found the one true benchmark. The point is that a serious benchmark makes the rest of our evaluation habits look lazy.

That is why the annoying question for teams is not “Which model won?

It is: “What would our own benchmark reveal about the way we already ship software?

Most organisations will not like the answer. Which is usually how you know the benchmark is starting to work.

Notes and References

1. Uploaded report, “DeepSWE and the Trustworthiness of Coding Benchmarks,” 2026. Used for the source framing that DeepSWE is a strong current public benchmark for long-horizon repository-level coding-agent evaluation, but not the only trustworthy benchmark for all coding work.

2. Datacurve, “DeepSWE,” Datacurve, 2026. Used for the current public leaderboard snapshot, including the 30 May 2026 update, Pass@1, cost, time, output-token fields, and the statement that all models are run on mini-swe-agent for consistency.

3. Datacurve, “datacurve-ai/deep-swe,” GitHub, 2026. Used for the public repository description: 113 tasks across TypeScript, Go, Python, JavaScript, and Rust; task format; held-out reference solution; and program-based behavioural verifiers.

4. Datacurve, “DeepSWE: Measuring frontier coding agents on original, long-horizon engineering tasks,” Datacurve Blog, 2026. Used for task-construction methodology, original-task claims, verifier design, false-positive and false-negative audit numbers, harness discussion, qualitative failure taxonomy, and examples of missed requirements and test-writing behaviour.

5. SWE-bench, “SWE-bench Verified,” SWE-bench, 2026. Used for the claim that SWE-bench Verified is a human-filtered subset of 500 SWE-bench instances for evaluating coding agents and language models.

6. OpenAI, “HumanEval: Hand-Written Evaluation Set,” GitHub, 2021. Used for HumanEval as a function-level code-generation benchmark and evaluation harness associated with “Evaluating Large Language Models Trained on Code.

7. Google Research, “Mostly Basic Python Problems Dataset,” GitHub, 2021. Used for MBPP as around 1,000 crowd-sourced entry-level Python programming problems with task descriptions, solutions, and automated tests.

8. Naman Jain et al., “LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code,” arXiv, 2024. Used for LiveCodeBench as a continuously updated coding benchmark with scenarios beyond code generation, including self-repair, code execution, and test-output prediction.

9. Hussein Mozannar et al., “The RealHumanEval: Evaluating Large Language Models’ Abilities to Support Programmers,” OpenReview, 2024. Used for the human-in-the-loop finding that benchmark gains can improve productivity but not proportionally, and that programmer preference signals do not necessarily correlate with actual performance.

10. OpenAI, “Why SWE-bench Verified no longer measures frontier coding capabilities,” OpenAI, 2026. Used for the contamination and test-design concerns around SWE-bench Verified and the broader argument that public coding benchmarks can stop measuring genuine frontier progress cleanly.

Discussion about this post

Ready for more?