We Keep Promoting the Wrong Kind of Leader

Jun 07, 2026

Every few months, the same argument comes back wearing a new jacket.

An open model lands near the top of a public leaderboard. Someone posts a chart. The conclusion arrives five seconds later: the frontier labs are cooked.

I think that conclusion is too neat. It confuses a model with a product, and a product with a business.

The wrong debate is whether an open model can match a closed model on a public test. Of course it can, in slices, and sometimes very impressively. The useful question is what actually has to be copied before a customer can move serious work without caring who made the underlying model.

My answer is simple: the moat has moved up the stack.

Benchmark parity still matters. It pressures prices. It embarrasses marketing decks. It makes lazy product managers nervous, which is a public service. But it is no longer the cleanest map of competitive power. The high-value AI product is a runtime system wrapped around a model: tool orchestration, retrieval, policy enforcement, private evaluations, monitoring, enterprise controls, distribution, pricing, and trust.

You can scrape answers. You can fine-tune a student. You can copy vibes. You cannot cheaply scrape the machinery that makes the thing work on Monday morning inside a company with permissions, angry engineers, compliance reviews, and audit logs.

* * *

Benchmark parity is a bad trophy

The 2026 AI Index gives us the more boring, more useful version of the story. It reports that top-model performance is converging, that the leaders are packed more tightly together, and that competition is shifting toward cost, reliability, and domain-specific performance. It also reports that the open-versus-closed gap, after narrowing in 2024, widened again, with the top closed model ahead of the top open model by 3.3 percentage points as of March 2026. [1]

That is not a victory lap for closed models. It is a warning against over-reading any single chart.

Public benchmarks are useful instruments, but they are increasingly noisy instruments. Stanford’s report points to fast benchmark saturation, invalid-question rates as high as 42 percent on some widely used evaluations, and concerns that leaderboard standing can partly reflect adaptation to the platform rather than general capability. [1]

This sounds obvious, but teams miss it all the time: when the test is unstable, parity on the test is not a durable economic fact.

LiveBench exists because static benchmarks are vulnerable to contamination and obsolescence. Its designers release new questions regularly and use objective ground-truth answers to reduce dependence on LLM judges. [2] Humanity’s Last Exam exists for a similar reason: older academic tests were becoming too easy for the strongest systems, so the measurement target had to move. [23]

The deeper problem is that benchmark capability is jagged. A model can look equivalent on a math set and then behave differently on long-context retrieval, tool use, refusal calibration, latency, consistency, or enterprise permissions. Lost in the Middle showed that language models can perform worse when relevant information appears in the middle of a long context, even when the same information is easier to use at the beginning or end. [3] That kind of failure rarely shows up in the leaderboard screenshot that gets circulated on social media.

Preference is not performance. A public arena vote is not the same thing as a production system surviving a month of messy workflows.

The product is not the model

The easiest way to misunderstand the frontier labs is to keep imagining that they sell a single static model. That was closer to true in the early chatbot phase. It is much less true now.

OpenAI’s Responses API is a good example. The official docs describe a stateful interface for agent-like applications, with built-in tools such as web search, file search, computer use, Code Interpreter, remote MCP support, and reasoning summaries. [4] [5] A leaderboard score does not recreate that product surface. A copied answer style does not reproduce the server-side tool loop, preserved context, encrypted reasoning items, or enterprise controls around the API.

Google’s Gemini stack is similarly not just a model endpoint. The developer docs cover grounding with Google Search, context caching, structured outputs, built-in tool combinations, and file or function calling. [11] [12] Google’s Gemini 3.1 Pro model card describes internal safety evaluations, human red teaming, and continuous testing under its Frontier Safety Framework. [14] Again, the interesting thing is not only the model. It is the placement of the model inside search, cloud, developer tools, enterprise surfaces, and a safety-release process.

Anthropic’s moat is different again. It is not only that Claude answers well. Anthropic has built a safety- and developer-oriented deployment regime around tool use, prompt caching, web search, system-card discipline, and usage restrictions. [7] [8] It also now talks publicly about detecting industrial-scale distillation attempts through account behavior, request metadata, traffic patterns, and capability-targeted prompts. [10]

Meta is the useful counterexample. Meta deliberately weakens the ‘weights secrecy’ moat by releasing Llama models openly. But that does not mean Meta has no moat. Its strategy is ecosystem capture: developer mindshare, distribution, open weights, safety tools, partners, and the ability to keep reinvesting at huge scale. Meta’s own Llama 3.1 announcement says the license changes allow developers to use outputs from Llama models, including the 405B model, to improve other models. [15] That is not a lab forgetting to defend itself. That is a lab choosing a different layer to defend.

So when someone says, ‘the open model caught up,’ the first answer should be: caught up to what? The static score? The assistant product? The agent runtime? The compliance posture? The distribution channel? The cost curve?

The phrase ‘model moat’ is now too small. The product moat is the thing.

Distillation is real. It is just narrower than the panic

The anti-hype version of this argument is not that extraction is fake. It is very real.

The model-stealing literature goes back years. Tramèr and colleagues showed in 2016 that black-box prediction APIs could be copied with surprising fidelity in several supervised-learning settings. [17] Later work such as Knockoff Nets showed that attackers could train functional substitutes for vision models by querying a target and using the responses as labels. [18] Wallace and colleagues showed that commercial machine-translation systems could be imitated through query access, and Xu and colleagues showed that NLP API imitators could sometimes outperform the target on transferred domains. [19] [20]

Those papers matter because they kill the comforting story that an API boundary is automatically a wall. It is not. An API can be a teacher.

Recent LLM work makes the concern sharper. LoRD studies locality-reinforced extraction for aligned LLMs and argues that alignment-aware extraction can be more query-efficient than naive imitation. [21] A 2026 trace-inversion paper goes further: it reports that a student model fine-tuned on synthetic reasoning traces inferred from a commercial model’s answers and summaries improved substantially on MATH500 and JEEBench compared with fine-tuning on answers and summaries alone. [22]

That is not nothing. It is also not instant cloning.

The same research is revealing about the limits. These attacks are usually task-targeted. They depend on query budgets, public corpora, surrogate models, filtering, relabeling, and offline fine-tuning. They transfer slices of behavior. They do not copy the teacher’s full internal training mixture, hidden routing, safety monitors, enterprise permission model, tool environment, private launch gates, or customer trust surface.

Anthropic’s February 2026 post is important here, but it should be read carefully. Anthropic says it detected industrial-scale campaigns by DeepSeek, Moonshot AI, and MiniMax involving millions of exchanges through fraudulent accounts, with prompts targeting reasoning, coding, tool use, and agentic behavior. [10] That is a serious provider-side claim. It is evidence that labs treat capability siphoning as operationally real. It is not independent proof that any one actor obtained complete product equivalence.

The realistic threat is not a movie scene where someone steals the whole frontier model overnight. The realistic threat is continuous siphoning: cheaper followers copying selected behaviors, compressing commodity margins, and shortening the distance between leader and fast follower in specific domains.

That is dangerous enough. We do not need to exaggerate it.

Where the moat actually moved

First, inference-time systems. The best products now do more than answer text. They call tools, retrieve files, cache context, ground claims, execute code, remember workflow state, and route around failure. The visible output is the last inch of the system, not the system itself.

Second, private evaluation and release gates. Public benchmarks tell the world what everyone can see. Serious deployment depends on the tests customers never see: regression suites, abuse probes, internal red-team cases, latency targets, tool-use evals, refusal calibration, long-context retrieval checks, and domain-specific failure modes. A scraper can sample outputs. It cannot easily infer the hidden eval set that shaped the release decision.

Third, trust and liability. Enterprises do not only buy token prediction. They buy contractual posture, privacy defaults, account controls, admin consoles, SSO, retention settings, auditability, support, incident handling, and someone to blame when things break. These are boring moats. Boring moats are excellent because competitors underestimate them.

Fourth, distribution. Google has Search, Workspace-adjacent surfaces, Vertex AI, AI Studio, and developer channels. OpenAI has consumer distribution, API adoption, and an agent platform. Anthropic has a strong developer and enterprise wedge around Claude. Meta has the open ecosystem. These are not identical strategies, which is exactly the point. Competition has fragmented across layers.

Fifth, legal friction. Terms of service are not magic shields, but they raise the cost of extraction done openly. OpenAI, Google, and Anthropic all restrict using their services or outputs to build competing models or to reverse engineer or extract the underlying system. [6] [9] [13] A determined attacker can violate terms. A large commercial buyer cannot comfortably build its AI strategy on violating them.

This is why the phrase ‘open-source killed the moat’ is too blunt. Open models absolutely compress the commodity layer. They make generic summarization, basic coding help, formatting, translation, and many benchmark-facing demos cheaper. They also force incumbents to justify premium pricing with reliability and integration rather than mystique.

But the moat did not disappear. It moved from the static artifact to the operating system around the artifact.

A concrete workflow: the enterprise coding assistant

Take a boring example: a company wants an internal coding assistant for a large, ugly codebase. This is not a demo repo. It has legacy services, half-migrated libraries, secret configuration patterns, flaky tests, permission boundaries, and a few sacred files no one is supposed to touch unless the staff engineer has had coffee.

A distiller can query the public model with coding tasks and collect good answers. That helps. It may produce a cheaper model that writes decent patches for common problems. It may even learn some of the teacher’s style: clearer explanations, better decomposition, more cautious refactors.

But the production assistant has to do other things.

It has to retrieve the right files, not just plausible files. It has to respect permissions. It has to know when a failing test is a real signal and when the test suite is lying because some fixture was written during a previous geological era. It has to call a code interpreter or build system. It has to preserve context across a multi-step edit. It has to explain changes in the format the team actually reviews. It has to avoid leaking secrets into logs. It has to produce traces that compliance and security people can inspect. It has to be monitored for abuse, prompt injection, data exfiltration, and suspicious usage patterns.

The answer text is the visible residue of that workflow. It is not the workflow.

This is the practical distinction that keeps getting lost. If the job is ‘write a quick helper function,’ model parity matters a lot. If the job is ‘operate safely inside our production engineering system,’ the model is only one component. You still need connectors, permissions, retrieval, evals, logging, rollout controls, and trust.

That is why open models can be both extremely important and insufficient to erase frontier-lab advantage. Both things can be true. Annoying, I know.

The boring prediction

My forecast is a barbell.

At one end, open models and fast followers keep commoditizing broad-purpose inference. The price of generic text generation falls. The quality of local and self-hosted models improves. More companies decide that good enough is not only good enough, but preferable because it is cheaper, controllable, and closer to their data boundary.

At the other end, frontier labs push harder into high-trust, high-integration, high-liability surfaces: coding agents, enterprise workflows, search-grounded systems, multimodal orchestration, regulated deployments, private eval advantage, and products where failures are expensive enough that customers pay for the surrounding machinery.

Distillation accelerates the first end of the barbell. It narrows the commodity layer and makes fast-following easier. It also makes providers expose less reasoning signal, watermark more, fingerprint more traffic, verify more accounts, and separate public demo behavior from high-value production behavior. That is the natural defensive response when the attack is not theft of raw weights, but capability siphoning through outputs.

The stronger labs should not be complacent. Open models will keep cutting into margins. Customers will get better at asking why a premium model is worth the premium. Some products that look defensible today will become wrapper dust tomorrow.

But the stronger critics should also be more precise. ‘The open model matched a benchmark’ is not the same as ‘the frontier product has no moat.’ It is a claim about one layer, often the most visible layer, and visibility is not the same as importance.

What actually matters is whether the hard-to-copy parts of the system are where customers feel the pain.

So the useful question is not whether open models will catch frontier labs on more charts. They will. The useful question is whether you are buying intelligence, or buying the machinery that makes intelligence useful when the chart is gone and the work starts.

* * *

Notes and References

1. Stanford Institute for Human-Centered AI, 2026 AI Index Report, Technical Performance chapter. https://hai.stanford.edu/ai-index/2026-ai-index-report/technical-performance

2. Colin White et al., LiveBench: A Challenging, Contamination-Free LLM Benchmark, ICLR 2025 / LiveBench project. https://github.com/LiveBench/LiveBench

3. Nelson F. Liu et al., Lost in the Middle: How Language Models Use Long Contexts, Transactions of the Association for Computational Linguistics, 2024. https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00638/119630/Lost-in-the-Middle-How-Language-Models-Use-Long

4. OpenAI, New tools and features in the Responses API, 2025. https://openai.com/index/new-tools-and-features-in-the-responses-api/

5. OpenAI API documentation, Responses API overview and migration guidance. https://developers.openai.com/api/docs/guides/migrate-to-responses

6. OpenAI, Terms of Use, restrictions on reverse engineering, extraction, rate-limit bypass, and using output to develop competing models. https://openai.com/policies/row-terms-of-use/

7. Anthropic documentation, Prompt caching. https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

8. Anthropic API release notes, tool use, web search, and platform features. https://docs.anthropic.com/en/release-notes/api

9. Anthropic Help Center, Can I use my Outputs to train an AI model?, March 2026. https://support.claude.com/en/articles/12326764-can-i-use-my-outputs-to-train-an-ai-model

10. Anthropic, Detecting and preventing distillation attacks, February 2026. https://www.anthropic.com/news/detecting-and-preventing-distillation-attacks

11. Google AI for Developers, Grounding with Google Search. https://ai.google.dev/gemini-api/docs/google-search

12. Google AI for Developers, Context caching. https://ai.google.dev/gemini-api/docs/caching

13. Google AI for Developers, Gemini API Additional Terms of Service, effective March 23, 2026. https://ai.google.dev/gemini-api/terms

14. Google DeepMind, Gemini 3.1 Pro Model Card, February 2026. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-1-Pro-Model-Card.pdf

15. Meta, Introducing Llama 3.1: Our most capable models to date, July 2024. https://ai.meta.com/blog/meta-llama-3-1/

16. The Llama Team, The Llama 3 Herd of Models, 2024. https://arxiv.org/abs/2407.21783

17. Florian Tramèr et al., Stealing Machine Learning Models via Prediction APIs, USENIX Security, 2016. https://www.usenix.org/conference/usenixsecurity16/technical-sessions/presentation/tramer

18. Tribhuvanesh Orekondy, Bernt Schiele, and Mario Fritz, Knockoff Nets: Stealing Functionality of Black-Box Models, CVPR 2019. https://openaccess.thecvf.com/content_CVPR_2019/html/Orekondy_Knockoff_Nets_Stealing_Functionality_of_Black-Box_Models_CVPR_2019_paper.html

19. Eric Wallace et al., Imitation Attacks and Defenses for Black-box Machine Translation Systems, EMNLP 2020. https://aclanthology.org/2020.emnlp-main.446/

20. Yifei Xu et al., Student Surpasses Teacher: Imitation Attacks for Black-Box NLP APIs, COLING 2022. https://aclanthology.org/2022.coling-1.252/

21. Zi Liang et al., Yes, My LoRD: Guiding Language Model Extraction with Locality Reinforced Distillation, 2024. https://arxiv.org/abs/2409.02718

22. Tingwei Zhang, John X. Morris, and Vitaly Shmatikov, How to Steal Reasoning Without Reasoning Traces, arXiv, March 2026. https://arxiv.org/abs/2603.07267

23. Long Phan et al., Humanity’s Last Exam, arXiv, 2025/2026. https://arxiv.org/abs/2501.14249

Discussion about this post

Ready for more?