AI Is Cheap. Good Ideas Still Aren't.
The useful question is not whether AI saves hours. It is whether it lets teams buy more serious attempts without lowering the bar.
Most AI ROI debates start with a payroll spreadsheet.
How many people can this replace? How many hours can this shave? How much output can we squeeze from the same org chart?
I think that is the wrong debate.
The useful question is not whether AI can make average work cheaper. It can. The useful question is whether it makes serious attempts cheap enough that teams can afford to search better.
That difference sounds small. It is not.
Treat AI as cheap labor and you get a faster content mill, a larger backlog of plausible nonsense, and a new category of review debt. Treat AI as cheap ideation and critique and you can raise the quality frontier. You get more candidate designs, more test cases, more product angles, more copy variants, more user stories, more failure modes, and more shots at finding the one option worth shipping.
The problem is usually not generation.
It is selection.
* * *
Cheap means marginal, not magical
When people say AI is cheap, they often smuggle in a much larger claim: cheap to deploy, cheap to govern, cheap to integrate, cheap to trust.
That is false.
AI is cheap in a narrower and more interesting way. It lowers the marginal cost of one more attempt.
Stanford HAI’s 2025 AI Index reported that the inference cost for a system performing at GPT-3.5 level dropped more than 280-fold between November 2022 and October 2024. That is the part of the economics that matters for ideation. One more draft, summary, test case, sketch, critique, or scenario is suddenly much cheaper than it used to be. [1]
But the model call is not the program cost.
The total cost still includes workflow redesign, data access, retrieval, security review, evaluation harnesses, legal controls, vendor management, and the human time needed to decide whether the output is any good. The productivity J-curve literature makes this point in a less fashionable but more useful way: general-purpose technologies tend to require complementary intangible investments before their benefits show up cleanly. [2]
This is why some AI pilots feel miraculous and some feel like expensive theater.
The difference is not just model quality. It is whether the organization built the surrounding machinery that turns cheap attempts into better outcomes.
Cheap attempts without selection are noise.
Cheap attempts with disciplined selection are leverage.
* * *
The thing AI buys is search
A lot of teams still ask the small question: can AI write this memo?
The better question is: how many versions of the memo, with how many frames, from how many stakeholder perspectives, can I now afford to inspect before making a decision?
That is the strategic move. AI changes the search budget.
In a professional writing experiment published in Science, ChatGPT reduced task time by 40% and raised output quality by 18%. That is not only a speed story. It is a quality story, because the tool made drafting and revision cheaper inside a bounded task. [3]
The Boston Consulting Group field experiment is even more useful because it exposes both the upside and the trap. In the experiment, 758 consultants worked on realistic consulting tasks. For tasks inside the model’s capability frontier, consultants using AI completed 12.2% more tasks, worked 25.1% faster, and produced work rated more than 40% higher in quality than the control group. For one task outside the frontier, however, consultants using AI were 19 percentage points less likely to produce the correct answer. [4]
That is the whole AI strategy problem in miniature.
Inside the frontier, AI buys more search. Outside the frontier, AI buys false confidence at scale.
The same pattern shows up elsewhere. In customer support, a study of 5,179 agents found that access to a generative AI assistant increased productivity by 14% on average, with a 34% improvement for novice and low-skilled workers. That matters because AI did not simply replace expertise. It helped redistribute some of the practices of stronger workers to people who had less experience. [5]
In product innovation work at Procter & Gamble, a field experiment with 776 professionals found that individuals using AI matched the performance of two-person teams without AI. It also found that AI helped R&D and commercial professionals produce more balanced solutions across functional boundaries. [6]
That is not a headcount story. It is a collaboration story.
The best version of this argument is not that AI makes experts unnecessary. It is that AI makes a broader idea pool available before experts spend their scarce time.
You can see the same logic in creative work. In a Science Advances study, writers with access to AI-generated ideas produced stories judged more creative, better written, and more enjoyable. But the collective diversity of the stories fell. [7]
That last sentence is the tax.
AI can raise the floor while narrowing the room.
In R&D, the search argument becomes more literal. Google DeepMind’s GNoME work, published in Nature, reported 381,000 new stable crystal structures available on the convex hull. The point is not that a scientist typed faster. The point is that AI changed the scale of candidate exploration. [8]
This is why “AI as automation” is too small a frame.
For many teams, the practical advantage is not doing the same thing with fewer people. It is making it economically reasonable to try more things, compare them, and find the better one.
* * *
The jagged frontier is where ROI gets embarrassed
The lazy pro-AI story says the tool makes everyone faster.
The lazy anti-AI story says the tool is mostly slop.
Both are too convenient.
What actually matters is task fit. The frontier is jagged. A model may be excellent at drafting a first-pass policy, mediocre at designing the policy’s exception system, and dangerous when asked to infer the legal status of an edge case without grounding.
This sounds obvious, but teams miss it all the time. They run one impressive demo, generalize from it, and then act surprised when a different workflow collapses under review.
Software is a good place to be humble. AI coding tools can be valuable in bounded tasks, scaffolding, test generation, and unfamiliar syntax. They can also slow down experts working inside large, mature codebases where context, judgment, and verification dominate the work.
METR’s early-2025 randomized trial found that experienced open-source developers working in their own repositories took 19% longer when AI tools were allowed. METR later noted that a follow-up experiment in late 2025 and early 2026 became harder to interpret because many developers did not want to participate in work without AI and because task selection changed; the researchers said their later data provided weak evidence of speedup rather than a clean estimate. [10]
The lesson is not “AI slows developers.”
The lesson is that measurement without context is a prank.
The advertising evidence has the same jagged shape. In the Pairit field experiment, human-AI teams produced 50% more ads per worker and higher-quality text, while human-human teams produced higher-quality images. The AI-assisted outputs also became more homogeneous. [9]
So yes, AI can produce more. Sometimes it produces better. Sometimes it produces faster sameness.
Preference is not performance.
A user liking the tool is not the same as the work improving. A team feeling faster is not the same as the release being safer. A pile of polished drafts is not the same as a better decision.
AI often moves cost from generation to review. That is fine, as long as you admit it.
The question is not, “Did we produce more?”
The question is, “Did the extra attempts improve the selected output after evaluation, risk controls, and downstream consequences?”
That is a much harder question. It is also the only one worth asking.
* * *
A workflow: generate broadly, critique narrowly, approve explicitly
Here is the pattern I would use for most quality-led AI pilots.
Do not start by buying the fanciest model for everyone.
Start with one workflow where more variants plausibly improve the outcome and where quality can be measured. Product onboarding. Campaign copy. Support macros. Test generation. Architecture decision records. Internal research briefs. Design concepts. Customer email drafts.
Then run the workflow like this.
First, define the quality metric before anyone touches the model.
For a marketing team, that might be click-through rate, conversion rate, cost per acquisition, brand compliance, and creative diversity. For software, it might be unit-test pass rate, review churn, revert rate, defect escape, maintainability, cycle time, and developer satisfaction. For product design, it might be feasibility, novelty, variety across solution families, user value, and downstream validation success. Developer productivity frameworks such as SPACE explicitly warn against reducing productivity to a single activity metric, and design research similarly distinguishes quantity, quality, novelty, and variety in ideation. [11] [12]
Second, generate too many options on purpose.
Ask for 30 candidate onboarding flows, not three. Ask for failure modes across novice users, power users, support teams, legal reviewers, and abuse cases. Ask for boring ideas, risky ideas, high-variance ideas, and ideas optimized for constraints the team usually ignores.
This is where cheap models, templates, and reusable prompts make sense. You are not looking for the final answer. You are buying breadth.
Third, force disagreement.
Have the model critique its own outputs against the rubric. Have a second model argue against the first. Have humans mark the outputs as promising, wrong, duplicative, risky, or surprisingly useful. Ask what assumptions would need to be true for each option to win.
This is not because models are reliable judges. They are not. It is because structured critique turns a pile of drafts into a map of trade-offs.
Fourth, ground the shortlist.
This is where teams should spend money. Use retrieval against approved sources. Run tests. Check facts. Ask legal or security to review only the shortlisted options that actually matter. Use stronger models for difficult reasoning, not for every stray brainstorm. Put humans at the points where judgment, accountability, and taste matter.
Fifth, approve explicitly.
The final artifact should include a short decision log: what was selected, why it was selected, what was rejected, what evidence was used, what risks remain, and who owns the outcome.
That log is not bureaucracy. It is institutional memory.
A concrete example: suppose a product team wants to improve trial activation.
The weak AI workflow asks a chatbot for onboarding ideas and ships the prettiest one.
The useful workflow asks for 40 onboarding variants across user segments, clusters them into themes, scores them against friction, clarity, implementation cost, abuse risk, and likely activation impact, then selects three for testing. Engineering uses AI to generate edge cases and test plans. Support uses AI to draft macro changes. Legal reviews only the two flows that touch sensitive data. Product writes down why the winner was chosen.
No one in that workflow worships the model.
They use it to make exploration cheap, then spend human attention where it compounds.
* * *
Measurement is the moat
AI programs fail quietly when they measure activity and call it value.
Number of prompts is not value. Number of drafts is not value. Tokens consumed is not value. Hours allegedly saved is not value unless the saved hours become something better.
The target is quality-adjusted output.
That phrase is ugly, but it is the right one.
A quality-led ROI model should include four terms. What was the value of the quality uplift? What additional experiments became possible? What time savings were actually redeployed to higher-value work? What risk losses were avoided?
Then subtract the total AI program cost, not just the model invoice.
The total cost includes seats, tokens, integration, retrieval, observability, evaluations, security, compliance, human review, training, vendor management, and the time spent cleaning up confident mistakes.
The failure mode is predictable. A team runs a pilot, sees that people feel faster, buys more licenses, and declares victory before measuring whether selected outputs improved.
This is why the most useful AI metrics are often boring.
For software, count fewer escaped defects, fewer reverts, faster review of low-risk changes, better test coverage, and less review fatigue. For marketing, count more meaningful experiments and better conversion, not just more copy. For product, count whether the team found ideas that survived user evidence. For internal writing, count whether decisions got clearer and faster, not whether documents got longer.
The best AI deployment will often look less like a productivity dashboard and more like an experiment system.
Baseline. Generate. Select. Test. Learn. Repeat.
* * *
Governance is not a vibes tax
A lot of builders treat governance as the department of no.
That is lazy.
Risk-based governance is how you keep cheap attempts from becoming expensive mistakes.
NIST’s Generative AI Profile for the AI Risk Management Framework identifies risks that are directly relevant to this kind of workflow: data privacy, harmful bias and homogenization, intellectual property, information integrity, and the need for pre-deployment and ongoing evaluation. [13]
Those are not abstract enterprise concerns. They show up inside ordinary work.
A model drafts customer copy that subtly changes a claim. A sales deck includes an invented customer reference. A code assistant suggests a dependency with a license problem. A team uses the same model and prompt pattern until every campaign sounds like it came from the same mildly caffeinated intern. A product researcher uploads sensitive notes into a tool that should not have seen them.
Cheap ideation increases surface area.
Governance should therefore be embedded in the workflow, not bolted on at the end.
That means a use-case inventory. Clear data boundaries. Approved source retrieval. Human review thresholds. Provenance for generated assets. Evaluation logs. Incident reporting. Vendor review. Periodic checks for output sameness. A rule that high-stakes uses need stronger evidence than “the model sounded confident.”
Copyright is part of this too. The U.S. Copyright Office’s 2025 report on AI and copyrightability concluded that purely AI-generated material is not copyrightable, that prompts alone are unlikely to provide sufficient control, and that human-authored expression, creative selection, arrangement, or modification can matter. [14]
The practical point is simple: if the output matters, document the human contribution.
Not because paperwork is fun.
Because provenance becomes part of quality.
* * *
The uncomfortable team question
If AI makes attempts cheap, then the role of the team changes.
People do less blank-page production. They do more framing, delegation, evaluation, verification, and integration.
That sounds tidy. In practice, it is politically weird.
The person who used to be rewarded for producing the first draft may now be rewarded for writing the best rubric. The senior engineer may spend less time typing boilerplate and more time identifying which generated patch is subtly wrong. The product manager may become more valuable by asking better questions than by writing longer strategy docs. The designer may become the guardian of taste against the beige flood.
This is why AI adoption is an organization design problem disguised as a tool rollout.
DORA’s 2025 research on AI-assisted software development put it cleanly: AI acts as an amplifier, magnifying an organization’s existing strengths and weaknesses. The greatest returns come not from the tools alone, but from improving the underlying system. [15]
That matches the evidence everywhere else.
AI helps when the workflow knows what good means. It hurts when the workflow has no taste, no metrics, no review capacity, and no one willing to say no to polished garbage.
The wrong debate is whether AI replaces people.
The sharper debate is which human work becomes more important when generation gets cheap.
My answer: framing, taste, domain judgment, experimental discipline, and accountability.
AI is not cheap intelligence.
It is cheap attempts.
The teams that benefit will not be the ones that produce the most synthetic output. They will be the ones that use cheap attempts to find better options, then have the discipline to reject most of them.
So here is the slightly annoying question.
Where is your team still behaving as if serious attempts are expensive?
And is that prudence, or just an old cost structure wearing a strategy badge?
* * *
Notes and References
1. Stanford Institute for Human-Centered Artificial Intelligence, The 2025 AI Index Report, 2025. https://hai.stanford.edu/ai-index/2025-ai-index-report
2. Erik Brynjolfsson, Daniel Rock, and Chad Syverson, “The Productivity J-Curve: How Intangibles Complement General Purpose Technologies,” American Economic Journal: Macroeconomics, 2021. https://www.aeaweb.org/articles?id=10.1257/mac.20180386
3. Shakked Noy and Whitney Zhang, “Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence,” Science, 2023. https://www.science.org/doi/10.1126/science.adh2586
4. Fabrizio Dell’Acqua et al., “Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality,” Harvard Business School / SSRN, 2023. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4573321
5. Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond, “Generative AI at Work,” NBER Working Paper 31161, 2023; later published in The Quarterly Journal of Economics, 2025. https://www.nber.org/papers/w31161
6. Fabrizio Dell’Acqua et al., “The Cybernetic Teammate: A Field Experiment on Generative AI Reshaping Teamwork and Expertise,” NBER Working Paper 33641, 2025. https://www.nber.org/papers/w33641
7. Anil R. Doshi and Oliver P. Hauser, “Generative AI Enhances Individual Creativity but Reduces the Collective Diversity of Novel Content,” Science Advances, 2024. https://www.science.org/doi/10.1126/sciadv.adn5290
8. Amil Merchant et al., “Scaling Deep Learning for Materials Discovery,” Nature, 2023. https://www.nature.com/articles/s41586-023-06735-9
9. Hyelim Ju et al., “Collaborating with AI Agents: A Field Experiment on Teamwork, Productivity, and Performance,” arXiv, version updated 2026. https://arxiv.org/abs/2503.18238
10. Joel Becker, Nate Rush, Elizabeth Barnes, and David Rein, “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity,” METR, 2025; and Joel Becker et al., “We are Changing our Developer Productivity Experiment Design,” METR, 2026. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ and https://metr.org/blog/2026-02-24-uplift-update/
11. Nicole Forsgren et al., “The SPACE of Developer Productivity,” ACM Queue, 2021. https://queue.acm.org/detail.cfm?id=3454124
12. Scarlett R. Miller et al., “How Should We Measure Creativity in Engineering Design? A Comparison of Social Science and Engineering Approaches,” Journal of Mechanical Design, 2021. https://asmedigitalcollection.asme.org/mechanicaldesign/article/143/3/031404/1090546/How-Should-We-Measure-Creativity-in-Engineering
13. National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, 2024. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
14. U.S. Copyright Office, Copyright and Artificial Intelligence, Part 2: Copyrightability, 2025. https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-2-Copyrightability-Report.pdf
15. DORA, State of AI-assisted Software Development, 2025. https://dora.dev/dora-report-2025/

