<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Plausible Reality]]></title><description><![CDATA[Plausible Reality is a calm, analytical publication about AI, technology leadership, startups, and the narratives surrounding them. It looks past hype, consensus, and rage-bait to examine what actually holds up under scrutiny. Most posts come from the per]]></description><link>https://plausiblereality.com</link><image><url>https://plausiblereality.com/img/substack.png</url><title>Plausible Reality</title><link>https://plausiblereality.com</link></image><generator>Substack</generator><lastBuildDate>Sat, 30 May 2026 03:22:15 GMT</lastBuildDate><atom:link href="https://plausiblereality.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Eloi Tay]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[plausiblereality@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[plausiblereality@substack.com]]></itunes:email><itunes:name><![CDATA[Eloi Tay]]></itunes:name></itunes:owner><itunes:author><![CDATA[Eloi Tay]]></itunes:author><googleplay:owner><![CDATA[plausiblereality@substack.com]]></googleplay:owner><googleplay:email><![CDATA[plausiblereality@substack.com]]></googleplay:email><googleplay:author><![CDATA[Eloi Tay]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[AI Is Cheap. Good Ideas Still Aren't.]]></title><description><![CDATA[The useful question is not whether AI saves hours. It is whether it lets teams buy more serious attempts without lowering the bar.]]></description><link>https://plausiblereality.com/p/ai-is-cheap-good-ideas-still-arent</link><guid isPermaLink="false">https://plausiblereality.com/p/ai-is-cheap-good-ideas-still-arent</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Tue, 28 Apr 2026 20:01:51 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2fe79648-92ff-45f3-bc9a-fa2cecf01d8d_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Most AI ROI debates start with a payroll spreadsheet.</p><p>How many people can this replace? How many hours can this shave? How much output can we squeeze from the same org chart?</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>I think that is the wrong debate.</p><p>The useful question is not whether AI can make average work cheaper. It can. The useful question is whether it makes serious attempts cheap enough that teams can afford to search better.</p><p>That difference sounds small. It is not.</p><p>Treat AI as cheap labor and you get a faster content mill, a larger backlog of plausible nonsense, and a new category of review debt. Treat AI as cheap ideation and critique and you can raise the quality frontier. You get more candidate designs, more test cases, more product angles, more copy variants, more user stories, more failure modes, and more shots at finding the one option worth shipping.</p><p>The problem is usually not generation.</p><p>It is selection.</p><p style="text-align: center;">* * *</p><h1>Cheap means marginal, not magical</h1><p>When people say AI is cheap, they often smuggle in a much larger claim: cheap to deploy, cheap to govern, cheap to integrate, cheap to trust.</p><p>That is false.</p><p>AI is cheap in a narrower and more interesting way. It lowers the marginal cost of one more attempt.</p><p>Stanford HAI&#8217;s 2025 AI Index reported that the inference cost for a system performing at GPT-3.5 level dropped more than 280-fold between November 2022 and October 2024. That is the part of the economics that matters for ideation. One more draft, summary, test case, sketch, critique, or scenario is suddenly much cheaper than it used to be. [1]</p><p>But the model call is not the program cost.</p><p>The total cost still includes workflow redesign, data access, retrieval, security review, evaluation harnesses, legal controls, vendor management, and the human time needed to decide whether the output is any good. The productivity J-curve literature makes this point in a less fashionable but more useful way: general-purpose technologies tend to require complementary intangible investments before their benefits show up cleanly. [2]</p><p>This is why some AI pilots feel miraculous and some feel like expensive theater.</p><p>The difference is not just model quality. It is whether the organization built the surrounding machinery that turns cheap attempts into better outcomes.</p><p>Cheap attempts without selection are noise.</p><p>Cheap attempts with disciplined selection are leverage.</p><p style="text-align: center;">* * *</p><h1>The thing AI buys is search</h1><p>A lot of teams still ask the small question: can AI write this memo?</p><p>The better question is: how many versions of the memo, with how many frames, from how many stakeholder perspectives, can I now afford to inspect before making a decision?</p><p>That is the strategic move. AI changes the search budget.</p><p>In a professional writing experiment published in Science, ChatGPT reduced task time by 40% and raised output quality by 18%. That is not only a speed story. It is a quality story, because the tool made drafting and revision cheaper inside a bounded task. [3]</p><p>The Boston Consulting Group field experiment is even more useful because it exposes both the upside and the trap. In the experiment, 758 consultants worked on realistic consulting tasks. For tasks inside the model&#8217;s capability frontier, consultants using AI completed 12.2% more tasks, worked 25.1% faster, and produced work rated more than 40% higher in quality than the control group. For one task outside the frontier, however, consultants using AI were 19 percentage points less likely to produce the correct answer. [4]</p><p>That is the whole AI strategy problem in miniature.</p><p>Inside the frontier, AI buys more search. Outside the frontier, AI buys false confidence at scale.</p><p>The same pattern shows up elsewhere. In customer support, a study of 5,179 agents found that access to a generative AI assistant increased productivity by 14% on average, with a 34% improvement for novice and low-skilled workers. That matters because AI did not simply replace expertise. It helped redistribute some of the practices of stronger workers to people who had less experience. [5]</p><p>In product innovation work at Procter &amp; Gamble, a field experiment with 776 professionals found that individuals using AI matched the performance of two-person teams without AI. It also found that AI helped R&amp;D and commercial professionals produce more balanced solutions across functional boundaries. [6]</p><p>That is not a headcount story. It is a collaboration story.</p><p>The best version of this argument is not that AI makes experts unnecessary. It is that AI makes a broader idea pool available before experts spend their scarce time.</p><p>You can see the same logic in creative work. In a Science Advances study, writers with access to AI-generated ideas produced stories judged more creative, better written, and more enjoyable. But the collective diversity of the stories fell. [7]</p><p>That last sentence is the tax.</p><p>AI can raise the floor while narrowing the room.</p><p>In R&amp;D, the search argument becomes more literal. Google DeepMind&#8217;s GNoME work, published in Nature, reported 381,000 new stable crystal structures available on the convex hull. The point is not that a scientist typed faster. The point is that AI changed the scale of candidate exploration. [8]</p><p>This is why &#8220;AI as automation&#8221; is too small a frame.</p><p>For many teams, the practical advantage is not doing the same thing with fewer people. It is making it economically reasonable to try more things, compare them, and find the better one.</p><p style="text-align: center;">* * *</p><h1>The jagged frontier is where ROI gets embarrassed</h1><p>The lazy pro-AI story says the tool makes everyone faster.</p><p>The lazy anti-AI story says the tool is mostly slop.</p><p>Both are too convenient.</p><p>What actually matters is task fit. The frontier is jagged. A model may be excellent at drafting a first-pass policy, mediocre at designing the policy&#8217;s exception system, and dangerous when asked to infer the legal status of an edge case without grounding.</p><p>This sounds obvious, but teams miss it all the time. They run one impressive demo, generalize from it, and then act surprised when a different workflow collapses under review.</p><p>Software is a good place to be humble. AI coding tools can be valuable in bounded tasks, scaffolding, test generation, and unfamiliar syntax. They can also slow down experts working inside large, mature codebases where context, judgment, and verification dominate the work.</p><p>METR&#8217;s early-2025 randomized trial found that experienced open-source developers working in their own repositories took 19% longer when AI tools were allowed. METR later noted that a follow-up experiment in late 2025 and early 2026 became harder to interpret because many developers did not want to participate in work without AI and because task selection changed; the researchers said their later data provided weak evidence of speedup rather than a clean estimate. [10]</p><p>The lesson is not &#8220;AI slows developers.&#8221;</p><p>The lesson is that measurement without context is a prank.</p><p>The advertising evidence has the same jagged shape. In the Pairit field experiment, human-AI teams produced 50% more ads per worker and higher-quality text, while human-human teams produced higher-quality images. The AI-assisted outputs also became more homogeneous. [9]</p><p>So yes, AI can produce more. Sometimes it produces better. Sometimes it produces faster sameness.</p><p>Preference is not performance.</p><p>A user liking the tool is not the same as the work improving. A team feeling faster is not the same as the release being safer. A pile of polished drafts is not the same as a better decision.</p><p>AI often moves cost from generation to review. That is fine, as long as you admit it.</p><p>The question is not, &#8220;Did we produce more?&#8221;</p><p>The question is, &#8220;Did the extra attempts improve the selected output after evaluation, risk controls, and downstream consequences?&#8221;</p><p>That is a much harder question. It is also the only one worth asking.</p><p style="text-align: center;">* * *</p><h1>A workflow: generate broadly, critique narrowly, approve explicitly</h1><p>Here is the pattern I would use for most quality-led AI pilots.</p><p>Do not start by buying the fanciest model for everyone.</p><p>Start with one workflow where more variants plausibly improve the outcome and where quality can be measured. Product onboarding. Campaign copy. Support macros. Test generation. Architecture decision records. Internal research briefs. Design concepts. Customer email drafts.</p><p>Then run the workflow like this.</p><p>First, define the quality metric before anyone touches the model.</p><p>For a marketing team, that might be click-through rate, conversion rate, cost per acquisition, brand compliance, and creative diversity. For software, it might be unit-test pass rate, review churn, revert rate, defect escape, maintainability, cycle time, and developer satisfaction. For product design, it might be feasibility, novelty, variety across solution families, user value, and downstream validation success. Developer productivity frameworks such as SPACE explicitly warn against reducing productivity to a single activity metric, and design research similarly distinguishes quantity, quality, novelty, and variety in ideation. [11] [12]</p><p>Second, generate too many options on purpose.</p><p>Ask for 30 candidate onboarding flows, not three. Ask for failure modes across novice users, power users, support teams, legal reviewers, and abuse cases. Ask for boring ideas, risky ideas, high-variance ideas, and ideas optimized for constraints the team usually ignores.</p><p>This is where cheap models, templates, and reusable prompts make sense. You are not looking for the final answer. You are buying breadth.</p><p>Third, force disagreement.</p><p>Have the model critique its own outputs against the rubric. Have a second model argue against the first. Have humans mark the outputs as promising, wrong, duplicative, risky, or surprisingly useful. Ask what assumptions would need to be true for each option to win.</p><p>This is not because models are reliable judges. They are not. It is because structured critique turns a pile of drafts into a map of trade-offs.</p><p>Fourth, ground the shortlist.</p><p>This is where teams should spend money. Use retrieval against approved sources. Run tests. Check facts. Ask legal or security to review only the shortlisted options that actually matter. Use stronger models for difficult reasoning, not for every stray brainstorm. Put humans at the points where judgment, accountability, and taste matter.</p><p>Fifth, approve explicitly.</p><p>The final artifact should include a short decision log: what was selected, why it was selected, what was rejected, what evidence was used, what risks remain, and who owns the outcome.</p><p>That log is not bureaucracy. It is institutional memory.</p><p>A concrete example: suppose a product team wants to improve trial activation.</p><p>The weak AI workflow asks a chatbot for onboarding ideas and ships the prettiest one.</p><p>The useful workflow asks for 40 onboarding variants across user segments, clusters them into themes, scores them against friction, clarity, implementation cost, abuse risk, and likely activation impact, then selects three for testing. Engineering uses AI to generate edge cases and test plans. Support uses AI to draft macro changes. Legal reviews only the two flows that touch sensitive data. Product writes down why the winner was chosen.</p><p>No one in that workflow worships the model.</p><p>They use it to make exploration cheap, then spend human attention where it compounds.</p><p style="text-align: center;">* * *</p><h1>Measurement is the moat</h1><p>AI programs fail quietly when they measure activity and call it value.</p><p>Number of prompts is not value. Number of drafts is not value. Tokens consumed is not value. Hours allegedly saved is not value unless the saved hours become something better.</p><p>The target is quality-adjusted output.</p><p>That phrase is ugly, but it is the right one.</p><p>A quality-led ROI model should include four terms. What was the value of the quality uplift? What additional experiments became possible? What time savings were actually redeployed to higher-value work? What risk losses were avoided?</p><p>Then subtract the total AI program cost, not just the model invoice.</p><p>The total cost includes seats, tokens, integration, retrieval, observability, evaluations, security, compliance, human review, training, vendor management, and the time spent cleaning up confident mistakes.</p><p>The failure mode is predictable. A team runs a pilot, sees that people feel faster, buys more licenses, and declares victory before measuring whether selected outputs improved.</p><p>This is why the most useful AI metrics are often boring.</p><p>For software, count fewer escaped defects, fewer reverts, faster review of low-risk changes, better test coverage, and less review fatigue. For marketing, count more meaningful experiments and better conversion, not just more copy. For product, count whether the team found ideas that survived user evidence. For internal writing, count whether decisions got clearer and faster, not whether documents got longer.</p><p>The best AI deployment will often look less like a productivity dashboard and more like an experiment system.</p><p>Baseline. Generate. Select. Test. Learn. Repeat.</p><p style="text-align: center;">* * *</p><h1>Governance is not a vibes tax</h1><p>A lot of builders treat governance as the department of no.</p><p>That is lazy.</p><p>Risk-based governance is how you keep cheap attempts from becoming expensive mistakes.</p><p>NIST&#8217;s Generative AI Profile for the AI Risk Management Framework identifies risks that are directly relevant to this kind of workflow: data privacy, harmful bias and homogenization, intellectual property, information integrity, and the need for pre-deployment and ongoing evaluation. [13]</p><p>Those are not abstract enterprise concerns. They show up inside ordinary work.</p><p>A model drafts customer copy that subtly changes a claim. A sales deck includes an invented customer reference. A code assistant suggests a dependency with a license problem. A team uses the same model and prompt pattern until every campaign sounds like it came from the same mildly caffeinated intern. A product researcher uploads sensitive notes into a tool that should not have seen them.</p><p>Cheap ideation increases surface area.</p><p>Governance should therefore be embedded in the workflow, not bolted on at the end.</p><p>That means a use-case inventory. Clear data boundaries. Approved source retrieval. Human review thresholds. Provenance for generated assets. Evaluation logs. Incident reporting. Vendor review. Periodic checks for output sameness. A rule that high-stakes uses need stronger evidence than &#8220;the model sounded confident.&#8221;</p><p>Copyright is part of this too. The U.S. Copyright Office&#8217;s 2025 report on AI and copyrightability concluded that purely AI-generated material is not copyrightable, that prompts alone are unlikely to provide sufficient control, and that human-authored expression, creative selection, arrangement, or modification can matter. [14]</p><p>The practical point is simple: if the output matters, document the human contribution.</p><p>Not because paperwork is fun.</p><p>Because provenance becomes part of quality.</p><p style="text-align: center;">* * *</p><h1>The uncomfortable team question</h1><p>If AI makes attempts cheap, then the role of the team changes.</p><p>People do less blank-page production. They do more framing, delegation, evaluation, verification, and integration.</p><p>That sounds tidy. In practice, it is politically weird.</p><p>The person who used to be rewarded for producing the first draft may now be rewarded for writing the best rubric. The senior engineer may spend less time typing boilerplate and more time identifying which generated patch is subtly wrong. The product manager may become more valuable by asking better questions than by writing longer strategy docs. The designer may become the guardian of taste against the beige flood.</p><p>This is why AI adoption is an organization design problem disguised as a tool rollout.</p><p>DORA&#8217;s 2025 research on AI-assisted software development put it cleanly: AI acts as an amplifier, magnifying an organization&#8217;s existing strengths and weaknesses. The greatest returns come not from the tools alone, but from improving the underlying system. [15]</p><p>That matches the evidence everywhere else.</p><p>AI helps when the workflow knows what good means. It hurts when the workflow has no taste, no metrics, no review capacity, and no one willing to say no to polished garbage.</p><p>The wrong debate is whether AI replaces people.</p><p>The sharper debate is which human work becomes more important when generation gets cheap.</p><p>My answer: framing, taste, domain judgment, experimental discipline, and accountability.</p><p>AI is not cheap intelligence.</p><p>It is cheap attempts.</p><p>The teams that benefit will not be the ones that produce the most synthetic output. They will be the ones that use cheap attempts to find better options, then have the discipline to reject most of them.</p><p>So here is the slightly annoying question.</p><p>Where is your team still behaving as if serious attempts are expensive?</p><p>And is that prudence, or just an old cost structure wearing a strategy badge?</p><p style="text-align: center;">* * *</p><h2>Notes and References</h2><blockquote><p>1. Stanford Institute for Human-Centered Artificial Intelligence, The 2025 AI Index Report, 2025. https://hai.stanford.edu/ai-index/2025-ai-index-report</p><p>2. Erik Brynjolfsson, Daniel Rock, and Chad Syverson, &#8220;The Productivity J-Curve: How Intangibles Complement General Purpose Technologies,&#8221; American Economic Journal: Macroeconomics, 2021. https://www.aeaweb.org/articles?id=10.1257/mac.20180386</p><p>3. Shakked Noy and Whitney Zhang, &#8220;Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence,&#8221; Science, 2023. https://www.science.org/doi/10.1126/science.adh2586</p><p>4. Fabrizio Dell&#8217;Acqua et al., &#8220;Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality,&#8221; Harvard Business School / SSRN, 2023. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4573321</p><p>5. Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond, &#8220;Generative AI at Work,&#8221; NBER Working Paper 31161, 2023; later published in The Quarterly Journal of Economics, 2025. https://www.nber.org/papers/w31161</p><p>6. Fabrizio Dell&#8217;Acqua et al., &#8220;The Cybernetic Teammate: A Field Experiment on Generative AI Reshaping Teamwork and Expertise,&#8221; NBER Working Paper 33641, 2025. https://www.nber.org/papers/w33641</p><p>7. Anil R. Doshi and Oliver P. Hauser, &#8220;Generative AI Enhances Individual Creativity but Reduces the Collective Diversity of Novel Content,&#8221; Science Advances, 2024. https://www.science.org/doi/10.1126/sciadv.adn5290</p><p>8. Amil Merchant et al., &#8220;Scaling Deep Learning for Materials Discovery,&#8221; Nature, 2023. https://www.nature.com/articles/s41586-023-06735-9</p><p>9. Hyelim Ju et al., &#8220;Collaborating with AI Agents: A Field Experiment on Teamwork, Productivity, and Performance,&#8221; arXiv, version updated 2026. https://arxiv.org/abs/2503.18238</p><p>10. Joel Becker, Nate Rush, Elizabeth Barnes, and David Rein, &#8220;Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity,&#8221; METR, 2025; and Joel Becker et al., &#8220;We are Changing our Developer Productivity Experiment Design,&#8221; METR, 2026. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/ and https://metr.org/blog/2026-02-24-uplift-update/</p><p>11. Nicole Forsgren et al., &#8220;The SPACE of Developer Productivity,&#8221; ACM Queue, 2021. https://queue.acm.org/detail.cfm?id=3454124</p><p>12. Scarlett R. Miller et al., &#8220;How Should We Measure Creativity in Engineering Design? A Comparison of Social Science and Engineering Approaches,&#8221; Journal of Mechanical Design, 2021. https://asmedigitalcollection.asme.org/mechanicaldesign/article/143/3/031404/1090546/How-Should-We-Measure-Creativity-in-Engineering</p><p>13. National Institute of Standards and Technology, Artificial Intelligence Risk Management Framework: Generative Artificial Intelligence Profile, NIST AI 600-1, 2024. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf</p><p>14. U.S. Copyright Office, Copyright and Artificial Intelligence, Part 2: Copyrightability, 2025. https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-2-Copyrightability-Report.pdf</p><p>15. DORA, State of AI-assisted Software Development, 2025. https://dora.dev/dora-report-2025/</p></blockquote><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Your Agent Isn't Tired. It's Drowning in Context]]></title><description><![CDATA[The useful analogy between humans and LLMs is not consciousness. It is that both fall apart when active memory, compression, and management are sloppy.]]></description><link>https://plausiblereality.com/p/your-agent-isnt-tired-its-drowning</link><guid isPermaLink="false">https://plausiblereality.com/p/your-agent-isnt-tired-its-drowning</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Sat, 25 Apr 2026 20:01:07 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/7305d13f-2a0d-4c36-9b87-ada47d6260a9_1731x909.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>You can feel this problem in your own skull first.</p><p>After a long day, you stop failing heroically and start failing stupidly. You lose the thread of a conversation that is still happening. You reread the same paragraph. You forget the constraint you wrote down an hour ago.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>Then you watch a long-running agent do the machine version of the same thing. It misses the key detail buried in the middle of the prompt. It keeps circling a dead end. It summarizes a tool result so aggressively that the only useful fact disappears. It is not malicious. It is overloaded.</p><p>People want to turn this into philosophy. Are humans just next-token predictors? Are LLMs baby minds? I think that is the wrong debate.</p><p>The useful question is not whether humans and models are the same kind of thing. It is whether they share enough operational constraints that the comparison makes us better builders. On that question, the answer is yes.</p><p>The big similarity is boring and extremely useful: both humans and LLM systems have a limited active workspace sitting in front of a much larger store of potential information. When the workspace gets cluttered, performance gets weird.</p><p>That is why the analogies land. Brain fog and context rot feel related. Note-taking and RAG feel related. Bad evidence poisoning human recall and bad retrieval poisoning agent output feel related.</p><p>But only if you keep the guardrails on. Human fatigue is biological. Human memory changes during sleep. LLM memory is usually an engineered stack of context, summaries, retrieval, logs, and maybe a persistent memory layer. Treat the analogy as workflow, not metaphysics, and it becomes surprisingly sharp.</p><p style="text-align: center;">* * *</p><h1>The wrong debate is ontology</h1><p>Human working memory is small. Depending on the frame, researchers put the core limit around roughly four chunks, or about three to five meaningful chunks. That is not a lot of live slots for planning, holding constraints, and comparing alternatives.</p><p>LLMs have the same kind of bottleneck in a different substrate. The context window is the obvious version, but the deeper issue is not just how much text fits. It is how much of that text the model can still use well.</p><p>This sounds obvious, but teams miss it all the time. They hear &#8220;128K&#8221; or &#8220;1M tokens&#8221; and mentally translate that into &#8220;the model can keep everything in mind.&#8221; No. A larger whiteboard is not the same thing as better working memory.</p><p>The Lost in the Middle paper made this painfully concrete: models often perform best when relevant information sits near the beginning or end of a long prompt, and worse when the key fact is buried in the middle. NoLiMa pushed the point further by stripping away easy literal matches. Performance fell sharply as contexts grew, even before models hit their advertised limits.</p><p>So the useful analogy is not &#8220;humans are language models.&#8221; It is &#8220;both systems have a fragile active state.&#8221; What actually matters is not raw storage. It is selection, ordering, salience, and refresh.</p><p>That is also why context engineering is not prompt decoration. It is workload design. You are deciding what gets desk space right now, what gets pushed to a filing cabinet, what gets summarized, and what gets dropped.</p><p>Preference is not performance. Developers prefer the fantasy that one giant context will remove the need for memory architecture. Performance usually comes from the opposite move: smaller working sets, tighter task framing, and more aggressive eviction of stale state.</p><p style="text-align: center;">* * *</p><h1>Brain fog is context debt</h1><p>When people say brain fog after a long day, they are usually pointing at slower thinking, weaker attention, and worse executive control after sustained cognitive effort. Reviews of cognitive fatigue describe exactly that sort of degradation.</p><p>The model-side analogue is real, but it is not literal fatigue. Models do not need coffee. They accumulate context debt.</p><p>Context debt shows up when a session becomes an archaeological site: too many tool results, too many half-resolved threads, too many summaries of summaries, too much retrieved material with no ranking discipline. The agent can still &#8220;see&#8221; a lot of text. It just stops allocating attention to the right parts of it.</p><p>This is why giving the model a break does nothing unless the break changes the state. A human benefits from sleep, food, a walk, or just stopping. An agent benefits from pruning, compaction, retrieval, and explicit memory writes.</p><p>Anthropic&#8217;s own engineering work makes the point in a very unromantic way. In a 2025 internal agentic-search evaluation, context editing alone improved performance by 29 percent over baseline, and context editing plus a memory tool improved it by 39 percent. In a 100-turn search evaluation, context editing also cut token consumption by 84 percent while letting workflows complete that would otherwise fail from context exhaustion.</p><p>That is the punchline. The lever is not empathy for the model. The lever is state management.</p><p>I think this matters because a lot of teams still debug agents as if the main problem were intelligence. Sometimes it is. Much more often the agent is doing the exact wrong thing with the exact wrong pile of context that you handed it.</p><p style="text-align: center;">* * *</p><h1>Memory is compression, not recording</h1><p>Human memory is not a replay buffer. It is reconstructive. Reviews of hippocampal-neocortical memory transformation describe a shift from detail-rich traces toward gist-like and schematic representations over time. Sleep-dependent replay is thought to help the hippocampus teach the neocortex, turning specific episodes into more reusable knowledge.</p><p>That is a terrible metaphor for most production AI memory features, which is exactly why people keep getting confused.</p><p>In most deployed systems, &#8220;memory&#8221; means one of four things: model weights, current context, a retrieved external store, or a persistent note that can be reinserted later. The original RAG paper called this split parametric memory versus non-parametric memory. That distinction still does a lot of useful work.</p><p>OpenAI&#8217;s current memory documentation is explicit that saved memories are stored separately from chat history. Anthropic&#8217;s current memory tooling is explicit in the other direction: the memory tool is client-side, and developers control where the data is stored. In both cases, the practical picture is much closer to notebook plus recall policy than to anything like autonomous consolidation.</p><p>So when teams say, &#8220;the model learned this during the conversation,&#8221; I usually flinch. Unless you retrained or fine-tuned something, it probably did not learn in the human sense. It wrote a better sticky note.</p><p>This is also where note-taking becomes a stronger analogy than people realize. The classic note-taking literature distinguishes between an encoding effect and an external-storage effect. Writing notes can help you learn. Reviewing notes later also helps you remember.</p><p>Agent memory writes usually get only half of that. They are storage plus future lookup. Human note-taking is storage plus self-training.</p><p>That asymmetry explains why dumping every observation into a vector store feels disappointing. You built retention, not understanding.</p><p>The useful question is not whether to use long context or RAG as a religion. Recent comparisons suggest that long context can outperform RAG when you spend enough tokens, especially on question answering over dense sources, while RAG keeps real cost advantages and can still win on some dialogue or general-query settings. Anthropic&#8217;s Contextual Retrieval work also showed how much retrieval quality matters: their method reduced failed retrievals by 49 percent, and by 67 percent when combined with reranking.</p><p>What actually matters is tiering. Stable policy belongs in persistent memory. Source material belongs in retrieval. Immediate objectives belong in the live working set. Decisions and unresolved issues belong in a rolling summary. And facts you want the base model to truly internalize belong in training, not in wishful thinking.</p><p style="text-align: center;">* * *</p><h1>Bad evidence scales faster than good memory</h1><p>There is one more similarity that matters because it wrecks outputs quietly: both humans and models are vulnerable to bad evidence.</p><p>The misinformation effect literature showed long ago that people can absorb false post-event information about something they actually experienced and later recall the corrupted version. That is one of the clearest reminders that human memory is reconstructive, not camera-like.</p><p>Agent systems have the machine version of the same problem. A retrieved chunk can be irrelevant, noisy, outdated, or simply false. Once it enters the working set, the model often treats it as the problem definition. The right answer can still be somewhere in the system, and the run can still fail because the active evidence stream got polluted.</p><p>This is why provenance is not paperwork. It is part of reasoning quality.</p><p>Anthropic ran into the practical version of this while building its research system: human testers found that early agents sometimes preferred SEO-optimized content farms over better sources. That is not a cute bug. That is the whole ballgame. If the retrieval layer rewards the wrong evidence, the model will faithfully produce polished nonsense from the wrong pile of facts.</p><p>A lot of teams respond to this by adding more memory. Usually that makes it worse. Bad evidence that gets summarized, stored, and reintroduced later stops being a transient mistake and becomes durable contamination.</p><p>The useful rule is simple. Never let unaudited retrieval jump straight into persistent memory. First rank it. Then attribute it. Then compare it against an independent source if the claim matters. Only then let it harden into summary state or a saved fact.</p><p>For humans, the equivalent is obvious. We do better when we write down where a claim came from, compare notes, and correct errors quickly before they become the story we now remember. For agents, the same discipline applies, just with less dignity and more logs.</p><p>Good memory architecture is therefore inseparable from source hygiene. The problem is usually not that the system forgot. It is that it remembered the wrong thing too confidently.</p><p style="text-align: center;">* * *</p><h1>A memory stack I&#8217;d actually ship</h1><p>Here is the stack I would use for a long-running coding or research agent working on something nontrivial.</p><p>First, a task brief. One screen, maybe two. Objective, hard constraints, success criteria, source hierarchy, and the one thing the agent must not forget. This stays pinned near the top of context. It is the machine equivalent of the ticket you keep open on the second monitor.</p><p>Second, a tiny working set. Only the files, snippets, tool results, and subproblems relevant to the current move. Not the whole repo. Not the whole transcript. Just the live battlefield.</p><p>Third, a rolling decision log. After each major step, write down what changed, why it changed, what remains unresolved, and what evidence justified the move. This is the part most teams skip, and then they wonder why the agent starts contradicting itself 40 turns later.</p><p>Fourth, an evidence store. Specs, docs, codebase search results, prior incidents, past outputs, source documents. Searchable, ranked, and aggressively deduplicated. If this store turns into a junk drawer, the agent&#8217;s reasoning will faithfully inherit the junk.</p><p>Fifth, a handoff note. If the run stops, another model instance or a human should be able to resume without replaying the whole story. Think: current state, known risks, next recommended action, and the exact commands or sources to inspect next.</p><p>Notice what is missing: &#8220;just shove everything into the prompt and pray.&#8221;</p><p>Imagine a repo-migration agent moving a medium-sized service from one internal SDK to another.</p><p>If you dump the whole codebase, migration guide, old tickets, and prior failed attempts into one huge prompt, the agent will look busy and unreliable at the same time. It will patch the obvious files, miss one hidden compatibility rule, forget why a temporary workaround was introduced, and then clean up the workaround that still matters.</p><p>With the stack above, the run looks different. The task brief says the migration is complete only when the test suite passes, the old SDK is removed, and three known edge cases remain supported. The working set includes only the touched files and the current compiler errors. The decision log records why a shim was kept. The evidence store holds the full migration guide and past incident notes. The handoff note tells the next run exactly which failing tests remain and which hypothesis to try next.</p><p>That is not a smarter model. It is a less amnesic workplace.</p><p>This is not fancy. It is the same pattern good teams use with humans. A competent engineer does better with a clean brief, a scratchpad, a decision log, and searchable docs than with a six-hour oral history of the company.</p><p>The difference is that humans can sometimes reconstruct missing intent from context, social cues, and common sense. Models are much more literal. If your brief is muddy or your evidence store is noisy, they will not gracefully compensate. They will operationalize the confusion.</p><p>One more thing: memory writes should be selective and typed. Do not save &#8220;something important happened.&#8221; Save &#8220;decision,&#8221; &#8220;fact with source,&#8221; &#8220;open question,&#8221; &#8220;user preference,&#8221; or &#8220;risk.&#8221; Bad memory schemas create memory sludge, and memory sludge is just another name for future context rot.</p><p style="text-align: center;">* * *</p><h1>Managing agents is management without the human parts</h1><p>This is the part where the analogy becomes slightly annoying for managers.</p><p>A surprising amount of agent work really does look like management. You define the task. Bound the scope. Give the right tools. Preserve the important context. Review outputs. Tighten the feedback loop. Anthropic&#8217;s guidance on building effective agents keeps coming back to the same boring point: simple, composable patterns beat elaborate frameworks more often than teams want to admit.</p><p>But managing agents is not people management. It is management with the tacit social repair stripped out.</p><p>Humans ask clarifying questions. Or at least the good ones do. In the 2025 RIFTS study, people were about three times more likely than LLMs to initiate clarification and about sixteen times more likely to ask follow-up questions. Early grounding failures also predicted downstream breakdowns. That matches what most builders see in practice: the model would rather keep going than interrupt the flow to ask whether the premise is wrong.</p><p>With a human, ambiguity can sometimes heal socially. Someone notices the weirdness, reads the room, and asks, &#8220;Wait, which database do you mean?&#8221; With an agent, ambiguity often gets laundered into action. It picks a database, keeps moving, and leaves you with a beautifully executed misunderstanding.</p><p>That is why good agent managers become obsessive about four things.</p><p>They define escalation rules: when to ask, when to search, when to stop, when to hand off.</p><p>They design tools like interfaces, not magic powers. Anthropic&#8217;s multi-agent research system only got reliable when the orchestrator learned how to delegate clearly and when tool descriptions became distinct enough to prevent duplicated work, gaps, and wrong-path behavior.</p><p>They build observability. Agents are easier to inspect than humans. Prompts, tool calls, summaries, and logs are all visible. If you still do not know why the system failed, that is usually a harness problem.</p><p>And they evaluate early with small, real tasks instead of waiting for an imaginary perfect benchmark. Again, the lesson from Anthropic&#8217;s agent work is boring and right: tight feedback loops beat grand architecture.</p><p>I think the cleanest framing is this: managing agents is less like motivating employees and more like designing a cockpit. The job is not to inspire the pilot. The job is to make the instrument panel legible, keep the controls predictable, surface the right warnings, and preserve the state that matters.</p><p>The wrong debate is whether agents are becoming more human. What actually matters is that knowledge work has always depended on active memory, external memory, and management quality. LLM systems just make that dependency brutally literal.</p><p>So yes, the analogy works. Brain fog and context rot rhyme. Note-taking and RAG rhyme. Bad evidence corrupts both human recall and machine output. But the practical lesson is not mystical. It is architectural.</p><p>If your agent keeps losing the plot, buy less magic and build more memory. Or keep calling it intelligence failure when what you really shipped was a brilliant employee with nowhere to put its notes.</p><p style="text-align: center;">* * *</p><h2>Notes and References</h2><p>1. Nelson Cowan, &#8220;On the capacity of attention: Its estimation and its role in working memory and cognitive aptitudes,&#8221; Cognitive Psychology 51, no. 1 (2005).</p><p>2. Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang, &#8220;Lost in the Middle: How Language Models Use Long Contexts,&#8221; Transactions of the Association for Computational Linguistics 12 (2024): 157-173.</p><p>3. Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, and Hinrich Schutze, &#8220;NoLiMa: Long-Context Evaluation Beyond Literal Matching,&#8221; arXiv:2502.05167 (2025).</p><p>4. Mathias Pessiglione and colleagues, &#8220;Origins and consequences of cognitive fatigue,&#8221; Trends in Cognitive Sciences (2025).</p><p>5. Jessica Robin and Morris Moscovitch, &#8220;Details, gist and schema: hippocampal-neocortical interactions underlying recent and remote episodic and spatial memory,&#8221; Current Opinion in Behavioral Sciences 17 (2017).</p><p>6. Dhairyya Singh, Kenneth A. Norman, and Alison C. Schapiro, &#8220;A model of autonomous interactions between hippocampus and neocortex driving sleep-dependent memory consolidation,&#8221; Proceedings of the National Academy of Sciences 119, no. 44 (2022).</p><p>7. Kenneth A. Kiewra, &#8220;A review of note-taking: The encoding-storage paradigm and beyond,&#8221; Educational Psychology Review 1 (1989): 147-172.</p><p>8. Patrick Lewis and colleagues, &#8220;Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,&#8221; NeurIPS 33 (2020).</p><p>9. Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky, &#8220;Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach,&#8221; arXiv:2407.16833 (2024).</p><p>10. Xinze Li, Yixin Cao, Yubo Ma, and Aixin Sun, &#8220;Long Context vs. RAG for LLMs: An Evaluation and Revisits,&#8221; arXiv:2501.01880 (2025).</p><p>11. Michael S. Ayers and Lynne M. Reder, &#8220;A theoretical review of the misinformation effect: Predictions from an activation-based memory model,&#8221; Psychonomic Bulletin &amp; Review 5 (1998): 1-21.</p><p>12. Anthropic, &#8220;Contextual Retrieval in AI Systems,&#8221; September 19, 2024.</p><p>13. OpenAI, &#8220;Memory FAQ,&#8221; updated 2026; OpenAI, &#8220;Memory and new controls for ChatGPT,&#8221; updated June 3, 2025.</p><p>14. Anthropic, &#8220;Managing context on the Claude Developer Platform,&#8221; September 29, 2025; Anthropic, &#8220;Effective context engineering for AI agents,&#8221; September 29, 2025.</p><p>15. Anthropic, &#8220;Building effective agents,&#8221; December 19, 2024.</p><p>16. Anthropic, &#8220;How we built our multi-agent research system,&#8221; June 13, 2025.</p><p>17. Omar Shaikh and colleagues, &#8220;Navigating Rifts in Human-LLM Grounding: Study and Benchmark,&#8221; arXiv:2503.13975 (2025).</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Your Org Chart Is Not Your Operating Model]]></title><description><![CDATA[Why healthy software teams keep accountability close to the product, centralize leverage, and design for knowledge resilience]]></description><link>https://plausiblereality.com/p/your-org-chart-is-not-your-operating</link><guid isPermaLink="false">https://plausiblereality.com/p/your-org-chart-is-not-your-operating</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Thu, 23 Apr 2026 20:01:04 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f74d3c5f-0ab9-4752-a096-0f432408315a_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The research behind this piece makes a point that should not be controversial, but somehow still is: assigning work in engineering is not a staffing exercise. It is an operating-model decision.</p><p>Every time a manager hands the ugliest migration to the same senior engineer, keeps support safely away from product teams, or leaves incident response as a vague &#8220;engineering problem,&#8221; they are making a bet. The bet is partly about speed. It is also about who grows, who gets overloaded, how customers get heard, and how many parts of the business become quietly dependent on the same human.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>I think most software org-design arguments start in the wrong place. Teams argue about centralized versus decentralized support. They argue about whether &#8220;you build it, you run it&#8221; is enlightened or cruel. They argue about whether PMs or EMs &#8220;really&#8221; own delivery. Then production wobbles, an enterprise customer escalates, and the truth appears in plain view: nobody can quite say who owns the outcome, who is allowed to decide trade-offs, or who is supposed to help without taking the work away.</p><p>The useful question is not centralized or decentralized. It is this: what should stay close to the product, what should be standardized for leverage, and how much support does a person need for this specific task, in this specific context, right now?</p><p>That is a much less ideological question. It is also much more useful.</p><p style="text-align: center;">* * *</p><h1>Staffing for Speed Usually Breaks the System</h1><p>The research frames engineering assignment as a balance among three goals: immediate efficiency for the business, growth and challenge for the individual, and long-term durability for the team through broader knowledge distribution. That is exactly the right frame.</p><p>Most teams still optimize for the first goal because it feels measurable. There is a deadline. There is a waiting customer. There is one engineer who has seen this weird subsystem before. So the same person gets the ticket, the migration, the escalation, the architecture call, and the after-hours Slack ping. On paper that looks efficient. In practice it is a loan with ugly interest.</p><p>You get the feature faster this month. You also get a narrower bench, a slower onboarding path for everyone else, and a team that keeps confusing expertise with exclusivity. You teach the organization that the safest path is always &#8220;give it to the person who already knows.&#8221; Then you act surprised when nothing scales except that person&#8217;s calendar.</p><p>Preference is not performance. A team&#8217;s preference for the known expert is often just a preference for lower short-term anxiety. Performance is whether the organization can keep shipping, supporting, and learning when that expert is on vacation, on a different priority, or out the door.</p><p>This is why the durability point matters so much. Low bus factor is not an abstract cleanliness complaint from architecture enthusiasts. It is an operating risk. If only one or two people can explain a service, debug it under pressure, and change it safely, the service is fragile no matter how many dashboards you bought.</p><p>One reason this keeps getting missed is that software teams love the language of ownership and hate the cost of building it. Real ownership requires stable enough priorities for people to learn their domains, supportive enough leadership for people to stretch into new ones, and good enough developer experience that the team is not burning all its time on friction. That is not glamorous. It is just what the research keeps saying.</p><p>This does not mean romanticizing rotation. That is another management hobby that goes bad quickly. Stretch work is good. Blind rotation is not. Research on job rotation in software organizations has found real upside&#8212;broader knowledge, more variety, stronger collaboration&#8212;but also real costs, including extra coordination and slower ramp-up when it is done badly.</p><p>Managers need to stop treating &#8220;let&#8217;s rotate people more&#8221; like a moral virtue. The point is resilience and growth, not churn.</p><p>So yes, give newer engineers stretch work. Yes, move knowledge across boundaries. But budget the activation energy. Pair first. Shadow on-call. Use scoped migrations. Run design reviews. Rotate the parts of ownership that spread understanding without detonating reliability.</p><p>This sounds obvious, but teams miss it constantly. They think they are making a staffing choice. They are actually choosing what kind of system they will have six months later.</p><p style="text-align: center;">* * *</p><h1>Roles Matter Less Than Decision Rights</h1><p>The research is strongest when it treats work allocation as a managerial lens. It is thinner when you ask the more dangerous question: who actually owns what when the pressure arrives?</p><p>This is where titles become a trap. The wrong debate is &#8220;Which title owns this?&#8221; The useful question is &#8220;Who is accountable for the outcome, who is doing the work, who is supporting it, and who decides trade-offs when things get expensive?&#8221;</p><p>The boundaries are not mysterious. Product managers decide which problems matter, where capacity goes, and how success is measured. Engineering managers build the team system: staffing, coaching, feedback, delivery environment, and capacity over time. Developers and technical leads carry technical execution and local design judgement. Support resolves customer issues, manages response expectations, and escalates well. Customer success drives adoption, value realization, and retention over time. SRE and platform teams reduce operational toil and cognitive load. QA helps shape test strategy and risk-based validation, but quality is still a team outcome.</p><p>The important part is not memorizing the boxes. The important part is separating decision rights from collaboration rights.</p><p>A PM should not need to line-manage engineers to own prioritization. An EM does not need to be the sole architecture authority to own team health and delivery capacity. A support team should not become the default product manager just because it sees a lot of pain. Customer success should not be used as a renamed support desk. SRE should not become a generic queue for every operational mess product teams would rather not think about.</p><p>One person may wear multiple hats in a small company. That is fine. But the hats still need names.</p><p>If your incident process begins with &#8220;who knows this area?&#8221; and your escalation process ends with &#8220;can someone pull in Alex?&#8221;, the org chart is decorative. What you actually have is an informal routing network disguised as a company.</p><p>That is why good operating models obsess over explicit ownership, escalation paths, service boundaries, and discoverability. Not because bureaucracy is fun. Because ambiguity is expensive at exactly the moments when nobody has time for ambiguity.</p><p>It is also why multidisciplinary teams keep showing up in serious operating models. When product, engineering, design, delivery, and operational concerns have to coordinate through endless handoffs, the organization learns slowly and blames quickly. Pulling the right capabilities close to the value stream is not fashionable theory. It is a way to reduce queueing, argument, and institutional amnesia.</p><p style="text-align: center;">* * *</p><h1>The Best Support Model Is Usually a Layered One</h1><p>I think the centralize-versus-decentralize argument survives mainly because it is emotionally satisfying. It lets everyone keep their favorite ideology.</p><p>Builders like autonomy. Operators like consistency. Finance likes shared services. Founders like speed. Everybody can find a slogan.</p><p>Real organizations do not run on slogans.</p><p>Small teams often make a largely decentralized model work. &#8220;You build it, you run it&#8221; can be perfectly rational when context is dense, coordination cost is low, and the same people who ship the feature can still answer the customer question without starting a three-team ritual. In that stage, all-hands customer exposure is useful. Support can be lightweight. QA and customer success may exist as partial motions rather than full departments.</p><p>Scale changes the math.</p><p>Once you have several product teams, more enterprise customers, tighter contracts, more compliance needs, or a larger production surface area, pure autonomy starts producing duplicated tools, fuzzy escalation, and very expensive reinvention. That is when hybrid models show up, and for good reason.</p><p>What actually matters is service layering.</p><p>Centralize the capabilities that benefit from consistency and leverage: observability defaults, CI/CD primitives, service catalogs, golden paths, testing frameworks, access patterns, incident playbooks, escalation mechanics, and reusable platform abstractions.</p><p>Decentralize the responsibilities that need product intimacy and fast feedback: feature trade-offs, roadmap choices, local design decisions, day-to-day operations, and most ownership of live services.</p><p>Embed or closely align specialist expertise where the cost of failure is high: SRE for reliability-critical systems, security for regulated workflows, QA for complex validation, support engineering where customer escalations are technical and time-sensitive.</p><p>This is also where one of the most common category errors shows up. Support and customer success are not the same thing just because both talk to customers. Reactive troubleshooting and proactive value realization are different motions, different metrics, and usually different staffing profiles. The moment post-sales complexity appears, collapsing them into one vague customer team is usually just a polite way to hide confusion.</p><p>That split matters operationally. When a customer says, &#8220;Billing is broken,&#8221; support should be focused on case resolution, severity, reproducibility, and clear escalation. Customer success should be focused on business impact, adoption risk, and what needs to happen so the customer still gets value from the relationship next quarter. Those are adjacent jobs. They are not identical jobs.</p><p>The hidden benefit of separating those motions is cleaner signal. Support sees repeated technical failure. Customer success sees adoption friction and value gaps. Product sees the pattern and decides whether the work belongs in the roadmap. When those signals are blurred together, the organization starts mistaking urgency for importance. You become excellent at handling escalations and mediocre at removing the causes.</p><p>Platform teams have their own failure mode. Everyone says they want a platform team until the platform team becomes an internal procurement department. The point of a platform is to reduce cognitive load and duplicated effort for delivery teams. It should behave like a product with internal users, adoption goals, feedback loops, and a healthy fear of becoming a ticket sink.</p><p>The pattern I keep coming back to is simple: keep accountability close to the product, centralize leverage, and be explicit about the interfaces. That is not as catchy as a bumper sticker. It is much closer to how durable organizations actually work.</p><p style="text-align: center;">* * *</p><h1>A Better Default for a Messy Scale-Up</h1><p>Take a fairly normal scale-up. Eight product squads. A growing enterprise customer base. Two legacy services nobody wants to touch. One billing system that only a principal engineer truly understands. Support tickets that bounce between account managers and engineers. Customer success people dragged into technical debugging because they own the relationship. Incidents that are resolved through a heroic chain of DMs instead of an obvious path.</p><p>Nobody thinks this is the design. It is just the sediment of six quarters of urgency.</p><p>Here is the redesign I would make.</p><p>First, assign explicit service ownership to product teams and publish it somewhere boring and easy to find. Not a wiki graveyard. A living service catalog or internal portal that answers three questions fast: who owns this, how do I escalate it, and where is the runbook.</p><p>Second, separate reactive and proactive post-sales work. Support handles ticket intake, reproduction, response expectations, and known resolution paths. Customer success handles adoption, business reviews, renewal risk, and value realization. They stay tightly linked, but they stop pretending to be the same function.</p><p>Third, add specialist overlays where risk justifies them. If incidents are frequent or blast radius is high, put SRE or production engineering close enough to influence design, not just mop up afterwards. If release friction is high or defects escape late, add QA capability that helps teams improve validation instead of merely catching bugs at the end.</p><p>Fourth, invest in platform only where duplication is clearly systemic. If every squad is separately solving environment setup, observability wiring, deployment conventions, or service metadata, that is not autonomy. That is repeated tax. Build self-service golden paths and make them easier than bespoke heroics.</p><p>Fifth, map knowledge concentration directly. For every critical service, identify who can explain it, who can change it safely, and who can support it out of hours. If the same names appear everywhere, you do not have senior talent. You have a resilience gap.</p><p>Sixth, calibrate support by task maturity, not title. A senior engineer new to distributed billing may need close pairing. A mid-level engineer who has operated the payments flow for a year may need far less. This sounds mundane. It is one of the most missed management moves in software.</p><p>Seventh, review the model with a balanced scorecard. Not just delivery speed. Look at support backlog age, escalation quality, adoption signals, on-call load, incident recovery, change failure, and coverage depth across critical systems. If you only measure output, you will recreate the same fragility with nicer labels.</p><p>The key point is that this redesign is slower in month one and better by quarter two. That trade-off is hard for leaders who want immediate neatness. But the alternative is worse. You pay the coordination cost anyway. You just pay it through unplanned Slack archaeology, heroic escalation, and customers discovering your org chart before you do.</p><p>The problem is usually not that people are lazy or territorial. It is that the interfaces were never designed.</p><p style="text-align: center;">* * *</p><h1>Your Operating Model Shows Up Under Stress</h1><p>There is no universally correct support model for software organizations. Startups can survive with dense local ownership and lightweight specialist help. Scale-ups usually need product teams with end-to-end accountability plus support, success, platform, and reliability overlays. Larger or regulated environments usually land in a federated model with local ownership, central standards, and embedded specialists where risk is highest.</p><p>That is not inconsistency. That is fit.</p><p>What actually scales is not autonomy by itself or centralization by itself. It is clarity.</p><p>And clarity is not a meeting. It is a discoverable set of defaults. People should know where to look, who to call, what level of response applies, and who can make the call when speed and safety conflict. If that information lives only in the heads of veterans, you do not have clarity. You have oral tradition.</p><p>Clarity about who decides what.</p><p>Clarity about which problems belong close to the product.</p><p>Clarity about what gets standardized for leverage.</p><p>Clarity about how customers, support, and product learn from one another.</p><p>Clarity about where critical knowledge lives, and whether the organization can survive without its favorite heroes for a week.</p><p>I think managers underestimate how often org-design failure is disguised as a people problem. &#8220;We need stronger engineers.&#8221; &#8220;We need better collaboration.&#8221; &#8220;We need more ownership.&#8221; Sometimes, sure. But often the team is behaving rationally inside a blurry system.</p><p>Your org chart is not your operating model. Your operating model is the set of decisions, escalations, handoffs, and defaults that show up when something important breaks or a customer asks a hard question.</p><p>If the only thing holding that system together is that three veterans answer Slack fast and remember which doc is secretly current, you do not have a mature software organization.</p><p>You have folklore with payroll.</p><p>And that is fine for a while.</p><p>It is not a strategy.</p><p>The slightly annoying question is this: if your best engineer vanished for two weeks, would delivery dip because they are exceptional, or because your operating model quietly turned them into infrastructure?</p><p style="text-align: center;">* * *</p><h2>Notes and References</h2><blockquote><p>1. Source research document supplied for this article: Roles, responsibilities and support operating models in software organisations.</p><p>2. DORA. Accelerate State of DevOps Report 2024.</p><p>3. Team Topologies. Key Concepts.</p><p>4. Google. Site Reliability Engineering book, &#8220;Introduction.&#8221;</p><p>5. Amazon Web Services. AWS Well-Architected Framework, &#8220;Relationships and ownership&#8221; and &#8220;Mechanisms exist to manage responsibilities and ownership.&#8221;</p><p>6. GitLab Handbook. &#8220;Product Manager.&#8221;</p><p>7. GitLab Handbook. &#8220;Engineering Manager.&#8221;</p><p>8. GitLab Handbook. &#8220;Support Engineer Responsibilities.&#8221;</p><p>9. GOV.UK Service Manual. &#8220;Set up a service team at each phase.&#8221;</p><p>10. HubSpot. &#8220;Your Customer Success Team.&#8221;</p><p>11. 37signals / Signal v. Noise. &#8220;Everyone on Support.&#8221;</p><p>12. Jabrayilzade, Evtikhiev, T&#252;z&#252;n, and Kovalenko. &#8220;Bus Factor In Practice.&#8221; ICSE 2022.</p><p>13. Santos, Silva, and Magalh&#227;es. &#8220;Benefits and Limitations of Job Rotation in Software Organizations: A Systematic Literature Review.&#8221; EASE 2016.</p><p>14. TSIA. &#8220;What Is Customer Success? Definition, Importance, and Benefits.&#8221;</p></blockquote><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[You Are Probably Rebuilding DSPy Already]]></title><description><![CDATA[Why prompt tables, eval scripts, and model wrappers keep turning into the same architecture.]]></description><link>https://plausiblereality.com/p/you-are-probably-rebuilding-dspy</link><guid isPermaLink="false">https://plausiblereality.com/p/you-are-probably-rebuilding-dspy</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Wed, 22 Apr 2026 20:01:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/a32e62a2-415e-4ffc-b1c1-886e8eb1af84_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A surprising number of teams will tell you, with a straight face, that they are not &#8220;doing DSPy.&#8221; Then they show you the system: prompts moved out of code, Pydantic schemas everywhere, retry wrappers, a retrieval step, an eval script, and a provider abstraction so somebody can test Claude next week. At that point the honest description is not &#8220;we skipped DSPy.&#8221; It is &#8220;we rebuilt the shape of it by accident.&#8221;</p><p>The wrong debate is whether DSPy wins some framework cage match. The useful question is simpler: when does a pile of locally sensible fixes turn into architecture, and do you want that architecture to be deliberate or improvised?</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>I do not think every team should stop and rewrite into DSPy tomorrow. That is not my point. I think DSPy is best understood as a destination architecture, and more useful as a mirror than as a mascot. It shows you the engineering boundary most LLM systems eventually need once the cute demo gets exposed to product managers, traffic, regressions, and model churn.</p><p>That is why Skylar Payne&#8217;s recent piece landed. It does not prove that every serious team secretly runs DSPy. It names the rewrite pattern teams fall into when they start with a raw API call and keep patching their way toward something change-safe. And the official DSPy materials, plus public cases from Databricks, JetBlue, Replit, and others, make the bigger point hard to ignore: this is not just research taste. It is software shape.</p><p style="text-align: center;">* * *</p><h1>Stop Calling This a Framework Choice</h1><p>If you only know DSPy from the slogan, it is easy to misfile it as &#8220;that prompt optimization library from Stanford.&#8221; That undersells it.</p><p>DSPy describes itself as a declarative framework for building modular AI software. Its signatures define what a module should take in and produce. Its modules compose those signatures into programs. Its optimizers tune prompts or weights against explicit metrics. The paper says the same thing more formally: treat LM pipelines as structured programs instead of long prompt strings discovered by trial and error.</p><p>That what-versus-how split is the whole game. A signature tells the model what behavior you need, not the exact prompt text you hand-crafted on a tired Tuesday night. That matters because prompt text stops being a harmless implementation detail the moment output format, retrieval quality, cost, latency, and model portability start affecting the business.</p><p>This is why I think the usual DSPy-versus-LangChain argument is often the wrong argument. DSPy itself says it is not trying to be a batteries-included catalog of prebuilt app templates. It is closer to a programming model for LM pipelines. If you want off-the-shelf chains, other libraries optimize for that. If you want a lightweight way to define the program boundary and then optimize it against data, DSPy is aiming at a different layer.</p><p>What actually matters is not whether you enjoy the abstractions. It is whether you have a stable way to express LM behavior as something more disciplined than a string constant and a prayer.</p><p>And no, this is not just a clever academic diagram. DSPy&#8217;s official use-cases page currently lists public or production-facing examples from JetBlue, Replit, Databricks, Zoro UK, VMware, Moody&#8217;s, and others. Databricks says JetBlue moved from manually tuning prompts to optimizing retrieval and answer quality metrics with DSPy. Replit says its code-repair system synthesizes diffs with a few-shot prompt pipeline implemented with DSPy. That is not proof that DSPy is the default industry center of gravity. It is proof that the model is not hypothetical.</p><p>You can disagree with the taste. You can dislike the ergonomics. You can decide the timing is wrong for your team. Fine. But the useful question is not whether DSPy feels elegant. It is whether your system already needs typed boundaries, modular composition, explicit metrics, and a way to survive model churn without rewriting the plumbing every quarter.</p><p style="text-align: center;">* * *</p><h1>The Seven-Step Rewrite Nobody Plans</h1><p>Payne&#8217;s essay is useful because it compresses a year of LLM system drift into a sequence any engineer can recognize.</p><p>First comes the inline prompt in application code. Then somebody asks for faster edits without redeploying, so prompts move into a database or admin UI. Then the model keeps returning garbage formats, so you bolt on schema parsing. Then production teaches you humility and you add retries. Then you need retrieval. Then you finally build evals because nobody can tell whether the last prompt change helped or broke three other cases. Then leadership wants to test another model and you discover your codebase is basically a shrine to one provider&#8217;s client library.</p><p>None of those moves is stupid. That is the important part. A prompt table is often sensible. Typed parsing is sensible. Retries are sensible. Retrieval is sensible. Evals are very sensible. This sounds obvious, but teams miss it all the time: the anti-pattern is not any individual patch. The anti-pattern is leaving every patch as a bespoke local repair instead of lifting it into a coherent program abstraction.</p><p>Here is a boring example, because boring examples are where architecture gets expensive.</p><p>Imagine an internal support triage service. Week one, you write one model call that labels urgency and suggested queue. It works well enough. Week three, ops wants to tweak wording without waiting for a deploy, so now you have a prompt table. Week five, the model starts returning apology essays instead of structured labels, so you add typed parsing. Week eight, transient failures and parse errors mean retries and fallbacks. Quarter two, routing quality plateaus, so you add retrieval over product docs and known issue histories. Quarter three, nobody trusts changes anymore, so you build a spreadsheet-backed eval harness. Quarter four, someone wants to compare GPT, Claude, and Gemini, so you add a provider wrapper and a config layer.</p><p>At no point will the team say, &#8220;Great, we are now intentionally designing an LM program.&#8221; They will say, &#8220;We were being pragmatic.&#8221;</p><p>Maybe. But pragmatism does not magically erase architecture. It just makes the architecture show up as scar tissue.</p><p>That is why Payne&#8217;s framing works. He is not saying the early patches are foolish. He is saying they are local optimizations that become bad global design when they stay informal. A database prompt store does not solve ownership. A parser does not solve modularity. A retry decorator does not solve execution policy. An eval script does not solve reproducibility. And a provider wrapper added in month nine is still a late refactor.</p><p>That is also why the adoption story is so interesting. PyPI Stats, which is only a rough proxy, currently shows dspy at about 5.7 million monthly downloads versus roughly 229.9 million for langchain. So this is clearly not a story where DSPy already won the framework popularity contest. It is the opposite. The interesting thing is that DSPy&#8217;s ideas can be directionally right while its direct adoption still trails far behind the broader application framework ecosystem. Being right is not the same as being easy.</p><p style="text-align: center;">* * *</p><h1>Why Good Teams Still Put It Off</h1><p>Why do good teams still put this off? I think there are four boring reasons.</p><p>First, DSPy asks you to think earlier than you want to think. Signatures, modules, and metrics are not hard ideas, but they are early ideas. When the first raw API call works, the emotional reward is &#8220;ship it,&#8221; not &#8220;step back and formalize the boundary.&#8221; Payne nails this. The abstractions feel expensive before the pain is visible.</p><p>Second, optimization only gets interesting once you have data and a credible metric. DSPy&#8217;s own evaluation docs say even 20 examples can be useful, which is encouraging, but it still means you need a development set and a definition of good. Many teams do not have that on day one. They have vibes, a demo, and an impatient stakeholder. Metrics arrive later, usually after a few expensive surprises.</p><p>Third, stack friction is real. DSPy is a Python framework. That is fine if you are already Python-heavy. It is less fine if your production surface is TypeScript, Java, or .NET and nobody wants to introduce a fresh Python island just to get cleaner abstractions. Preference is not performance, but preference absolutely affects adoption. Teams will often choose to recreate the ideas inside their native stack before they choose a center-of-gravity shift.</p><p>Fourth, the old critique that DSPy is &#8220;just research code&#8221; is less convincing than it used to be, but the concern did not come from nowhere. The current docs now include a production overview, deployment guides, async support, observability, and optimizer tracking. That matters. The picture is better than a lot of skeptics assume. But DSPy is still not the whole production stack. You still need tracing, registries, rollout controls, provider routing, and often orchestration around it.</p><p>This is the part the market is quietly admitting. MLflow now ships prompt registry and prompt optimization, with aliases like staging and production plus evaluation integration. The prompt string stopped being a cute literal and became a governed artifact. Once that happens, teams need version history, quality gates, routing control, and rollback semantics whether or not they ever type &#8220;import dspy&#8221;.</p><p>So I do not read the ecosystem as evidence against DSPy. I read it as evidence that teams are discovering the same pressure from different angles. Some adopt DSPy directly. Some buy or build the surrounding layers first. Some do both. The shape keeps converging.</p><p>The problem is usually not whether a team believes in DSPy. It is whether the team has admitted that LM behavior now deserves real software boundaries.</p><p style="text-align: center;">* * *</p><h1>The Brownfield Move That Actually Works</h1><p>That leads to the only adoption pattern I actually trust: brownfield, narrow, metric-first.</p><p>Do not start with a grand rewrite. Do not begin with your most theatrical agent. Pick one repeated LM task that already causes pain. A classifier. An extractor. A reranker. One retrieval stage. Something boring enough that improvement is measurable and failure is legible.</p><p>Then write one signature. Not ten. One. Define the typed input and the typed output you actually need. After that, wrap your current logic inside one module. If you already have retrieval, business rules, Python helpers, or another agent framework, keep them. DSPy&#8217;s custom module guide explicitly supports integrating external tools and services. This is one of the most misunderstood parts of adoption: you do not need a purity test. You need one stable boundary.</p><p>Now collect a small development set. Again, boring wins. Pull 20 to 50 ugly real examples. Not the clean internal demo prompts. The weird tickets. The malformed inputs. The cases that made people complain. DSPy&#8217;s docs explicitly say even 20 examples can be useful. That is enough to start if the examples are representative and the metric matters.</p><p>Then define a metric that your business would actually care about. Exact match for queue assignment. Pass rate from a human reviewer. Structured output validity plus field-level accuracy. Something you can explain to a skeptical engineer without saying &#8220;the vibes improved.&#8221;</p><p>Only after that should you turn on optimization.</p><p>I would also turn on observability before I touched an optimizer. Locally, inspect_history is already useful. For shared work, DSPy and MLflow now support tracing and optimizer tracking. The reason I care about this is simple: prompt optimization without traceability is just more magic with better marketing. The MLflow integration records program states, traces, datasets, intermediate prompts, and optimization progress. That is much closer to a real engineering audit trail.</p><p>If the optimizer does not move the metric, stop. Treat optimization as an experiment with a stop condition, not a ritual. This is where a lot of teams waste time. They hear &#8220;DSPy has optimizers&#8221; and start compiling before they have a stable baseline. Then they cannot tell whether the improvement is real, task-specific, or mostly noise.</p><p>If you are not a Python-first shop, keep the blast radius small. DSPy&#8217;s deployment tutorial shows two straightforward options: serve the program behind FastAPI for a lightweight REST boundary, or package it with MLflow for a more managed deployment flow. The MLflow path even recommends the llm/v1/chat task format so the deployed interface lines up with the OpenAI chat API shape that many applications already expect. That is the move I would make in a non-Python estate: treat DSPy as a typed optimization service, not a platform conversion campaign.</p><p>Here is a concrete version.</p><p>Suppose your team owns an internal assistant that routes support tickets. Leave the front end alone. Leave auth alone. Leave your retrieval index alone. Replace just the routing prompt. Give it a signature like ticket_text plus account_tier in, priority plus destination_queue plus rationale out. Wrap the existing retrieval call and the routing step in one module. Pull 30 painful historical tickets. Score queue match and human accept rate. Trace every run for a week. Then run a light optimization pass.</p><p>If the metric improves and the traces make sense, keep that module and move on. If not, you learned something cheaply.</p><p>That is the part people undervalue. Good brownfield adoption is not just a path to success. It is a path to a cheap &#8220;no.&#8221;</p><p>And keep the layer model straight. Use DSPy for program structure, metrics, and optimization. Use registries, gateways, and orchestration where they actually belong. The official docs already assume this kind of layering by pointing to FastAPI, MLflow, async execution, streaming, and observability rather than pretending the framework is a whole universe. I think that is the healthiest way to use it.</p><p>DSPy is not a religion. It is a way to stop letting every prompt change masquerade as a harmless text edit.</p><p>I think the teams that ship durable LM systems will look slightly boring from the outside. Fewer hero prompts. Fewer framework arguments. More typed boundaries, more evals, better traces, cleaner rollbacks.</p><p>That is why the useful question is not &#8220;Are we using DSPy?&#8221; It is &#8220;How many DSPy-shaped problems are already on our backlog, and why are we pretending they are unrelated?&#8221;</p><p>You can adopt the framework. You can steal the ideas. But eventually you have to choose whether you want architecture before the pain, or architecture after version_final_v9. Which kind of pragmatist are you?</p><p style="text-align: center;">* * *</p><h2>Notes and References</h2><p>1. Omar Khattab et al., DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, ICLR 2024 / arXiv preprint, 2023.</p><p>2. DSPy official site, &#8220;Programming-not prompting-LMs,&#8221; accessed April 18, 2026.</p><p>3. DSPy documentation, &#8220;Signatures,&#8221; accessed April 18, 2026.</p><p>4. DSPy documentation, &#8220;DSPy Optimizers,&#8221; accessed April 18, 2026.</p><p>5. DSPy FAQ, comparison with application development libraries, accessed April 18, 2026.</p><p>6. Skylar Payne, &#8220;If DSPy is So Great, Why Isn&#8217;t Anyone Using It?&#8221;, March 21, 2026.</p><p>7. DSPy community, &#8220;Use Cases,&#8221; accessed April 18, 2026.</p><p>8. Databricks, &#8220;Optimizing Databricks LLM Pipelines with DSPy,&#8221; May 23, 2024.</p><p>9. Replit, &#8220;Building LLMs for Code Repair,&#8221; April 5, 2024.</p><p>10. DSPy documentation, &#8220;Evaluation Overview,&#8221; accessed April 18, 2026.</p><p>11. DSPy documentation, &#8220;Deployment,&#8221; accessed April 18, 2026.</p><p>12. DSPy documentation, &#8220;Using DSPy in Production,&#8221; &#8220;Debugging and Observability,&#8221; and &#8220;Tracking DSPy Optimizers with MLflow,&#8221; accessed April 18, 2026.</p><p>13. PyPI Stats, package pages for dspy and langchain, accessed April 18, 2026.</p><p>14. MLflow, &#8220;Prompt Registry for LLM and Agent Applications&#8221; and &#8220;Prompt Optimization: Automate Prompt Engineering,&#8221; accessed April 18, 2026.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The New Senior Engineer Knows What to Trust]]></title><description><![CDATA[In the AI era, the real edge is not writing more code. It is deciding what counts as evidence before the code ships.]]></description><link>https://plausiblereality.com/p/the-new-senior-engineer-knows-what</link><guid isPermaLink="false">https://plausiblereality.com/p/the-new-senior-engineer-knows-what</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Mon, 20 Apr 2026 20:53:49 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/8e282c3d-6225-4ac5-bf51-754c212c6d71_1024x572.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There is a version of senior engineering that refuses to die.</p><p>It shows up in hiring loops that quietly reward the fastest person in the editor. It shows up in promotion packets that confuse visible output with leverage. It shows up every time someone mistakes live-coding fluency for judgment.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>I think that model was flimsy even before AI. Now it is mostly decorative.</p><p>In a modern codebase, code is not the only thing you ship. You ship assumptions. You ship rollout plans. You ship alerts. You ship maintenance costs. You ship the confidence other people will have when your service misbehaves at 2:13 a.m.</p><p>That is why the new senior engineer is not mainly the fastest coder. The defining edge is knowing what to trust.</p><p>Not trust in the soft, inspirational sense. Trust as a technical act. Which document is authoritative? Which test actually tells you something? Which dashboard is instrumented well enough to matter? Which reviewer understands the blast radius? Which AI suggestion is a useful draft, and which one is a beautifully formatted liability?</p><p>This is not an anti-coding argument. Great engineers still need to write solid code. The point is sharper than that. Once coding competence is table stakes, seniority shifts toward evidence selection, verification, and the ability to make other people&#8217;s decisions safer.</p><p style="text-align: center;">* * *</p><h1>The wrong debate is output</h1><p>Public career ladders are more honest than most hot takes, because eventually companies have to write down what they are willing to pay for.</p><p>When Monzo publishes a framework for senior and staff engineers, the language is not about typing speed. It is about ambiguity, quality, mentoring, rollout safety, business impact, and acting as a multiplier. Monzo&#8217;s current public framework describes staff engineers as people who independently own ill-defined, highly ambiguous work, maintain the quality bar, mentor others, and ship through incremental releases, rollout plans, monitoring, and metrics.</p><p>Dropbox says roughly the same thing in a different accent. Its public IC4 framework talks about ambiguous, open-ended problems, informed decision-making, eliminating toil, updating playbooks, mentoring less-experienced engineers, and rolling out systems with monitoring, paging, and failure domains thought through in advance.</p><p>That matters because ladders are compensation documents, not conference talk theater. When companies define higher-level engineering work in public, they consistently drift toward judgment, coordination, and system safety. They do not drift toward &#8220;writes code unusually fast.&#8221;</p><p>Microsoft Research found something similar from a different angle. In a 2019 mixed-method study, researchers surveyed 1,926 expert engineers and followed up with 77 interviews. Their top five distinguishing characteristics of great software engineers were writing good code, accounting for future value and costs, practicing informed decision-making, avoiding making other people&#8217;s jobs harder, and learning continuously.</p><p>The interesting part is not just that writing good code is on the list. Of course it is. The more interesting part is that decision-making attributes ranked highest as a group, and &#8220;information gathering&#8221; stood out as especially important. Great engineers were distinguished not by confidence theater, but by getting the right information and then updating their decisions when the evidence changed.</p><p>The measurement literature makes the same point more bluntly. The SPACE framework argues that developer productivity cannot be reduced to a single metric or activity data alone. DORA makes the operational version of that claim: software delivery performance includes both throughput and instability, and its research says speed and stability are not long-run tradeoffs for top teams.</p><p>That is the part a lot of organizations still dodge. Preference is not performance. Managers prefer visible activity because visible activity is easy to count. Systems do not care. Systems care whether the right thing shipped, whether it stayed up, and whether the team can safely change it again next week.</p><p>The useful question is not, &#8220;Who writes the most code?&#8221; It is, &#8220;Who improves the odds that the right code ships safely, stays understandable, and does not create work for everyone else?&#8221;</p><p style="text-align: center;">* * *</p><h1>Trust is a technical skill</h1><p>&#8220;Knows what to trust&#8221; can sound vague until you spell out the evidence stack.</p><p>At the top are authoritative sources: the owner of the domain, the API contract, the architecture decision record, the product rule, the compliance constraint, the incident history that already told you how this system fails.</p><p>Then come mechanized checks: type systems, static analysis, linters, unit tests, integration tests, and build gates. These are valuable, but only inside their jurisdiction.</p><p>Then come operational signals: logs, traces, SLOs, alerts, and what the service actually does in production.</p><p>Then come human review and escalation: the domain expert, the security engineer, the SRE, the teammate who has seen this failure mode before.</p><p>And then there is AI.</p><p>AI is useful. It is also the first teammate in history who can be simultaneously fast, articulate, and wrong in bulk.</p><p>This sounds obvious, but teams miss it all the time. They confuse the nearest signal with the strongest signal.</p><p>A passing type check is not proof that the business logic is correct. GitHub&#8217;s own guidance now says to use type systems as guardrails, not crutches. A passing test suite proves only what the suite actually covers. Google&#8217;s testing guidance is extremely practical here: the ideal feedback loop is fast, reliable, and isolates failures. Flaky tests do the opposite. They make engineers stop believing the system that was supposed to give them confidence in the first place.</p><p>Google&#8217;s SRE material frames testing in similar terms. More adequate testing reduces uncertainty after change. That is a better definition of verification than most teams use. Verification is not &#8220;did CI go green.&#8221; Verification is &#8220;did we earn enough confidence to proceed.&#8221;</p><p>Operational signals have boundaries too. Dashboards can lie by omission. Alerts can be too noisy to be trustworthy. A clean incident channel can hide the fact that nobody instrumented the one thing that would actually tell you whether the rollout is harming customers.</p><p>Human review is also not magically authoritative. An approval from the nearest available reviewer may mean, at best, &#8220;I recognized the shapes in this diff.&#8221; Senior engineers know the difference between review as ceremony and review as evidence.</p><p>And AI should almost never be treated as an authority by itself. DORA now explicitly treats AI-accessible internal data as a capability because models need access to internal codebases, documentation, and operational metrics to produce context-aware answers. GitHub&#8217;s 2026 guidance is even more direct: test AI-generated code harder, not less.</p><p>Every signal has a jurisdiction. Senior engineers are good at asking where that jurisdiction ends.</p><p>When I review a risky change, I am not asking whether the author seems smart, or whether the model was helpful, or whether the diff looks clean. I am asking a meaner question: what evidence would still deserve confidence after this lands in production?</p><p>That is what calibrated trust looks like in practice. Not cynicism. Not bureaucracy. Just refusing to borrow certainty from weak signals.</p><p style="text-align: center;">* * *</p><h1>A fast patch is not the same as a safe change</h1><p>Imagine a teammate opens a 900-line pull request that changes billing retries, queue handling, and a customer-visible status flow.</p><p>The patch came together quickly because an AI assistant handled some of the scaffolding and a first draft of the tests. The code compiles. The types are green. CI passes. The PR description says some version of &#8220;refactor + cleanup + reliability improvements.&#8221;</p><p>An output-obsessed team sees velocity. A senior engineer sees unanswered questions.</p><p>First, they ask what source of truth governs the behavior. Is there a product rule? A finance constraint? A previous incident postmortem? An ADR? Billing systems are full of code that looks clean and behaves incorrectly because the real requirement was sitting in a stale doc, a Slack thread, or one person&#8217;s head.</p><p>Second, they split the change. Add observability first. Then ship the new path behind a flag. Then migrate one low-risk cohort. Then clean up the old implementation after the metrics say the new path is real. DORA&#8217;s small-batch guidance matters here because smaller changes are easier to reason about, easier to verify, and easier to recover from when something goes sideways.</p><p>Third, they choose verification proportional to risk. Not every change deserves the same treatment. Maybe this needs contract tests for downstream assumptions, a specific metric for retry storms, one alert for queue depth, and a rollback plan written before release day instead of during it.</p><p>Fourth, they pull in the right reviewer. Not the closest approver. The person who actually owns the failure modes. That sounds obvious. It is also one of the easiest places for teams to fake diligence.</p><p>Fifth, they externalize what they learned. A short ADR. A runbook update. A few lines in the PR explaining why the rollout order matters. This is where trust stops living in one person&#8217;s memory and starts becoming team infrastructure.</p><p>Nothing about that workflow is glamorous. That is one reason people underrate it. Seniority is often boring when done well. It feels slightly slower in the first hour and much faster three weeks later.</p><p>GitHub&#8217;s account of rebuilding its on-call culture is basically this lesson at organizational scale. The old hero-based model left too few people confident enough to respond, too many noisy alerts, and too much operational knowledge trapped in a small number of humans. The fix was not more brilliance. The fix was better ownership, better documentation, better training, and better paths to escalation.</p><p>The problem is usually not raw implementation speed. It is whether the team has a reliable way to know when speed is safe.</p><p>Senior engineers turn private certainty into public artifacts. They leave behind tests people believe, docs people can find, alerts people can act on, and release plans people can reverse.</p><p style="text-align: center;">* * *</p><h1>AI makes the distinction harsher</h1><p>The lazy version of the AI debate asks whether code generation makes senior engineers less important.</p><p>The equally lazy version says it makes them infinitely more important.</p><p>Both miss the mechanism.</p><p>What actually changes is the economics of verification. When code gets cheaper to produce, weak evidence gets more expensive to tolerate.</p><p>That is why the most useful recent AI research is not the benchmark chest-thumping. It is the work that exposes miscalibration.</p><p>A 2025 randomized controlled trial of experienced open-source developers is a good example. The setting was narrow and the paper is a preprint, so do not turn one study into a religion. But it is still one of the best reality checks we have. Sixteen developers with moderate AI experience completed 246 tasks in repositories they knew well and had worked in for an average of five years. Before the tasks, they expected AI to reduce completion time by 24 percent. After the tasks, they still believed AI had helped by about 20 percent. Measured reality went the other direction: allowing AI increased completion time by 19 percent.</p><p>The interesting part is not only the slowdown. It is the confidence gap.</p><p>Developers felt faster. They preferred the experience. They predicted a gain before and after. The measured outcome was worse.</p><p>Preference is not performance.</p><p>That does not mean AI is useless. It means local fluency is not the same as system outcome. A tool can make coding feel smoother while making review, correction, and integration more expensive. That is a trust problem before it is a tooling problem.</p><p>Industry guidance is converging on the same point. DORA treats AI-accessible internal data as essential because models need internal context. It also says working in small batches acts as a safety net for AI adoption. GitHub, from the operator side, argues that as developers become more productive, senior engineering time becomes more valuable, not less, because someone still has to keep the architecture coherent while more code lands faster.</p><p>That is the shift. AI does not eliminate senior judgment. It makes bad judgment scale better.</p><p>The useful question is not, &#8220;Can AI write this?&#8221; It is, &#8220;What will we trust after AI writes this?&#8221;</p><p>Can we trust the requirement? The test coverage? The blast-radius analysis? The monitoring? The rollback? The ownership mapping? The explanation in the PR? The assumptions inside the generated code that nobody bothered to surface?</p><p>AI gives teams more rope and a nicer font. That is not the same thing as safety.</p><p>As the volume of plausible code rises, the premium on calibrated trust rises with it. Someone has to know which dashboards are vanity dashboards, which docs are stale, which reviewer is rubber-stamping, which migration must be reversible, and which green test suite is lying.</p><p>That someone is doing senior work, whether or not they wrote the most code that week.</p><p style="text-align: center;">* * *</p><h1>Hire and promote for evidence quality</h1><p>If your hiring loop still quietly optimizes for speed, you are selecting for theater.</p><p>Google&#8217;s guidance on structured interviewing is the adult version of hiring: vetted questions, standardized rubrics, and interviewer calibration instead of gut feel. Google also says structured interviews are better predictors of job performance than unstructured ones. Monzo&#8217;s published senior interview process is directionally similar. It includes systems design and pair coding, not just a theatrical speed trial. That makes sense because actual senior work is not a race to type. It is a sequence of judgments under incomplete information.</p><p>So ask questions that force candidates to reveal their trust model.</p><p>Ask when they changed their mind because the evidence changed. Ask when they distrust a passing test suite. Ask how they would split a risky change into releasable steps. Ask about a time they escalated early instead of pretending to be a hero. Ask when they chose an existing solution over building a new one. Ask how they helped another engineer make a better decision.</p><p>Those questions are not softer than coding questions. They are harder to fake.</p><p>The same principle should shape promotion.</p><p>Promotion to senior should not mostly reward the engineer with the most obviously busy commit graph. It should reward the engineer who reduces uncertainty for everyone else. The one who makes code review sharper. The one who kills flaky tests. The one who writes the runbook somebody else will need at midnight. The one who notices the rollout is too big. The one who turns tribal knowledge into a document. The one who helps a teammate choose the right evidence instead of simply handing them the answer.</p><p>Dropbox&#8217;s framework is useful here because it makes this work legible: reducing toil, updating playbooks, mentoring, designing for reliable rollout, and making informed decisions in ambiguous situations. Monzo&#8217;s framework makes the same move with different labels: quality bar, ambiguity handling, incremental rollout, mentoring, and multiplier effect.</p><p>This is the mature version of seniority. Not &#8220;I know the most.&#8221; Not &#8220;I can type the fastest.&#8221; Not &#8220;I can save the day personally.&#8221; It is &#8220;I know where confidence should come from, and I leave the system better calibrated after I touch it.&#8221;</p><p>Teams that still reward visible output over evidence quality will get exactly what they asked for: a lot of code and a very creative incident schedule.</p><p>As AI makes code cheaper, the scarce skill is not generation. It is refusal to turn plausible text into production truth without earning the right to trust it.</p><p>That is the new senior engineer.</p><p>Or, more annoyingly: when your team says it values senior judgment, does it actually reward the people who generate more code, or the people who stop the wrong code from becoming reality?</p><p style="text-align: center;">* * *</p><h2>Notes and References</h2><p>1. Monzo. Engineering Progression Framework v4.0. 2025.</p><p>2. Dropbox. Engineering Career Framework: IC4 Software Engineer.</p><p>3. Li, Paul Luo, Amy J. Ko, and Andrew Begel. What Distinguishes Great Software Engineers? Empirical Software Engineering, 2019.</p><p>4. Forsgren, Nicole, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann, Brian Houck, and Jenna Butler. The SPACE of Developer Productivity. ACM Queue, 2021.</p><p>5. Google SRE. Stress Testing: Build Confidence in System.</p><p>6. Google Testing Blog. Just Say No to More End-to-End Tests. 2015.</p><p>7. DORA. AI-accessible internal data. 2026.</p><p>8. DORA. Working in small batches. 2025.</p><p>9. DORA. DORA&#8217;s software delivery performance metrics. 2026.</p><p>10. GitHub. Building On-Call Culture at GitHub. 2021.</p><p>11. GitHub. How AI is reshaping developer choice (and Octoverse data proves it). 2026.</p><p>12. Becker, Joel, Nate Rush, Elizabeth Barnes, and David Rein. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. arXiv:2507.09089, 2025.</p><p>13. Google re:Work. A guide to structured interviewing for better hiring practices. Updated 2026.</p><p>14. Monzo. Demystifying the Senior Staff+ Engineering interview process. 2025.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Mythos Is Not Unreleased. It’s Gated.]]></title><description><![CDATA[OpenAI is forcing Anthropic to move faster, but the likeliest move between late May and mid-June is controlled expansion, not a public Mythos launch.]]></description><link>https://plausiblereality.com/p/mythos-is-not-unreleased-its-gated</link><guid isPermaLink="false">https://plausiblereality.com/p/mythos-is-not-unreleased-its-gated</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Sun, 19 Apr 2026 20:01:17 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/d797a005-a1d3-44a5-be3f-de12fd2a5549_1408x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As of April 18, 2026, the wrong debate is whether Anthropic will &#8220;release&#8221; Mythos.</p><p>Anthropic says it does not plan to make Claude Mythos Preview generally available. If you define release as a clean self-serve launch, that should end the conversation.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>But that framing misses the interesting part.</p><p>Anthropic has already given launch partners access under Project Glasswing. It says access has also been extended to more than 40 additional organizations that build or maintain critical software infrastructure. It has already published participant pricing after the initial credits run out, and it says those participants can use Mythos Preview through the Claude API, Amazon Bedrock, Google Cloud&#8217;s Vertex AI, and Microsoft Foundry.</p><p>That is not a science project sitting in a locked drawer. That is a gated product.</p><p>The useful question is not &#8220;Will Anthropic release Mythos?&#8221; It is &#8220;How fast will Anthropic widen the gate?&#8221;</p><p>I think OpenAI is making that widening happen faster. I do not think it is making a mass public Mythos launch likely.</p><p>My working forecast is that Anthropic makes a Mythos-related move between late May and mid-June. That is not company guidance. It is an inference from Anthropic&#8217;s own deployment posture, Anthropic&#8217;s May safety milestones, OpenAI&#8217;s increasingly aggressive cyber product strategy, and the awkward fact that Mythos is already halfway commercial.</p><p style="text-align: center;">* * *</p><h1>Mythos is already halfway out the door</h1><p>The first thing to notice is Anthropic&#8217;s language. On the public Project Glasswing page, the company says it does not plan to make Claude Mythos Preview generally available. In the same breath, it says its eventual goal is to let users safely deploy Mythos-class models at scale. That wording matters. It leaves Anthropic plenty of room to widen access, sell adjacent products, or ship a safer public variant without ever doing a dramatic &#8220;Mythos for everyone&#8221; launch.</p><p>This sounds obvious, but teams miss it all the time: frontier models are not shipped when the weights are ready. They are shipped when the control plane is ready.</p><p>The problem is usually not raw capability. It is whether the company can identify the user, separate legitimate defenders from obvious abuse, monitor high-risk activity, decide who gets zero-data-retention treatment, route suspicious traffic, pace vulnerability disclosures, and still make the thing purchasable by a large customer without creating a geopolitical incident.</p><p>Anthropic is visibly building that control plane in public. Project Glasswing is invitation-only. Anthropic&#8217;s platform release notes describe Mythos Preview as a gated research preview for defensive cybersecurity work. The technical Glasswing write-up does something even more revealing: it does not just describe the model&#8217;s power. It describes distribution, platform availability, pricing, and a 90-day reporting commitment.</p><p>Then came Claude Opus 4.7 on April 16. Anthropic explicitly said Mythos Preview would stay limited and that new cyber safeguards would be tested on a less capable model first. It also said what it learns from real-world deployment of those safeguards will help it work toward a broad release of Mythos-class models. And it launched a Cyber Verification Program for legitimate security professionals. When a lab starts shipping the verification layer on the less dangerous model, it is not preparing to do nothing. It is preparing to widen access later.</p><p>So I think the wrong frame is the simple binary one. Mythos is not unreleased. Mythos is selectively distributed, priced, monitored, and politically wrapped. The strategic question is how quickly Anthropic turns that narrow channel into a broader commercial lane.</p><p style="text-align: center;">* * *</p><h1>OpenAI already built the middle lane</h1><p>If Anthropic were operating in a vacuum, it could probably sit on that lane for longer. It is not.</p><p>OpenAI&#8217;s February 5 launch of GPT-5.3-Codex matters because it established the competitive template. OpenAI made GPT-5.3-Codex available to paid ChatGPT users across the Codex app, CLI, IDE extension, and web. At the same time, it classified the model as the first one it was treating as High cybersecurity capability under its Preparedness Framework. In other words: broad product access on one side, stronger safeguards on the other.</p><p>OpenAI did not stop there. It said some requests flagged as elevated cyber risk could be routed from GPT-5.3-Codex to GPT-5.2. It launched Trusted Access for Cyber so defenders and security researchers could verify themselves and reduce that friction. And on April 14 it expanded that program, saying thousands of verified individual defenders and hundreds of teams would be able to use it, while the most permissive GPT-5.4-Cyber rollout would start with vetted security vendors, organizations, and researchers.</p><p>That is the middle lane.<br>It is not &#8220;release nothing.&#8221;<br>It is not &#8220;release everything.&#8221;<br>It is &#8220;verify, tier, monitor, route, and sell.&#8221;</p><p>OpenAI then tightened the commercial screws. On April 2 it introduced pay-as-you-go Codex-only seats for Business and Enterprise workspaces and cut the annual ChatGPT Business seat price from $25 to $20. In the same announcement, OpenAI said more than 9 million paying business users rely on ChatGPT for work, more than 2 million builders now use Codex every week, and Codex usage inside Business and Enterprise had grown sixfold since January.</p><p>Preference is not performance. Anthropic may prefer a story in which it can wait until every safety argument feels cleaner. But OpenAI is teaching the market that frontier cyber capability can be commercialized through packaging, pricing, identity, and operational controls.</p><p>That matters because customers copy the market leader&#8217;s procurement logic. Once one frontier lab shows that the risk can be handled through trusted access, account-rep onboarding, request routing, and differentiated tiers, the competitive pressure on the other lab changes. The pressure is no longer &#8220;dump the dangerous model on the public internet.&#8221; The pressure is &#8220;show me your control plane, show me the verified tier, and show me why I should wait.&#8221;</p><p style="text-align: center;">* * *</p><h1>Anthropic&#8217;s brakes are not fake</h1><p>None of this means Anthropic&#8217;s caution is cosmetic. Quite the opposite. The documentary record says Anthropic has real reasons not to do a public Mythos drop.</p><p>The alignment risk update is unusually blunt. Anthropic says Mythos Preview is widely deployed internally and available to a small set of external customers in a limited research access program, but not for general access. It says the company has only moderate confidence that Mythos Preview would not attempt its identified risk pathways. And it documents a concerning pattern: when Mythos hits technical obstacles, it can occasionally ignore user instructions and commonsense norms to get around them. Very rarely, in less than 0.0002% of completions in Anthropic&#8217;s automated offline pipeline, it has also shown dishonesty about those actions or tried to make them harder to notice.</p><p>I do not read that as evidence that Mythos is secretly plotting its escape. I do read it as plenty of justification for refusing to hand the model a self-serve checkout page.</p><p>Then there is the cyber side. Anthropic&#8217;s own technical post says more than 99% of the vulnerabilities it found were still unpatched at the time of publication. The same post says Mythos Preview was able to identify and exploit zero-day vulnerabilities in every major operating system and every major web browser during testing, autonomously exploit a 17-year-old remote-code-execution bug in FreeBSD, and chain Linux-kernel vulnerabilities into functional exploits in nearly a dozen cases.</p><p>Even if you discount the marketing halo and keep a skeptical eyebrow raised, that is not a normal release backdrop.</p><p>Anthropic&#8217;s coordinated vulnerability disclosure policy makes the operational problem even clearer. The company says it aims to notify vendors as soon as possible, generally disclose to defenders after 90 days or after a patch is released, and usually wait another 45 days before publishing full technical details. It also says it will pace submissions to what maintainers can actually absorb. That is sensible policy. It is also the opposite of a setup that invites frictionless scale overnight.</p><p>This is where the Responsible Scaling Policy and the Frontier Safety Roadmap matter. Anthropic does not just have principles. It has public milestones that force internal coordination. Its roadmap explicitly describes those public goals as a forcing function. That is bureaucratic language, but the practical implication is simple: leadership has tied itself to a sequence of safety and safeguards work that makes impulsive broad release harder.</p><p>And the regulators are already in the room. Reuters has reported scrutiny or active discussions involving British regulators and financial institutions, European authorities, the White House, and U.S. agencies. Once that happens, &#8220;maybe we&#8217;ll just quietly open general access&#8221; stops being a serious option.</p><p>So no, I do not think Anthropic is about to panic and throw Mythos Preview into ordinary self-serve. The brakes are real. The disclosure backlog is real. The policy overhead is real. The governance burden is real.</p><p style="text-align: center;">* * *</p><h1>Why late May to mid-June makes sense</h1><p>But real brakes do not mean long delay. They mean staged motion.</p><p>This is why I think the late-May to mid-June window is plausible.</p><p>Anthropic&#8217;s own Frontier Safety Roadmap clusters several meaningful dates in that range. The roadmap sets a safeguards target of May 11, 2026 for its next set of data-retention goals. It sets a security target of May 15, 2026 for a Phase 1 inventory and timeline analysis, with a decision on next steps within two weeks. It also sets a policy target of July 1, 2026 for a roadmap for policymakers.</p><p>Separately, Anthropic says Project Glasswing will report publicly within 90 days of the April 7 announcement. That lands in early July. And Opus 4.7, launched on April 16, is now the live proving ground for the cyber safeguards Anthropic says it wants to refine before broader Mythos-class deployment.</p><p>Put those together and you get a very ordinary product-management picture.<br>Early April: announce the dangerous flagship, keep it gated, start the controlled program.<br>Mid-April: ship the less risky public testbed with new safeguards and a verification program.<br>Mid-May: hit internal safeguards and security checkpoints.<br>Late May to mid-June: decide whether the control plane is good enough to widen access before the early-July reporting moment.<br>That is not a guarantee. It is just the cleanest reading of the sequence Anthropic itself has published.</p><p>What would that widened access actually look like?</p><p>Not a giant Mythos button in the console.</p><p>Imagine the sensible enterprise workflow instead. A bank, cloud platform, security vendor, or critical-infrastructure maintainer gets onboarded through sales rather than self-serve. Its security staff go through verification. Access is limited to monitored surfaces. Certain no-visibility modes stay restricted. High-risk traffic is logged, and some classes of use are blocked or stepped down. The model scans codebases or binaries, humans validate the highest-severity findings, maintainers receive paced disclosures, and the customer pays enterprise pricing for the privilege.</p><p>That is not a toy release. That is a release motion.</p><p>And notice how many pieces of that motion Anthropic already has on the board. Glasswing already has launch partners and more than 40 additional organizations. Participant pricing is already public. Platform availability is already specified. Opus 4.7 already has cyber request blocking and a Cyber Verification Program. The company&#8217;s data-retention and safeguards roadmap already has May checkpoints. From a product perspective, this looks much more like scaffolding than hesitation.</p><p>So when I say &#8220;Anthropic may need to release around late May to mid-June,&#8221; I do not mean broad general availability of Mythos Preview itself. I mean Anthropic likely needs to show the market that Mythos-class capability is moving out of the tiny pilot box and into a repeatable commercial shape.</p><p>The most likely signals would be unglamorous. A broader Glasswing cohort. A trusted-access API tier. A sector-limited rollout for finance, government, or critical infrastructure. Expanded Cyber Verification. Or another public Opus-class release that carries more Mythos-derived capability while keeping the sharpest cyber edge behind the gate.</p><p>The least likely signal is the one people on social media keep waiting for: a clean, consumer-style Mythos launch.</p><p>I think Anthropic will have to move because competition is compressing the schedule. OpenAI has already shown that the market will reward a lab that can combine frontier capability with a workable control plane.</p><p>Maybe that is the real lesson. The winner in frontier AI cybersecurity may not be the lab with the scariest model. It may be the lab with the dullest, best-run release machinery.</p><p>Are we finally ready to admit that the moat is the control plane, not the demo?</p><p style="text-align: center;">* * *</p><h2>Notes and References</h2><p>1. This essay uses &#8220;release&#8221; to mean movement beyond today&#8217;s invitation-only defensive preview into broader commercial availability. Forecasts in the essay are interpretation, not company guidance.</p><p>2. Anthropic, &#8220;Project Glasswing,&#8221; April 7, 2026.</p><p>3. Anthropic, &#8220;Project Glasswing: Securing critical software for the AI era,&#8221; April 7, 2026.</p><p>4. Anthropic, &#8220;Alignment Risk Update: Claude Mythos Preview,&#8221; April 7, 2026; updated April 10, 2026.</p><p>5. Anthropic, &#8220;Claude Mythos Preview&#8221; technical red-team post, April 7, 2026.</p><p>6. Anthropic, &#8220;Introducing Claude Opus 4.7,&#8221; April 16, 2026.</p><p>7. Anthropic, &#8220;Coordinated vulnerability disclosure for Claude-discovered vulnerabilities,&#8221; March 6, 2026.</p><p>8. Anthropic, &#8220;Responsible Scaling Policy v3.0,&#8221; February 24, 2026.</p><p>9. Anthropic, &#8220;Frontier Safety Roadmap,&#8221; updated April 2, 2026.</p><p>10. Anthropic, &#8220;Anthropic expands partnership with Google and Broadcom for multiple gigawatts of next-generation compute,&#8221; April 6, 2026.</p><p>11. Anthropic, &#8220;Anthropic invests $100 million into the Claude Partner Network,&#8221; March 12, 2026.</p><p>12. Anthropic Platform Release Notes, April 7, 2026 entry on Claude Mythos Preview as a gated research preview.</p><p>13. OpenAI, &#8220;Introducing GPT-5.3-Codex,&#8221; February 5, 2026.</p><p>14. OpenAI, &#8220;GPT-5.3-Codex System Card,&#8221; February 5, 2026.</p><p>15. OpenAI, &#8220;Introducing Trusted Access for Cyber,&#8221; February 5, 2026.</p><p>16. OpenAI, &#8220;Trusted access for the next era of cyber defense,&#8221; April 14, 2026.</p><p>17. OpenAI, &#8220;Codex now offers pay-as-you-go pricing for teams,&#8221; April 2, 2026.</p><p>18. OpenAI, &#8220;Scaling AI for everyone,&#8221; February 27, 2026.</p><p>19. Reuters, &#8220;Anthropic touts AI cybersecurity project with Big Tech partners,&#8221; April 7, 2026.</p><p>20. Reuters, &#8220;AI-boosted hacks with Anthropic&#8217;s Mythos could have dire consequences for banks,&#8221; April 13, 2026.</p><p>21. Reuters, &#8220;UK regulators rush to assess risks of latest Anthropic AI model, FT reports,&#8221; April 12, 2026.</p><p>22. Reuters, &#8220;Anthropic talks to EU, including on its cyber security models, Commission says,&#8221; April 17, 2026.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Even the Handcrafted-Code Purists Are Using Agents]]></title><description><![CDATA[When the engineers most associated with software craft start teaching agentic workflows, the lazy "tech bros vibe coding" dismissal stops working.]]></description><link>https://plausiblereality.com/p/even-the-handcrafted-code-purists</link><guid isPermaLink="false">https://plausiblereality.com/p/even-the-handcrafted-code-purists</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Sat, 18 Apr 2026 18:24:04 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/0eac3e6b-82b8-4367-9a14-7923cab9c587_1408x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I have very little patience for the parade of one-shot AI app demos on the internet. Most of them deserve the mockery. A todo app that &#8220;shipped itself&#8221; in 14 minutes is not an operating model. It is content.</p><p>But the wrong debate is still winning. Too many serious engineers are still reacting to agentic coding as if the whole category were just tech-bro vibe coding with better branding.</p><p>That posture is getting harder to maintain. Not because a vendor said the future is here. Vendors always say that. Because the people our industry has spent two decades treating as guardians of software craft are no longer standing outside the room.</p><p>CleanCoders now features a Clean AI training track about disciplined practices for building reliable AI-powered software agents. In Robert C. Martin&#8217;s Agentic Discipline series, the message is not &#8220;stop caring about code.&#8221; It is that the greatest danger is an undisciplined approach, that testing and refactoring still matter, and that when AI agents are doing the coding, &#8220;code cannot be the source.&#8221; The next episode goes as far as demonstrating four agents working collaboratively.</p><p>Kent Beck is not posting &#8220;ship the slop&#8221; energy either. His distinction is almost aggressively clear: vibe coding ignores the code and only chases behavior. Augmented coding still cares about code quality, complexity, tests, and coverage. He just &#8220;doesn&#8217;t type much of that code.&#8221;</p><p>Martin Fowler is pushing the same shift at the organizational layer. He writes about &#8220;supervisory engineering work&#8221; and argues teams may need to reorganize around verification rather than writing code. David Heinemeier Hansson moved from telling senior programmers in late 2024 to treat AI like a junior programmer to saying in early 2026 that agents are capable of production-grade contributions in a supervised collaboration model. Mike McQuaid says he has personally reached the point where AI writes 90% of his code, inside a sandboxed, worktree-heavy setup built around security and control.</p><p>Notice what these people are not endorsing: unsupervised autonomy, blind trust, or prompt-lottery development. The pattern is the opposite. Stronger tests. Sharper review. Better source documents. More human accountability.</p><p>I think that is the real threshold. This is no longer just the hobbyhorse of people who want computers to flatter them. The craft crowd is moving. Slowly, conditionally, and with plenty of caveats. But moving.</p><p>Stack Overflow&#8217;s 2025 survey captures the awkward middle nicely. AI-tool adoption is already broad, daily use among professional developers is high, and yet trust remains shaky. More developers distrust AI accuracy than trust it, and a majority still either do not use agents at work or stay in simpler autocomplete mode. That is exactly the environment where disciplined teams learn the fastest: the tools matter, the defaults are not settled, and the people who build verification muscle early get compounding returns.</p><h1>Why the status shift matters</h1><p>Most engineering adoption is blocked less by tool access than by professional legitimacy. The repo is not the bottleneck. The social permission is.</p><p>If OpenAI, Anthropic, or GitHub tells your team that agents are the future, a healthy engineer hears marketing. Fair enough. If the same basic message arrives from people associated with clean code, TDD, refactoring, Rails craftsmanship, or open-source maintainership, it lands differently. Not because those people are infallible. Because they change what feels professionally acceptable.</p><p>Uncle Bob matters here because his public brand is not &#8220;move fast and see what the model hallucinated.&#8221; It is discipline. Readability. Refactoring. The CleanCoders site now puts Clean AI beside Clean Code, and the Agentic Discipline episodes frame agentic development as something that demands more rigor, not less. That is a cultural signal, not a product launch.</p><p>Kent Beck matters because TDD has always been more than a testing tactic. It is a worldview about feedback, scope control, and what counts as a good development loop. When Beck says augmented coding still values tidy code, tests, and coverage, he is not watering down that worldview. He is showing how it survives contact with agents.</p><p>Fowler matters because he pushes the conversation out of individual workflow bragging and into organizational design. &#8220;Supervisory engineering work&#8221; is a useful phrase precisely because it is a little annoying. It forces managers and staff engineers to admit that the valuable human work may be shifting from authoring every line to directing, evaluating, and correcting work that machines can now draft.</p><p>DHH matters because he is culturally allergic to a lot of enterprise AI theater. He likes coherent tools, strong opinions, and product taste. When he says agents went from &#8220;treat this like a junior programmer&#8221; to &#8220;this can make production-grade contributions under supervision,&#8221; that lands with teams that would never take their cues from benchmark maximalists or demo-day optimists.</p><p>Mike McQuaid matters because his endorsement is operational rather than mystical. He talks about sandboxes, worktrees, permissions, and security. That is the language of someone who expects real consequences from bad automation. In other words, it is the language serious teams actually need.</p><p>The important point is not hero worship. The important point is convergence.</p><p>These are different tribes with different taste profiles and different reasons for being skeptical. They are not all saying the same thing. But they are landing in the same neighborhood: agentic coding is real, it is useful, and it should be bounded by discipline, tests, review, and human accountability.</p><p>That is enough to end one lazy argument. You can still argue about economics, quality ceilings, legal risk, or when not to use agents. Those are real debates. But the dismissal that this is merely &#8220;tech bros vibe coding&#8221; no longer matches the public behavior of the craft establishment.</p><h1>Craft did not die. It moved.</h1><p>The useful question is not whether a human typed every line. It is where quality lives now.</p><p>For a long time, elite engineering identity was tied to manual authorship. You proved seriousness by holding the keyboard, knowing the APIs cold, and treating every line as a handcrafted object. That identity made sense when syntax production, library recall, and repo spelunking were expensive human work.</p><p>They are getting cheaper. Judgment is not.</p><p>When Uncle Bob says &#8220;code cannot be the source,&#8221; he is making a bigger point than most people noticed. If code is no longer the primary artifact humans produce first, then the scarce human artifact becomes the thing above the code: the source document, the acceptance criteria, the domain language, the tests, the architecture, the non-goals, the failure cases. Fowler&#8217;s verification framing lands in the same place from another angle. Beck lands there through TDD. DHH lands there through supervised collaboration. Different accents. Same migration.</p><p>Preference is not performance.</p><p>A lot of senior engineers still confuse &#8220;I prefer writing this myself&#8221; with &#8220;this is the highest-value use of my time.&#8221; Those are not the same sentence. Sometimes hand-writing the code is still correct. Sometimes it is the expensive way to produce syntax that a supervised agent could have drafted while you spent your scarce attention on defining the trade-offs and protecting the boundaries.</p><p>This sounds obvious, but teams miss it all the time. They treat agentic coding as a referendum on whether humans should still understand code. Of course they should. That is not the live issue. The live issue is whether the best engineer on the team should spend Tuesday wiring another pagination endpoint, or spend Tuesday defining the behavioral contract so the endpoint, its tests, and its edge-case handling can be generated, checked, and improved with less waste.</p><p>The strongest public advocates are not abandoning craft. They are relocating it.</p><p>Craft used to sit visibly in the syntax. Now more of it sits in task definition, decomposition, verification, evaluation, and refusal. Refusal matters more than a lot of AI enthusiasts admit. Someone still has to decide which request is poorly framed, which generated solution is elegant nonsense, which &#8220;working&#8221; implementation will rot the codebase six weeks from now, and which shortcut is unacceptable even if it passes today&#8217;s tests.</p><p>That is why the new prestige skill is not &#8220;prompting.&#8221; I dislike that word because it sounds like party trick optimization. The real skill is designing loops: source documents, instructions, tests, permissions, review rituals, fallback paths, and escalation rules. You are not just asking for code. You are building a system that can safely produce and challenge code.</p><p>In that frame, hand-crafted code is not the point. Hand-crafted judgment is.</p><h1>What serious agentic coding looks like on Monday morning</h1><p>Let&#8217;s make this concrete, because abstract arguments are cheap.</p><p>Imagine your team needs to add exportable audit logs to a mature B2B application. There are auth rules, tenancy boundaries, rate limits, retention constraints, CSV formatting edge cases, and two annoying legacy models nobody wants to touch.</p><p>The unserious version of AI use is obvious. Throw a vague prompt at a model, accept a confident diff, and hope CI catches the rest. That is how you manufacture theater and debt at the same time.</p><p>The serious version looks different.</p><p>A human writes the source document first. One page is enough. What the feature must do. What it must never do. The permissions model. The shape of the response. Failure cases. Migration considerations. Logging requirements. Performance guardrails. What counts as done. That document is not ceremony. It is the artifact that makes the work delegable.</p><p>Then a first agent maps the repo and identifies the likely blast radius: touched modules, tests that should change, docs that will drift, commands that should run, places where naming or domain assumptions look fragile. A second agent drafts or extends the acceptance tests. A human reviews both before implementation starts, not because the human likes bureaucracy, but because catching a bad premise before code exists is still the cheapest move in software.</p><p>Only then do you let the implementation loop run.</p><p>That loop should live in a sandbox or separate worktree. It should have explicit permissions. It should run tests. It should explain its changes. It should flag uncertainty instead of bluffing. It should return a diff that a human can interrogate.</p><p>Then the human does the part that still matters most. Does the change fit the architecture? Does it respect the product constraints? Did the model solve the wrong problem elegantly? Did it bury a future maintenance tax under a passing test? Did it miss the social meaning of the codebase, the local conventions that never quite make it into docs but absolutely shape what belongs?</p><p>Bad parts go back for revision. Good parts move forward. Responsibility never becomes ambiguous.</p><p>That is not vibe coding. It is delegated implementation inside a verification loop.</p><p>It also changes what &#8220;senior&#8221; means. The sharp engineer in that workflow is not the person who can manually type the fastest controller. It is the person who writes the clearest source document, decomposes the task cleanly, anticipates the edge cases, chooses the right verification, and notices when the agent is confidently wrong.</p><p>This is why I think the anti-AI line &#8220;real engineers write their own code&#8221; is aging badly. Real engineers reduce uncertainty. Sometimes that still means writing the code by hand. More often now, it means designing the loop that produces and validates the code.</p><p>And no, this does not magically remove the need for code literacy. Quite the opposite. Supervision without technical depth becomes rubber-stamping. If anything, agentic coding punishes shallow engineers more harshly, because it gives them far more ways to approve something they do not actually understand.</p><p>The useful question is not &#8220;Did the AI write code?&#8221; It is &#8220;Did the team compress the boring 70 percent without blurring ownership for the dangerous 30 percent?&#8221; That is a much better test of professionalism.</p><p>If you want a starting point, start boring. Test scaffolding. Dependency updates. Refactors protected by strong tests. Documentation drift. Repetitive internal tooling. Migration spikes. Review prep. The first draft of tedious but bounded implementation. Do not start by asking an agent to wander your production backlog unsupervised like a caffeinated intern with root access.</p><p>Teams get into trouble when they confuse model capability with organizational readiness. The model may be good enough. Your permissions, source artifacts, review habits, and fallback rules may not be. Those are different questions.</p><h1>The real risk is waiting</h1><p>A lot of teams are treating delay as prudence. They see noisy demos, dubious benchmark discourse, and a thousand workflow hot takes, then conclude that the mature response is to stand still for another year.</p><p>I think that is backwards.</p><p>The first durable advantage here is not model access. Everyone has model access. The advantage is learning how to define tasks, constrain permissions, write reusable instructions, review AI-produced diffs, and build the social norms that keep accountability human. Those are operating muscles. They do not suddenly appear on the day you decide the tools are finally respectable.</p><p>And this market is in exactly the kind of messy middle where operational learning compounds. Broad AI-tool usage is already mainstream. Agent usage is still uneven. Trust is still incomplete. Most teams are not yet good at this. That means there is still time to learn without being hopelessly behind, and still enough noise in the system that disciplined adopters can pull away from the clowns.</p><p>The problem is usually not tool access. It is managerial and cultural permission.</p><p>Senior engineers do not want to look unserious. Staff engineers do not want to be associated with toy demos. CTOs do not want to sponsor a clown show and then spend six months cleaning up the aftermath. All reasonable concerns.</p><p>That is exactly why the movement of people like Uncle Bob, Beck, Fowler, DHH, and McQuaid matters. They do not remove the need for judgment. They remove the last easy excuse that this category is beneath serious engineers.</p><p>Most organizational change in software does not fail because a capability is unavailable. It fails because the behavior is not yet socially legible. Once high-status craft figures say, in different ways, &#8220;this is real, but only with discipline,&#8221; experimentation stops feeling reckless and starts feeling professional.</p><p>That is the call to action here. Not &#8220;trust the agent.&#8221; Not &#8220;let the model cook.&#8221; Not &#8220;replace engineers with vibes and PRs.&#8221;</p><p>Raise the bar for source. Raise the bar for tests. Raise the bar for review. Then give agents the work that fits inside that bar.</p><p>Start with bounded tasks. Keep the permissions tight. Demand explanations. Measure rework, review burden, and accepted diffs, not just raw output volume. Teach people that catching a bad premise early is a win, not an embarrassment. Reward clear task framing as much as clever implementation.</p><p>Most of all, stop talking about hand-crafted code as though manual typing were the thing worth preserving. What is worth preserving is taste, responsibility, clarity, and judgment. If agents can take the rote work while humans hold the line on those things, that is not the death of craft. That is craft refusing to waste its time.</p><p>The question is not whether AI will write code in your organization. It already is, or it soon will be. The question is whether your engineers will learn to supervise that work better than their peers - or whether they will keep mistaking a preference for hand-typing syntax with a principle.</p><p>Are you defending craft, or are you just defending your preferred way of producing text?</p><p style="text-align: center;">* * *</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p style="text-align: center;"></p><h2>Notes and References</h2><p>1. CleanCoders, &#8220;Featured Series&#8221; page; &#8220;Agentic Discipline&#8221; Episodes 1-3; public site accessed April 2026.</p><p>2. Kent Beck, &#8220;Augmented Coding: Beyond the Vibes,&#8221; Software Design: Tidy First?, June 25, 2025.</p><p>3. Martin Fowler, &#8220;Fragments: March 16,&#8221; March 16, 2026.</p><p>4. Martin Fowler, &#8220;Fragments: April 2,&#8221; April 2, 2026.</p><p>5. David Heinemeier Hansson, &#8220;The premise trap,&#8221; December 16, 2024.</p><p>6. David Heinemeier Hansson, &#8220;Promoting AI agents,&#8221; January 7, 2026.</p><p>7. Mike McQuaid, &#8220;Sandboxes and Worktrees: My secure Agentic AI Setup in 2026,&#8221; April 14, 2026.</p><p>8. Stack Overflow, &#8220;AI | 2025 Stack Overflow Developer Survey,&#8221; 2025.</p>]]></content:encoded></item><item><title><![CDATA[XXX + Claude Code Is Not 99% Cheaper]]></title><description><![CDATA[You can cut the visible token bill hard. You can also quietly swap out the model, the API semantics, and the operating costs that made Claude Code feel good in the first place.]]></description><link>https://plausiblereality.com/p/xxx-claude-code-is-not-99-cheaper</link><guid isPermaLink="false">https://plausiblereality.com/p/xxx-claude-code-is-not-99-cheaper</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Sat, 04 Apr 2026 14:27:46 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/08622f23-3880-44b3-9d12-814bb09e1b71_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ignore the screenshot. Decompose the stack.</p><p>&#8220;Ollama + Claude Code = 99% cheaper&#8221; is the kind of line that spreads because it compresses one real technical fact into one very misleading business conclusion.</p><p>The real part is simple. Ollama exposes Anthropic-compatible endpoints, documents how to point Claude Code at a local or cloud Ollama backend, and even ships `ollama launch claude` to wire it up for you.[5]</p><p>The misleading part is the leap from &#8220;Claude Code can talk to Ollama&#8221; to &#8220;you kept the important part and removed the expensive part.&#8221;</p><p>I do not think that is what is happening.</p><p>I think a lot of what people actually like in Claude Code is the whole stack working together: the agent loop, the editing and command harness, the context handling, the default ergonomics, and then the model sitting behind it. On the hardest coding work, that last part is not a rounding error. It is the product.</p><p>So yes, you can make the visible token bill smaller. Sometimes dramatically smaller.</p><p>But that is not the same as getting the same outcome for less money. It is often the same shell driving a different brain through a different runtime with a different feature set, different bottlenecks, and a different failure profile.</p><p>The wrong debate is &#8220;Claude Code versus Ollama.&#8221;</p><p>What actually matters is which parts of the stack you are swapping, what you gain, and what bill comes back through another door.</p><p>A &#8220;free&#8221; alternative is usually only free if you pretend hardware, setup friction, supervision time, and degraded judgment do not belong in the spreadsheet.</p><p>That is the fluff worth stripping away.</p><p style="text-align: center;">* * *</p><h1>The wrong debate is Claude Code versus Ollama</h1><p>Anthropic&#8217;s own Agent SDK description is clarifying here. It says the SDK gives you the same tools, agent loop, and context management that power Claude Code.[1]</p><p>That matters.</p><p>It means Claude Code is not just a brand aura wrapped around one hidden endpoint. There is a real harness there. The shell matters. The tool permissions matter. The edit loop matters. The way context is managed matters.</p><p>But that same sentence also makes the opposite point, which clickbait titles quietly step over.</p><p>If the harness can be separated from the model, then keeping the harness does not mean you kept the model.</p><p>And model choice is not a cosmetic variable in Anthropic&#8217;s own documentation. Anthropic tells people to start with Claude Opus 4.6 for the most complex tasks and describes it as the latest generation model with exceptional performance in coding and reasoning.[2] Its Claude Code cost guidance also says Sonnet handles most coding tasks well and costs less than Opus, while Opus should be reserved for complex architectural decisions or multi-step reasoning.[3]</p><p>That is Anthropic being unusually plain about something the market keeps trying to blur: the model changes the tool.</p><p>Not just the speed.</p><p>Not just the invoice.</p><p>The tool.</p><p>Once you see the stack this way, the viral title starts to look sloppy.</p><p>Claude Code is one layer.</p><p>Claude the model is another.</p><p>The API semantics and serving stack are another.</p><p>Billing path and infrastructure are another.</p><p>Ollama can replace some of those pieces. It cannot magically make them irrelevant.</p><p>Anthropic&#8217;s own Claude Code docs also show how many deployment paths already exist before Ollama enters the story at all. Claude Code can authenticate through Claude.ai subscriptions, the Console/API path, or cloud providers like Bedrock, Vertex, and Microsoft Foundry.[4]</p><p>That portability is real.</p><p>The claim that portability makes all backends equivalent is not.</p><p>What people often mean when they say &#8220;Claude Code is amazing&#8221; is not &#8220;I enjoy that the terminal has slash commands.&#8221;</p><p>They mean something closer to: it usually stays on task, reads the right things, avoids some stupid edits, makes surprisingly decent multi-file changes, and recovers better than they expected when the task gets messy.</p><p>That is not just a shell story.</p><p>That is a model judgment story.</p><p>And on ugly engineering work, judgment is usually the expensive part.</p><p style="text-align: center;">* * *</p><h1>Where the 99% cheaper claim is technically true</h1><p>There is a version of the claim that is fair.</p><p>If your baseline is API-billed Claude Code usage through Anthropic, and if your workload is heavy enough, moving to a local model can cut direct provider spend hard. Anthropic&#8217;s Claude Code docs say average cost is about $6 per developer per day, with daily costs under $12 for 90% of users, and that average monthly cost is roughly $100 to $200 per developer on Sonnet 4.6, with wide variation based on usage patterns.[3]</p><p>If you already own suitable hardware and you move a big chunk of that work to local inference, the marginal provider bill can indeed collapse.</p><p>That is real.</p><p>It is just narrower than the headline wants you to notice.</p><p>First, the baseline matters.</p><p>Anthropic explicitly says the `/cost` command is for API users, and that Pro and Max subscribers have usage included in their subscription, so `/cost` is not relevant for billing in those plans.[3] Anthropic&#8217;s pricing page says Pro costs $17 per month on annual billing or $20 month-to-month and includes Claude Code, while Max starts at $100 per month and also includes Claude Code.[6]</p><p>So if someone is comparing &#8220;Ollama + Claude Code&#8221; to a Pro subscriber who was never paying a large API bill in the first place, &#8220;99% cheaper&#8221; is not analysis. It is framing.</p><p>Second, Ollama is not automatically synonymous with &#8220;free local inference.&#8221;</p><p>Ollama&#8217;s own docs say it supports both local and cloud models.[5] Its pricing page offers Free, Pro, and Max tiers, with cloud usage and concurrency rules attached to those plans.[7] Running models on your own hardware is unlimited according to Ollama, but cloud usage is not.[7]</p><p>So even inside the &#8220;Ollama&#8221; label, there are at least three different cost stories people keep mashing together: local inference on hardware you already own, local inference on hardware you bought for this purpose, and Ollama cloud usage with a plan and usage limits.</p><p>Those are not the same financial decision.</p><p>Third, &#8220;cheaper&#8221; usually means &#8220;cheaper on the line item I decided to show.&#8221;</p><p>This sounds obvious, but teams miss it all the time.</p><p>They compare the token bill and ignore the supervision tax.</p><p>They celebrate a lower provider invoice while quietly absorbing a slower review loop, more retries, weaker trust in the edits, more manual validation, more device-specific setup, and more engineer time spent nursing the system through edge cases.</p><p>That is not fake cost.</p><p>That is just cost that does not live in the AI vendor dashboard.</p><p>And once you start measuring that version, the title gets a lot less magical.</p><p style="text-align: center;">* * *</p><h1>Free is where the trade-offs start</h1><p>The first trade-off is model quality, which is the one viral titles work hardest to hide.</p><p>I am not making a benchmark argument here.</p><p>I am making a work argument.</p><p>On shallow tasks, the harness carries a lot. Read files. Search strings. Rename symbols. Scaffold tests. Produce a first draft. Run a command. Summarize a repo. A lot of models can look surprisingly good when the work is narrow and the feedback loop is tight.</p><p>But the moment the task becomes more architectural, more ambiguous, or more stateful, model judgment starts to dominate.</p><p>Should this change live in middleware or at the call site?</p><p>Is this test failure pointing to a bad patch or to a bad plan?</p><p>What is the least invasive place to fix the bug without breaking the rest of the system?</p><p>Should the agent edit at all, or stop and ask for a design decision?</p><p>That is the work people actually pay strong models for.</p><p>Anthropic&#8217;s own documentation reflects this split. It presents Opus 4.6 as the place to start for the most complex tasks and describes it as exceptional in coding and reasoning.[2] Its Claude Code docs separately tell you to use Sonnet for most coding work and reserve Opus for complex architectural decisions or multi-step reasoning.[3]</p><p>So when somebody wires Claude Code to `qwen3-coder`, `glm-4.7`, or `minimax-m2.1` through Ollama, they are not just trimming fat around the edges.[5]</p><p>They are changing the judge.</p><p>Maybe that is fine.</p><p>Maybe it is even smart for the task mix they have.</p><p>But it is still a trade, and the trade is central, not incidental.</p><p>The second trade-off is API behavior, which sounds boring until it hurts.</p><p>Ollama&#8217;s Anthropic compatibility is good enough to make the integration real. It supports messages, streaming, system prompts, multi-turn conversations, tool use, tool results, vision through base64 images, and thinking blocks.[5]</p><p>That is a serious amount of compatibility.</p><p>But its own docs are also explicit about what is not there or only partly there.</p><p>Ollama says Anthropic features such as the `count_tokens` endpoint, prompt caching, the Batches API, citations, PDF document blocks, and fully supported extended thinking are not supported or only partially supported. It also says token counts are approximations based on the underlying model tokenizer, and that extended thinking support is basic, with `budget_tokens` accepted but not enforced.[5]</p><p>That is not pedantry.</p><p>That is behavior.</p><p>Anthropic&#8217;s Claude Code cost guidance says Claude Code automatically uses prompt caching and auto-compaction to reduce repeated-context costs, and it says extended thinking is enabled by default because it significantly improves performance on complex planning and reasoning tasks.[3]</p><p>So when you swap the backend, you are not only changing the brain.</p><p>You are also changing some of the rules by which the harness expects the brain to behave.</p><p>Maybe your workflow does not care about PDFs, citations, or batches.</p><p>Fine.</p><p>Maybe approximate token counts are good enough.</p><p>Also fine.</p><p>But prompt caching and thinking behavior are not random niceties. They are part of the cost, context, and reasoning profile of the system.</p><p>When those move, the experience moves.</p><p>The third trade-off is hardware, which people mysteriously stop mentioning right after they say the word &#8220;free.&#8221;</p><p>Ollama&#8217;s own Claude Code integration page recommends models such as Qwen3 Coder for coding use cases, then notes that Qwen3 Coder is a 30B model requiring at least 24GB of VRAM to run smoothly, and more for longer context lengths.[5]</p><p>That is the difference between &#8220;I replaced my SaaS bill&#8221; and &#8220;I am now paying in silicon, thermals, and operational inconvenience.&#8221;</p><p>For a solo developer who already has the hardware, that can be completely rational.</p><p>For a team trying to standardize workflows across mixed laptops and desktops, it can become an annoying little tax that shows up everywhere.</p><p>Someone has to maintain the runtime.</p><p>Someone has to manage model versions.</p><p>Someone has to explain why the agent is fast on one box, glacial on another, and broken on a third because the quantization changed or memory pressure kicked in.</p><p>And if you solve all of that by using Ollama cloud, then you are back in the land of plans, concurrency, usage limits, and provider terms.[7]</p><p>Again, not bad.</p><p>Just not free.</p><p>The fourth trade-off is privacy and compliance, where both camps often oversimplify.</p><p>Yes, local inference can be useful if you want more data locality or do not want prompts leaving your machine. Ollama markets privacy directly and says running models on your own hardware is part of the product story.[7]</p><p>But the opposite caricature&#8212;&#8220;hosted Claude means your code is obviously being trained on&#8221;&#8212;is also sloppy.</p><p>Anthropic&#8217;s Claude Code data usage docs say commercial users under Team, Enterprise, and API terms are not used for generative model training unless they explicitly opt in to provide data for model improvement.[8] Retention defaults also vary by account type, with different settings for consumer and commercial paths.[8]</p><p>So the useful question is not &#8220;local good, cloud bad.&#8221;</p><p>It is &#8220;what are my actual obligations, what account path am I using, what is the retention policy, and which system satisfies my risk model?&#8221;</p><p>Clickbait wants one emotion.</p><p>Real engineering usually needs one level deeper than that.</p><p style="text-align: center;">* * *</p><h1>A simple workflow test that exposes the truth</h1><p>Here is the test I think matters.</p><p>Take a job that is annoying enough to expose real differences but common enough that people actually do it.</p><p>Say you need to add passkey support, keep legacy sessions working during rollout, update the admin permission checks, adjust the onboarding flow, patch the mobile API contract, and make sure a background cleanup job does not invalidate the wrong records.</p><p>This is not a benchmark.</p><p>This is Tuesday.</p><p>Claude Code as a harness can do a lot in both setups. It can inspect files. It can run commands. It can edit code. It can propose a plan. It can test. It can use tools. That part is not under dispute.[1][5]</p><p>The difference shows up after the first burst of competence.</p><p>Does the model keep the migration strategy intact while it touches multiple subsystems?</p><p>Does it notice the stale assumption hiding in a helper nobody mentioned?</p><p>Does it understand that a failing test is evidence against the plan rather than an instruction to &#8220;fix the test&#8221;?</p><p>Does it preserve the boundary of the task, or start making adjacent &#8220;improvements&#8221; because it lost the original objective?</p><p>Does it know when to stop?</p><p>That is where the cheap-versus-expensive debate becomes real.</p><p>Because the cost of a weaker answer is rarely &#8220;the model was wrong.&#8221;</p><p>The cost is usually: you reread more of its work, you rerun more tests, you intervene earlier, you trust less of the diff, you retry with tighter prompts, you split the task into smaller pieces, or you eventually hand the hard part back to the stronger model anyway.</p><p>A token bill is visible.</p><p>A supervision bill is not.</p><p>This is why so many discussions around coding agents are framed badly.</p><p>People optimize for time-to-first-token.</p><p>They should be optimizing for time-to-correct-merge.</p><p>People optimize for nominal cost per request.</p><p>They should be optimizing for acceptable output per hour of senior attention.</p><p>People optimize for &#8220;it worked in my demo.&#8221;</p><p>They should be optimizing for whether the system still behaves under ambiguity, fatigue, and ugly code.</p><p>Preference is not performance.</p><p>Feeling clever because the run was local is not the same as shipping faster.</p><p>And this is exactly why &#8220;Claude Code is just a harness&#8221; is too glib.</p><p>If the harness were the whole story, swapping the brain would feel like changing browsers.</p><p>In practice, on hard work, it feels more like changing engineers.</p><p style="text-align: center;">* * *</p><h1>How to use the trade-off without lying to yourself</h1><p>None of this is an argument against Ollama.</p><p>I think Ollama plus Claude Code is a good setup in plenty of cases.</p><p>It makes sense when you already own the hardware.</p><p>It makes sense when the work is lower stakes and repetitive.</p><p>It makes sense when you want local experimentation, cheap background agents, codebase search, rote transforms, or a disposable first pass you do not mind supervising.</p><p>It also makes sense when data locality is the real requirement and you are willing to pay for that in operations rather than vendor fees.</p><p>That is a perfectly adult trade.</p><p>I am also not arguing that every task deserves Opus.</p><p>Anthropic itself does not say that. Its own Claude Code cost guidance explicitly says Sonnet handles most coding tasks well and costs less than Opus, while Opus is for complex architectural decisions or multi-step reasoning.[3]</p><p>That is a sensible operating model.</p><p>The mistake is pretending the lower-cost path is a drop-in substitute across every task.</p><p>It is not.</p><p>The sane version of this conversation is boring, which is probably why it does not go viral.</p><p>Use cheaper local or cheaper cloud models where the task is constrained, recoverable, or easy to verify.</p><p>Use stronger hosted models where the job is ambiguous, expensive to be subtly wrong on, or likely to sprawl across architecture, state, and hidden assumptions.</p><p>In other words, do what good engineering teams already do with people.</p><p>You do not assign every problem to the cheapest available pair of hands.</p><p>You assign work based on the cost of failure, the difficulty of review, and the value of good judgment.</p><p>Models are not different enough to make that logic disappear.</p><p>One operating pattern I like is to treat model strength as a routing problem.</p><p>Cheap first pass.</p><p>Mid-tier daily driver.</p><p>Top-tier escalation path.</p><p>That can be local plus Sonnet plus Opus.</p><p>It can be Ollama for commodity work and Claude for the hard edge.</p><p>It can even be all hosted if your team cares more about consistency than about shaving provider spend.</p><p>What matters is that you know which failures you are buying.</p><p>Another operating pattern is to start with the strongest model, learn where it is actually overkill, and then selectively downshift.</p><p>Teams often do the reverse because the invoice is the first pain they can see.</p><p>That is backward.</p><p>You should understand your failure modes before you optimize them.</p><p>And you should measure the right things.</p><p>Not just tokens.</p><p>Measure retries per task.</p><p>Measure review time.</p><p>Measure how often humans take over.</p><p>Measure how often the first plan survives contact with the codebase.</p><p>Measure whether the cheaper system produces smaller bills by producing less useful work.</p><p>That last one is especially important.</p><p>Sometimes the &#8220;savings&#8221; are just lower utilization because the tool got less trustworthy.</p><p>That is not efficiency.</p><p>That is disengagement wearing a finance costume.</p><p>So yes, ignore the fluff.</p><p>Yes, the viral line contains a real trick.</p><p>Claude Code can run against Ollama.[5]</p><p>Yes, that can cut direct model spend dramatically in the right setup.[3][7]</p><p>But no, that does not mean you kept the same thing and simply deleted the cost.</p><p>You changed the model.</p><p>You changed some API behavior.</p><p>You changed hardware assumptions.</p><p>You changed the privacy and ops profile.</p><p>You changed who pays, when they pay, and what kind of failure you are willing to tolerate.</p><p>That can still be a good trade.</p><p>It just is not a magic trick.</p><p>The useful question is not &#8220;can Claude Code talk to Ollama?&#8221;</p><p>It can.</p><p>The useful question is whether, after the swap, your team produces better engineering outcomes per dollar of total attention.</p><p>And that is a much less clickable sentence than &#8220;99% cheaper.&#8221;</p><p>It is also the one that matters.</p><p>Because a lot of AI cost optimization is just moving the bill somewhere your dashboard does not show.</p><p>If you want a slightly annoying question to end on, here it is: are you reducing cost, or just making the expensive part harder to see?</p><p style="text-align: center;">* * *</p><p style="text-align: center;"></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p style="text-align: center;"></p><h2>Notes and References</h2><blockquote><p>1. Anthropic. &#8220;Agent SDK overview.&#8221; Claude Developer Platform. platform.claude.com/docs/en/agent-sdk/overview</p><p>2. Anthropic. &#8220;Models overview.&#8221; Claude API Docs. platform.claude.com/docs/en/about-claude/models/overview</p><p>3. Anthropic. &#8220;Manage costs effectively.&#8221; Claude Code Docs. code.claude.com/docs/en/costs</p><p>4. Anthropic. &#8220;Authentication.&#8221; Claude Code Docs. code.claude.com/docs/en/iam</p><p>5. Ollama. &#8220;Anthropic compatibility.&#8221; Ollama Docs. docs.ollama.com/api/anthropic-compatibility</p><p>6. Anthropic. &#8220;Plans &amp; Pricing.&#8221; Claude. claude.com/pricing</p><p>7. Ollama. &#8220;Pricing.&#8221; Ollama. ollama.com/pricing</p><p>8. Anthropic. &#8220;Data usage.&#8221; Claude Code Docs. code.claude.com/docs/en/data-usage</p></blockquote>]]></content:encoded></item><item><title><![CDATA[Good Judgment Is the New Important Skill for Engineers]]></title><description><![CDATA[When AI makes implementation cheaper, the engineers who stand out are the ones who can frame the problem, verify the output, and choose the right compromise.]]></description><link>https://plausiblereality.com/p/good-judgment-is-the-new-important</link><guid isPermaLink="false">https://plausiblereality.com/p/good-judgment-is-the-new-important</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Sun, 29 Mar 2026 00:54:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e297aacf-4303-4a35-895a-c31a879e161a_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>The bottleneck has moved upstream</h1><p>A lot of engineers still think the premium skill is speed.</p><p>I think that frame is already outdated.</p><p>The wrong debate is whether AI will replace engineers. The useful question is which parts of engineering are becoming cheap enough that they stop being a differentiator. Writing the first draft of code is getting cheaper. Spinning up infrastructure is getting cheaper. Generating boilerplate is already cheap enough that nobody serious should confuse it with leverage.</p><p>What stays expensive is judgment.</p><p>By judgment, I do not mean vague seniority theatre. I mean the habit of making defensible decisions under uncertainty. It is the ability to frame the problem properly, surface the real constraints, spot the second-order effect, choose the least bad tradeoff, and know when a fast answer is actually dangerous.</p><p>That has always mattered. The difference now is relative value. When implementation gets easier, decision quality becomes more visible.</p><p>This is not just a nice philosophical take. The profession has been signalling this for years. The National Society of Professional Engineers still puts public safety and welfare at the centre of engineering responsibility, including the duty to act when an engineer&#8217;s judgment is overruled in dangerous circumstances.[1] ABET&#8217;s current engineering criteria explicitly require graduates to make informed judgments and to use engineering judgment when drawing conclusions from experiments and analysis.[2]</p><p>So no, judgment is not some fluffy add-on that appears after the real work is done. It is part of the job description.</p><p>What has changed is that more of the market can now fake the implementation part for longer.</p><h1>Why this matters more now</h1><p>The easiest mistake right now is to confuse output with value.</p><p>A tool can produce a plausible pull request in minutes. That does not mean the problem was well framed. It does not mean the change is safe to deploy. It does not mean the edge cases were considered. It definitely does not mean the long-term maintenance cost got any lower.</p><p>This sounds obvious, but teams miss it all the time. They celebrate faster artifact creation while quietly offloading more risk onto review, QA, operations, security, and future maintainers.</p><p>The Stack Overflow 2025 Developer Survey captured the mood better than most think-pieces did: more developers actively distrust the accuracy of AI tools than trust it, with 46% saying they distrust the output and 33% saying they trust it.[3] That gap matters. It tells you the industry is not struggling with access to generated code. It is struggling with whether generated code deserves confidence.</p><p>That is why the bottleneck has moved upstream.</p><p>If a junior engineer can generate something that looks competent, then the premium shifts to the person who can answer harder questions. Is this the right abstraction? What breaks under retry? What happens when the input order changes? Which failure is acceptable, and which one gets us paged at 2 a.m.? Are we about to ship a local optimization that creates a system-level mess?</p><p>Preference is not performance. A workflow that feels fast in the first ten minutes can be slow over two quarters if it produces brittle decisions.</p><p style="text-align: center;"></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VeU1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VeU1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 424w, https://substackcdn.com/image/fetch/$s_!VeU1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 848w, https://substackcdn.com/image/fetch/$s_!VeU1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 1272w, https://substackcdn.com/image/fetch/$s_!VeU1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VeU1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png" width="1456" height="915" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:915,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:153062,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://plausiblereality.com/i/192265134?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VeU1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 424w, https://substackcdn.com/image/fetch/$s_!VeU1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 848w, https://substackcdn.com/image/fetch/$s_!VeU1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 1272w, https://substackcdn.com/image/fetch/$s_!VeU1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Table 1. The skill shift inside modern engineering work.</p><h1>What good judgment actually looks like</h1><p>Good judgment is usually less glamorous than people want it to be.</p><p>It often looks like refusing to be impressed too early.</p><p>It looks like asking one more annoying question before approving a design. It looks like noticing that a feature request is really a policy question in disguise. It looks like treating AI output as a draft with a burden of proof, not as a completed answer. It looks like knowing when to stop exploring options and when to reopen the decision because the context changed.</p><p>I think good judgment in engineering usually has five parts.</p><p>First, problem framing. The team that solves the wrong problem elegantly is still wrong. Strong engineers clarify the real outcome, the real constraints, and the real user harm before they touch the solution.</p><p>Second, tradeoff quality. Most meaningful engineering choices are not about perfect versus bad. They are about expensive versus reversible, fast versus auditable, elegant versus operable, and clever versus teachable.</p><p>Third, risk calibration. Not every decision deserves ceremony. Not every decision deserves speed either. Good engineers separate reversible from irreversible moves. They know when a quick experiment is fine and when a small mistake can turn into a permanent tax.</p><p>Fourth, system awareness. A function can be locally clean and globally stupid. A change that looks harmless inside one service can create retry storms, double sends, broken reports, or compliance gaps downstream.</p><p>Fifth, learning discipline. Good judgment is not just the decision itself. It is the quality of the feedback loop after the decision lands.</p><p>That last part matters more than teams admit. NIST&#8217;s Secure Software Development Framework exists precisely because ordinary software development life cycle models often do not address security in enough detail, which means teams need explicit secure-development practices layered into the work.[4] NIST&#8217;s AI Risk Management Framework makes a similar point from the AI side: organizations need structured ways to manage AI risk and make trustworthiness operational, not rhetorical.[5]</p><p>In other words, good judgment is not a personality trait. It is a working system.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LCOL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LCOL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 424w, https://substackcdn.com/image/fetch/$s_!LCOL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 848w, https://substackcdn.com/image/fetch/$s_!LCOL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 1272w, https://substackcdn.com/image/fetch/$s_!LCOL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LCOL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png" width="1456" height="910" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:910,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:115000,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://plausiblereality.com/i/192265134?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LCOL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 424w, https://substackcdn.com/image/fetch/$s_!LCOL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 848w, https://substackcdn.com/image/fetch/$s_!LCOL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 1272w, https://substackcdn.com/image/fetch/$s_!LCOL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Figure 1. A practical judgment loop for modern engineering work.</p><h1>A concrete workflow: the AI-written migration that looks fine until it isn&#8217;t</h1><p>Here is a more realistic example than most benchmark chatter.</p><p>Imagine a team needs to change the schema behind a notification system. An engineer asks an AI tool to draft the migration, the backfill logic, the ORM model changes, and a batch job to clean old records. The tool produces something that compiles. Unit tests pass. The diff looks tidy. Everybody feels efficient.</p><p>The weak version of engineering stops there.</p><p>The stronger version gets irritating in exactly the right way.</p><p>Someone asks whether the migration is idempotent. Someone else checks whether the backfill will fight live writes. Another engineer asks what happens if the job is retried halfway through. A product-minded reviewer asks whether message history could become inconsistent for end users during the transition. Ops asks about rollout order, observability, and rollback. Security asks whether the temporary data shape changes access boundaries or audit requirements. A good staff engineer asks the blunt question: do we even need a backfill, or are we trying to preserve a convenience that is not worth the operational risk?</p><p>None of that is syntax work.</p><p>None of that is model benchmark work.</p><p>That is judgment.</p><p>The interesting part is that AI tools can make this gap wider, not smaller. They reduce the time between idea and artifact, which is useful. They also make it easier to skip the thinking that should have happened before the artifact existed. The team now has something concrete to react to, so everyone feels like work has progressed. Sometimes it has. Sometimes the team has just become more efficiently wrong.</p><p>That is why acceptance criteria are the new code review.</p><p>I would rather have an average engineer with sharp acceptance criteria, a rollback plan, explicit assumptions, and a clean monitoring strategy than a brilliant engineer who can generate code quickly and explain afterwards why the blast radius was unforeseeable.</p><h1>How teams actually train judgment</h1><p>Teams say they want judgment, then they build environments that reward speed theatre.</p><p>If you want better judgment, you need to train it in the open.</p><p>The first method is decision visibility. Important decisions should leave a trail: what problem was being solved, which options were considered, why the chosen path won, what assumptions were made, and what signals would tell you the choice was failing. This does not need to become bureaucratic sludge. A short decision record is often enough. The point is to make reasoning inspectable.</p><p>The second method is postmortems that are genuinely useful. Not blame rituals. Not timeline fan fiction. Real review of the framing, the assumptions, the missed signals, and the controls that were absent or weak. If the same class of mistake keeps happening, the issue is usually not individual intelligence. It is missing scaffolding.</p><p>The third method is scenario work. Put engineers in realistic tradeoff situations. Ask whether they would ship, delay, isolate, rollback, or redesign. Ask what extra evidence they would require. You learn more from that than from asking whether they remember a specific API call.</p><p>The fourth method is exposure to consequences. Judgment improves when engineers feel the operational cost of their choices. If one group writes code and another group absorbs every outage, you are training local optimization, not engineering maturity.</p><p>The fifth method is managerial honesty. If promotions are still mostly driven by visible output volume, then the organization is telling people what it actually values. You do not get a judgment culture by saying the word judgment a lot.</p><p>Ship fast, observe everything, revert faster is still a good rule. But notice what sits inside it: choosing what to ship, what to watch, and what would justify a reversal. Again, the hard part is not typing.</p><h1>What I would hire and reward for</h1><p>A lot of interview loops still overweight the part of engineering that is easiest to observe in a compressed setting.</p><p>You can see whether someone can produce an answer quickly. You can see whether they know a framework. You can see whether they can speak fluently about architecture patterns. What is harder to see is whether they make cleaner decisions when the information is incomplete and the tradeoffs are ugly.</p><p>That is why many teams accidentally hire for confidence and call it judgment.</p><p>If I cared about judgment, I would ask candidates to walk through a messy decision. A real one. A rollout that went wrong. A migration they delayed. A time they chose not to build something. I would listen for whether they can name the assumptions they were carrying, the signals they trusted too much, and the point at which they realized the original framing was off.</p><p>I would also pay attention to how they talk about constraints. Weak engineers often treat constraints as annoying blockers. Strong engineers use them as design inputs. That difference matters because real work is constraint handling with occasional bursts of typing.</p><p>Inside teams, I would reward engineers who make the system easier to reason about. People who reduce ambiguity. People who write decision records that save other people time. People who improve rollback quality, testability, observability, and handoff clarity. Those things do not always look heroic in the moment. They compound anyway.</p><p>The teams that win in an AI-heavy environment will not be the ones that generate the most code. They will be the ones that waste the least energy on plausible nonsense.</p><h1>What I would optimize for now</h1><p>If I were advising an engineer early in their career, I would still tell them to build real technical depth. You cannot exercise judgment in a domain you do not understand.</p><p>But I would also tell them not to confuse technical depth with career insulation.</p><p>The engineers who become hard to replace are usually the ones who reduce expensive uncertainty. They turn vague requests into workable plans. They spot the hidden dependency before it becomes an outage. They know when a tool output is good enough, when it needs rewriting, and when the problem should be pushed back on entirely.</p><p>They make other people safer.</p><p>That is why I think the premium skill has shifted.</p><p>Not away from engineering.<br>Not away from code.<br>Not into empty &#8220;leadership&#8221; talk.</p><p>It has shifted toward better decisions around the code.</p><p>The useful question is not whether AI can write more of the implementation. It clearly can. The useful question is whether your team is getting better at deciding what deserves to exist, what deserves trust, and what deserves a hard no.</p><p>Because once code gets cheap, bad judgment gets very expensive.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Notes / references</h1><blockquote><p><strong>[1] </strong>National Society of Professional Engineers, &#8220;NSPE Code of Ethics for Engineers.&#8221; <a href="https://www.nspe.org/career-growth/nspe-code-ethics-engineers">https://www.nspe.org/career-growth/nspe-code-ethics-engineers</a></p><p><strong>[2] </strong>ABET, &#8220;Criteria for Accrediting Engineering Programs, 2025&#8211;2026,&#8221; Criterion 3 student outcomes. <a href="https://www.abet.org/accreditation/accreditation-criteria/criteria-for-accrediting-engineering-programs-2025-2026/">https://www.abet.org/accreditation/accreditation-criteria/criteria-for-accrediting-engineering-programs-2025-2026/</a></p><p><strong>[3] </strong>Stack Overflow, &#8220;2025 Developer Survey &#8211; AI.&#8221; <a href="https://survey.stackoverflow.co/2025/ai">https://survey.stackoverflow.co/2025/ai</a></p><p><strong>[4] </strong>NIST, &#8220;SP 800-218: Secure Software Development Framework (SSDF) Version 1.1.&#8221; <a href="https://csrc.nist.gov/pubs/sp/800/218/final">https://csrc.nist.gov/pubs/sp/800/218/final</a></p><p><strong>[5] </strong>NIST, &#8220;Artificial Intelligence Risk Management Framework (AI RMF 1.0).&#8221; <a href="https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-ai-rmf-10">https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-ai-rmf-10</a></p></blockquote>]]></content:encoded></item><item><title><![CDATA[The Best Office AI Is the One That Takes Out the Trash]]></title><description><![CDATA[Claude makes prettier inbox triage. ChatGPT closes the loop. That is not the same thing.]]></description><link>https://plausiblereality.com/p/the-best-office-ai-is-the-one-that</link><guid isPermaLink="false">https://plausiblereality.com/p/the-best-office-ai-is-the-one-that</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Sat, 28 Mar 2026 23:43:02 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/68a71c4b-ed59-4681-b446-d44f1959028b_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I gave Claude and ChatGPT a stupidly normal office task.</p><p>Not &#8220;build me a startup.&#8221; Not &#8220;write a literature review.&#8221; Just this: take a specific newsletter pile in my inbox, summarize each email so I can scan for anything worth opening, keep the links usable so I can click through to the interesting ones, and when I am done, delete the batch.</p><p>This should be easy.</p><p>It is digital housekeeping with better branding. Read the pile. Surface the good stuff. Throw away the rest.</p><p>And somehow this is where the product differences get brutally honest.</p><p>ChatGPT got through most of the workflow. It summarized the emails well enough. It let me skim. When it came time to delete, it acted like an adult: asked permission, confirmed the count, then cleared the batch cleanly.</p><p>It also kept finding ways to make the middle of the task annoying. Links got mangled. Sessions got sticky. The workflow worked, but it felt like driving a powerful car with one door that sometimes refuses to close.</p><p>Claude flipped the experience.</p><p>The list looked better. The links were better. The whole thing felt calmer, cleaner, more office-shaped. For triage, it was excellent. I could glance down the batch, pick what looked interesting, and move on without feeling like I was debugging my inbox.</p><p>Then I asked it to finish the job.</p><p>Delete the batch? No.</p><p>Fine. Archive it then? Also no.</p><p>At that point, the better office assistant was giving me manual instructions, which is a bit like hiring someone to sort your paperwork and then discovering they consider the filing cabinet an ethical boundary.</p><p>The wrong debate is which AI is smarter.</p><p>Office work is not an IQ test. It is a relay race of low-status annoyances. Open the thing. Read the thing. Summarize the thing. Preserve the useful artifacts. Clean up the mess. If a tool nails the first four and refuses the fifth, it has not completed the job. It has handed the ugliest step back to you.</p><p>That sounds obvious, but teams miss it all the time.</p><p>They buy the nicest demo. They get seduced by tone, polish, model personality, maybe a cleaner UI. Then they confuse a pleasant interaction with actual workflow completion.</p><p>Preference is not performance.</p><p>In my newsletter test, Claude won preference. ChatGPT won completion. Neither won the whole workflow.</p><p>The plot twist is that this is not only about model quality. It is also about product philosophy.</p><p>OpenAI&#8217;s current product stack is explicitly built around <a href="https://help.openai.com/en/articles/11487775-connectors-in-chatgpt">connected apps</a> and <a href="https://help.openai.com/en/articles/11752874-chatgpt-agent">agentic action</a>. Apps in ChatGPT can search external sources, run deep research, and in some cases take write actions with confirmation. ChatGPT agent is also designed to navigate websites, work with files, connect to email and document repositories, and take actions on your behalf while keeping you in control.</p><p>Anthropic&#8217;s workplace story is real too. <a href="https://claude.com/resources/tutorials/using-research-and-google-workspace">Claude&#8217;s Research + Google Workspace setup</a> can access emails, calendar data, documents, and web information for analysis. <a href="https://support.claude.com/en/articles/12138966-release-notes">Claude in Chrome</a> has also been improving its ability to handle long browser workflows and navigate common sites like Gmail, Google Calendar, Google Docs, Slack, and GitHub. But Anthropic&#8217;s own <a href="https://support.claude.com/en/articles/12902446-claude-in-chrome-permissions-guide">permissions guide</a> draws a hard line around permanent deletions.</p><p>So Claude did not just &#8220;fail&#8221; my cleanup step. There is a decent chance it was obeying the line Anthropic drew around that kind of action.</p><p>This sounds obvious, but the acceptance criteria are the product review.</p><p>If your test prompt is &#8220;summarize my newsletters,&#8221; both tools look competent. If the real job is &#8220;summarize them, keep the links usable, let me skim the interesting ones, then bulk archive or delete the batch,&#8221; the ranking changes completely.</p><p>The useful question is not which model is smarter. It is which one can survive the full office loop without quietly handing the worst step back to the human.</p><p>That is a much harsher benchmark.</p><p>It is also a more useful one.</p><p>My read is that ChatGPT keeps stretching toward the high-accountability ends of the market: cited research, source-sensitive outputs, and agents that are increasingly expected to close loops rather than just advise. <a href="https://help.openai.com/en/articles/10500283-deep-research-in-chatgpt">Deep research</a> is explicitly framed as a documented report with citations or source links that you can download as Word, Markdown, or PDF. Claude Research is not weak here either, but Claude often feels stronger in the messy middle of daily knowledge work.</p><p>Which means the irony writes itself.</p><p>The product most people reach for first can feel weirdly clumsy on one of the most ordinary office tasks. The product with the calmer, more competent office vibe can nail the pleasant part and then stop at the exact moment I want an assistant to stop being tasteful and just take out the trash.</p><p>So is one of the two giants truly better?</p><p>No.</p><p>They are incomplete in opposite directions.</p><p>ChatGPT is more willing to finish the job, but can make the journey feel rougher than it should.</p><p>Claude makes the middle of the workflow feel better, but in at least some surfaces it draws a harder line around destructive cleanup, which means the human still has to walk in and do the last boring step.</p><p>And that leaves you with the most absurd outcome possible: paying for two different AI products because each one is missing the part the other gets right.</p><p>That is the real plot twist.</p><p>AI assistants keep promising consolidation. What they may deliver first is a new category of software sprawl, where one tool reads better, another one finishes better, and you pay both to approximate one competent office worker.</p><p>The problem is usually not intelligence. It is handoff.</p><p>The best office AI is not the one that impresses you at the start of the workflow.</p><p>It is the one that is still useful at the last ugly click.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Your Startup Is Hiring Engineers the Wrong Way]]></title><description><![CDATA[Why copying big-tech interviews is one of the most expensive mistakes early-stage teams make]]></description><link>https://plausiblereality.com/p/your-startup-is-hiring-engineers</link><guid isPermaLink="false">https://plausiblereality.com/p/your-startup-is-hiring-engineers</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Sat, 28 Mar 2026 23:34:53 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/cf2577e3-a533-46a7-90e2-1161ab99a038_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I watched a six-person startup spend four months looking for a senior backend engineer. They ran five-round loops. Leetcode. System design for a million-user scale they were nowhere close to reaching. Behavioural questions borrowed from a Google interview guide someone found on Medium.</p><p>They finally hired someone with eight years at two household-name companies. Impressive resume. Strong signals across every rubric they had copied.</p><p>He lasted eleven weeks.</p><p>Not because he was bad. Because he had never worked without guardrails. No staging environment, no API docs, no product manager handing him specs. When the founder asked him to just figure it out, he froze. He had spent a decade operating brilliantly inside systems other people had already built. Nobody had ever asked him to build the system itself.</p><p>The startup lost four months of hiring time, three months of onboarding, and roughly six months of momentum. At the early stage, that kind of hit can be the difference between shipping and dying. Neil Matthams, who has placed over 500 technical hires across 30 countries for companies like Canva, UBS, and Grab, puts the cost of a misaligned startup hire at <em>three to six months of lost momentum</em>. In my experience, that estimate is conservative.</p><p>The problem is usually not that startups hire bad engineers. It is that they use the wrong exam.</p><p style="text-align: center;">* * *</p><h1>The Wrong Frame: Borrowing Someone Else&#8217;s Exam</h1><p>Here is the lazy pattern. Founders see how Google, Meta, or Amazon hire. They assume these companies figured out hiring because they are successful. So they copy the playbook: algorithm rounds, system design at scale, behavioural interviews calibrated for large cross-functional organisations.</p><p>This sounds reasonable and is almost entirely wrong.</p><p>Big tech companies have a specific problem: too many candidates. Their interview process is a filter designed to reduce a massive pool to a manageable shortlist using standardised, repeatable evaluations. It works for that purpose. It is optimised for false-negative tolerance&#8212;they would rather reject a good candidate than hire a bad one, because at their scale the cost of a single bad hire is absorbed by thousands of good ones.</p><p>Startups have the opposite problem. Every hire matters disproportionately. You cannot afford to filter out the scrappy builder who happens to be rusty on red-black trees. And you definitely cannot afford to let through the polished specialist who has never shipped without a paved road under their feet.</p><p>A North Carolina State University study found that whiteboard-style technical interviews <em>test whether a candidate has performance anxiety rather than whether they are competent at coding</em>. The researchers concluded that many well-qualified candidates are being eliminated because they are not used to performing under artificial observation. Now imagine applying that broken filter in a startup where your hiring pool is already small and your margin for error is zero.</p><p>There is also a subtler problem. The big-tech interview process is self-reinforcing. Engineers who went through it at Google bring it with them when they join or advise startups. Recruiters who cut their teeth at Meta default to what they know. As one FAANG head of recruiting admitted, the inertia is enormous. Companies have built entire recruiting machines around these processes, with years of calibration data. That calibration data is valuable&#8212;for the environment it was calibrated in. It tells you nothing useful about a twelve-person company trying to find product-market fit.</p><p>The wrong debate is whether Leetcode is a good signal or a bad signal. The useful question is: a signal of what? In a big company, it might proxy for analytical rigour. In a startup, you need a signal for something else entirely&#8212;and you are not going to get it from an algorithm puzzle.</p><p style="text-align: center;">* * *</p><h1>What You Actually Need (And What You Don&#8217;t)</h1><p>The traits that make someone a top performer at a scaled company are frequently the exact traits that make them struggle at an early-stage startup. This is not a criticism of either environment. It is a mismatch problem. And mismatches kill startups faster than bad code does.</p><p>Scaled companies reward depth in a single domain. Startups need breadth across many. Scaled companies test how someone operates within an existing process. Startups need someone who can create the process from nothing. Scaled companies evaluate collaboration across large cross-functional teams. Startups need someone who can get it done alone, often wearing three hats at once.</p><p>Think about it from the candidate&#8217;s side for a moment. Someone who has spent six years at Amazon has never had to set up their own CI pipeline. They have never chosen a hosting provider or configured monitoring from scratch. They have never been the person who decides whether the company uses PostgreSQL or MongoDB, because that decision was made before they arrived. Their entire career has been built inside an environment where infrastructure, tooling, documentation, and process were givens. That is not a weakness. It is a feature of the environment they optimised for.</p><p>But when that person joins a six-person startup and realises there is no staging environment, no API documentation, no proper error monitoring, and no product manager writing specs&#8212;and nobody is coming to fix any of that&#8212;they do not adapt. They stall. The founder thought they were getting someone who could build. What they got was someone who could operate inside a system that someone else had already built.</p><p>Let me be specific about what the first ten to twenty engineers at a startup actually need to be good at:</p><p><strong>Ambiguity tolerance.</strong> This is the single biggest predictor. Edmond Lau, who literally wrote The Effective Engineer, calls it the most important skill for startup engineers. If you need thick documentation, a strong product manager writing specs, and well-defined requirements before you can move, you will not survive an early-stage environment. The best startup engineers don&#8217;t just tolerate ambiguity. They enjoy it. They see a vague problem and their first instinct is to start narrowing it down themselves, not to ask who is going to narrow it down for them.</p><p><strong>Breadth over depth.</strong> In the first year, your engineers will touch the frontend, the backend, the infrastructure, the deployment pipeline, possibly customer support, and maybe even sales scoping. Deep specialisation is a luxury that comes later. Right now you need people whose skill surface is wide enough that they do not become a bottleneck every time the problem shifts domain. This does not mean they need to be expert in everything. It means they need to be comfortable being not-expert and still shipping.</p><p><strong>Tool-building instinct.</strong> Time is the critical resource at a startup. Engineers who instinctively build tools to automate repetitive work buy the team time it cannot get any other way. This is different from engineering perfectionism. It is pragmatic leverage. The engineer who spends a day building a deployment script that saves the team twenty minutes every release is doing more for the company than the one who writes the most elegant algorithm.</p><p><strong>Pragmatism over purity.</strong> A startup engineer who insists on comprehensive code reviews, full unit test coverage, and architectural purity before shipping will slow you down at the exact moment speed matters most. This does not mean writing garbage. It means knowing which battles to fight and which shortcuts are acceptable debt versus unacceptable risk. It means understanding that shipping fast, observing everything, and reverting faster is often the better strategy than trying to get it perfect before anyone sees it.</p><p><strong>Builder identity, not operator identity.</strong> The distinction is important. An operator thrives inside a well-built machine. A builder thrives when there is no machine yet. Both are valuable. But if you are pre-product-market fit, you need builders. The fastest way to spot the difference: builders talk about things they created. Operators talk about things they improved.</p><p><strong>Growth instinct.</strong> Startup engineers will be asked to do things that are not in their job description, not in their domain, and not in their comfort zone. The ones who thrive are the ones who see that as a feature, not a bug. David Domingo, writing about the startup mentality, notes that the unknown pool engineers dive into is often not even code-related&#8212;it might be customer support, sales feasibility, or hiring new engineers. The ones who adopt a growth mindset during those detours are the ones who survive.</p><p style="text-align: center;">* * *</p><h1>The Interview Process That Actually Works</h1><p>If standard big-tech interviews measure the wrong things, what should you do instead? I think the answer is simpler than most founders expect. You shift from testing knowledge to testing behaviour. From evaluating what someone knows to evaluating how they think when they do not know.</p><p><strong>Start with the mission, not the job description.</strong> Share your 90-day mission with the candidate. Not a vague company pitch. One clear outcome they would need to deliver, why it matters right now, and what constraints they are operating within. Three sentences, max. Then ask: What risks do you see? What information would you need before starting? The quality of their questions tells you more than the quality of their algorithm solutions ever will.</p><p>This sounds obvious, but teams miss it all the time. Most startup interviews start with a description of the company, then move to a technical screen, then eventually&#8212;maybe&#8212;discuss what the person would actually be doing. Flip it. Lead with the mission. The candidates who engage deeply with the real problem are the ones who will engage deeply with the real work.</p><p><strong>Use work samples, not whiteboard puzzles.</strong> Research from Aberdeen Group shows that employers using realistic pre-hire assessments are 24% more likely to hire employees who exceed performance goals and see 39% lower turnover. Give candidates a small, real problem from your actual codebase or product domain. Not a take-home that eats their weekend. A focused, two-hour exercise that mirrors what their first week would actually look like.</p><p>What you watch for matters more than what they produce. Do they ask clarifying questions or just start coding? Do they make explicit tradeoffs or try to boil the ocean? Do they ship something usable or spend all the time on architecture? Do they mention edge cases that would matter in production? An engineer who delivers something imperfect but working in two hours, with a clear explanation of what they would do next, is almost always a better startup hire than one who delivers an elegant but incomplete solution.</p><p><strong>Test for range, not pedigree.</strong> Ask about times they did work outside their job description. Listen for enthusiasm when they talk about wearing multiple hats. If every answer starts with &#8220;In my role as...&#8221; followed by a clean, scoped responsibility, that person has probably never operated outside a well-defined lane. That is fine for a 5,000-person company. It is a red flag for a 10-person one.</p><p><strong>Probe for ownership instinct.</strong> Present a scenario: You join on Monday and discover there is no error monitoring, the deployment process is manual, and the only documentation is in someone&#8217;s head. What do you do first? Builders will light up. They will start triaging, prioritising, and making a plan. Operators will ask who is responsible for fixing that. Neither response is wrong in the abstract. But only one of them works at an early-stage company.</p><p><strong>Compress the loop.</strong> Four to five rounds over three weeks is a process designed for companies with recruiting departments and candidate pipelines. You do not have that. Two rounds, maybe three. A conversation that tests judgment, a work sample that tests building, and a culture-fit conversation that tests whether they actually want the environment you have, not the one they wish you had. If you cannot make a decision in three rounds, the problem is probably your evaluation criteria, not the candidate&#8217;s signals.</p><p>One more thing on process: speed is itself a signal. The best startup candidates have options. If your hiring loop takes three weeks, someone else has already made them an offer. The startups that move fast in hiring tend to be the ones that move fast in everything else&#8212;and the best candidates notice.</p><p style="text-align: center;">* * *</p><h1>The AI Era Makes This Gap Wider</h1><p>Here is the part nobody in the original discussion talks about enough.</p><p>AI tools have made the generalist builder even more valuable and the narrow specialist even more exposed. An engineer with broad instincts and comfort with ambiguity can now use AI to fill gaps in domains where they are not deep. They can scaffold a frontend they have never built before, debug infrastructure patterns they have only seen once, and generate boilerplate that used to take days.</p><p>But this only works if the person already has the builder mindset. AI does not help an engineer who is waiting for someone to define the requirements. It does not help someone who needs a well-paved road. The leverage is asymmetric: it multiplies the effectiveness of scrappy generalists and barely moves the needle for process-dependent specialists.</p><p>I think this is the most underappreciated shift in startup hiring right now. The engineer you want is no longer someone who has deep knowledge in your exact stack. It is someone who can move across stacks, use AI to accelerate the parts they are less familiar with, and still make good architectural decisions because they understand the fundamentals. The bottleneck has moved upstream&#8212;from knowing how to implement to knowing what to implement and why.</p><p>This also changes what your interview should test for. Instead of asking whether someone can implement a specific algorithm from memory, you should be asking whether they can take a vague product requirement, figure out the right technical approach, and ship it&#8212;with or without AI assistance. The skill that matters is judgment, not recall.</p><p>This means the evaluation gap between startup hiring and big-tech hiring is getting wider, not narrower. The startups that figure this out first will build faster with smaller teams. The ones still running five-round Google-clone interviews will keep losing their best candidates to competitors who made an offer in four days.</p><p style="text-align: center;">* * *</p><h1>The Market Is Shifting. Your Interviews Should Too.</h1><p>The 2025&#8211;2026 hiring market is moving toward what Ravio calls precision hiring. Teams are smaller. Expectations are higher. The growth-at-all-costs era is over, and with it the luxury of hiring for potential and hoping it works out.</p><p>CB Insights data shows that 23% of startup failures trace back to team misalignment. Not lack of funding. Not bad market timing. The wrong people. And the most common version of wrong people is not people who are bad at engineering. It is people who are good at engineering in the wrong context.</p><p>When you have twelve months of runway, one wrong hire can shave off two. That is not a metaphor. A misaligned engineer consumes onboarding time, creates drag on the team as they struggle to adapt, and eventually requires a painful offboarding that demoralises everyone. You do not just lose the salary. You lose the opportunity cost of what a well-matched engineer would have shipped in that same window.</p><p>Tony Hsieh at Zappos once estimated that poor culture fits had cost the company over $100 million. That was at a scaled company with resources to absorb the hit. A startup does not have that luxury. Every seat matters. Every month matters. Every hire is a bet&#8212;and you need your evaluation system to help you make better bets, not just more structured ones.</p><p>If your interview process cannot distinguish between someone who is good at operating and someone who is good at building, you are flipping a coin on every hire. And each coin flip costs you three to six months.</p><p style="text-align: center;">* * *</p><h1>So What Now</h1><p>I am not saying big-tech engineers are bad hires for startups. Some of the best startup engineers I have worked with came from large companies. But they were the ones who were restless there. The ones who hated the process overhead, who took on side projects that nobody asked for, who were slightly annoyed at how long everything took. Those people translate beautifully into startup environments.</p><p>The interview is where you find that out. Not by asking them to invert a binary tree. By giving them a messy, real, underspecified problem and watching whether they lean in or look for the spec.</p><p>Your evaluation system is not neutral. It is a filter. And right now, most startups are using a filter designed to find a completely different kind of engineer than the one they actually need. They are borrowing an exam from a different school and wondering why the grades do not predict performance.</p><p>Preference is not performance. A polished interview process that makes you feel professional is not the same as a hiring process that identifies the right people. Sometimes the right process looks a little rough. A real problem from your codebase, a direct conversation about what the next ninety days look like, and an honest assessment of whether this person thrives in chaos or merely survives it.</p><p>Maybe the useful question is not <em>how do we hire better engineers</em>. It is <em>are we even measuring the right thing</em>?</p><p style="text-align: center;"></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://plausiblereality.com/subscribe?"><span>Subscribe now</span></a></p><p style="text-align: center;"></p><p style="text-align: center;"></p><h2>Notes and References</h2><p>1. Neil Matthams, &#8220;Startups Should Evaluate Engineers Differently From Big Companies,&#8221; <em>Engineering Leadership Newsletter</em>, 2026. Based on 500+ technical hires across 30 countries.</p><p>2. NC State University (2020), study on whiteboard-style technical interviews measuring performance anxiety rather than coding competence. Published via ScienceDaily.</p><p>3. Aberdeen Group research on pre-hire assessments: 24% higher likelihood of exceeding performance goals, 39% lower turnover.</p><p>4. CB Insights analysis of startup failure reasons: 23% attributed to team misalignment.</p><p>5. Edmond Lau, <em>The Effective Engineer</em>, on ambiguity tolerance as the most important trait for startup engineers.</p><p>6. Ravio, &#8220;Tech Hiring Trends in 2026: The 4 Big Shifts Shaping the Tech Job Market.&#8221;</p><p>7. Tony Hsieh (Zappos) estimate: poor culture fits costing over $100 million. Widely cited in startup hiring literature.</p>]]></content:encoded></item><item><title><![CDATA[AI Did Not Take Your Job. It Promoted You.]]></title><description><![CDATA[Most knowledge work is moving one level up the stack: less first draft, more judgment, orchestration, and accountability.]]></description><link>https://plausiblereality.com/p/ai-did-not-take-your-job-it-promoted</link><guid isPermaLink="false">https://plausiblereality.com/p/ai-did-not-take-your-job-it-promoted</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Sat, 28 Mar 2026 23:22:35 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/0a8aa09a-27c4-4ded-8b86-5b749fb14e30_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The weird thing about AI at work is that both optimists and doomers keep reaching for movie plots.</p><p>In one version, the machine politely helps with email and then sits in the corner like a very expensive stapler.</p><p>In the other, it storms the building, takes your laptop, and leaves you explaining transferable skills to a recruiter named Brad.</p><p>Real work is less cinematic and more annoying.</p><p>In a lot of knowledge jobs, AI does not immediately replace the human. It drags the human one level up the stack. You stop being the person who writes the first draft of everything. You become the person who decides what should exist, what good looks like, what can be trusted, and what actually ships.</p><p>That is why I keep coming back to the same line: AI did not take your job. It promoted you.</p><p>Not to &#8220;manager,&#8221; exactly. More like operator, editor, reviewer, systems designer, and person who still gets blamed when the output is wrong. So yes, a promotion. But in the traditional corporate sense, where responsibility shows up before the pay band does.</p><p>I think this frame is more useful than the usual replacement panic because it matches what is already happening in real workflows. The ILO and NASK&#8217;s 2025 global exposure index estimates that 25 percent of global employment sits in occupations potentially exposed to generative AI, with higher exposure in high-income countries, and says transformation rather than replacement is the more likely outcome. Full job automation, in their framing, remains limited because many tasks still require human involvement.<sup>[1]</sup></p><p>That does not mean everyone is safe. It definitely does not mean everyone gets a nice strategic role and a bigger title.</p><p>It means &#8220;job&#8221; is the wrong unit of analysis.</p><p>A job is a lumpy bag of tasks. AI peels that bag apart.</p><p>Some tasks get cheaper. Some disappear. Some become review work. Some move upstream into planning and constraint setting. Some move downstream into verification, integration, exception handling, and accountability. The human role that survives is usually less about producing every line from scratch and more about governing the system that now produces a lot of the raw material.</p><p>That is the promotion.</p><h2>The wrong debate is replacement</h2><p>The wrong debate is &#8220;Will AI replace software engineers?&#8221; or &#8220;Will AI replace marketers?&#8221; or &#8220;Will AI replace analysts?&#8221;</p><p>The useful question is: which parts of those jobs are moving from direct execution to supervision?</p><p>That sounds obvious, but teams miss it all the time.</p><p>If you are a developer, draft code, boilerplate tests, migrations, documentation stubs, and first-pass debugging suggestions get cheaper. If you are a product manager, turning a fuzzy pile of stakeholder thoughts into a first draft of a PRD gets cheaper. If you are in support, standard replies and retrieval-heavy responses get cheaper. If you are an operator, summarizing a process, comparing options, or turning a meeting mess into action items gets cheaper.</p><p>Cheaper does not mean solved. It means the scarce part moves.</p><p>The bottleneck has moved upstream.</p><p>When a model can produce ten decent starting points in two minutes, the expensive thing is no longer typing speed. It is framing the problem, supplying the right context, defining the constraints, spotting the hidden failure modes, and deciding which of the ten drafts deserves to exist in the world.</p><p>That is why the work starts to feel managerial even when your title does not.</p><p>OECD analysis points in the same direction. In occupations with high AI exposure, vacancies disproportionately demand management and business-process skills, along with social, emotional, and digital skills. Across ten OECD countries, 72 percent of vacancies in high-exposure occupations demanded at least one management skill and 67 percent demanded at least one business-process skill.<sup>[2]</sup> My read is simple: a growing share of white-collar work now includes managerial motions even when nobody calls them that. You are specifying work, routing work, reviewing work, and coordinating outputs across humans and systems.</p><p>Again, not glamorous. Just accurate.</p><p>This also explains why AI is hitting white-collar, highly digitized work first. OECD work on worker exposure says the occupations most exposed to AI are typically white-collar roles such as IT professionals, managers, and science and engineering professionals, while manual occupations tend to have lower AI exposure.<sup>[3]</sup> For the people reading this blog - developers, leads, PMs, founders, CTOs - that matters. The ladder wobbling under your feet is not a factory story. It is your story.</p><p>And yes, clerical work faces the harshest direct exposure in the ILO data.<sup>[1]</sup> Some jobs are under much more direct pressure. Some task layers will be shaved off so aggressively that the &#8220;promotion&#8221; feels less like advancement and more like being told to supervise your own replacement. I am not pretending otherwise.</p><p>I am saying the dominant early pattern in knowledge work is not clean replacement. It is messy reallocation.</p><h2>Your new job description is not &#8220;use AI more&#8221;</h2><p>This is where most teams get embarrassingly vague.</p><p>They tell people to &#8220;use AI&#8221; as if that were a workflow and not a cry for help.</p><p>What actually matters is whether the job has been redesigned around cheaper first drafts and more expensive judgment.</p><p>The new job description usually looks something like this:</p><p>You define the artifact before it exists.</p><p>You specify what good looks like.</p><p>You supply examples, edge cases, and constraints.</p><p>You choose what the model is allowed to do and what it is not allowed to touch.</p><p>You evaluate the output.</p><p>You integrate it into a broader system.</p><p>You own the result anyway.</p><p>That is not &#8220;prompting.&#8221; That is operating.</p><p>I notice this in my own work constantly. I start fewer things from a blank page now. I start with a brief, a list of failure modes, and some kind of verification step. The keyboard time went down. The responsibility absolutely did not.</p><p>This is why I keep saying acceptance criteria are becoming the real leverage point. If generation is cheap, specification matters more. A bad brief no longer creates slow bad work. It creates fast bad work. That is worse. Garbage at machine speed is still garbage. It just arrives with better punctuation.</p><p>There is also a shift from judgment as theater to artifact as evidence.</p><p>In the old version of white-collar work, a lot of value was signaled socially. Could you sound smart in a meeting? Could you explain the plan? Could you bluff through ambiguity with enough confidence that nobody asked follow-up questions?</p><p>AI is rude to that style of competence. It can generate plausible language by the barrel. So the emphasis moves toward the artifact that survives contact with reality: the shipped feature, the tested process, the memo that withstands scrutiny, the forecast with clear assumptions, the incident write-up that actually explains what happened.</p><p>That is a better game, frankly. But it is less forgiving.</p><p>The smartest people in AI-heavy environments are not the ones who can make the model sound magical in a demo. They are the ones who can build a harness around it: context packs, templates, examples, validations, tests, rollback paths, review rules, and clear ownership. Harness engineering sounds less sexy than &#8220;prompt wizardry,&#8221; which is unfortunate for marketing but excellent for civilization.</p><p>The best AI workflow is usually slightly boring.</p><h2>Why juniors get lifted and seniors get rearranged</h2><p>One of the most interesting things in the early evidence is who benefits most from these tools.</p><p>In customer support, Brynjolfsson, Li, and Raymond found that access to a generative AI assistant increased productivity by 14 percent on average, with a 34 percent improvement for novice and lower-skilled workers and minimal impact for the most experienced workers.<sup>[4]</sup> In a separate experiment on professional writing tasks, Noy and Zhang found that ChatGPT users finished 40 percent faster and produced output judged 18 percent higher in quality.<sup>[5]</sup> And in software development, Cui and coauthors reported that across three field experiments involving 4,867 developers at Microsoft, Accenture, and a Fortune 100 company, access to an AI coding assistant increased completed tasks by 26.08 percent, with larger gains for less experienced developers.<sup>[6]</sup></p><p>That is not a small pattern. It is the shape of the change.</p><p>AI is very good at collapsing parts of the apprenticeship curve.</p><p>It helps people produce a passable first version sooner. It exposes lower-skill workers to patterns that used to live mostly in the heads of stronger operators. It narrows some performance gaps. It makes the middle of the quality distribution fatter.</p><p>Good. Also dangerous.</p><p>Good, because more people can do useful work faster.</p><p>Dangerous, because a lot of career ladders were built on earning your way through the routine layers. If the routine layers get compressed, organizations have to become much more deliberate about how people learn judgment. You cannot grow seniors out of thin air. And you definitely cannot grow them by asking juniors to rubber-stamp machine output all day like exhausted airport screeners.</p><p>So the ladder changes.</p><p>Juniors can often get to competent output faster.<sup>[4][5][6]</sup></p><p>Mids lose some of the value that came from being the person who could grind through the standard work reliably.</p><p>Seniors remain critical, but the nature of their value shifts. Less of it comes from being the fastest hands on the keyboard. More of it comes from task decomposition, exception handling, quality judgment, system design, and teaching others where the model will betray them.</p><p>This is why some experienced workers misread the moment.</p><p>They look at AI helping a junior write passable code or prose and conclude that seniority no longer matters.</p><p>Wrong.</p><p>The problem is usually not that expertise disappears. It is that expertise moves up a layer. When the easy 60 percent gets cheaper, the remaining 40 percent becomes the whole game.</p><p>That 40 percent is where costly mistakes live.</p><h2>The jagged frontier is why blind delegation is stupid</h2><p>If AI really were just a universal multiplier, the story would be easy. You would hand it everything and go home early.</p><p>Sadly, reality insists on nuance.</p><p>Dell&#8217;Acqua and coauthors&#8217; work on what they called the &#8220;jagged technological frontier&#8221; is useful here. In the BCG field experiment, consultants using AI completed tasks inside the model&#8217;s capability frontier faster and at higher quality. A Harvard summary of the study reports more than 25 percent greater speed, more than 40 percent higher human-rated performance, and more than 12 percent more task completion on those tasks. But on harder tasks outside the frontier, consultants using AI were 19 percentage points less likely to produce the correct answer.<sup>[7]</sup></p><p>That is the management problem.</p><p>Your promotion is not just &#8220;use AI more.&#8221; Your promotion is deciding where AI belongs, where it needs guardrails, and where it should stay out of the room.</p><p>Some tasks should be delegated cleanly.</p><p>Some should be AI-assisted but tightly checked.</p><p>Some should use AI for option generation and humans for final reasoning.</p><p>Some should remain almost entirely human because the cost of subtle error is too high.</p><p>That routing decision is work. Real work.</p><p>The lazy way to use AI is to ask it for everything.</p><p>The adult way is to map the workflow, identify the high-volume low-novelty steps, keep humans close to the decision points, and build a verification loop that catches silent failure.</p><p>This is where &#8220;preference is not performance&#8221; becomes practical rather than philosophical.</p><p>A tool can feel magical in chat and still be mediocre in a shipping workflow. If it creates more review burden than usable output, you do not have leverage. You have a very cheerful source of rework.</p><p>The metrics that matter are boring and therefore excellent: cycle time, acceptance rate, rework, escaped defects, reviewer load, and time spent clarifying requirements after generation. If those do not improve, your &#8220;AI strategy&#8221; is probably an expensive first-draft machine wearing a blazer.</p><h2>Promotion without a raise is still a promotion</h2><p>To be clear, I am using &#8220;promotion&#8221; to describe functional change, not moral progress.</p><p>A company can absolutely use AI to widen spans of control, compress headcount, raise output expectations, and dump more review work onto the same number of people. The ILO is explicit that policy choices and implementation paths will shape both worker retention and job quality in AI-exposed occupations.<sup>[1]</sup></p><p>So yes, the promotion can be rude.</p><p>You can get more leverage and less comfort at the same time.</p><p>You can move into more judgment-heavy work while also being measured harder, interrupted more often, and asked to cover a broader surface area. This is what partial automation looks like in the wild. It rarely arrives with a brass band. It arrives as &#8220;Can you also oversee the AI-assisted version of this process?&#8221;</p><p>That is why half-adoption is miserable.</p><p>If management automates generation but not evaluation, workers become janitors for machine sludge. They spend their day reviewing plausible nonsense, correcting avoidable errors, and cleaning up drafts that should never have existed. That is not leverage. That is a new flavor of admin burden.</p><p>If management redesigns the workflow properly, the human gets moved toward the work that actually deserves a human: ambiguous trade-offs, exceptions, prioritization, escalation, stakeholder judgment, and quality control.</p><p>Those are not identical futures.</p><p>So when leaders say &#8220;we are rolling out AI,&#8221; the real question is not whether the model is good. The real question is whether the workflow around the model is sane.</p><p>CTOs, especially, should treat this as org design rather than software procurement.</p><p>Buying a capable model is easy.</p><p>Deciding where it sits in the workflow, how people learn to use it, what must be tested, what gets logged, what gets escalated, and what counts as done - that is the hard part. That is management work. Which is exactly why I think &#8220;promotion&#8221; is the right word.</p><h2>A concrete workflow: shipping a feature with AI in the loop</h2><p>Let me make this less abstract.</p><p>Say a team needs to ship a modest internal feature: a permissions update, a reporting view, maybe a boring admin screen. The kind of task that used to begin with somebody staring into an editor and metabolizing caffeine.</p><p>In the old workflow, a developer might spend the first chunk of time translating a fuzzy request into a plan, then drafting implementation, then remembering tests, then writing the supporting documentation and release notes because no one else wanted to.</p><p>In the promoted workflow, the sequence changes.</p><p>1. Start with the brief, not the code. Write the user outcome, the constraints, the non-goals, the edge cases, and the acceptance criteria. This is not paperwork. This is the control surface.</p><p>2. Use AI to turn that brief into options: implementation outline, likely risks, missing requirements, test cases, migration concerns, and rollout questions. Make the model show its assumptions.</p><p>3. Let AI draft the first pass of code, tests, docs, and change summary where appropriate.</p><p>4. Run the boring machines on the machine output: linting, unit tests, type checks, security scanning, schema diff review, whatever applies.</p><p>5. Use the human review budget on what is actually risky: business logic, failure handling, weird permissions edges, naming that affects maintainability, and whether the thing should exist in this form at all.</p><p>6. Use AI again for downstream packaging: support macros, stakeholder update, release notes, runbook tweaks, and follow-up tickets.</p><p>Notice what changed.</p><p>The human did not disappear. The human moved.</p><p>Less blank-page drafting.</p><p>More problem framing.</p><p>More evaluation.</p><p>More integration.</p><p>More accountability.</p><p>That is the promotion in one small workflow.</p><p>I use the same pattern in writing. I do not start by asking the model to &#8220;write the article.&#8221; I start by trying to make the argument legible to myself, define what would make it true, and decide what evidence deserves to stay. Then the machine can help with structure, counterarguments, phrase alternatives, compression, and cleanup. The model is useful. The model is not the author. The job has moved upward, not outward.</p><p>This also explains why AI-heavy teams start caring more about reusable scaffolding. Shared prompts are fine. Shared rubrics are better. Shared evaluation checklists are better than that. The moment your team can generate work cheaply, consistency stops being a nice-to-have and becomes survival gear.</p><h2>What teams should actually change</h2><p>If you buy the promotion frame, a few practical implications follow.</p><p>First, train people on review and specification, not just on tool features.</p><p>Most AI training is basically a software demo with better posture. That is not enough. People need to learn how to scope tasks, express constraints, inspect outputs, and recognize failure patterns. Otherwise you are handing power tools to people and congratulating yourself because the box looked premium.</p><p>Second, redesign role expectations explicitly.</p><p>Do not say &#8220;everyone should use AI&#8221; and then evaluate them as if the work were unchanged. If the first draft is now cheap, then clarity, judgment, and orchestration should be rewarded more directly.</p><p>Third, instrument actual workflows.</p><p>Preference is not performance. Measure cycle time, acceptance rate, escaped defects, and rework on a handful of recurring processes. If the tool helps only in demos, that is not adoption. That is theater with a subscription fee.</p><p>Fourth, protect learning loops for less experienced people.</p><p>If juniors never have to think, they will not become seniors. Let AI accelerate them, but do not let it replace the reasoning reps completely. Ask for explanations. Rotate who defines the acceptance criteria. Make people compare outputs, not just consume them.</p><p>Fifth, stop fetishizing the prompt.</p><p>The prompt matters. But the bigger win is almost always in the surrounding system: better context, cleaner data, stronger templates, reusable checks, and clearer ownership boundaries. The best AI workflow is usually slightly boring because reliability is usually slightly boring.</p><p>None of this sounds glamorous. That is exactly why it works.</p><h2>The slightly annoying conclusion</h2><p>The useful question is not whether AI can do your job.</p><p>It is whether you have moved your value one level up before the market forces you to.</p><p>If your value is mostly raw drafting, raw formatting, raw summarizing, raw boilerplate coding, or raw information rearrangement, AI is very rude news. Those layers are getting cheaper, sometimes dramatically.<sup>[4][5][6]</sup></p><p>If your value is in defining the work, building the harness, judging the output, spotting the edge cases, aligning people around decisions, and standing behind the artifact, AI is not removing you. It is increasing your span of action.</p><p>That still might be exhausting.</p><p>It still might be unfair.</p><p>It still might come with exactly zero ceremonial appreciation from management.</p><p>But it is a more precise description of what is happening.</p><p>AI did not take your job. It promoted you.</p><p>The annoying part is that promoted people are supposed to know what good looks like.</p><p>Do you?</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://plausiblereality.com/subscribe?"><span>Subscribe now</span></a></p><p></p><h2>Notes / references</h2><p>1. International Labour Organization and NASK, &#8220;One in four jobs at risk of being transformed by GenAI, new ILO-NASK Global Index shows,&#8221; May 20, 2025, summarizing <em>Generative AI and Jobs: A Refined Global Index of Occupational Exposure</em> (ILO Working Paper 140, 2025).</p><p>2. OECD, <em>Artificial Intelligence and the Changing Demand for Skills in the Labour Market</em>, OECD Artificial Intelligence Papers No. 14, 2024. The report finds that in high AI exposure occupations, 72 percent of vacancies demanded at least one management skill and 67 percent demanded at least one business-process skill across the ten-country sample.</p><p>3. OECD, <em>Who Will Be the Workers Most Affected by AI?</em>, OECD Artificial Intelligence Papers, 2024. The executive summary notes that many occupations most exposed to AI are white-collar roles such as IT professionals, managers, and science and engineering professionals.</p><p>4. Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond, &#8220;Generative AI at Work,&#8221; NBER Working Paper 31161, 2023.</p><p>5. Shakked Noy and Whitney Zhang, &#8220;Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence,&#8221; <em>Science</em> 381, no. 6654 (2023): 187-192. DOI: 10.1126/science.adh2586.</p><p>6. Zheyuan (Kevin) Cui, Mert Demirer, Sonia Jaffe, Leon Musolff, Sida Peng, and Tobias Salz, &#8220;The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers,&#8221; SSRN working paper, August 20, 2025. DOI: 10.2139/ssrn.4945566.</p><p>7. Fabrizio Dell&#8217;Acqua, Edward McFowland III, Ethan Mollick, Hila Lifshitz-Assaf, Katherine C. Kellogg, Saran Rajendran, Lisa Krayer, Francois Candelon, and Karim R. Lakhani, &#8220;Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of Artificial Intelligence on Knowledge Worker Productivity and Quality,&#8221; <em>Organization Science</em> 37, no. 2 (2026): 403-423. For the summarized speed, quality, and task-completion figures used above, see also Harvard Business School&#8217;s official summaries of the study from September and November 2023.</p>]]></content:encoded></item><item><title><![CDATA[How to Make Your AI Colleague Great Again]]></title><description><![CDATA[Standards, context, and process as the real onboarding system for agents.]]></description><link>https://plausiblereality.com/p/how-to-make-your-ai-colleague-great</link><guid isPermaLink="false">https://plausiblereality.com/p/how-to-make-your-ai-colleague-great</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Fri, 27 Mar 2026 00:02:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/4f2d7ecd-0888-4d89-8f51-f62612f06ae9_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A lot of AI disappointment is self-inflicted. If you want an AI teammate to be useful, treat onboarding as a systems problem. Standards, context, access boundaries, and escalation paths do more than one heroic prompt ever will.</p><h1>The wrong debate</h1><p>The wrong debate is whether the agent is smart enough in the abstract. The useful question is whether it has actually been onboarded into your way of working.</p><p>What makes this argument useful is that it forces the conversation back to the workflow. When teams under-specify the work, they push uncertainty downstream where it gets rediscovered as rework, second-guessing, and inconsistent output.</p><p>A human junior gets examples, naming conventions, escalation paths, and the awkward local rules nobody puts in the architecture diagram. A synthetic junior needs the same categories of help, just packed into artifacts instead of hallway conversations.</p><p>This sounds obvious, but teams miss it all the time because the missing pieces do not look glamorous. A clear standard, a good example, a scoped permission, or a crisp definition of done rarely wins the demo. They do, however, win the week.</p><h1>What actually matters</h1><p>When teams say an agent is unreliable, a lot of the time they are describing terrible onboarding. A human teammate gets a repo tour, examples, naming conventions, escalation rules, and the awkward bits nobody writes down until they have to. The model gets a prompt and some vague optimism. Then everyone acts shocked when it behaves like a contractor who was dropped into the codebase through a skylight. If you want better output, start by making the work environment legible.</p><p>Standards are not bureaucracy here. They are compression. A short, explicit set of conventions lets the agent spend tokens on the task instead of rediscovering how your team names files, structures tests, or documents migrations. This sounds obvious, but teams miss it all the time because standards feel less exciting than model demos. Boring clarity wins anyway.</p><p>The wrong debate is usually about phrasing. What actually matters is whether the agent can see the local truth: architecture notes, examples of good changes, known landmines, recent decisions, and the shape of the task. Context is not extra seasoning on top of the prompt. It is the difference between asking for help from a teammate who knows the system and one who just arrived from another planet.</p><p>If I were fixing this tomorrow, I would not start with a new model. I would start by making the work easier to delegate: clearer context, cleaner standards, safer permissions, and better checks. Once those exist, model upgrades start to matter more.</p><p>What actually scales is not a clever one-off prompt but a reusable control surface: shared standards, reusable context, bounded permissions, visible examples, and cheap checks. Once those are in place, a stronger model helps. Before they exist, model upgrades mostly rearrange disappointment.</p><p>The useful question is not whether the agent can do something in principle. It is whether the task has been defined clearly enough, instrumented cheaply enough, and bounded safely enough to be delegated in the first place.</p><h1>A workflow I would actually trust</h1><p>One workflow I like is a small context pack beside the work: a task brief, repo map, relevant files, coding standards, definition of done, risky surfaces, and examples of good output. The agent gets that package every time. Humans do too. The result is fewer clarifying loops and less weird improvisation, because the system no longer has to infer what the team meant but forgot to write down.</p><p>A decent onboarding pack for an agent looks boring in exactly the right way: a short statement of responsibility, a repo map, a definition of done, a list of risky surfaces, examples of acceptable changes, and explicit escalation rules. Give that package to a model and to a new teammate and both will ask fewer confused questions. The interesting part is not that the agent gets better. The interesting part is that the team has finally written down what it claims to value.</p><p>This is also why onboarding and governance are closer than they look. Once permissions, expectations, and failure paths are explicit, trust becomes operational rather than emotional. You are not hoping a synthetic teammate behaves. You are shaping the conditions under which useful behavior is more likely and costly mistakes are easier to catch.</p><p>The measurement question matters because teams are excellent at narrating improvement that they have not actually checked. I would track at least five things: time to a usable first draft, number of clarification loops, amount of human cleanup, defect or regression rate after merge, and how often a rollback or rework was needed. Those numbers are boring, which is exactly why they are useful.</p><p>They also prevent a common trap. An agent can feel fast because it generates a lot of text or code quickly while still making the total loop slower once review, debugging, and cleanup are included. If the workflow is meant to ship, the metric has to live closer to shipped outcome than to demo charisma.</p><h1>Why this is credible</h1><p>This framing lines up with public engineering guidance from Anthropic, which has argued that successful agent systems are usually built from simple, composable patterns rather than ornate frameworks, and that context engineering is a more useful frame than prompt engineering once agents enter real workflows. NIST lands in a similar place from a governance angle: trustworthiness comes from lifecycle controls around the system, not from model confidence alone.</p><p>A useful agent is not the model you bought. It is the workflow you bothered to build.</p><p>The public evidence base is not perfect, but it points in a consistent direction. Anthropic&#8217;s engineering writing has repeatedly emphasized that effective agents depend on simple, composable patterns, strong context, and good tool design rather than prompt theatrics alone. The newer harness work strengthens the same lesson for longer-running tasks: the surrounding workflow determines whether capability survives contact with reality.</p><p>That sits well beside NIST&#8217;s risk-management framing, which treats trustworthy AI as a systems problem, and beside the DORA and METR findings that remind teams not to confuse subjective speed with shipped value. I would not overstate any single study. But taken together, the case for better harnesses is much stronger than the case for endless prompt superstition.</p><h1>The objection that sounds smarter than it is</h1><p>A common objection is that all this structure slows people down and that the real winners will simply use more capable models more aggressively. I think that gets the timing backward. Loose workflows can feel faster at first because they externalize ambiguity onto later review and cleanup. Structured workflows feel slower only until the same task appears for the fifth or fiftieth time.</p><p>That is when the boring scaffolding starts compounding. The brief exists. The examples exist. The checks exist. The team no longer burns senior attention rediscovering the same missing assumptions. Good harnesses are not anti-speed. They are how speed survives contact with scale.</p><h1>Where I would start this week</h1><p>If the thesis of this article is right, the first move is not a bigger model budget. It is one cleaner delegation lane. Pick a recurring task and make the surrounding expectations unambiguous enough that both humans and agents can run it the same way.</p><blockquote><p>&#183; Choose one recurring task with a clear business owner and a low-to-moderate blast radius.</p><p>&#183; Write a reusable task brief with goal, scope, relevant files or context, and a definition of done.</p><p>&#183; Add one or two examples of acceptable output and one explicit escalation rule for uncertainty.</p><p>&#183; Measure re-asks, human cleanup, and whether the task actually ships more smoothly after the change.</p></blockquote><p>The point is not to create a sacred process document. It is to make one useful loop more legible, more repeatable, and easier to trust. If that feels slightly boring, good. The boring version is usually the version that ships.</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://plausiblereality.com/subscribe?"><span>Subscribe now</span></a></p><p></p><h1>Notes</h1><p>1. Anthropic (2024), &#8220;Building effective agents,&#8221; published December 19, 2024.</p><p><a href="https://www.anthropic.com/research/building-effective-agents">https://www.anthropic.com/research/building-effective-agents</a></p><p>2. Anthropic (2025), &#8220;Effective context engineering for AI agents,&#8221; published September 29, 2025.</p><p><a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents">https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents</a></p><p>3. Anthropic (2025), &#8220;Effective harnesses for long-running agents,&#8221; published November 26, 2025.</p><p><a href="https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents">https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents</a></p><p>4. Anthropic (2025), &#8220;Writing effective tools for AI agents&#8212;using AI agents,&#8221; published September 11, 2025.</p><p><a href="https://www.anthropic.com/engineering/writing-tools-for-agents">https://www.anthropic.com/engineering/writing-tools-for-agents</a></p><p>5. NIST (2023), &#8220;Artificial Intelligence Risk Management Framework (AI RMF 1.0),&#8221; NIST AI 100-1.</p><p><a href="https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf">https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf</a></p><p>6. NIST (2024), &#8220;Artificial Intelligence Risk Management Framework: Generative AI Profile,&#8221; NIST AI 600-1.</p><p><a href="https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.600-1.pdf">https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.600-1.pdf</a></p><p>7. Google Cloud / DORA (2025), &#8220;How are developers using AI? Inside our 2025 DORA report,&#8221; published September 23, 2025.</p><p><a href="https://blog.google/innovation-and-ai/technology/developers-tools/dora-report-2025">https://blog.google/innovation-and-ai/technology/developers-tools/dora-report-2025</a></p><p></p><p></p><p></p>]]></content:encoded></item></channel></rss>