<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Plausible Reality]]></title><description><![CDATA[Plausible Reality is a calm, analytical publication about AI, technology leadership, startups, and the narratives surrounding them. It looks past hype, consensus, and rage-bait to examine what actually holds up under scrutiny. Most posts come from the per]]></description><link>https://plausiblereality.com</link><image><url>https://plausiblereality.com/img/substack.png</url><title>Plausible Reality</title><link>https://plausiblereality.com</link></image><generator>Substack</generator><lastBuildDate>Mon, 13 Apr 2026 23:45:16 GMT</lastBuildDate><atom:link href="https://plausiblereality.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Eloi Tay]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[plausiblereality@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[plausiblereality@substack.com]]></itunes:email><itunes:name><![CDATA[Eloi Tay]]></itunes:name></itunes:owner><itunes:author><![CDATA[Eloi Tay]]></itunes:author><googleplay:owner><![CDATA[plausiblereality@substack.com]]></googleplay:owner><googleplay:email><![CDATA[plausiblereality@substack.com]]></googleplay:email><googleplay:author><![CDATA[Eloi Tay]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[XXX + Claude Code Is Not 99% Cheaper]]></title><description><![CDATA[You can cut the visible token bill hard. You can also quietly swap out the model, the API semantics, and the operating costs that made Claude Code feel good in the first place.]]></description><link>https://plausiblereality.com/p/xxx-claude-code-is-not-99-cheaper</link><guid isPermaLink="false">https://plausiblereality.com/p/xxx-claude-code-is-not-99-cheaper</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Sat, 04 Apr 2026 14:27:46 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/08622f23-3880-44b3-9d12-814bb09e1b71_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Ignore the screenshot. Decompose the stack.</p><p>&#8220;Ollama + Claude Code = 99% cheaper&#8221; is the kind of line that spreads because it compresses one real technical fact into one very misleading business conclusion.</p><p>The real part is simple. Ollama exposes Anthropic-compatible endpoints, documents how to point Claude Code at a local or cloud Ollama backend, and even ships `ollama launch claude` to wire it up for you.[5]</p><p>The misleading part is the leap from &#8220;Claude Code can talk to Ollama&#8221; to &#8220;you kept the important part and removed the expensive part.&#8221;</p><p>I do not think that is what is happening.</p><p>I think a lot of what people actually like in Claude Code is the whole stack working together: the agent loop, the editing and command harness, the context handling, the default ergonomics, and then the model sitting behind it. On the hardest coding work, that last part is not a rounding error. It is the product.</p><p>So yes, you can make the visible token bill smaller. Sometimes dramatically smaller.</p><p>But that is not the same as getting the same outcome for less money. It is often the same shell driving a different brain through a different runtime with a different feature set, different bottlenecks, and a different failure profile.</p><p>The wrong debate is &#8220;Claude Code versus Ollama.&#8221;</p><p>What actually matters is which parts of the stack you are swapping, what you gain, and what bill comes back through another door.</p><p>A &#8220;free&#8221; alternative is usually only free if you pretend hardware, setup friction, supervision time, and degraded judgment do not belong in the spreadsheet.</p><p>That is the fluff worth stripping away.</p><p style="text-align: center;">* * *</p><h1>The wrong debate is Claude Code versus Ollama</h1><p>Anthropic&#8217;s own Agent SDK description is clarifying here. It says the SDK gives you the same tools, agent loop, and context management that power Claude Code.[1]</p><p>That matters.</p><p>It means Claude Code is not just a brand aura wrapped around one hidden endpoint. There is a real harness there. The shell matters. The tool permissions matter. The edit loop matters. The way context is managed matters.</p><p>But that same sentence also makes the opposite point, which clickbait titles quietly step over.</p><p>If the harness can be separated from the model, then keeping the harness does not mean you kept the model.</p><p>And model choice is not a cosmetic variable in Anthropic&#8217;s own documentation. Anthropic tells people to start with Claude Opus 4.6 for the most complex tasks and describes it as the latest generation model with exceptional performance in coding and reasoning.[2] Its Claude Code cost guidance also says Sonnet handles most coding tasks well and costs less than Opus, while Opus should be reserved for complex architectural decisions or multi-step reasoning.[3]</p><p>That is Anthropic being unusually plain about something the market keeps trying to blur: the model changes the tool.</p><p>Not just the speed.</p><p>Not just the invoice.</p><p>The tool.</p><p>Once you see the stack this way, the viral title starts to look sloppy.</p><p>Claude Code is one layer.</p><p>Claude the model is another.</p><p>The API semantics and serving stack are another.</p><p>Billing path and infrastructure are another.</p><p>Ollama can replace some of those pieces. It cannot magically make them irrelevant.</p><p>Anthropic&#8217;s own Claude Code docs also show how many deployment paths already exist before Ollama enters the story at all. Claude Code can authenticate through Claude.ai subscriptions, the Console/API path, or cloud providers like Bedrock, Vertex, and Microsoft Foundry.[4]</p><p>That portability is real.</p><p>The claim that portability makes all backends equivalent is not.</p><p>What people often mean when they say &#8220;Claude Code is amazing&#8221; is not &#8220;I enjoy that the terminal has slash commands.&#8221;</p><p>They mean something closer to: it usually stays on task, reads the right things, avoids some stupid edits, makes surprisingly decent multi-file changes, and recovers better than they expected when the task gets messy.</p><p>That is not just a shell story.</p><p>That is a model judgment story.</p><p>And on ugly engineering work, judgment is usually the expensive part.</p><p style="text-align: center;">* * *</p><h1>Where the 99% cheaper claim is technically true</h1><p>There is a version of the claim that is fair.</p><p>If your baseline is API-billed Claude Code usage through Anthropic, and if your workload is heavy enough, moving to a local model can cut direct provider spend hard. Anthropic&#8217;s Claude Code docs say average cost is about $6 per developer per day, with daily costs under $12 for 90% of users, and that average monthly cost is roughly $100 to $200 per developer on Sonnet 4.6, with wide variation based on usage patterns.[3]</p><p>If you already own suitable hardware and you move a big chunk of that work to local inference, the marginal provider bill can indeed collapse.</p><p>That is real.</p><p>It is just narrower than the headline wants you to notice.</p><p>First, the baseline matters.</p><p>Anthropic explicitly says the `/cost` command is for API users, and that Pro and Max subscribers have usage included in their subscription, so `/cost` is not relevant for billing in those plans.[3] Anthropic&#8217;s pricing page says Pro costs $17 per month on annual billing or $20 month-to-month and includes Claude Code, while Max starts at $100 per month and also includes Claude Code.[6]</p><p>So if someone is comparing &#8220;Ollama + Claude Code&#8221; to a Pro subscriber who was never paying a large API bill in the first place, &#8220;99% cheaper&#8221; is not analysis. It is framing.</p><p>Second, Ollama is not automatically synonymous with &#8220;free local inference.&#8221;</p><p>Ollama&#8217;s own docs say it supports both local and cloud models.[5] Its pricing page offers Free, Pro, and Max tiers, with cloud usage and concurrency rules attached to those plans.[7] Running models on your own hardware is unlimited according to Ollama, but cloud usage is not.[7]</p><p>So even inside the &#8220;Ollama&#8221; label, there are at least three different cost stories people keep mashing together: local inference on hardware you already own, local inference on hardware you bought for this purpose, and Ollama cloud usage with a plan and usage limits.</p><p>Those are not the same financial decision.</p><p>Third, &#8220;cheaper&#8221; usually means &#8220;cheaper on the line item I decided to show.&#8221;</p><p>This sounds obvious, but teams miss it all the time.</p><p>They compare the token bill and ignore the supervision tax.</p><p>They celebrate a lower provider invoice while quietly absorbing a slower review loop, more retries, weaker trust in the edits, more manual validation, more device-specific setup, and more engineer time spent nursing the system through edge cases.</p><p>That is not fake cost.</p><p>That is just cost that does not live in the AI vendor dashboard.</p><p>And once you start measuring that version, the title gets a lot less magical.</p><p style="text-align: center;">* * *</p><h1>Free is where the trade-offs start</h1><p>The first trade-off is model quality, which is the one viral titles work hardest to hide.</p><p>I am not making a benchmark argument here.</p><p>I am making a work argument.</p><p>On shallow tasks, the harness carries a lot. Read files. Search strings. Rename symbols. Scaffold tests. Produce a first draft. Run a command. Summarize a repo. A lot of models can look surprisingly good when the work is narrow and the feedback loop is tight.</p><p>But the moment the task becomes more architectural, more ambiguous, or more stateful, model judgment starts to dominate.</p><p>Should this change live in middleware or at the call site?</p><p>Is this test failure pointing to a bad patch or to a bad plan?</p><p>What is the least invasive place to fix the bug without breaking the rest of the system?</p><p>Should the agent edit at all, or stop and ask for a design decision?</p><p>That is the work people actually pay strong models for.</p><p>Anthropic&#8217;s own documentation reflects this split. It presents Opus 4.6 as the place to start for the most complex tasks and describes it as exceptional in coding and reasoning.[2] Its Claude Code docs separately tell you to use Sonnet for most coding work and reserve Opus for complex architectural decisions or multi-step reasoning.[3]</p><p>So when somebody wires Claude Code to `qwen3-coder`, `glm-4.7`, or `minimax-m2.1` through Ollama, they are not just trimming fat around the edges.[5]</p><p>They are changing the judge.</p><p>Maybe that is fine.</p><p>Maybe it is even smart for the task mix they have.</p><p>But it is still a trade, and the trade is central, not incidental.</p><p>The second trade-off is API behavior, which sounds boring until it hurts.</p><p>Ollama&#8217;s Anthropic compatibility is good enough to make the integration real. It supports messages, streaming, system prompts, multi-turn conversations, tool use, tool results, vision through base64 images, and thinking blocks.[5]</p><p>That is a serious amount of compatibility.</p><p>But its own docs are also explicit about what is not there or only partly there.</p><p>Ollama says Anthropic features such as the `count_tokens` endpoint, prompt caching, the Batches API, citations, PDF document blocks, and fully supported extended thinking are not supported or only partially supported. It also says token counts are approximations based on the underlying model tokenizer, and that extended thinking support is basic, with `budget_tokens` accepted but not enforced.[5]</p><p>That is not pedantry.</p><p>That is behavior.</p><p>Anthropic&#8217;s Claude Code cost guidance says Claude Code automatically uses prompt caching and auto-compaction to reduce repeated-context costs, and it says extended thinking is enabled by default because it significantly improves performance on complex planning and reasoning tasks.[3]</p><p>So when you swap the backend, you are not only changing the brain.</p><p>You are also changing some of the rules by which the harness expects the brain to behave.</p><p>Maybe your workflow does not care about PDFs, citations, or batches.</p><p>Fine.</p><p>Maybe approximate token counts are good enough.</p><p>Also fine.</p><p>But prompt caching and thinking behavior are not random niceties. They are part of the cost, context, and reasoning profile of the system.</p><p>When those move, the experience moves.</p><p>The third trade-off is hardware, which people mysteriously stop mentioning right after they say the word &#8220;free.&#8221;</p><p>Ollama&#8217;s own Claude Code integration page recommends models such as Qwen3 Coder for coding use cases, then notes that Qwen3 Coder is a 30B model requiring at least 24GB of VRAM to run smoothly, and more for longer context lengths.[5]</p><p>That is the difference between &#8220;I replaced my SaaS bill&#8221; and &#8220;I am now paying in silicon, thermals, and operational inconvenience.&#8221;</p><p>For a solo developer who already has the hardware, that can be completely rational.</p><p>For a team trying to standardize workflows across mixed laptops and desktops, it can become an annoying little tax that shows up everywhere.</p><p>Someone has to maintain the runtime.</p><p>Someone has to manage model versions.</p><p>Someone has to explain why the agent is fast on one box, glacial on another, and broken on a third because the quantization changed or memory pressure kicked in.</p><p>And if you solve all of that by using Ollama cloud, then you are back in the land of plans, concurrency, usage limits, and provider terms.[7]</p><p>Again, not bad.</p><p>Just not free.</p><p>The fourth trade-off is privacy and compliance, where both camps often oversimplify.</p><p>Yes, local inference can be useful if you want more data locality or do not want prompts leaving your machine. Ollama markets privacy directly and says running models on your own hardware is part of the product story.[7]</p><p>But the opposite caricature&#8212;&#8220;hosted Claude means your code is obviously being trained on&#8221;&#8212;is also sloppy.</p><p>Anthropic&#8217;s Claude Code data usage docs say commercial users under Team, Enterprise, and API terms are not used for generative model training unless they explicitly opt in to provide data for model improvement.[8] Retention defaults also vary by account type, with different settings for consumer and commercial paths.[8]</p><p>So the useful question is not &#8220;local good, cloud bad.&#8221;</p><p>It is &#8220;what are my actual obligations, what account path am I using, what is the retention policy, and which system satisfies my risk model?&#8221;</p><p>Clickbait wants one emotion.</p><p>Real engineering usually needs one level deeper than that.</p><p style="text-align: center;">* * *</p><h1>A simple workflow test that exposes the truth</h1><p>Here is the test I think matters.</p><p>Take a job that is annoying enough to expose real differences but common enough that people actually do it.</p><p>Say you need to add passkey support, keep legacy sessions working during rollout, update the admin permission checks, adjust the onboarding flow, patch the mobile API contract, and make sure a background cleanup job does not invalidate the wrong records.</p><p>This is not a benchmark.</p><p>This is Tuesday.</p><p>Claude Code as a harness can do a lot in both setups. It can inspect files. It can run commands. It can edit code. It can propose a plan. It can test. It can use tools. That part is not under dispute.[1][5]</p><p>The difference shows up after the first burst of competence.</p><p>Does the model keep the migration strategy intact while it touches multiple subsystems?</p><p>Does it notice the stale assumption hiding in a helper nobody mentioned?</p><p>Does it understand that a failing test is evidence against the plan rather than an instruction to &#8220;fix the test&#8221;?</p><p>Does it preserve the boundary of the task, or start making adjacent &#8220;improvements&#8221; because it lost the original objective?</p><p>Does it know when to stop?</p><p>That is where the cheap-versus-expensive debate becomes real.</p><p>Because the cost of a weaker answer is rarely &#8220;the model was wrong.&#8221;</p><p>The cost is usually: you reread more of its work, you rerun more tests, you intervene earlier, you trust less of the diff, you retry with tighter prompts, you split the task into smaller pieces, or you eventually hand the hard part back to the stronger model anyway.</p><p>A token bill is visible.</p><p>A supervision bill is not.</p><p>This is why so many discussions around coding agents are framed badly.</p><p>People optimize for time-to-first-token.</p><p>They should be optimizing for time-to-correct-merge.</p><p>People optimize for nominal cost per request.</p><p>They should be optimizing for acceptable output per hour of senior attention.</p><p>People optimize for &#8220;it worked in my demo.&#8221;</p><p>They should be optimizing for whether the system still behaves under ambiguity, fatigue, and ugly code.</p><p>Preference is not performance.</p><p>Feeling clever because the run was local is not the same as shipping faster.</p><p>And this is exactly why &#8220;Claude Code is just a harness&#8221; is too glib.</p><p>If the harness were the whole story, swapping the brain would feel like changing browsers.</p><p>In practice, on hard work, it feels more like changing engineers.</p><p style="text-align: center;">* * *</p><h1>How to use the trade-off without lying to yourself</h1><p>None of this is an argument against Ollama.</p><p>I think Ollama plus Claude Code is a good setup in plenty of cases.</p><p>It makes sense when you already own the hardware.</p><p>It makes sense when the work is lower stakes and repetitive.</p><p>It makes sense when you want local experimentation, cheap background agents, codebase search, rote transforms, or a disposable first pass you do not mind supervising.</p><p>It also makes sense when data locality is the real requirement and you are willing to pay for that in operations rather than vendor fees.</p><p>That is a perfectly adult trade.</p><p>I am also not arguing that every task deserves Opus.</p><p>Anthropic itself does not say that. Its own Claude Code cost guidance explicitly says Sonnet handles most coding tasks well and costs less than Opus, while Opus is for complex architectural decisions or multi-step reasoning.[3]</p><p>That is a sensible operating model.</p><p>The mistake is pretending the lower-cost path is a drop-in substitute across every task.</p><p>It is not.</p><p>The sane version of this conversation is boring, which is probably why it does not go viral.</p><p>Use cheaper local or cheaper cloud models where the task is constrained, recoverable, or easy to verify.</p><p>Use stronger hosted models where the job is ambiguous, expensive to be subtly wrong on, or likely to sprawl across architecture, state, and hidden assumptions.</p><p>In other words, do what good engineering teams already do with people.</p><p>You do not assign every problem to the cheapest available pair of hands.</p><p>You assign work based on the cost of failure, the difficulty of review, and the value of good judgment.</p><p>Models are not different enough to make that logic disappear.</p><p>One operating pattern I like is to treat model strength as a routing problem.</p><p>Cheap first pass.</p><p>Mid-tier daily driver.</p><p>Top-tier escalation path.</p><p>That can be local plus Sonnet plus Opus.</p><p>It can be Ollama for commodity work and Claude for the hard edge.</p><p>It can even be all hosted if your team cares more about consistency than about shaving provider spend.</p><p>What matters is that you know which failures you are buying.</p><p>Another operating pattern is to start with the strongest model, learn where it is actually overkill, and then selectively downshift.</p><p>Teams often do the reverse because the invoice is the first pain they can see.</p><p>That is backward.</p><p>You should understand your failure modes before you optimize them.</p><p>And you should measure the right things.</p><p>Not just tokens.</p><p>Measure retries per task.</p><p>Measure review time.</p><p>Measure how often humans take over.</p><p>Measure how often the first plan survives contact with the codebase.</p><p>Measure whether the cheaper system produces smaller bills by producing less useful work.</p><p>That last one is especially important.</p><p>Sometimes the &#8220;savings&#8221; are just lower utilization because the tool got less trustworthy.</p><p>That is not efficiency.</p><p>That is disengagement wearing a finance costume.</p><p>So yes, ignore the fluff.</p><p>Yes, the viral line contains a real trick.</p><p>Claude Code can run against Ollama.[5]</p><p>Yes, that can cut direct model spend dramatically in the right setup.[3][7]</p><p>But no, that does not mean you kept the same thing and simply deleted the cost.</p><p>You changed the model.</p><p>You changed some API behavior.</p><p>You changed hardware assumptions.</p><p>You changed the privacy and ops profile.</p><p>You changed who pays, when they pay, and what kind of failure you are willing to tolerate.</p><p>That can still be a good trade.</p><p>It just is not a magic trick.</p><p>The useful question is not &#8220;can Claude Code talk to Ollama?&#8221;</p><p>It can.</p><p>The useful question is whether, after the swap, your team produces better engineering outcomes per dollar of total attention.</p><p>And that is a much less clickable sentence than &#8220;99% cheaper.&#8221;</p><p>It is also the one that matters.</p><p>Because a lot of AI cost optimization is just moving the bill somewhere your dashboard does not show.</p><p>If you want a slightly annoying question to end on, here it is: are you reducing cost, or just making the expensive part harder to see?</p><p style="text-align: center;">* * *</p><p style="text-align: center;"></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p style="text-align: center;"></p><h2>Notes and References</h2><blockquote><p>1. Anthropic. &#8220;Agent SDK overview.&#8221; Claude Developer Platform. platform.claude.com/docs/en/agent-sdk/overview</p><p>2. Anthropic. &#8220;Models overview.&#8221; Claude API Docs. platform.claude.com/docs/en/about-claude/models/overview</p><p>3. Anthropic. &#8220;Manage costs effectively.&#8221; Claude Code Docs. code.claude.com/docs/en/costs</p><p>4. Anthropic. &#8220;Authentication.&#8221; Claude Code Docs. code.claude.com/docs/en/iam</p><p>5. Ollama. &#8220;Anthropic compatibility.&#8221; Ollama Docs. docs.ollama.com/api/anthropic-compatibility</p><p>6. Anthropic. &#8220;Plans &amp; Pricing.&#8221; Claude. claude.com/pricing</p><p>7. Ollama. &#8220;Pricing.&#8221; Ollama. ollama.com/pricing</p><p>8. Anthropic. &#8220;Data usage.&#8221; Claude Code Docs. code.claude.com/docs/en/data-usage</p></blockquote>]]></content:encoded></item><item><title><![CDATA[Good Judgment Is the New Important Skill for Engineers]]></title><description><![CDATA[When AI makes implementation cheaper, the engineers who stand out are the ones who can frame the problem, verify the output, and choose the right compromise.]]></description><link>https://plausiblereality.com/p/good-judgment-is-the-new-important</link><guid isPermaLink="false">https://plausiblereality.com/p/good-judgment-is-the-new-important</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Sun, 29 Mar 2026 00:54:00 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/e297aacf-4303-4a35-895a-c31a879e161a_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>The bottleneck has moved upstream</h1><p>A lot of engineers still think the premium skill is speed.</p><p>I think that frame is already outdated.</p><p>The wrong debate is whether AI will replace engineers. The useful question is which parts of engineering are becoming cheap enough that they stop being a differentiator. Writing the first draft of code is getting cheaper. Spinning up infrastructure is getting cheaper. Generating boilerplate is already cheap enough that nobody serious should confuse it with leverage.</p><p>What stays expensive is judgment.</p><p>By judgment, I do not mean vague seniority theatre. I mean the habit of making defensible decisions under uncertainty. It is the ability to frame the problem properly, surface the real constraints, spot the second-order effect, choose the least bad tradeoff, and know when a fast answer is actually dangerous.</p><p>That has always mattered. The difference now is relative value. When implementation gets easier, decision quality becomes more visible.</p><p>This is not just a nice philosophical take. The profession has been signalling this for years. The National Society of Professional Engineers still puts public safety and welfare at the centre of engineering responsibility, including the duty to act when an engineer&#8217;s judgment is overruled in dangerous circumstances.[1] ABET&#8217;s current engineering criteria explicitly require graduates to make informed judgments and to use engineering judgment when drawing conclusions from experiments and analysis.[2]</p><p>So no, judgment is not some fluffy add-on that appears after the real work is done. It is part of the job description.</p><p>What has changed is that more of the market can now fake the implementation part for longer.</p><h1>Why this matters more now</h1><p>The easiest mistake right now is to confuse output with value.</p><p>A tool can produce a plausible pull request in minutes. That does not mean the problem was well framed. It does not mean the change is safe to deploy. It does not mean the edge cases were considered. It definitely does not mean the long-term maintenance cost got any lower.</p><p>This sounds obvious, but teams miss it all the time. They celebrate faster artifact creation while quietly offloading more risk onto review, QA, operations, security, and future maintainers.</p><p>The Stack Overflow 2025 Developer Survey captured the mood better than most think-pieces did: more developers actively distrust the accuracy of AI tools than trust it, with 46% saying they distrust the output and 33% saying they trust it.[3] That gap matters. It tells you the industry is not struggling with access to generated code. It is struggling with whether generated code deserves confidence.</p><p>That is why the bottleneck has moved upstream.</p><p>If a junior engineer can generate something that looks competent, then the premium shifts to the person who can answer harder questions. Is this the right abstraction? What breaks under retry? What happens when the input order changes? Which failure is acceptable, and which one gets us paged at 2 a.m.? Are we about to ship a local optimization that creates a system-level mess?</p><p>Preference is not performance. A workflow that feels fast in the first ten minutes can be slow over two quarters if it produces brittle decisions.</p><p style="text-align: center;"></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!VeU1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!VeU1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 424w, https://substackcdn.com/image/fetch/$s_!VeU1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 848w, https://substackcdn.com/image/fetch/$s_!VeU1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 1272w, https://substackcdn.com/image/fetch/$s_!VeU1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!VeU1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png" width="1456" height="915" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:915,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:153062,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://plausiblereality.com/i/192265134?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!VeU1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 424w, https://substackcdn.com/image/fetch/$s_!VeU1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 848w, https://substackcdn.com/image/fetch/$s_!VeU1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 1272w, https://substackcdn.com/image/fetch/$s_!VeU1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0abe69f7-05ba-40e8-9def-bf9d085c720e_2100x1320.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Table 1. The skill shift inside modern engineering work.</p><h1>What good judgment actually looks like</h1><p>Good judgment is usually less glamorous than people want it to be.</p><p>It often looks like refusing to be impressed too early.</p><p>It looks like asking one more annoying question before approving a design. It looks like noticing that a feature request is really a policy question in disguise. It looks like treating AI output as a draft with a burden of proof, not as a completed answer. It looks like knowing when to stop exploring options and when to reopen the decision because the context changed.</p><p>I think good judgment in engineering usually has five parts.</p><p>First, problem framing. The team that solves the wrong problem elegantly is still wrong. Strong engineers clarify the real outcome, the real constraints, and the real user harm before they touch the solution.</p><p>Second, tradeoff quality. Most meaningful engineering choices are not about perfect versus bad. They are about expensive versus reversible, fast versus auditable, elegant versus operable, and clever versus teachable.</p><p>Third, risk calibration. Not every decision deserves ceremony. Not every decision deserves speed either. Good engineers separate reversible from irreversible moves. They know when a quick experiment is fine and when a small mistake can turn into a permanent tax.</p><p>Fourth, system awareness. A function can be locally clean and globally stupid. A change that looks harmless inside one service can create retry storms, double sends, broken reports, or compliance gaps downstream.</p><p>Fifth, learning discipline. Good judgment is not just the decision itself. It is the quality of the feedback loop after the decision lands.</p><p>That last part matters more than teams admit. NIST&#8217;s Secure Software Development Framework exists precisely because ordinary software development life cycle models often do not address security in enough detail, which means teams need explicit secure-development practices layered into the work.[4] NIST&#8217;s AI Risk Management Framework makes a similar point from the AI side: organizations need structured ways to manage AI risk and make trustworthiness operational, not rhetorical.[5]</p><p>In other words, good judgment is not a personality trait. It is a working system.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LCOL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LCOL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 424w, https://substackcdn.com/image/fetch/$s_!LCOL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 848w, https://substackcdn.com/image/fetch/$s_!LCOL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 1272w, https://substackcdn.com/image/fetch/$s_!LCOL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LCOL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png" width="1456" height="910" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:910,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:115000,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://plausiblereality.com/i/192265134?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!LCOL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 424w, https://substackcdn.com/image/fetch/$s_!LCOL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 848w, https://substackcdn.com/image/fetch/$s_!LCOL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 1272w, https://substackcdn.com/image/fetch/$s_!LCOL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9909d152-40ab-462f-ac85-28a6509920ba_2400x1500.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Figure 1. A practical judgment loop for modern engineering work.</p><h1>A concrete workflow: the AI-written migration that looks fine until it isn&#8217;t</h1><p>Here is a more realistic example than most benchmark chatter.</p><p>Imagine a team needs to change the schema behind a notification system. An engineer asks an AI tool to draft the migration, the backfill logic, the ORM model changes, and a batch job to clean old records. The tool produces something that compiles. Unit tests pass. The diff looks tidy. Everybody feels efficient.</p><p>The weak version of engineering stops there.</p><p>The stronger version gets irritating in exactly the right way.</p><p>Someone asks whether the migration is idempotent. Someone else checks whether the backfill will fight live writes. Another engineer asks what happens if the job is retried halfway through. A product-minded reviewer asks whether message history could become inconsistent for end users during the transition. Ops asks about rollout order, observability, and rollback. Security asks whether the temporary data shape changes access boundaries or audit requirements. A good staff engineer asks the blunt question: do we even need a backfill, or are we trying to preserve a convenience that is not worth the operational risk?</p><p>None of that is syntax work.</p><p>None of that is model benchmark work.</p><p>That is judgment.</p><p>The interesting part is that AI tools can make this gap wider, not smaller. They reduce the time between idea and artifact, which is useful. They also make it easier to skip the thinking that should have happened before the artifact existed. The team now has something concrete to react to, so everyone feels like work has progressed. Sometimes it has. Sometimes the team has just become more efficiently wrong.</p><p>That is why acceptance criteria are the new code review.</p><p>I would rather have an average engineer with sharp acceptance criteria, a rollback plan, explicit assumptions, and a clean monitoring strategy than a brilliant engineer who can generate code quickly and explain afterwards why the blast radius was unforeseeable.</p><h1>How teams actually train judgment</h1><p>Teams say they want judgment, then they build environments that reward speed theatre.</p><p>If you want better judgment, you need to train it in the open.</p><p>The first method is decision visibility. Important decisions should leave a trail: what problem was being solved, which options were considered, why the chosen path won, what assumptions were made, and what signals would tell you the choice was failing. This does not need to become bureaucratic sludge. A short decision record is often enough. The point is to make reasoning inspectable.</p><p>The second method is postmortems that are genuinely useful. Not blame rituals. Not timeline fan fiction. Real review of the framing, the assumptions, the missed signals, and the controls that were absent or weak. If the same class of mistake keeps happening, the issue is usually not individual intelligence. It is missing scaffolding.</p><p>The third method is scenario work. Put engineers in realistic tradeoff situations. Ask whether they would ship, delay, isolate, rollback, or redesign. Ask what extra evidence they would require. You learn more from that than from asking whether they remember a specific API call.</p><p>The fourth method is exposure to consequences. Judgment improves when engineers feel the operational cost of their choices. If one group writes code and another group absorbs every outage, you are training local optimization, not engineering maturity.</p><p>The fifth method is managerial honesty. If promotions are still mostly driven by visible output volume, then the organization is telling people what it actually values. You do not get a judgment culture by saying the word judgment a lot.</p><p>Ship fast, observe everything, revert faster is still a good rule. But notice what sits inside it: choosing what to ship, what to watch, and what would justify a reversal. Again, the hard part is not typing.</p><h1>What I would hire and reward for</h1><p>A lot of interview loops still overweight the part of engineering that is easiest to observe in a compressed setting.</p><p>You can see whether someone can produce an answer quickly. You can see whether they know a framework. You can see whether they can speak fluently about architecture patterns. What is harder to see is whether they make cleaner decisions when the information is incomplete and the tradeoffs are ugly.</p><p>That is why many teams accidentally hire for confidence and call it judgment.</p><p>If I cared about judgment, I would ask candidates to walk through a messy decision. A real one. A rollout that went wrong. A migration they delayed. A time they chose not to build something. I would listen for whether they can name the assumptions they were carrying, the signals they trusted too much, and the point at which they realized the original framing was off.</p><p>I would also pay attention to how they talk about constraints. Weak engineers often treat constraints as annoying blockers. Strong engineers use them as design inputs. That difference matters because real work is constraint handling with occasional bursts of typing.</p><p>Inside teams, I would reward engineers who make the system easier to reason about. People who reduce ambiguity. People who write decision records that save other people time. People who improve rollback quality, testability, observability, and handoff clarity. Those things do not always look heroic in the moment. They compound anyway.</p><p>The teams that win in an AI-heavy environment will not be the ones that generate the most code. They will be the ones that waste the least energy on plausible nonsense.</p><h1>What I would optimize for now</h1><p>If I were advising an engineer early in their career, I would still tell them to build real technical depth. You cannot exercise judgment in a domain you do not understand.</p><p>But I would also tell them not to confuse technical depth with career insulation.</p><p>The engineers who become hard to replace are usually the ones who reduce expensive uncertainty. They turn vague requests into workable plans. They spot the hidden dependency before it becomes an outage. They know when a tool output is good enough, when it needs rewriting, and when the problem should be pushed back on entirely.</p><p>They make other people safer.</p><p>That is why I think the premium skill has shifted.</p><p>Not away from engineering.<br>Not away from code.<br>Not into empty &#8220;leadership&#8221; talk.</p><p>It has shifted toward better decisions around the code.</p><p>The useful question is not whether AI can write more of the implementation. It clearly can. The useful question is whether your team is getting better at deciding what deserves to exist, what deserves trust, and what deserves a hard no.</p><p>Because once code gets cheap, bad judgment gets very expensive.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h1>Notes / references</h1><blockquote><p><strong>[1] </strong>National Society of Professional Engineers, &#8220;NSPE Code of Ethics for Engineers.&#8221; <a href="https://www.nspe.org/career-growth/nspe-code-ethics-engineers">https://www.nspe.org/career-growth/nspe-code-ethics-engineers</a></p><p><strong>[2] </strong>ABET, &#8220;Criteria for Accrediting Engineering Programs, 2025&#8211;2026,&#8221; Criterion 3 student outcomes. <a href="https://www.abet.org/accreditation/accreditation-criteria/criteria-for-accrediting-engineering-programs-2025-2026/">https://www.abet.org/accreditation/accreditation-criteria/criteria-for-accrediting-engineering-programs-2025-2026/</a></p><p><strong>[3] </strong>Stack Overflow, &#8220;2025 Developer Survey &#8211; AI.&#8221; <a href="https://survey.stackoverflow.co/2025/ai">https://survey.stackoverflow.co/2025/ai</a></p><p><strong>[4] </strong>NIST, &#8220;SP 800-218: Secure Software Development Framework (SSDF) Version 1.1.&#8221; <a href="https://csrc.nist.gov/pubs/sp/800/218/final">https://csrc.nist.gov/pubs/sp/800/218/final</a></p><p><strong>[5] </strong>NIST, &#8220;Artificial Intelligence Risk Management Framework (AI RMF 1.0).&#8221; <a href="https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-ai-rmf-10">https://www.nist.gov/publications/artificial-intelligence-risk-management-framework-ai-rmf-10</a></p></blockquote>]]></content:encoded></item><item><title><![CDATA[The Best Office AI Is the One That Takes Out the Trash]]></title><description><![CDATA[Claude makes prettier inbox triage. ChatGPT closes the loop. That is not the same thing.]]></description><link>https://plausiblereality.com/p/the-best-office-ai-is-the-one-that</link><guid isPermaLink="false">https://plausiblereality.com/p/the-best-office-ai-is-the-one-that</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Sat, 28 Mar 2026 23:43:02 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/68a71c4b-ed59-4681-b446-d44f1959028b_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I gave Claude and ChatGPT a stupidly normal office task.</p><p>Not &#8220;build me a startup.&#8221; Not &#8220;write a literature review.&#8221; Just this: take a specific newsletter pile in my inbox, summarize each email so I can scan for anything worth opening, keep the links usable so I can click through to the interesting ones, and when I am done, delete the batch.</p><p>This should be easy.</p><p>It is digital housekeeping with better branding. Read the pile. Surface the good stuff. Throw away the rest.</p><p>And somehow this is where the product differences get brutally honest.</p><p>ChatGPT got through most of the workflow. It summarized the emails well enough. It let me skim. When it came time to delete, it acted like an adult: asked permission, confirmed the count, then cleared the batch cleanly.</p><p>It also kept finding ways to make the middle of the task annoying. Links got mangled. Sessions got sticky. The workflow worked, but it felt like driving a powerful car with one door that sometimes refuses to close.</p><p>Claude flipped the experience.</p><p>The list looked better. The links were better. The whole thing felt calmer, cleaner, more office-shaped. For triage, it was excellent. I could glance down the batch, pick what looked interesting, and move on without feeling like I was debugging my inbox.</p><p>Then I asked it to finish the job.</p><p>Delete the batch? No.</p><p>Fine. Archive it then? Also no.</p><p>At that point, the better office assistant was giving me manual instructions, which is a bit like hiring someone to sort your paperwork and then discovering they consider the filing cabinet an ethical boundary.</p><p>The wrong debate is which AI is smarter.</p><p>Office work is not an IQ test. It is a relay race of low-status annoyances. Open the thing. Read the thing. Summarize the thing. Preserve the useful artifacts. Clean up the mess. If a tool nails the first four and refuses the fifth, it has not completed the job. It has handed the ugliest step back to you.</p><p>That sounds obvious, but teams miss it all the time.</p><p>They buy the nicest demo. They get seduced by tone, polish, model personality, maybe a cleaner UI. Then they confuse a pleasant interaction with actual workflow completion.</p><p>Preference is not performance.</p><p>In my newsletter test, Claude won preference. ChatGPT won completion. Neither won the whole workflow.</p><p>The plot twist is that this is not only about model quality. It is also about product philosophy.</p><p>OpenAI&#8217;s current product stack is explicitly built around <a href="https://help.openai.com/en/articles/11487775-connectors-in-chatgpt">connected apps</a> and <a href="https://help.openai.com/en/articles/11752874-chatgpt-agent">agentic action</a>. Apps in ChatGPT can search external sources, run deep research, and in some cases take write actions with confirmation. ChatGPT agent is also designed to navigate websites, work with files, connect to email and document repositories, and take actions on your behalf while keeping you in control.</p><p>Anthropic&#8217;s workplace story is real too. <a href="https://claude.com/resources/tutorials/using-research-and-google-workspace">Claude&#8217;s Research + Google Workspace setup</a> can access emails, calendar data, documents, and web information for analysis. <a href="https://support.claude.com/en/articles/12138966-release-notes">Claude in Chrome</a> has also been improving its ability to handle long browser workflows and navigate common sites like Gmail, Google Calendar, Google Docs, Slack, and GitHub. But Anthropic&#8217;s own <a href="https://support.claude.com/en/articles/12902446-claude-in-chrome-permissions-guide">permissions guide</a> draws a hard line around permanent deletions.</p><p>So Claude did not just &#8220;fail&#8221; my cleanup step. There is a decent chance it was obeying the line Anthropic drew around that kind of action.</p><p>This sounds obvious, but the acceptance criteria are the product review.</p><p>If your test prompt is &#8220;summarize my newsletters,&#8221; both tools look competent. If the real job is &#8220;summarize them, keep the links usable, let me skim the interesting ones, then bulk archive or delete the batch,&#8221; the ranking changes completely.</p><p>The useful question is not which model is smarter. It is which one can survive the full office loop without quietly handing the worst step back to the human.</p><p>That is a much harsher benchmark.</p><p>It is also a more useful one.</p><p>My read is that ChatGPT keeps stretching toward the high-accountability ends of the market: cited research, source-sensitive outputs, and agents that are increasingly expected to close loops rather than just advise. <a href="https://help.openai.com/en/articles/10500283-deep-research-in-chatgpt">Deep research</a> is explicitly framed as a documented report with citations or source links that you can download as Word, Markdown, or PDF. Claude Research is not weak here either, but Claude often feels stronger in the messy middle of daily knowledge work.</p><p>Which means the irony writes itself.</p><p>The product most people reach for first can feel weirdly clumsy on one of the most ordinary office tasks. The product with the calmer, more competent office vibe can nail the pleasant part and then stop at the exact moment I want an assistant to stop being tasteful and just take out the trash.</p><p>So is one of the two giants truly better?</p><p>No.</p><p>They are incomplete in opposite directions.</p><p>ChatGPT is more willing to finish the job, but can make the journey feel rougher than it should.</p><p>Claude makes the middle of the workflow feel better, but in at least some surfaces it draws a harder line around destructive cleanup, which means the human still has to walk in and do the last boring step.</p><p>And that leaves you with the most absurd outcome possible: paying for two different AI products because each one is missing the part the other gets right.</p><p>That is the real plot twist.</p><p>AI assistants keep promising consolidation. What they may deliver first is a new category of software sprawl, where one tool reads better, another one finishes better, and you pay both to approximate one competent office worker.</p><p>The problem is usually not intelligence. It is handoff.</p><p>The best office AI is not the one that impresses you at the start of the workflow.</p><p>It is the one that is still useful at the last ugly click.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Plausible Reality! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Your Startup Is Hiring Engineers the Wrong Way]]></title><description><![CDATA[Why copying big-tech interviews is one of the most expensive mistakes early-stage teams make]]></description><link>https://plausiblereality.com/p/your-startup-is-hiring-engineers</link><guid isPermaLink="false">https://plausiblereality.com/p/your-startup-is-hiring-engineers</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Sat, 28 Mar 2026 23:34:53 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/cf2577e3-a533-46a7-90e2-1161ab99a038_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I watched a six-person startup spend four months looking for a senior backend engineer. They ran five-round loops. Leetcode. System design for a million-user scale they were nowhere close to reaching. Behavioural questions borrowed from a Google interview guide someone found on Medium.</p><p>They finally hired someone with eight years at two household-name companies. Impressive resume. Strong signals across every rubric they had copied.</p><p>He lasted eleven weeks.</p><p>Not because he was bad. Because he had never worked without guardrails. No staging environment, no API docs, no product manager handing him specs. When the founder asked him to just figure it out, he froze. He had spent a decade operating brilliantly inside systems other people had already built. Nobody had ever asked him to build the system itself.</p><p>The startup lost four months of hiring time, three months of onboarding, and roughly six months of momentum. At the early stage, that kind of hit can be the difference between shipping and dying. Neil Matthams, who has placed over 500 technical hires across 30 countries for companies like Canva, UBS, and Grab, puts the cost of a misaligned startup hire at <em>three to six months of lost momentum</em>. In my experience, that estimate is conservative.</p><p>The problem is usually not that startups hire bad engineers. It is that they use the wrong exam.</p><p style="text-align: center;">* * *</p><h1>The Wrong Frame: Borrowing Someone Else&#8217;s Exam</h1><p>Here is the lazy pattern. Founders see how Google, Meta, or Amazon hire. They assume these companies figured out hiring because they are successful. So they copy the playbook: algorithm rounds, system design at scale, behavioural interviews calibrated for large cross-functional organisations.</p><p>This sounds reasonable and is almost entirely wrong.</p><p>Big tech companies have a specific problem: too many candidates. Their interview process is a filter designed to reduce a massive pool to a manageable shortlist using standardised, repeatable evaluations. It works for that purpose. It is optimised for false-negative tolerance&#8212;they would rather reject a good candidate than hire a bad one, because at their scale the cost of a single bad hire is absorbed by thousands of good ones.</p><p>Startups have the opposite problem. Every hire matters disproportionately. You cannot afford to filter out the scrappy builder who happens to be rusty on red-black trees. And you definitely cannot afford to let through the polished specialist who has never shipped without a paved road under their feet.</p><p>A North Carolina State University study found that whiteboard-style technical interviews <em>test whether a candidate has performance anxiety rather than whether they are competent at coding</em>. The researchers concluded that many well-qualified candidates are being eliminated because they are not used to performing under artificial observation. Now imagine applying that broken filter in a startup where your hiring pool is already small and your margin for error is zero.</p><p>There is also a subtler problem. The big-tech interview process is self-reinforcing. Engineers who went through it at Google bring it with them when they join or advise startups. Recruiters who cut their teeth at Meta default to what they know. As one FAANG head of recruiting admitted, the inertia is enormous. Companies have built entire recruiting machines around these processes, with years of calibration data. That calibration data is valuable&#8212;for the environment it was calibrated in. It tells you nothing useful about a twelve-person company trying to find product-market fit.</p><p>The wrong debate is whether Leetcode is a good signal or a bad signal. The useful question is: a signal of what? In a big company, it might proxy for analytical rigour. In a startup, you need a signal for something else entirely&#8212;and you are not going to get it from an algorithm puzzle.</p><p style="text-align: center;">* * *</p><h1>What You Actually Need (And What You Don&#8217;t)</h1><p>The traits that make someone a top performer at a scaled company are frequently the exact traits that make them struggle at an early-stage startup. This is not a criticism of either environment. It is a mismatch problem. And mismatches kill startups faster than bad code does.</p><p>Scaled companies reward depth in a single domain. Startups need breadth across many. Scaled companies test how someone operates within an existing process. Startups need someone who can create the process from nothing. Scaled companies evaluate collaboration across large cross-functional teams. Startups need someone who can get it done alone, often wearing three hats at once.</p><p>Think about it from the candidate&#8217;s side for a moment. Someone who has spent six years at Amazon has never had to set up their own CI pipeline. They have never chosen a hosting provider or configured monitoring from scratch. They have never been the person who decides whether the company uses PostgreSQL or MongoDB, because that decision was made before they arrived. Their entire career has been built inside an environment where infrastructure, tooling, documentation, and process were givens. That is not a weakness. It is a feature of the environment they optimised for.</p><p>But when that person joins a six-person startup and realises there is no staging environment, no API documentation, no proper error monitoring, and no product manager writing specs&#8212;and nobody is coming to fix any of that&#8212;they do not adapt. They stall. The founder thought they were getting someone who could build. What they got was someone who could operate inside a system that someone else had already built.</p><p>Let me be specific about what the first ten to twenty engineers at a startup actually need to be good at:</p><p><strong>Ambiguity tolerance.</strong> This is the single biggest predictor. Edmond Lau, who literally wrote The Effective Engineer, calls it the most important skill for startup engineers. If you need thick documentation, a strong product manager writing specs, and well-defined requirements before you can move, you will not survive an early-stage environment. The best startup engineers don&#8217;t just tolerate ambiguity. They enjoy it. They see a vague problem and their first instinct is to start narrowing it down themselves, not to ask who is going to narrow it down for them.</p><p><strong>Breadth over depth.</strong> In the first year, your engineers will touch the frontend, the backend, the infrastructure, the deployment pipeline, possibly customer support, and maybe even sales scoping. Deep specialisation is a luxury that comes later. Right now you need people whose skill surface is wide enough that they do not become a bottleneck every time the problem shifts domain. This does not mean they need to be expert in everything. It means they need to be comfortable being not-expert and still shipping.</p><p><strong>Tool-building instinct.</strong> Time is the critical resource at a startup. Engineers who instinctively build tools to automate repetitive work buy the team time it cannot get any other way. This is different from engineering perfectionism. It is pragmatic leverage. The engineer who spends a day building a deployment script that saves the team twenty minutes every release is doing more for the company than the one who writes the most elegant algorithm.</p><p><strong>Pragmatism over purity.</strong> A startup engineer who insists on comprehensive code reviews, full unit test coverage, and architectural purity before shipping will slow you down at the exact moment speed matters most. This does not mean writing garbage. It means knowing which battles to fight and which shortcuts are acceptable debt versus unacceptable risk. It means understanding that shipping fast, observing everything, and reverting faster is often the better strategy than trying to get it perfect before anyone sees it.</p><p><strong>Builder identity, not operator identity.</strong> The distinction is important. An operator thrives inside a well-built machine. A builder thrives when there is no machine yet. Both are valuable. But if you are pre-product-market fit, you need builders. The fastest way to spot the difference: builders talk about things they created. Operators talk about things they improved.</p><p><strong>Growth instinct.</strong> Startup engineers will be asked to do things that are not in their job description, not in their domain, and not in their comfort zone. The ones who thrive are the ones who see that as a feature, not a bug. David Domingo, writing about the startup mentality, notes that the unknown pool engineers dive into is often not even code-related&#8212;it might be customer support, sales feasibility, or hiring new engineers. The ones who adopt a growth mindset during those detours are the ones who survive.</p><p style="text-align: center;">* * *</p><h1>The Interview Process That Actually Works</h1><p>If standard big-tech interviews measure the wrong things, what should you do instead? I think the answer is simpler than most founders expect. You shift from testing knowledge to testing behaviour. From evaluating what someone knows to evaluating how they think when they do not know.</p><p><strong>Start with the mission, not the job description.</strong> Share your 90-day mission with the candidate. Not a vague company pitch. One clear outcome they would need to deliver, why it matters right now, and what constraints they are operating within. Three sentences, max. Then ask: What risks do you see? What information would you need before starting? The quality of their questions tells you more than the quality of their algorithm solutions ever will.</p><p>This sounds obvious, but teams miss it all the time. Most startup interviews start with a description of the company, then move to a technical screen, then eventually&#8212;maybe&#8212;discuss what the person would actually be doing. Flip it. Lead with the mission. The candidates who engage deeply with the real problem are the ones who will engage deeply with the real work.</p><p><strong>Use work samples, not whiteboard puzzles.</strong> Research from Aberdeen Group shows that employers using realistic pre-hire assessments are 24% more likely to hire employees who exceed performance goals and see 39% lower turnover. Give candidates a small, real problem from your actual codebase or product domain. Not a take-home that eats their weekend. A focused, two-hour exercise that mirrors what their first week would actually look like.</p><p>What you watch for matters more than what they produce. Do they ask clarifying questions or just start coding? Do they make explicit tradeoffs or try to boil the ocean? Do they ship something usable or spend all the time on architecture? Do they mention edge cases that would matter in production? An engineer who delivers something imperfect but working in two hours, with a clear explanation of what they would do next, is almost always a better startup hire than one who delivers an elegant but incomplete solution.</p><p><strong>Test for range, not pedigree.</strong> Ask about times they did work outside their job description. Listen for enthusiasm when they talk about wearing multiple hats. If every answer starts with &#8220;In my role as...&#8221; followed by a clean, scoped responsibility, that person has probably never operated outside a well-defined lane. That is fine for a 5,000-person company. It is a red flag for a 10-person one.</p><p><strong>Probe for ownership instinct.</strong> Present a scenario: You join on Monday and discover there is no error monitoring, the deployment process is manual, and the only documentation is in someone&#8217;s head. What do you do first? Builders will light up. They will start triaging, prioritising, and making a plan. Operators will ask who is responsible for fixing that. Neither response is wrong in the abstract. But only one of them works at an early-stage company.</p><p><strong>Compress the loop.</strong> Four to five rounds over three weeks is a process designed for companies with recruiting departments and candidate pipelines. You do not have that. Two rounds, maybe three. A conversation that tests judgment, a work sample that tests building, and a culture-fit conversation that tests whether they actually want the environment you have, not the one they wish you had. If you cannot make a decision in three rounds, the problem is probably your evaluation criteria, not the candidate&#8217;s signals.</p><p>One more thing on process: speed is itself a signal. The best startup candidates have options. If your hiring loop takes three weeks, someone else has already made them an offer. The startups that move fast in hiring tend to be the ones that move fast in everything else&#8212;and the best candidates notice.</p><p style="text-align: center;">* * *</p><h1>The AI Era Makes This Gap Wider</h1><p>Here is the part nobody in the original discussion talks about enough.</p><p>AI tools have made the generalist builder even more valuable and the narrow specialist even more exposed. An engineer with broad instincts and comfort with ambiguity can now use AI to fill gaps in domains where they are not deep. They can scaffold a frontend they have never built before, debug infrastructure patterns they have only seen once, and generate boilerplate that used to take days.</p><p>But this only works if the person already has the builder mindset. AI does not help an engineer who is waiting for someone to define the requirements. It does not help someone who needs a well-paved road. The leverage is asymmetric: it multiplies the effectiveness of scrappy generalists and barely moves the needle for process-dependent specialists.</p><p>I think this is the most underappreciated shift in startup hiring right now. The engineer you want is no longer someone who has deep knowledge in your exact stack. It is someone who can move across stacks, use AI to accelerate the parts they are less familiar with, and still make good architectural decisions because they understand the fundamentals. The bottleneck has moved upstream&#8212;from knowing how to implement to knowing what to implement and why.</p><p>This also changes what your interview should test for. Instead of asking whether someone can implement a specific algorithm from memory, you should be asking whether they can take a vague product requirement, figure out the right technical approach, and ship it&#8212;with or without AI assistance. The skill that matters is judgment, not recall.</p><p>This means the evaluation gap between startup hiring and big-tech hiring is getting wider, not narrower. The startups that figure this out first will build faster with smaller teams. The ones still running five-round Google-clone interviews will keep losing their best candidates to competitors who made an offer in four days.</p><p style="text-align: center;">* * *</p><h1>The Market Is Shifting. Your Interviews Should Too.</h1><p>The 2025&#8211;2026 hiring market is moving toward what Ravio calls precision hiring. Teams are smaller. Expectations are higher. The growth-at-all-costs era is over, and with it the luxury of hiring for potential and hoping it works out.</p><p>CB Insights data shows that 23% of startup failures trace back to team misalignment. Not lack of funding. Not bad market timing. The wrong people. And the most common version of wrong people is not people who are bad at engineering. It is people who are good at engineering in the wrong context.</p><p>When you have twelve months of runway, one wrong hire can shave off two. That is not a metaphor. A misaligned engineer consumes onboarding time, creates drag on the team as they struggle to adapt, and eventually requires a painful offboarding that demoralises everyone. You do not just lose the salary. You lose the opportunity cost of what a well-matched engineer would have shipped in that same window.</p><p>Tony Hsieh at Zappos once estimated that poor culture fits had cost the company over $100 million. That was at a scaled company with resources to absorb the hit. A startup does not have that luxury. Every seat matters. Every month matters. Every hire is a bet&#8212;and you need your evaluation system to help you make better bets, not just more structured ones.</p><p>If your interview process cannot distinguish between someone who is good at operating and someone who is good at building, you are flipping a coin on every hire. And each coin flip costs you three to six months.</p><p style="text-align: center;">* * *</p><h1>So What Now</h1><p>I am not saying big-tech engineers are bad hires for startups. Some of the best startup engineers I have worked with came from large companies. But they were the ones who were restless there. The ones who hated the process overhead, who took on side projects that nobody asked for, who were slightly annoyed at how long everything took. Those people translate beautifully into startup environments.</p><p>The interview is where you find that out. Not by asking them to invert a binary tree. By giving them a messy, real, underspecified problem and watching whether they lean in or look for the spec.</p><p>Your evaluation system is not neutral. It is a filter. And right now, most startups are using a filter designed to find a completely different kind of engineer than the one they actually need. They are borrowing an exam from a different school and wondering why the grades do not predict performance.</p><p>Preference is not performance. A polished interview process that makes you feel professional is not the same as a hiring process that identifies the right people. Sometimes the right process looks a little rough. A real problem from your codebase, a direct conversation about what the next ninety days look like, and an honest assessment of whether this person thrives in chaos or merely survives it.</p><p>Maybe the useful question is not <em>how do we hire better engineers</em>. It is <em>are we even measuring the right thing</em>?</p><p style="text-align: center;"></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://plausiblereality.com/subscribe?"><span>Subscribe now</span></a></p><p style="text-align: center;"></p><p style="text-align: center;"></p><h2>Notes and References</h2><p>1. Neil Matthams, &#8220;Startups Should Evaluate Engineers Differently From Big Companies,&#8221; <em>Engineering Leadership Newsletter</em>, 2026. Based on 500+ technical hires across 30 countries.</p><p>2. NC State University (2020), study on whiteboard-style technical interviews measuring performance anxiety rather than coding competence. Published via ScienceDaily.</p><p>3. Aberdeen Group research on pre-hire assessments: 24% higher likelihood of exceeding performance goals, 39% lower turnover.</p><p>4. CB Insights analysis of startup failure reasons: 23% attributed to team misalignment.</p><p>5. Edmond Lau, <em>The Effective Engineer</em>, on ambiguity tolerance as the most important trait for startup engineers.</p><p>6. Ravio, &#8220;Tech Hiring Trends in 2026: The 4 Big Shifts Shaping the Tech Job Market.&#8221;</p><p>7. Tony Hsieh (Zappos) estimate: poor culture fits costing over $100 million. Widely cited in startup hiring literature.</p>]]></content:encoded></item><item><title><![CDATA[AI Did Not Take Your Job. It Promoted You.]]></title><description><![CDATA[Most knowledge work is moving one level up the stack: less first draft, more judgment, orchestration, and accountability.]]></description><link>https://plausiblereality.com/p/ai-did-not-take-your-job-it-promoted</link><guid isPermaLink="false">https://plausiblereality.com/p/ai-did-not-take-your-job-it-promoted</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Sat, 28 Mar 2026 23:22:35 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/0a8aa09a-27c4-4ded-8b86-5b749fb14e30_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>The weird thing about AI at work is that both optimists and doomers keep reaching for movie plots.</p><p>In one version, the machine politely helps with email and then sits in the corner like a very expensive stapler.</p><p>In the other, it storms the building, takes your laptop, and leaves you explaining transferable skills to a recruiter named Brad.</p><p>Real work is less cinematic and more annoying.</p><p>In a lot of knowledge jobs, AI does not immediately replace the human. It drags the human one level up the stack. You stop being the person who writes the first draft of everything. You become the person who decides what should exist, what good looks like, what can be trusted, and what actually ships.</p><p>That is why I keep coming back to the same line: AI did not take your job. It promoted you.</p><p>Not to &#8220;manager,&#8221; exactly. More like operator, editor, reviewer, systems designer, and person who still gets blamed when the output is wrong. So yes, a promotion. But in the traditional corporate sense, where responsibility shows up before the pay band does.</p><p>I think this frame is more useful than the usual replacement panic because it matches what is already happening in real workflows. The ILO and NASK&#8217;s 2025 global exposure index estimates that 25 percent of global employment sits in occupations potentially exposed to generative AI, with higher exposure in high-income countries, and says transformation rather than replacement is the more likely outcome. Full job automation, in their framing, remains limited because many tasks still require human involvement.<sup>[1]</sup></p><p>That does not mean everyone is safe. It definitely does not mean everyone gets a nice strategic role and a bigger title.</p><p>It means &#8220;job&#8221; is the wrong unit of analysis.</p><p>A job is a lumpy bag of tasks. AI peels that bag apart.</p><p>Some tasks get cheaper. Some disappear. Some become review work. Some move upstream into planning and constraint setting. Some move downstream into verification, integration, exception handling, and accountability. The human role that survives is usually less about producing every line from scratch and more about governing the system that now produces a lot of the raw material.</p><p>That is the promotion.</p><h2>The wrong debate is replacement</h2><p>The wrong debate is &#8220;Will AI replace software engineers?&#8221; or &#8220;Will AI replace marketers?&#8221; or &#8220;Will AI replace analysts?&#8221;</p><p>The useful question is: which parts of those jobs are moving from direct execution to supervision?</p><p>That sounds obvious, but teams miss it all the time.</p><p>If you are a developer, draft code, boilerplate tests, migrations, documentation stubs, and first-pass debugging suggestions get cheaper. If you are a product manager, turning a fuzzy pile of stakeholder thoughts into a first draft of a PRD gets cheaper. If you are in support, standard replies and retrieval-heavy responses get cheaper. If you are an operator, summarizing a process, comparing options, or turning a meeting mess into action items gets cheaper.</p><p>Cheaper does not mean solved. It means the scarce part moves.</p><p>The bottleneck has moved upstream.</p><p>When a model can produce ten decent starting points in two minutes, the expensive thing is no longer typing speed. It is framing the problem, supplying the right context, defining the constraints, spotting the hidden failure modes, and deciding which of the ten drafts deserves to exist in the world.</p><p>That is why the work starts to feel managerial even when your title does not.</p><p>OECD analysis points in the same direction. In occupations with high AI exposure, vacancies disproportionately demand management and business-process skills, along with social, emotional, and digital skills. Across ten OECD countries, 72 percent of vacancies in high-exposure occupations demanded at least one management skill and 67 percent demanded at least one business-process skill.<sup>[2]</sup> My read is simple: a growing share of white-collar work now includes managerial motions even when nobody calls them that. You are specifying work, routing work, reviewing work, and coordinating outputs across humans and systems.</p><p>Again, not glamorous. Just accurate.</p><p>This also explains why AI is hitting white-collar, highly digitized work first. OECD work on worker exposure says the occupations most exposed to AI are typically white-collar roles such as IT professionals, managers, and science and engineering professionals, while manual occupations tend to have lower AI exposure.<sup>[3]</sup> For the people reading this blog - developers, leads, PMs, founders, CTOs - that matters. The ladder wobbling under your feet is not a factory story. It is your story.</p><p>And yes, clerical work faces the harshest direct exposure in the ILO data.<sup>[1]</sup> Some jobs are under much more direct pressure. Some task layers will be shaved off so aggressively that the &#8220;promotion&#8221; feels less like advancement and more like being told to supervise your own replacement. I am not pretending otherwise.</p><p>I am saying the dominant early pattern in knowledge work is not clean replacement. It is messy reallocation.</p><h2>Your new job description is not &#8220;use AI more&#8221;</h2><p>This is where most teams get embarrassingly vague.</p><p>They tell people to &#8220;use AI&#8221; as if that were a workflow and not a cry for help.</p><p>What actually matters is whether the job has been redesigned around cheaper first drafts and more expensive judgment.</p><p>The new job description usually looks something like this:</p><p>You define the artifact before it exists.</p><p>You specify what good looks like.</p><p>You supply examples, edge cases, and constraints.</p><p>You choose what the model is allowed to do and what it is not allowed to touch.</p><p>You evaluate the output.</p><p>You integrate it into a broader system.</p><p>You own the result anyway.</p><p>That is not &#8220;prompting.&#8221; That is operating.</p><p>I notice this in my own work constantly. I start fewer things from a blank page now. I start with a brief, a list of failure modes, and some kind of verification step. The keyboard time went down. The responsibility absolutely did not.</p><p>This is why I keep saying acceptance criteria are becoming the real leverage point. If generation is cheap, specification matters more. A bad brief no longer creates slow bad work. It creates fast bad work. That is worse. Garbage at machine speed is still garbage. It just arrives with better punctuation.</p><p>There is also a shift from judgment as theater to artifact as evidence.</p><p>In the old version of white-collar work, a lot of value was signaled socially. Could you sound smart in a meeting? Could you explain the plan? Could you bluff through ambiguity with enough confidence that nobody asked follow-up questions?</p><p>AI is rude to that style of competence. It can generate plausible language by the barrel. So the emphasis moves toward the artifact that survives contact with reality: the shipped feature, the tested process, the memo that withstands scrutiny, the forecast with clear assumptions, the incident write-up that actually explains what happened.</p><p>That is a better game, frankly. But it is less forgiving.</p><p>The smartest people in AI-heavy environments are not the ones who can make the model sound magical in a demo. They are the ones who can build a harness around it: context packs, templates, examples, validations, tests, rollback paths, review rules, and clear ownership. Harness engineering sounds less sexy than &#8220;prompt wizardry,&#8221; which is unfortunate for marketing but excellent for civilization.</p><p>The best AI workflow is usually slightly boring.</p><h2>Why juniors get lifted and seniors get rearranged</h2><p>One of the most interesting things in the early evidence is who benefits most from these tools.</p><p>In customer support, Brynjolfsson, Li, and Raymond found that access to a generative AI assistant increased productivity by 14 percent on average, with a 34 percent improvement for novice and lower-skilled workers and minimal impact for the most experienced workers.<sup>[4]</sup> In a separate experiment on professional writing tasks, Noy and Zhang found that ChatGPT users finished 40 percent faster and produced output judged 18 percent higher in quality.<sup>[5]</sup> And in software development, Cui and coauthors reported that across three field experiments involving 4,867 developers at Microsoft, Accenture, and a Fortune 100 company, access to an AI coding assistant increased completed tasks by 26.08 percent, with larger gains for less experienced developers.<sup>[6]</sup></p><p>That is not a small pattern. It is the shape of the change.</p><p>AI is very good at collapsing parts of the apprenticeship curve.</p><p>It helps people produce a passable first version sooner. It exposes lower-skill workers to patterns that used to live mostly in the heads of stronger operators. It narrows some performance gaps. It makes the middle of the quality distribution fatter.</p><p>Good. Also dangerous.</p><p>Good, because more people can do useful work faster.</p><p>Dangerous, because a lot of career ladders were built on earning your way through the routine layers. If the routine layers get compressed, organizations have to become much more deliberate about how people learn judgment. You cannot grow seniors out of thin air. And you definitely cannot grow them by asking juniors to rubber-stamp machine output all day like exhausted airport screeners.</p><p>So the ladder changes.</p><p>Juniors can often get to competent output faster.<sup>[4][5][6]</sup></p><p>Mids lose some of the value that came from being the person who could grind through the standard work reliably.</p><p>Seniors remain critical, but the nature of their value shifts. Less of it comes from being the fastest hands on the keyboard. More of it comes from task decomposition, exception handling, quality judgment, system design, and teaching others where the model will betray them.</p><p>This is why some experienced workers misread the moment.</p><p>They look at AI helping a junior write passable code or prose and conclude that seniority no longer matters.</p><p>Wrong.</p><p>The problem is usually not that expertise disappears. It is that expertise moves up a layer. When the easy 60 percent gets cheaper, the remaining 40 percent becomes the whole game.</p><p>That 40 percent is where costly mistakes live.</p><h2>The jagged frontier is why blind delegation is stupid</h2><p>If AI really were just a universal multiplier, the story would be easy. You would hand it everything and go home early.</p><p>Sadly, reality insists on nuance.</p><p>Dell&#8217;Acqua and coauthors&#8217; work on what they called the &#8220;jagged technological frontier&#8221; is useful here. In the BCG field experiment, consultants using AI completed tasks inside the model&#8217;s capability frontier faster and at higher quality. A Harvard summary of the study reports more than 25 percent greater speed, more than 40 percent higher human-rated performance, and more than 12 percent more task completion on those tasks. But on harder tasks outside the frontier, consultants using AI were 19 percentage points less likely to produce the correct answer.<sup>[7]</sup></p><p>That is the management problem.</p><p>Your promotion is not just &#8220;use AI more.&#8221; Your promotion is deciding where AI belongs, where it needs guardrails, and where it should stay out of the room.</p><p>Some tasks should be delegated cleanly.</p><p>Some should be AI-assisted but tightly checked.</p><p>Some should use AI for option generation and humans for final reasoning.</p><p>Some should remain almost entirely human because the cost of subtle error is too high.</p><p>That routing decision is work. Real work.</p><p>The lazy way to use AI is to ask it for everything.</p><p>The adult way is to map the workflow, identify the high-volume low-novelty steps, keep humans close to the decision points, and build a verification loop that catches silent failure.</p><p>This is where &#8220;preference is not performance&#8221; becomes practical rather than philosophical.</p><p>A tool can feel magical in chat and still be mediocre in a shipping workflow. If it creates more review burden than usable output, you do not have leverage. You have a very cheerful source of rework.</p><p>The metrics that matter are boring and therefore excellent: cycle time, acceptance rate, rework, escaped defects, reviewer load, and time spent clarifying requirements after generation. If those do not improve, your &#8220;AI strategy&#8221; is probably an expensive first-draft machine wearing a blazer.</p><h2>Promotion without a raise is still a promotion</h2><p>To be clear, I am using &#8220;promotion&#8221; to describe functional change, not moral progress.</p><p>A company can absolutely use AI to widen spans of control, compress headcount, raise output expectations, and dump more review work onto the same number of people. The ILO is explicit that policy choices and implementation paths will shape both worker retention and job quality in AI-exposed occupations.<sup>[1]</sup></p><p>So yes, the promotion can be rude.</p><p>You can get more leverage and less comfort at the same time.</p><p>You can move into more judgment-heavy work while also being measured harder, interrupted more often, and asked to cover a broader surface area. This is what partial automation looks like in the wild. It rarely arrives with a brass band. It arrives as &#8220;Can you also oversee the AI-assisted version of this process?&#8221;</p><p>That is why half-adoption is miserable.</p><p>If management automates generation but not evaluation, workers become janitors for machine sludge. They spend their day reviewing plausible nonsense, correcting avoidable errors, and cleaning up drafts that should never have existed. That is not leverage. That is a new flavor of admin burden.</p><p>If management redesigns the workflow properly, the human gets moved toward the work that actually deserves a human: ambiguous trade-offs, exceptions, prioritization, escalation, stakeholder judgment, and quality control.</p><p>Those are not identical futures.</p><p>So when leaders say &#8220;we are rolling out AI,&#8221; the real question is not whether the model is good. The real question is whether the workflow around the model is sane.</p><p>CTOs, especially, should treat this as org design rather than software procurement.</p><p>Buying a capable model is easy.</p><p>Deciding where it sits in the workflow, how people learn to use it, what must be tested, what gets logged, what gets escalated, and what counts as done - that is the hard part. That is management work. Which is exactly why I think &#8220;promotion&#8221; is the right word.</p><h2>A concrete workflow: shipping a feature with AI in the loop</h2><p>Let me make this less abstract.</p><p>Say a team needs to ship a modest internal feature: a permissions update, a reporting view, maybe a boring admin screen. The kind of task that used to begin with somebody staring into an editor and metabolizing caffeine.</p><p>In the old workflow, a developer might spend the first chunk of time translating a fuzzy request into a plan, then drafting implementation, then remembering tests, then writing the supporting documentation and release notes because no one else wanted to.</p><p>In the promoted workflow, the sequence changes.</p><p>1. Start with the brief, not the code. Write the user outcome, the constraints, the non-goals, the edge cases, and the acceptance criteria. This is not paperwork. This is the control surface.</p><p>2. Use AI to turn that brief into options: implementation outline, likely risks, missing requirements, test cases, migration concerns, and rollout questions. Make the model show its assumptions.</p><p>3. Let AI draft the first pass of code, tests, docs, and change summary where appropriate.</p><p>4. Run the boring machines on the machine output: linting, unit tests, type checks, security scanning, schema diff review, whatever applies.</p><p>5. Use the human review budget on what is actually risky: business logic, failure handling, weird permissions edges, naming that affects maintainability, and whether the thing should exist in this form at all.</p><p>6. Use AI again for downstream packaging: support macros, stakeholder update, release notes, runbook tweaks, and follow-up tickets.</p><p>Notice what changed.</p><p>The human did not disappear. The human moved.</p><p>Less blank-page drafting.</p><p>More problem framing.</p><p>More evaluation.</p><p>More integration.</p><p>More accountability.</p><p>That is the promotion in one small workflow.</p><p>I use the same pattern in writing. I do not start by asking the model to &#8220;write the article.&#8221; I start by trying to make the argument legible to myself, define what would make it true, and decide what evidence deserves to stay. Then the machine can help with structure, counterarguments, phrase alternatives, compression, and cleanup. The model is useful. The model is not the author. The job has moved upward, not outward.</p><p>This also explains why AI-heavy teams start caring more about reusable scaffolding. Shared prompts are fine. Shared rubrics are better. Shared evaluation checklists are better than that. The moment your team can generate work cheaply, consistency stops being a nice-to-have and becomes survival gear.</p><h2>What teams should actually change</h2><p>If you buy the promotion frame, a few practical implications follow.</p><p>First, train people on review and specification, not just on tool features.</p><p>Most AI training is basically a software demo with better posture. That is not enough. People need to learn how to scope tasks, express constraints, inspect outputs, and recognize failure patterns. Otherwise you are handing power tools to people and congratulating yourself because the box looked premium.</p><p>Second, redesign role expectations explicitly.</p><p>Do not say &#8220;everyone should use AI&#8221; and then evaluate them as if the work were unchanged. If the first draft is now cheap, then clarity, judgment, and orchestration should be rewarded more directly.</p><p>Third, instrument actual workflows.</p><p>Preference is not performance. Measure cycle time, acceptance rate, escaped defects, and rework on a handful of recurring processes. If the tool helps only in demos, that is not adoption. That is theater with a subscription fee.</p><p>Fourth, protect learning loops for less experienced people.</p><p>If juniors never have to think, they will not become seniors. Let AI accelerate them, but do not let it replace the reasoning reps completely. Ask for explanations. Rotate who defines the acceptance criteria. Make people compare outputs, not just consume them.</p><p>Fifth, stop fetishizing the prompt.</p><p>The prompt matters. But the bigger win is almost always in the surrounding system: better context, cleaner data, stronger templates, reusable checks, and clearer ownership boundaries. The best AI workflow is usually slightly boring because reliability is usually slightly boring.</p><p>None of this sounds glamorous. That is exactly why it works.</p><h2>The slightly annoying conclusion</h2><p>The useful question is not whether AI can do your job.</p><p>It is whether you have moved your value one level up before the market forces you to.</p><p>If your value is mostly raw drafting, raw formatting, raw summarizing, raw boilerplate coding, or raw information rearrangement, AI is very rude news. Those layers are getting cheaper, sometimes dramatically.<sup>[4][5][6]</sup></p><p>If your value is in defining the work, building the harness, judging the output, spotting the edge cases, aligning people around decisions, and standing behind the artifact, AI is not removing you. It is increasing your span of action.</p><p>That still might be exhausting.</p><p>It still might be unfair.</p><p>It still might come with exactly zero ceremonial appreciation from management.</p><p>But it is a more precise description of what is happening.</p><p>AI did not take your job. It promoted you.</p><p>The annoying part is that promoted people are supposed to know what good looks like.</p><p>Do you?</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://plausiblereality.com/subscribe?"><span>Subscribe now</span></a></p><p></p><h2>Notes / references</h2><p>1. International Labour Organization and NASK, &#8220;One in four jobs at risk of being transformed by GenAI, new ILO-NASK Global Index shows,&#8221; May 20, 2025, summarizing <em>Generative AI and Jobs: A Refined Global Index of Occupational Exposure</em> (ILO Working Paper 140, 2025).</p><p>2. OECD, <em>Artificial Intelligence and the Changing Demand for Skills in the Labour Market</em>, OECD Artificial Intelligence Papers No. 14, 2024. The report finds that in high AI exposure occupations, 72 percent of vacancies demanded at least one management skill and 67 percent demanded at least one business-process skill across the ten-country sample.</p><p>3. OECD, <em>Who Will Be the Workers Most Affected by AI?</em>, OECD Artificial Intelligence Papers, 2024. The executive summary notes that many occupations most exposed to AI are white-collar roles such as IT professionals, managers, and science and engineering professionals.</p><p>4. Erik Brynjolfsson, Danielle Li, and Lindsey R. Raymond, &#8220;Generative AI at Work,&#8221; NBER Working Paper 31161, 2023.</p><p>5. Shakked Noy and Whitney Zhang, &#8220;Experimental Evidence on the Productivity Effects of Generative Artificial Intelligence,&#8221; <em>Science</em> 381, no. 6654 (2023): 187-192. DOI: 10.1126/science.adh2586.</p><p>6. Zheyuan (Kevin) Cui, Mert Demirer, Sonia Jaffe, Leon Musolff, Sida Peng, and Tobias Salz, &#8220;The Effects of Generative AI on High-Skilled Work: Evidence from Three Field Experiments with Software Developers,&#8221; SSRN working paper, August 20, 2025. DOI: 10.2139/ssrn.4945566.</p><p>7. Fabrizio Dell&#8217;Acqua, Edward McFowland III, Ethan Mollick, Hila Lifshitz-Assaf, Katherine C. Kellogg, Saran Rajendran, Lisa Krayer, Francois Candelon, and Karim R. Lakhani, &#8220;Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of Artificial Intelligence on Knowledge Worker Productivity and Quality,&#8221; <em>Organization Science</em> 37, no. 2 (2026): 403-423. For the summarized speed, quality, and task-completion figures used above, see also Harvard Business School&#8217;s official summaries of the study from September and November 2023.</p>]]></content:encoded></item><item><title><![CDATA[How to Make Your AI Colleague Great Again]]></title><description><![CDATA[Standards, context, and process as the real onboarding system for agents.]]></description><link>https://plausiblereality.com/p/how-to-make-your-ai-colleague-great</link><guid isPermaLink="false">https://plausiblereality.com/p/how-to-make-your-ai-colleague-great</guid><dc:creator><![CDATA[Eloi Tay]]></dc:creator><pubDate>Fri, 27 Mar 2026 00:02:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/4f2d7ecd-0888-4d89-8f51-f62612f06ae9_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>A lot of AI disappointment is self-inflicted. If you want an AI teammate to be useful, treat onboarding as a systems problem. Standards, context, access boundaries, and escalation paths do more than one heroic prompt ever will.</p><h1>The wrong debate</h1><p>The wrong debate is whether the agent is smart enough in the abstract. The useful question is whether it has actually been onboarded into your way of working.</p><p>What makes this argument useful is that it forces the conversation back to the workflow. When teams under-specify the work, they push uncertainty downstream where it gets rediscovered as rework, second-guessing, and inconsistent output.</p><p>A human junior gets examples, naming conventions, escalation paths, and the awkward local rules nobody puts in the architecture diagram. A synthetic junior needs the same categories of help, just packed into artifacts instead of hallway conversations.</p><p>This sounds obvious, but teams miss it all the time because the missing pieces do not look glamorous. A clear standard, a good example, a scoped permission, or a crisp definition of done rarely wins the demo. They do, however, win the week.</p><h1>What actually matters</h1><p>When teams say an agent is unreliable, a lot of the time they are describing terrible onboarding. A human teammate gets a repo tour, examples, naming conventions, escalation rules, and the awkward bits nobody writes down until they have to. The model gets a prompt and some vague optimism. Then everyone acts shocked when it behaves like a contractor who was dropped into the codebase through a skylight. If you want better output, start by making the work environment legible.</p><p>Standards are not bureaucracy here. They are compression. A short, explicit set of conventions lets the agent spend tokens on the task instead of rediscovering how your team names files, structures tests, or documents migrations. This sounds obvious, but teams miss it all the time because standards feel less exciting than model demos. Boring clarity wins anyway.</p><p>The wrong debate is usually about phrasing. What actually matters is whether the agent can see the local truth: architecture notes, examples of good changes, known landmines, recent decisions, and the shape of the task. Context is not extra seasoning on top of the prompt. It is the difference between asking for help from a teammate who knows the system and one who just arrived from another planet.</p><p>If I were fixing this tomorrow, I would not start with a new model. I would start by making the work easier to delegate: clearer context, cleaner standards, safer permissions, and better checks. Once those exist, model upgrades start to matter more.</p><p>What actually scales is not a clever one-off prompt but a reusable control surface: shared standards, reusable context, bounded permissions, visible examples, and cheap checks. Once those are in place, a stronger model helps. Before they exist, model upgrades mostly rearrange disappointment.</p><p>The useful question is not whether the agent can do something in principle. It is whether the task has been defined clearly enough, instrumented cheaply enough, and bounded safely enough to be delegated in the first place.</p><h1>A workflow I would actually trust</h1><p>One workflow I like is a small context pack beside the work: a task brief, repo map, relevant files, coding standards, definition of done, risky surfaces, and examples of good output. The agent gets that package every time. Humans do too. The result is fewer clarifying loops and less weird improvisation, because the system no longer has to infer what the team meant but forgot to write down.</p><p>A decent onboarding pack for an agent looks boring in exactly the right way: a short statement of responsibility, a repo map, a definition of done, a list of risky surfaces, examples of acceptable changes, and explicit escalation rules. Give that package to a model and to a new teammate and both will ask fewer confused questions. The interesting part is not that the agent gets better. The interesting part is that the team has finally written down what it claims to value.</p><p>This is also why onboarding and governance are closer than they look. Once permissions, expectations, and failure paths are explicit, trust becomes operational rather than emotional. You are not hoping a synthetic teammate behaves. You are shaping the conditions under which useful behavior is more likely and costly mistakes are easier to catch.</p><p>The measurement question matters because teams are excellent at narrating improvement that they have not actually checked. I would track at least five things: time to a usable first draft, number of clarification loops, amount of human cleanup, defect or regression rate after merge, and how often a rollback or rework was needed. Those numbers are boring, which is exactly why they are useful.</p><p>They also prevent a common trap. An agent can feel fast because it generates a lot of text or code quickly while still making the total loop slower once review, debugging, and cleanup are included. If the workflow is meant to ship, the metric has to live closer to shipped outcome than to demo charisma.</p><h1>Why this is credible</h1><p>This framing lines up with public engineering guidance from Anthropic, which has argued that successful agent systems are usually built from simple, composable patterns rather than ornate frameworks, and that context engineering is a more useful frame than prompt engineering once agents enter real workflows. NIST lands in a similar place from a governance angle: trustworthiness comes from lifecycle controls around the system, not from model confidence alone.</p><p>A useful agent is not the model you bought. It is the workflow you bothered to build.</p><p>The public evidence base is not perfect, but it points in a consistent direction. Anthropic&#8217;s engineering writing has repeatedly emphasized that effective agents depend on simple, composable patterns, strong context, and good tool design rather than prompt theatrics alone. The newer harness work strengthens the same lesson for longer-running tasks: the surrounding workflow determines whether capability survives contact with reality.</p><p>That sits well beside NIST&#8217;s risk-management framing, which treats trustworthy AI as a systems problem, and beside the DORA and METR findings that remind teams not to confuse subjective speed with shipped value. I would not overstate any single study. But taken together, the case for better harnesses is much stronger than the case for endless prompt superstition.</p><h1>The objection that sounds smarter than it is</h1><p>A common objection is that all this structure slows people down and that the real winners will simply use more capable models more aggressively. I think that gets the timing backward. Loose workflows can feel faster at first because they externalize ambiguity onto later review and cleanup. Structured workflows feel slower only until the same task appears for the fifth or fiftieth time.</p><p>That is when the boring scaffolding starts compounding. The brief exists. The examples exist. The checks exist. The team no longer burns senior attention rediscovering the same missing assumptions. Good harnesses are not anti-speed. They are how speed survives contact with scale.</p><h1>Where I would start this week</h1><p>If the thesis of this article is right, the first move is not a bigger model budget. It is one cleaner delegation lane. Pick a recurring task and make the surrounding expectations unambiguous enough that both humans and agents can run it the same way.</p><blockquote><p>&#183; Choose one recurring task with a clear business owner and a low-to-moderate blast radius.</p><p>&#183; Write a reusable task brief with goal, scope, relevant files or context, and a definition of done.</p><p>&#183; Add one or two examples of acceptable output and one explicit escalation rule for uncertainty.</p><p>&#183; Measure re-asks, human cleanup, and whether the task actually ships more smoothly after the change.</p></blockquote><p>The point is not to create a sacred process document. It is to make one useful loop more legible, more repeatable, and easier to trust. If that feels slightly boring, good. The boring version is usually the version that ships.</p><p></p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://plausiblereality.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://plausiblereality.com/subscribe?"><span>Subscribe now</span></a></p><p></p><h1>Notes</h1><p>1. Anthropic (2024), &#8220;Building effective agents,&#8221; published December 19, 2024.</p><p><a href="https://www.anthropic.com/research/building-effective-agents">https://www.anthropic.com/research/building-effective-agents</a></p><p>2. Anthropic (2025), &#8220;Effective context engineering for AI agents,&#8221; published September 29, 2025.</p><p><a href="https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents">https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents</a></p><p>3. Anthropic (2025), &#8220;Effective harnesses for long-running agents,&#8221; published November 26, 2025.</p><p><a href="https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents">https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents</a></p><p>4. Anthropic (2025), &#8220;Writing effective tools for AI agents&#8212;using AI agents,&#8221; published September 11, 2025.</p><p><a href="https://www.anthropic.com/engineering/writing-tools-for-agents">https://www.anthropic.com/engineering/writing-tools-for-agents</a></p><p>5. NIST (2023), &#8220;Artificial Intelligence Risk Management Framework (AI RMF 1.0),&#8221; NIST AI 100-1.</p><p><a href="https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf">https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf</a></p><p>6. NIST (2024), &#8220;Artificial Intelligence Risk Management Framework: Generative AI Profile,&#8221; NIST AI 600-1.</p><p><a href="https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.600-1.pdf">https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.600-1.pdf</a></p><p>7. Google Cloud / DORA (2025), &#8220;How are developers using AI? Inside our 2025 DORA report,&#8221; published September 23, 2025.</p><p><a href="https://blog.google/innovation-and-ai/technology/developers-tools/dora-report-2025">https://blog.google/innovation-and-ai/technology/developers-tools/dora-report-2025</a></p><p></p><p></p><p></p>]]></content:encoded></item></channel></rss>