The New Senior Engineer Knows What to Trust
In the AI era, the real edge is not writing more code. It is deciding what counts as evidence before the code ships.
There is a version of senior engineering that refuses to die.
It shows up in hiring loops that quietly reward the fastest person in the editor. It shows up in promotion packets that confuse visible output with leverage. It shows up every time someone mistakes live-coding fluency for judgment.
I think that model was flimsy even before AI. Now it is mostly decorative.
In a modern codebase, code is not the only thing you ship. You ship assumptions. You ship rollout plans. You ship alerts. You ship maintenance costs. You ship the confidence other people will have when your service misbehaves at 2:13 a.m.
That is why the new senior engineer is not mainly the fastest coder. The defining edge is knowing what to trust.
Not trust in the soft, inspirational sense. Trust as a technical act. Which document is authoritative? Which test actually tells you something? Which dashboard is instrumented well enough to matter? Which reviewer understands the blast radius? Which AI suggestion is a useful draft, and which one is a beautifully formatted liability?
This is not an anti-coding argument. Great engineers still need to write solid code. The point is sharper than that. Once coding competence is table stakes, seniority shifts toward evidence selection, verification, and the ability to make other people’s decisions safer.
* * *
The wrong debate is output
Public career ladders are more honest than most hot takes, because eventually companies have to write down what they are willing to pay for.
When Monzo publishes a framework for senior and staff engineers, the language is not about typing speed. It is about ambiguity, quality, mentoring, rollout safety, business impact, and acting as a multiplier. Monzo’s current public framework describes staff engineers as people who independently own ill-defined, highly ambiguous work, maintain the quality bar, mentor others, and ship through incremental releases, rollout plans, monitoring, and metrics.
Dropbox says roughly the same thing in a different accent. Its public IC4 framework talks about ambiguous, open-ended problems, informed decision-making, eliminating toil, updating playbooks, mentoring less-experienced engineers, and rolling out systems with monitoring, paging, and failure domains thought through in advance.
That matters because ladders are compensation documents, not conference talk theater. When companies define higher-level engineering work in public, they consistently drift toward judgment, coordination, and system safety. They do not drift toward “writes code unusually fast.”
Microsoft Research found something similar from a different angle. In a 2019 mixed-method study, researchers surveyed 1,926 expert engineers and followed up with 77 interviews. Their top five distinguishing characteristics of great software engineers were writing good code, accounting for future value and costs, practicing informed decision-making, avoiding making other people’s jobs harder, and learning continuously.
The interesting part is not just that writing good code is on the list. Of course it is. The more interesting part is that decision-making attributes ranked highest as a group, and “information gathering” stood out as especially important. Great engineers were distinguished not by confidence theater, but by getting the right information and then updating their decisions when the evidence changed.
The measurement literature makes the same point more bluntly. The SPACE framework argues that developer productivity cannot be reduced to a single metric or activity data alone. DORA makes the operational version of that claim: software delivery performance includes both throughput and instability, and its research says speed and stability are not long-run tradeoffs for top teams.
That is the part a lot of organizations still dodge. Preference is not performance. Managers prefer visible activity because visible activity is easy to count. Systems do not care. Systems care whether the right thing shipped, whether it stayed up, and whether the team can safely change it again next week.
The useful question is not, “Who writes the most code?” It is, “Who improves the odds that the right code ships safely, stays understandable, and does not create work for everyone else?”
* * *
Trust is a technical skill
“Knows what to trust” can sound vague until you spell out the evidence stack.
At the top are authoritative sources: the owner of the domain, the API contract, the architecture decision record, the product rule, the compliance constraint, the incident history that already told you how this system fails.
Then come mechanized checks: type systems, static analysis, linters, unit tests, integration tests, and build gates. These are valuable, but only inside their jurisdiction.
Then come operational signals: logs, traces, SLOs, alerts, and what the service actually does in production.
Then come human review and escalation: the domain expert, the security engineer, the SRE, the teammate who has seen this failure mode before.
And then there is AI.
AI is useful. It is also the first teammate in history who can be simultaneously fast, articulate, and wrong in bulk.
This sounds obvious, but teams miss it all the time. They confuse the nearest signal with the strongest signal.
A passing type check is not proof that the business logic is correct. GitHub’s own guidance now says to use type systems as guardrails, not crutches. A passing test suite proves only what the suite actually covers. Google’s testing guidance is extremely practical here: the ideal feedback loop is fast, reliable, and isolates failures. Flaky tests do the opposite. They make engineers stop believing the system that was supposed to give them confidence in the first place.
Google’s SRE material frames testing in similar terms. More adequate testing reduces uncertainty after change. That is a better definition of verification than most teams use. Verification is not “did CI go green.” Verification is “did we earn enough confidence to proceed.”
Operational signals have boundaries too. Dashboards can lie by omission. Alerts can be too noisy to be trustworthy. A clean incident channel can hide the fact that nobody instrumented the one thing that would actually tell you whether the rollout is harming customers.
Human review is also not magically authoritative. An approval from the nearest available reviewer may mean, at best, “I recognized the shapes in this diff.” Senior engineers know the difference between review as ceremony and review as evidence.
And AI should almost never be treated as an authority by itself. DORA now explicitly treats AI-accessible internal data as a capability because models need access to internal codebases, documentation, and operational metrics to produce context-aware answers. GitHub’s 2026 guidance is even more direct: test AI-generated code harder, not less.
Every signal has a jurisdiction. Senior engineers are good at asking where that jurisdiction ends.
When I review a risky change, I am not asking whether the author seems smart, or whether the model was helpful, or whether the diff looks clean. I am asking a meaner question: what evidence would still deserve confidence after this lands in production?
That is what calibrated trust looks like in practice. Not cynicism. Not bureaucracy. Just refusing to borrow certainty from weak signals.
* * *
A fast patch is not the same as a safe change
Imagine a teammate opens a 900-line pull request that changes billing retries, queue handling, and a customer-visible status flow.
The patch came together quickly because an AI assistant handled some of the scaffolding and a first draft of the tests. The code compiles. The types are green. CI passes. The PR description says some version of “refactor + cleanup + reliability improvements.”
An output-obsessed team sees velocity. A senior engineer sees unanswered questions.
First, they ask what source of truth governs the behavior. Is there a product rule? A finance constraint? A previous incident postmortem? An ADR? Billing systems are full of code that looks clean and behaves incorrectly because the real requirement was sitting in a stale doc, a Slack thread, or one person’s head.
Second, they split the change. Add observability first. Then ship the new path behind a flag. Then migrate one low-risk cohort. Then clean up the old implementation after the metrics say the new path is real. DORA’s small-batch guidance matters here because smaller changes are easier to reason about, easier to verify, and easier to recover from when something goes sideways.
Third, they choose verification proportional to risk. Not every change deserves the same treatment. Maybe this needs contract tests for downstream assumptions, a specific metric for retry storms, one alert for queue depth, and a rollback plan written before release day instead of during it.
Fourth, they pull in the right reviewer. Not the closest approver. The person who actually owns the failure modes. That sounds obvious. It is also one of the easiest places for teams to fake diligence.
Fifth, they externalize what they learned. A short ADR. A runbook update. A few lines in the PR explaining why the rollout order matters. This is where trust stops living in one person’s memory and starts becoming team infrastructure.
Nothing about that workflow is glamorous. That is one reason people underrate it. Seniority is often boring when done well. It feels slightly slower in the first hour and much faster three weeks later.
GitHub’s account of rebuilding its on-call culture is basically this lesson at organizational scale. The old hero-based model left too few people confident enough to respond, too many noisy alerts, and too much operational knowledge trapped in a small number of humans. The fix was not more brilliance. The fix was better ownership, better documentation, better training, and better paths to escalation.
The problem is usually not raw implementation speed. It is whether the team has a reliable way to know when speed is safe.
Senior engineers turn private certainty into public artifacts. They leave behind tests people believe, docs people can find, alerts people can act on, and release plans people can reverse.
* * *
AI makes the distinction harsher
The lazy version of the AI debate asks whether code generation makes senior engineers less important.
The equally lazy version says it makes them infinitely more important.
Both miss the mechanism.
What actually changes is the economics of verification. When code gets cheaper to produce, weak evidence gets more expensive to tolerate.
That is why the most useful recent AI research is not the benchmark chest-thumping. It is the work that exposes miscalibration.
A 2025 randomized controlled trial of experienced open-source developers is a good example. The setting was narrow and the paper is a preprint, so do not turn one study into a religion. But it is still one of the best reality checks we have. Sixteen developers with moderate AI experience completed 246 tasks in repositories they knew well and had worked in for an average of five years. Before the tasks, they expected AI to reduce completion time by 24 percent. After the tasks, they still believed AI had helped by about 20 percent. Measured reality went the other direction: allowing AI increased completion time by 19 percent.
The interesting part is not only the slowdown. It is the confidence gap.
Developers felt faster. They preferred the experience. They predicted a gain before and after. The measured outcome was worse.
Preference is not performance.
That does not mean AI is useless. It means local fluency is not the same as system outcome. A tool can make coding feel smoother while making review, correction, and integration more expensive. That is a trust problem before it is a tooling problem.
Industry guidance is converging on the same point. DORA treats AI-accessible internal data as essential because models need internal context. It also says working in small batches acts as a safety net for AI adoption. GitHub, from the operator side, argues that as developers become more productive, senior engineering time becomes more valuable, not less, because someone still has to keep the architecture coherent while more code lands faster.
That is the shift. AI does not eliminate senior judgment. It makes bad judgment scale better.
The useful question is not, “Can AI write this?” It is, “What will we trust after AI writes this?”
Can we trust the requirement? The test coverage? The blast-radius analysis? The monitoring? The rollback? The ownership mapping? The explanation in the PR? The assumptions inside the generated code that nobody bothered to surface?
AI gives teams more rope and a nicer font. That is not the same thing as safety.
As the volume of plausible code rises, the premium on calibrated trust rises with it. Someone has to know which dashboards are vanity dashboards, which docs are stale, which reviewer is rubber-stamping, which migration must be reversible, and which green test suite is lying.
That someone is doing senior work, whether or not they wrote the most code that week.
* * *
Hire and promote for evidence quality
If your hiring loop still quietly optimizes for speed, you are selecting for theater.
Google’s guidance on structured interviewing is the adult version of hiring: vetted questions, standardized rubrics, and interviewer calibration instead of gut feel. Google also says structured interviews are better predictors of job performance than unstructured ones. Monzo’s published senior interview process is directionally similar. It includes systems design and pair coding, not just a theatrical speed trial. That makes sense because actual senior work is not a race to type. It is a sequence of judgments under incomplete information.
So ask questions that force candidates to reveal their trust model.
Ask when they changed their mind because the evidence changed. Ask when they distrust a passing test suite. Ask how they would split a risky change into releasable steps. Ask about a time they escalated early instead of pretending to be a hero. Ask when they chose an existing solution over building a new one. Ask how they helped another engineer make a better decision.
Those questions are not softer than coding questions. They are harder to fake.
The same principle should shape promotion.
Promotion to senior should not mostly reward the engineer with the most obviously busy commit graph. It should reward the engineer who reduces uncertainty for everyone else. The one who makes code review sharper. The one who kills flaky tests. The one who writes the runbook somebody else will need at midnight. The one who notices the rollout is too big. The one who turns tribal knowledge into a document. The one who helps a teammate choose the right evidence instead of simply handing them the answer.
Dropbox’s framework is useful here because it makes this work legible: reducing toil, updating playbooks, mentoring, designing for reliable rollout, and making informed decisions in ambiguous situations. Monzo’s framework makes the same move with different labels: quality bar, ambiguity handling, incremental rollout, mentoring, and multiplier effect.
This is the mature version of seniority. Not “I know the most.” Not “I can type the fastest.” Not “I can save the day personally.” It is “I know where confidence should come from, and I leave the system better calibrated after I touch it.”
Teams that still reward visible output over evidence quality will get exactly what they asked for: a lot of code and a very creative incident schedule.
As AI makes code cheaper, the scarce skill is not generation. It is refusal to turn plausible text into production truth without earning the right to trust it.
That is the new senior engineer.
Or, more annoyingly: when your team says it values senior judgment, does it actually reward the people who generate more code, or the people who stop the wrong code from becoming reality?
* * *
Notes and References
1. Monzo. Engineering Progression Framework v4.0. 2025.
2. Dropbox. Engineering Career Framework: IC4 Software Engineer.
3. Li, Paul Luo, Amy J. Ko, and Andrew Begel. What Distinguishes Great Software Engineers? Empirical Software Engineering, 2019.
4. Forsgren, Nicole, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann, Brian Houck, and Jenna Butler. The SPACE of Developer Productivity. ACM Queue, 2021.
5. Google SRE. Stress Testing: Build Confidence in System.
6. Google Testing Blog. Just Say No to More End-to-End Tests. 2015.
7. DORA. AI-accessible internal data. 2026.
8. DORA. Working in small batches. 2025.
9. DORA. DORA’s software delivery performance metrics. 2026.
10. GitHub. Building On-Call Culture at GitHub. 2021.
11. GitHub. How AI is reshaping developer choice (and Octoverse data proves it). 2026.
12. Becker, Joel, Nate Rush, Elizabeth Barnes, and David Rein. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. arXiv:2507.09089, 2025.
13. Google re:Work. A guide to structured interviewing for better hiring practices. Updated 2026.
14. Monzo. Demystifying the Senior Staff+ Engineering interview process. 2025.

