How to Make Your AI Colleague Great Again
Standards, context, and process as the real onboarding system for agents.
A lot of AI disappointment is self-inflicted. If you want an AI teammate to be useful, treat onboarding as a systems problem. Standards, context, access boundaries, and escalation paths do more than one heroic prompt ever will.
The wrong debate
The wrong debate is whether the agent is smart enough in the abstract. The useful question is whether it has actually been onboarded into your way of working.
What makes this argument useful is that it forces the conversation back to the workflow. When teams under-specify the work, they push uncertainty downstream where it gets rediscovered as rework, second-guessing, and inconsistent output.
A human junior gets examples, naming conventions, escalation paths, and the awkward local rules nobody puts in the architecture diagram. A synthetic junior needs the same categories of help, just packed into artifacts instead of hallway conversations.
This sounds obvious, but teams miss it all the time because the missing pieces do not look glamorous. A clear standard, a good example, a scoped permission, or a crisp definition of done rarely wins the demo. They do, however, win the week.
What actually matters
When teams say an agent is unreliable, a lot of the time they are describing terrible onboarding. A human teammate gets a repo tour, examples, naming conventions, escalation rules, and the awkward bits nobody writes down until they have to. The model gets a prompt and some vague optimism. Then everyone acts shocked when it behaves like a contractor who was dropped into the codebase through a skylight. If you want better output, start by making the work environment legible.
Standards are not bureaucracy here. They are compression. A short, explicit set of conventions lets the agent spend tokens on the task instead of rediscovering how your team names files, structures tests, or documents migrations. This sounds obvious, but teams miss it all the time because standards feel less exciting than model demos. Boring clarity wins anyway.
The wrong debate is usually about phrasing. What actually matters is whether the agent can see the local truth: architecture notes, examples of good changes, known landmines, recent decisions, and the shape of the task. Context is not extra seasoning on top of the prompt. It is the difference between asking for help from a teammate who knows the system and one who just arrived from another planet.
If I were fixing this tomorrow, I would not start with a new model. I would start by making the work easier to delegate: clearer context, cleaner standards, safer permissions, and better checks. Once those exist, model upgrades start to matter more.
What actually scales is not a clever one-off prompt but a reusable control surface: shared standards, reusable context, bounded permissions, visible examples, and cheap checks. Once those are in place, a stronger model helps. Before they exist, model upgrades mostly rearrange disappointment.
The useful question is not whether the agent can do something in principle. It is whether the task has been defined clearly enough, instrumented cheaply enough, and bounded safely enough to be delegated in the first place.
A workflow I would actually trust
One workflow I like is a small context pack beside the work: a task brief, repo map, relevant files, coding standards, definition of done, risky surfaces, and examples of good output. The agent gets that package every time. Humans do too. The result is fewer clarifying loops and less weird improvisation, because the system no longer has to infer what the team meant but forgot to write down.
A decent onboarding pack for an agent looks boring in exactly the right way: a short statement of responsibility, a repo map, a definition of done, a list of risky surfaces, examples of acceptable changes, and explicit escalation rules. Give that package to a model and to a new teammate and both will ask fewer confused questions. The interesting part is not that the agent gets better. The interesting part is that the team has finally written down what it claims to value.
This is also why onboarding and governance are closer than they look. Once permissions, expectations, and failure paths are explicit, trust becomes operational rather than emotional. You are not hoping a synthetic teammate behaves. You are shaping the conditions under which useful behavior is more likely and costly mistakes are easier to catch.
The measurement question matters because teams are excellent at narrating improvement that they have not actually checked. I would track at least five things: time to a usable first draft, number of clarification loops, amount of human cleanup, defect or regression rate after merge, and how often a rollback or rework was needed. Those numbers are boring, which is exactly why they are useful.
They also prevent a common trap. An agent can feel fast because it generates a lot of text or code quickly while still making the total loop slower once review, debugging, and cleanup are included. If the workflow is meant to ship, the metric has to live closer to shipped outcome than to demo charisma.
Why this is credible
This framing lines up with public engineering guidance from Anthropic, which has argued that successful agent systems are usually built from simple, composable patterns rather than ornate frameworks, and that context engineering is a more useful frame than prompt engineering once agents enter real workflows. NIST lands in a similar place from a governance angle: trustworthiness comes from lifecycle controls around the system, not from model confidence alone.
A useful agent is not the model you bought. It is the workflow you bothered to build.
The public evidence base is not perfect, but it points in a consistent direction. Anthropic’s engineering writing has repeatedly emphasized that effective agents depend on simple, composable patterns, strong context, and good tool design rather than prompt theatrics alone. The newer harness work strengthens the same lesson for longer-running tasks: the surrounding workflow determines whether capability survives contact with reality.
That sits well beside NIST’s risk-management framing, which treats trustworthy AI as a systems problem, and beside the DORA and METR findings that remind teams not to confuse subjective speed with shipped value. I would not overstate any single study. But taken together, the case for better harnesses is much stronger than the case for endless prompt superstition.
The objection that sounds smarter than it is
A common objection is that all this structure slows people down and that the real winners will simply use more capable models more aggressively. I think that gets the timing backward. Loose workflows can feel faster at first because they externalize ambiguity onto later review and cleanup. Structured workflows feel slower only until the same task appears for the fifth or fiftieth time.
That is when the boring scaffolding starts compounding. The brief exists. The examples exist. The checks exist. The team no longer burns senior attention rediscovering the same missing assumptions. Good harnesses are not anti-speed. They are how speed survives contact with scale.
Where I would start this week
If the thesis of this article is right, the first move is not a bigger model budget. It is one cleaner delegation lane. Pick a recurring task and make the surrounding expectations unambiguous enough that both humans and agents can run it the same way.
· Choose one recurring task with a clear business owner and a low-to-moderate blast radius.
· Write a reusable task brief with goal, scope, relevant files or context, and a definition of done.
· Add one or two examples of acceptable output and one explicit escalation rule for uncertainty.
· Measure re-asks, human cleanup, and whether the task actually ships more smoothly after the change.
The point is not to create a sacred process document. It is to make one useful loop more legible, more repeatable, and easier to trust. If that feels slightly boring, good. The boring version is usually the version that ships.
Notes
1. Anthropic (2024), “Building effective agents,” published December 19, 2024.
https://www.anthropic.com/research/building-effective-agents
2. Anthropic (2025), “Effective context engineering for AI agents,” published September 29, 2025.
https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
3. Anthropic (2025), “Effective harnesses for long-running agents,” published November 26, 2025.
https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents
4. Anthropic (2025), “Writing effective tools for AI agents—using AI agents,” published September 11, 2025.
https://www.anthropic.com/engineering/writing-tools-for-agents
5. NIST (2023), “Artificial Intelligence Risk Management Framework (AI RMF 1.0),” NIST AI 100-1.
https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf
6. NIST (2024), “Artificial Intelligence Risk Management Framework: Generative AI Profile,” NIST AI 600-1.
https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.600-1.pdf
7. Google Cloud / DORA (2025), “How are developers using AI? Inside our 2025 DORA report,” published September 23, 2025.
https://blog.google/innovation-and-ai/technology/developers-tools/dora-report-2025

