Your Agent Isn't Tired. It's Drowning in Context

The useful analogy between humans and LLMs is not consciousness. It is that both fall apart when active memory, compression, and management are sloppy.

Apr 25, 2026

You can feel this problem in your own skull first.

After a long day, you stop failing heroically and start failing stupidly. You lose the thread of a conversation that is still happening. You reread the same paragraph. You forget the constraint you wrote down an hour ago.

Then you watch a long-running agent do the machine version of the same thing. It misses the key detail buried in the middle of the prompt. It keeps circling a dead end. It summarizes a tool result so aggressively that the only useful fact disappears. It is not malicious. It is overloaded.

People want to turn this into philosophy. Are humans just next-token predictors? Are LLMs baby minds? I think that is the wrong debate.

The useful question is not whether humans and models are the same kind of thing. It is whether they share enough operational constraints that the comparison makes us better builders. On that question, the answer is yes.

The big similarity is boring and extremely useful: both humans and LLM systems have a limited active workspace sitting in front of a much larger store of potential information. When the workspace gets cluttered, performance gets weird.

That is why the analogies land. Brain fog and context rot feel related. Note-taking and RAG feel related. Bad evidence poisoning human recall and bad retrieval poisoning agent output feel related.

But only if you keep the guardrails on. Human fatigue is biological. Human memory changes during sleep. LLM memory is usually an engineered stack of context, summaries, retrieval, logs, and maybe a persistent memory layer. Treat the analogy as workflow, not metaphysics, and it becomes surprisingly sharp.

* * *

The wrong debate is ontology

Human working memory is small. Depending on the frame, researchers put the core limit around roughly four chunks, or about three to five meaningful chunks. That is not a lot of live slots for planning, holding constraints, and comparing alternatives.

LLMs have the same kind of bottleneck in a different substrate. The context window is the obvious version, but the deeper issue is not just how much text fits. It is how much of that text the model can still use well.

This sounds obvious, but teams miss it all the time. They hear “128K” or “1M tokens” and mentally translate that into “the model can keep everything in mind.” No. A larger whiteboard is not the same thing as better working memory.

The Lost in the Middle paper made this painfully concrete: models often perform best when relevant information sits near the beginning or end of a long prompt, and worse when the key fact is buried in the middle. NoLiMa pushed the point further by stripping away easy literal matches. Performance fell sharply as contexts grew, even before models hit their advertised limits.

So the useful analogy is not “humans are language models.” It is “both systems have a fragile active state.” What actually matters is not raw storage. It is selection, ordering, salience, and refresh.

That is also why context engineering is not prompt decoration. It is workload design. You are deciding what gets desk space right now, what gets pushed to a filing cabinet, what gets summarized, and what gets dropped.

Preference is not performance. Developers prefer the fantasy that one giant context will remove the need for memory architecture. Performance usually comes from the opposite move: smaller working sets, tighter task framing, and more aggressive eviction of stale state.

* * *

Brain fog is context debt

When people say brain fog after a long day, they are usually pointing at slower thinking, weaker attention, and worse executive control after sustained cognitive effort. Reviews of cognitive fatigue describe exactly that sort of degradation.

The model-side analogue is real, but it is not literal fatigue. Models do not need coffee. They accumulate context debt.

Context debt shows up when a session becomes an archaeological site: too many tool results, too many half-resolved threads, too many summaries of summaries, too much retrieved material with no ranking discipline. The agent can still “see” a lot of text. It just stops allocating attention to the right parts of it.

This is why giving the model a break does nothing unless the break changes the state. A human benefits from sleep, food, a walk, or just stopping. An agent benefits from pruning, compaction, retrieval, and explicit memory writes.

Anthropic’s own engineering work makes the point in a very unromantic way. In a 2025 internal agentic-search evaluation, context editing alone improved performance by 29 percent over baseline, and context editing plus a memory tool improved it by 39 percent. In a 100-turn search evaluation, context editing also cut token consumption by 84 percent while letting workflows complete that would otherwise fail from context exhaustion.

That is the punchline. The lever is not empathy for the model. The lever is state management.

I think this matters because a lot of teams still debug agents as if the main problem were intelligence. Sometimes it is. Much more often the agent is doing the exact wrong thing with the exact wrong pile of context that you handed it.

* * *

Memory is compression, not recording

Human memory is not a replay buffer. It is reconstructive. Reviews of hippocampal-neocortical memory transformation describe a shift from detail-rich traces toward gist-like and schematic representations over time. Sleep-dependent replay is thought to help the hippocampus teach the neocortex, turning specific episodes into more reusable knowledge.

That is a terrible metaphor for most production AI memory features, which is exactly why people keep getting confused.

In most deployed systems, “memory” means one of four things: model weights, current context, a retrieved external store, or a persistent note that can be reinserted later. The original RAG paper called this split parametric memory versus non-parametric memory. That distinction still does a lot of useful work.

OpenAI’s current memory documentation is explicit that saved memories are stored separately from chat history. Anthropic’s current memory tooling is explicit in the other direction: the memory tool is client-side, and developers control where the data is stored. In both cases, the practical picture is much closer to notebook plus recall policy than to anything like autonomous consolidation.

So when teams say, “the model learned this during the conversation,” I usually flinch. Unless you retrained or fine-tuned something, it probably did not learn in the human sense. It wrote a better sticky note.

This is also where note-taking becomes a stronger analogy than people realize. The classic note-taking literature distinguishes between an encoding effect and an external-storage effect. Writing notes can help you learn. Reviewing notes later also helps you remember.

Agent memory writes usually get only half of that. They are storage plus future lookup. Human note-taking is storage plus self-training.

That asymmetry explains why dumping every observation into a vector store feels disappointing. You built retention, not understanding.

The useful question is not whether to use long context or RAG as a religion. Recent comparisons suggest that long context can outperform RAG when you spend enough tokens, especially on question answering over dense sources, while RAG keeps real cost advantages and can still win on some dialogue or general-query settings. Anthropic’s Contextual Retrieval work also showed how much retrieval quality matters: their method reduced failed retrievals by 49 percent, and by 67 percent when combined with reranking.

What actually matters is tiering. Stable policy belongs in persistent memory. Source material belongs in retrieval. Immediate objectives belong in the live working set. Decisions and unresolved issues belong in a rolling summary. And facts you want the base model to truly internalize belong in training, not in wishful thinking.

* * *

Bad evidence scales faster than good memory

There is one more similarity that matters because it wrecks outputs quietly: both humans and models are vulnerable to bad evidence.

The misinformation effect literature showed long ago that people can absorb false post-event information about something they actually experienced and later recall the corrupted version. That is one of the clearest reminders that human memory is reconstructive, not camera-like.

Agent systems have the machine version of the same problem. A retrieved chunk can be irrelevant, noisy, outdated, or simply false. Once it enters the working set, the model often treats it as the problem definition. The right answer can still be somewhere in the system, and the run can still fail because the active evidence stream got polluted.

This is why provenance is not paperwork. It is part of reasoning quality.

Anthropic ran into the practical version of this while building its research system: human testers found that early agents sometimes preferred SEO-optimized content farms over better sources. That is not a cute bug. That is the whole ballgame. If the retrieval layer rewards the wrong evidence, the model will faithfully produce polished nonsense from the wrong pile of facts.

A lot of teams respond to this by adding more memory. Usually that makes it worse. Bad evidence that gets summarized, stored, and reintroduced later stops being a transient mistake and becomes durable contamination.

The useful rule is simple. Never let unaudited retrieval jump straight into persistent memory. First rank it. Then attribute it. Then compare it against an independent source if the claim matters. Only then let it harden into summary state or a saved fact.

For humans, the equivalent is obvious. We do better when we write down where a claim came from, compare notes, and correct errors quickly before they become the story we now remember. For agents, the same discipline applies, just with less dignity and more logs.

Good memory architecture is therefore inseparable from source hygiene. The problem is usually not that the system forgot. It is that it remembered the wrong thing too confidently.

* * *

A memory stack I’d actually ship

Here is the stack I would use for a long-running coding or research agent working on something nontrivial.

First, a task brief. One screen, maybe two. Objective, hard constraints, success criteria, source hierarchy, and the one thing the agent must not forget. This stays pinned near the top of context. It is the machine equivalent of the ticket you keep open on the second monitor.

Second, a tiny working set. Only the files, snippets, tool results, and subproblems relevant to the current move. Not the whole repo. Not the whole transcript. Just the live battlefield.

Third, a rolling decision log. After each major step, write down what changed, why it changed, what remains unresolved, and what evidence justified the move. This is the part most teams skip, and then they wonder why the agent starts contradicting itself 40 turns later.

Fourth, an evidence store. Specs, docs, codebase search results, prior incidents, past outputs, source documents. Searchable, ranked, and aggressively deduplicated. If this store turns into a junk drawer, the agent’s reasoning will faithfully inherit the junk.

Fifth, a handoff note. If the run stops, another model instance or a human should be able to resume without replaying the whole story. Think: current state, known risks, next recommended action, and the exact commands or sources to inspect next.

Notice what is missing: “just shove everything into the prompt and pray.”

Imagine a repo-migration agent moving a medium-sized service from one internal SDK to another.

If you dump the whole codebase, migration guide, old tickets, and prior failed attempts into one huge prompt, the agent will look busy and unreliable at the same time. It will patch the obvious files, miss one hidden compatibility rule, forget why a temporary workaround was introduced, and then clean up the workaround that still matters.

With the stack above, the run looks different. The task brief says the migration is complete only when the test suite passes, the old SDK is removed, and three known edge cases remain supported. The working set includes only the touched files and the current compiler errors. The decision log records why a shim was kept. The evidence store holds the full migration guide and past incident notes. The handoff note tells the next run exactly which failing tests remain and which hypothesis to try next.

That is not a smarter model. It is a less amnesic workplace.

This is not fancy. It is the same pattern good teams use with humans. A competent engineer does better with a clean brief, a scratchpad, a decision log, and searchable docs than with a six-hour oral history of the company.

The difference is that humans can sometimes reconstruct missing intent from context, social cues, and common sense. Models are much more literal. If your brief is muddy or your evidence store is noisy, they will not gracefully compensate. They will operationalize the confusion.

One more thing: memory writes should be selective and typed. Do not save “something important happened.” Save “decision,” “fact with source,” “open question,” “user preference,” or “risk.” Bad memory schemas create memory sludge, and memory sludge is just another name for future context rot.

* * *

Managing agents is management without the human parts

This is the part where the analogy becomes slightly annoying for managers.

A surprising amount of agent work really does look like management. You define the task. Bound the scope. Give the right tools. Preserve the important context. Review outputs. Tighten the feedback loop. Anthropic’s guidance on building effective agents keeps coming back to the same boring point: simple, composable patterns beat elaborate frameworks more often than teams want to admit.

But managing agents is not people management. It is management with the tacit social repair stripped out.

Humans ask clarifying questions. Or at least the good ones do. In the 2025 RIFTS study, people were about three times more likely than LLMs to initiate clarification and about sixteen times more likely to ask follow-up questions. Early grounding failures also predicted downstream breakdowns. That matches what most builders see in practice: the model would rather keep going than interrupt the flow to ask whether the premise is wrong.

With a human, ambiguity can sometimes heal socially. Someone notices the weirdness, reads the room, and asks, “Wait, which database do you mean?” With an agent, ambiguity often gets laundered into action. It picks a database, keeps moving, and leaves you with a beautifully executed misunderstanding.

That is why good agent managers become obsessive about four things.

They define escalation rules: when to ask, when to search, when to stop, when to hand off.

They design tools like interfaces, not magic powers. Anthropic’s multi-agent research system only got reliable when the orchestrator learned how to delegate clearly and when tool descriptions became distinct enough to prevent duplicated work, gaps, and wrong-path behavior.

They build observability. Agents are easier to inspect than humans. Prompts, tool calls, summaries, and logs are all visible. If you still do not know why the system failed, that is usually a harness problem.

And they evaluate early with small, real tasks instead of waiting for an imaginary perfect benchmark. Again, the lesson from Anthropic’s agent work is boring and right: tight feedback loops beat grand architecture.

I think the cleanest framing is this: managing agents is less like motivating employees and more like designing a cockpit. The job is not to inspire the pilot. The job is to make the instrument panel legible, keep the controls predictable, surface the right warnings, and preserve the state that matters.

The wrong debate is whether agents are becoming more human. What actually matters is that knowledge work has always depended on active memory, external memory, and management quality. LLM systems just make that dependency brutally literal.

So yes, the analogy works. Brain fog and context rot rhyme. Note-taking and RAG rhyme. Bad evidence corrupts both human recall and machine output. But the practical lesson is not mystical. It is architectural.

If your agent keeps losing the plot, buy less magic and build more memory. Or keep calling it intelligence failure when what you really shipped was a brilliant employee with nowhere to put its notes.

* * *

Notes and References

1. Nelson Cowan, “On the capacity of attention: Its estimation and its role in working memory and cognitive aptitudes,” Cognitive Psychology 51, no. 1 (2005).

2. Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang, “Lost in the Middle: How Language Models Use Long Contexts,” Transactions of the Association for Computational Linguistics 12 (2024): 157-173.

3. Ali Modarressi, Hanieh Deilamsalehy, Franck Dernoncourt, Trung Bui, Ryan A. Rossi, Seunghyun Yoon, and Hinrich Schutze, “NoLiMa: Long-Context Evaluation Beyond Literal Matching,” arXiv:2502.05167 (2025).

4. Mathias Pessiglione and colleagues, “Origins and consequences of cognitive fatigue,” Trends in Cognitive Sciences (2025).

5. Jessica Robin and Morris Moscovitch, “Details, gist and schema: hippocampal-neocortical interactions underlying recent and remote episodic and spatial memory,” Current Opinion in Behavioral Sciences 17 (2017).

6. Dhairyya Singh, Kenneth A. Norman, and Alison C. Schapiro, “A model of autonomous interactions between hippocampus and neocortex driving sleep-dependent memory consolidation,” Proceedings of the National Academy of Sciences 119, no. 44 (2022).

7. Kenneth A. Kiewra, “A review of note-taking: The encoding-storage paradigm and beyond,” Educational Psychology Review 1 (1989): 147-172.

8. Patrick Lewis and colleagues, “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks,” NeurIPS 33 (2020).

9. Zhuowan Li, Cheng Li, Mingyang Zhang, Qiaozhu Mei, and Michael Bendersky, “Retrieval Augmented Generation or Long-Context LLMs? A Comprehensive Study and Hybrid Approach,” arXiv:2407.16833 (2024).

10. Xinze Li, Yixin Cao, Yubo Ma, and Aixin Sun, “Long Context vs. RAG for LLMs: An Evaluation and Revisits,” arXiv:2501.01880 (2025).

11. Michael S. Ayers and Lynne M. Reder, “A theoretical review of the misinformation effect: Predictions from an activation-based memory model,” Psychonomic Bulletin & Review 5 (1998): 1-21.

12. Anthropic, “Contextual Retrieval in AI Systems,” September 19, 2024.

13. OpenAI, “Memory FAQ,” updated 2026; OpenAI, “Memory and new controls for ChatGPT,” updated June 3, 2025.

14. Anthropic, “Managing context on the Claude Developer Platform,” September 29, 2025; Anthropic, “Effective context engineering for AI agents,” September 29, 2025.

15. Anthropic, “Building effective agents,” December 19, 2024.

16. Anthropic, “How we built our multi-agent research system,” June 13, 2025.

17. Omar Shaikh and colleagues, “Navigating Rifts in Human-LLM Grounding: Study and Benchmark,” arXiv:2503.13975 (2025).

Discussion about this post

Ready for more?