Free Tools to Help You Stretch Your Tokens

RTK cuts terminal noise before it enters context. Caveman cuts the agent’s visible chatter. The useful trick is knowing which tokens each one actually saves.

Jun 01, 2026

The new flex is not having the biggest context window. It is not feeding the model garbage in the first place.

This sounds obvious, but teams miss it all the time. They debate which model is smarter, which coding agent has the better benchmark chart, and whether the paid tier gives them enough room. Meanwhile, their agent is quietly stuffing the context with directory listings, dependency trees, test runner confetti, repeated stack traces, polite throat-clearing, and five paragraphs of commentary where one line would do.

The wrong debate is: how do I buy more room?

The useful question is: why am I filling the room with junk?

That is where tools like RTK and Caveman are interesting. Not because they are magic. Not because their headline compression numbers should be tattooed on your sprint board. They are interesting because they attack two different leaks in the agent loop.

RTK compresses terminal output before it becomes model input. Caveman compresses what the agent says back to you, and its companion compression workflow can reduce recurring instruction files. One sits at the command boundary. The other sits at the response and memory boundary.

Together, they point to a better operating habit for AI-heavy development: treat tokens as working memory, not as disposable exhaust.

* * *

The token bill is not one thing

When people say “tokens,” they usually mean a vague unit of AI consumption. That vagueness is how bad workflows hide.

OpenAI’s token documentation separates usage into input tokens, output tokens, cached tokens, and, for some advanced models, reasoning tokens. Tokens are not the same as words; the rough English rule of thumb is about four characters per token, with plenty of variation by language and text shape. [1]

For reasoning models, there is a second trap: the thinking may not be visible, but it still matters. OpenAI’s reasoning-model documentation says reasoning tokens are not visible through the API, but they occupy context-window space and are billed as output tokens. It also notes that input and output tokens from each step carry over in a multi-step conversation, while reasoning tokens are discarded. [2]

Claude Code has the same practical warning in its cost guidance: thinking tokens are billed as output tokens, and the default budget can be tens of thousands of tokens per request depending on the model. Anthropic’s guidance says simpler tasks can use lower effort, disabled thinking, or a lower thinking budget. [3]

That gives us the first useful split.

There are tokens you send into the model. There are tokens the model generates back. There are hidden reasoning tokens. There are tokens retained across the conversation. Different tools affect different buckets.

Preference is not performance. A terse answer feels cheaper, but it does not necessarily reduce hidden reasoning. A compressed terminal log reduces context, but it can also hide the one line you needed. A smaller instruction file helps every session, but only if the compressed version preserves the rules that actually matter.

So the article version of this is simple: RTK helps with CLI output before it pollutes context. Caveman helps with visible assistant output and repeated prose instructions. Neither should be described as a universal reasoning-token shrink ray.

That distinction matters because token hygiene is a systems problem, not a prompt-style problem.

* * *

RTK: cut terminal sludge before the agent reads it

AI coding agents love the terminal. They run tests. They inspect files. They check git status. They read logs. They list directories. They ask the shell for truth because the shell is where the project pushes back.

The problem is that most terminal output was designed for humans glancing at a screen, not for a language model carrying every character into context.

RTK, short for Rust Token Killer, is a CLI proxy that filters and compresses command output before it reaches the LLM context. Its README describes a single Rust binary with more than 100 supported commands and claims 60-90% token reduction on common development operations, with the caveat that actual savings vary by project size. [5]

The important part is not the exact percentage. The important part is the boundary.

Without a filter, the agent runs a command and receives raw output. With RTK, the agent runs the command through a proxy that can trim, group, truncate, and deduplicate before the response lands in the model’s working memory. The README describes four core strategies: smart filtering, grouping, truncation, and deduplication. [5]

That is boring in the best possible way.

A successful git push does not need the model to read object enumeration, delta compression, thread counts, and progress bars. It needs to know whether the push worked, where it went, and whether anything unusual happened. RTK’s example turns a verbose git push into something like “ok main.” [5]

A failing test run does not need 200 lines of passing test noise. It needs the failing tests, their assertions, and enough location data to act. RTK’s test wrappers aim to show failures only for common runners such as pytest, cargo test, go test, Jest, Vitest, and Playwright. [5]

A directory listing does not need every permission bit unless permissions are the problem. A compact tree is often better than a full ls -la dump. A git status does not need the full decoration parade if the agent only needs modified, staged, and untracked files.

This is why RTK belongs early in the loop. It reduces the amount of machine chatter that becomes model input. It also makes later turns cleaner, because less irrelevant terminal text is retained in the conversation history.

There are sharp edges.

RTK’s auto-rewrite hook works by rewriting Bash commands to RTK equivalents. The README notes that Claude Code built-in tools such as Read, Grep, and Glob bypass that Bash hook, so those workflows are not automatically rewritten. On native Windows, RTK filters work but the hook does not auto-rewrite commands; WSL gets the fuller hook experience. [5]

That Windows note is not trivia. If your setup is PowerShell-first, you should assume explicit RTK commands are safer than relying on magic. Use rtk git status, rtk tsc, rtk pytest, rtk read, rtk grep. Make the compression visible enough that you know when it is active.

The discipline I would use is conservative: compress routine inspection, but keep a raw escape hatch. For exploratory file discovery, compressed output is usually fine. For build logs, failures-only is usually fine. For security review, incident response, flaky race conditions, or migration diffs, I want the full output available somewhere, even if the agent first sees the compressed view.

Token savings are useful. Losing the clue is expensive.

* * *

Caveman: make the agent stop narrating

Caveman attacks a different leak: the assistant’s habit of talking too much.

The Julius Brussee Caveman project is a skill/plugin pattern for AI coding agents that pushes replies into terse, compressed language. Its README describes triggers such as /caveman, “talk like caveman,” “caveman mode,” and “less tokens please.” It also includes intensity levels: lite, full, ultra, and Wenyan modes. [7]

The README reports benchmark results across common coding tasks, with an average reduction from 1,214 normal output tokens to 294 Caveman tokens, or 65% saved, across the listed prompts. The reported range is 22% to 87%. [7]

Treat those numbers as project-reported measurements, not universal law. The more important claim is operational: less visible assistant prose means less output to pay for, less text to read, and less previous-assistant chatter carried into later context.

This is not just about being cute. Agent replies are often full of softeners: “Here’s what I found,” “The issue appears to be,” “Let me know if you’d like,” “We should probably consider,” and so on. In a normal chat, that can make the interaction feel nicer. In an engineering loop, it is often just a tax.

A good agent response during coding is usually closer to this:

`auth.ts: token expiry bug. Cause: refresh path skips clock skew guard. Fix: add 60s leeway before reject. Tests: auth-refresh.spec.ts failing before, passing after.`

That is not caveman because it is silly. It is caveman because it is a compact status packet.

The crucial correction: Caveman does not shrink hidden thinking tokens. The README says this directly: Caveman only affects output tokens; thinking and reasoning tokens are untouched. [7]

That matters because the phrase “compress thinking and output” is tempting but inaccurate. Caveman can make the agent’s visible answer shorter. It can make recurring memory or instruction files shorter through caveman-compress. It can reduce later context because previous visible assistant replies are shorter. But if the model is doing extended reasoning internally, that budget is controlled elsewhere.

In Claude Code, that “elsewhere” is things like effort level, model settings, thinking configuration, and thinking-token budget. Anthropic’s cost guidance is explicit about adjusting extended thinking for simpler tasks. [3]

Caveman also has an input-side trick worth separating from the response style. Its caveman-compress workflow rewrites memory files such as CLAUDE.md into compressed prose while keeping a human-readable original backup. The README’s sample table reports an average reduction from 898 to 481 tokens, or 46%, across example files. It says code blocks, URLs, file paths, commands, headings, dates, and version numbers pass through untouched while prose is compressed. [7]

That is more interesting than the meme suggests.

Claude Code documentation says CLAUDE.md loads into context at session start. If that file contains detailed instructions for workflows you are not using, those tokens are present anyway. Anthropic recommends moving specialised instructions into skills because skills load on demand, and aiming to keep CLAUDE.md under 200 lines. [3]

So the stronger pattern is not “make every answer caveman forever.” The stronger pattern is: keep base context lean, load specialised instructions only when needed, and make routine status updates compact by default.

There is a trade-off. Over-compression can hide uncertainty. It can also make an agent sound more confident than it should, because hedging gets deleted along with filler. That means any caveman-style rule should preserve three things: blockers, confidence, and next action.

Cut ceremony. Keep doubt.

* * *

A practical token-stretching workflow

Here is the workflow I would actually try for an AI coding setup. Nothing heroic. No dashboard. No spreadsheet theatre. Just fewer stupid tokens.

First, baseline the obvious leak. Run a normal 30-minute agent session and note where the agent reads too much. Not the total cost. The shape. Did it repeatedly call tree? Did it dump long test logs? Did it read whole files to find one function? Did it explain every tiny step in paragraphs?

Second, put RTK on the shell path and use it explicitly for high-noise commands. Start with status, diff, test, lint, grep, file read, logs, and dependency inspection. On WSL or Unix-like shells, experiment with the auto-rewrite hook. On native Windows, assume explicit commands until you have verified the hook path you actually use.

Third, give the agent a rule for escalation. Something like: use compressed output for routine inspection; request raw output when a failure is ambiguous, security-sensitive, or caused by timing, encoding, permissions, or environment state. The agent should know compression is a first pass, not the court transcript.

Fourth, turn on Caveman-style terse responses for routine engineering loops. Do not ask for a personality change. Ask for a reporting format. File. Cause. Fix. Test. Risk. Next step. That is enough for most turns.

Fifth, compress recurring instruction files only after you decide what belongs there. This is where people get it backward. They take a bloated CLAUDE.md, compress it, and call the job done. Better: delete stale rules, move workflow-specific guidance into skills, then compress the small durable core.

Sixth, tune reasoning separately. For planning, architecture, migrations, and unfamiliar bugs, keep higher reasoning. For “rename this field,” “fix lint,” “update tests,” or “summarise this diff,” lower effort may be enough. This is not a moral issue. It is matching compute to task shape.

A concrete example:

The agent needs to fix a TypeScript auth bug. In the noisy version, it runs a full directory tree, cats several files, runs the full test suite, receives pages of test output, explains its plan in six paragraphs, edits one file, runs the full test suite again, then writes a novella about what changed.

In the cleaner version, it starts with rtk grep around the token expiry path, uses rtk read on the relevant files, runs the focused auth test through an RTK test wrapper, and reports in a terse packet: `auth expiry bug. Cause: refresh branch ignores leeway. Changed auth/session.ts. Focused test passes. Full suite not run.`

That last sentence is important. Compression should not mutate reality. If the full suite was not run, the agent must say so. Brevity is useful only when it preserves the edges.

For longer tasks, delegate noisy operations. Claude Code’s cost guidance says running tests, fetching docs, or processing logs can consume significant context, and recommends delegating verbose operations to subagents so only a summary returns to the main conversation. [3]

That pattern composes with RTK and Caveman. RTK trims the subagent’s tool output. The subagent returns a compact summary. Caveman keeps the main agent from turning that summary into a speech.

The shape is: raw world, compressed tool signal, compact agent status, explicit raw escape hatch.

That is the whole game.

* * *

Free tools still need adult supervision

The obvious failure mode is believing the compression number more than the work.

A tool can save 80% of tokens and still be the wrong tool for the moment. A directory listing compressed into a tree is helpful. A stack trace compressed until the causal frame disappears is not. A terse PR review can be excellent. A terse security review can become malpractice if it drops evidence.

So I would classify compression targets by risk.

Low risk: successful command confirmations, dependency list summaries, passing test noise, repeated log lines, boilerplate install output, directory discovery, generic status updates.

Medium risk: failing tests, compiler errors, diffs, migration logs, build errors, dependency conflicts. Compress first, but preserve enough line numbers, filenames, error codes, and command context to act.

High risk: security findings, data-loss operations, production incidents, permissions, auth, cryptography, billing, legal text, compliance evidence, and any bug where one missing line changes the diagnosis. For these, compression should be an index, not the evidence.

The same applies to Caveman. I like terse agents. I do not like agents that delete caveats. The goal is not to sound primitive. The goal is to remove prose that does not carry state.

A good compressed engineering reply still includes uncertainty markers. `Likely cause` is different from `cause`. `Need full log` is different from `fixed`. `Focused test passes` is different from `all tests pass`. These are not filler words. They are control signals.

This is where the meme can accidentally teach the wrong lesson. The point is not to make the model dumber. The point is to make the channel cleaner.

There is some research support for brevity as more than aesthetics, but it should be handled carefully. A March 2026 arXiv preprint found that, on certain benchmark subsets, constraining large models to brief responses improved accuracy and reduced overelaboration errors. [8]

That does not mean every task wants fewer words. It means verbosity is not automatically intelligence. In engineering work, verbose can be worse than terse when it smears the state of the system across polite paragraphs.

The rule I would use: compress transport, not truth.

* * *

The better question

The useful question is not “Which AI plan gives me more tokens?”

It is “Where are my tokens leaking?”

Some leak through terminal output. Use RTK or the same idea: filter command output before the model sees it.

Some leak through assistant verbosity. Use Caveman or the same idea: make routine responses state-dense.

Some leak through oversized base instructions. Move specialised rules into on-demand skills. Keep the default context small.

Some leak through unnecessary reasoning. Lower effort when the task is simple. Raise it when the task deserves it.

Some leak through bad task framing. Ask for the file, function, test, or failure mode you care about instead of telling the agent to “look around.”

None of this requires a new religion. It requires a small amount of mechanical sympathy for how agent loops actually burn context.

The future of AI coding will probably include larger windows, better caching, smarter tool calling, and more automatic compression. Anthropic has already shipped token-efficient tool-use features for the API that reduce output token consumption for tool calls, with early users seeing an average reduction of 14% and up to 70% in some cases. [9]

But even if the tools get smarter, the habit still matters.

A bigger context window is useful. A cleaner context window is leverage.

So yes: free tools to help you stretch your tokens. RTK for terminal noise. Caveman for agent chatter and recurring prose. Keep the claim honest: RTK compresses CLI output before it becomes model input; Caveman compresses visible output and can compress memory prose, but it does not directly compress hidden thinking tokens.

The slightly annoying question is the one worth keeping on your desk:

Are you running out of tokens, or are you just giving the model too much trash to carry?

* * *

Notes and References

1. OpenAI Help Center, “What are tokens and how to count them?”, accessed April 27, 2026. https://help.openai.com/en/articles/4936856-what-are-tokens-and-how-to-count-them

2. OpenAI API documentation, “Reasoning models”, accessed April 27, 2026. https://developers.openai.com/api/docs/guides/reasoning

3. Anthropic Claude Code documentation, “Manage costs effectively”, accessed April 27, 2026. https://code.claude.com/docs/en/costs

4. Anthropic Claude API documentation, “Token counting”, accessed April 27, 2026. https://platform.claude.com/docs/en/build-with-claude/token-counting

5. rtk-ai/rtk GitHub README, accessed April 27, 2026. https://github.com/rtk-ai/rtk

6. RTK website, “Make your AI coding agent smarter”, accessed April 27, 2026. https://www.rtk-ai.app/

7. JuliusBrussee/caveman GitHub README, accessed April 27, 2026. https://github.com/JuliusBrussee/caveman

8. MD Azizul Hakim, “Brevity Constraints Reverse Performance Hierarchies in Language Models”, arXiv:2604.00025, submitted March 11, 2026. https://arxiv.org/abs/2604.00025

9. Anthropic, “Token-saving updates on the Anthropic API”, March 13, 2025, accessed April 27, 2026. https://claude.com/blog/token-saving-updates

10. A verification note: the article treats RTK and Caveman savings figures as project-reported measurements unless stated otherwise. I could verify the claims as published by the projects, but not independently reproduce their benchmark results during this pass.

Discussion about this post

Ready for more?