You Are Probably Rebuilding DSPy Already

Why prompt tables, eval scripts, and model wrappers keep turning into the same architecture.

Apr 22, 2026

A surprising number of teams will tell you, with a straight face, that they are not “doing DSPy.” Then they show you the system: prompts moved out of code, Pydantic schemas everywhere, retry wrappers, a retrieval step, an eval script, and a provider abstraction so somebody can test Claude next week. At that point the honest description is not “we skipped DSPy.” It is “we rebuilt the shape of it by accident.”

The wrong debate is whether DSPy wins some framework cage match. The useful question is simpler: when does a pile of locally sensible fixes turn into architecture, and do you want that architecture to be deliberate or improvised?

I do not think every team should stop and rewrite into DSPy tomorrow. That is not my point. I think DSPy is best understood as a destination architecture, and more useful as a mirror than as a mascot. It shows you the engineering boundary most LLM systems eventually need once the cute demo gets exposed to product managers, traffic, regressions, and model churn.

That is why Skylar Payne’s recent piece landed. It does not prove that every serious team secretly runs DSPy. It names the rewrite pattern teams fall into when they start with a raw API call and keep patching their way toward something change-safe. And the official DSPy materials, plus public cases from Databricks, JetBlue, Replit, and others, make the bigger point hard to ignore: this is not just research taste. It is software shape.

* * *

Stop Calling This a Framework Choice

If you only know DSPy from the slogan, it is easy to misfile it as “that prompt optimization library from Stanford.” That undersells it.

DSPy describes itself as a declarative framework for building modular AI software. Its signatures define what a module should take in and produce. Its modules compose those signatures into programs. Its optimizers tune prompts or weights against explicit metrics. The paper says the same thing more formally: treat LM pipelines as structured programs instead of long prompt strings discovered by trial and error.

That what-versus-how split is the whole game. A signature tells the model what behavior you need, not the exact prompt text you hand-crafted on a tired Tuesday night. That matters because prompt text stops being a harmless implementation detail the moment output format, retrieval quality, cost, latency, and model portability start affecting the business.

This is why I think the usual DSPy-versus-LangChain argument is often the wrong argument. DSPy itself says it is not trying to be a batteries-included catalog of prebuilt app templates. It is closer to a programming model for LM pipelines. If you want off-the-shelf chains, other libraries optimize for that. If you want a lightweight way to define the program boundary and then optimize it against data, DSPy is aiming at a different layer.

What actually matters is not whether you enjoy the abstractions. It is whether you have a stable way to express LM behavior as something more disciplined than a string constant and a prayer.

And no, this is not just a clever academic diagram. DSPy’s official use-cases page currently lists public or production-facing examples from JetBlue, Replit, Databricks, Zoro UK, VMware, Moody’s, and others. Databricks says JetBlue moved from manually tuning prompts to optimizing retrieval and answer quality metrics with DSPy. Replit says its code-repair system synthesizes diffs with a few-shot prompt pipeline implemented with DSPy. That is not proof that DSPy is the default industry center of gravity. It is proof that the model is not hypothetical.

You can disagree with the taste. You can dislike the ergonomics. You can decide the timing is wrong for your team. Fine. But the useful question is not whether DSPy feels elegant. It is whether your system already needs typed boundaries, modular composition, explicit metrics, and a way to survive model churn without rewriting the plumbing every quarter.

* * *

The Seven-Step Rewrite Nobody Plans

Payne’s essay is useful because it compresses a year of LLM system drift into a sequence any engineer can recognize.

First comes the inline prompt in application code. Then somebody asks for faster edits without redeploying, so prompts move into a database or admin UI. Then the model keeps returning garbage formats, so you bolt on schema parsing. Then production teaches you humility and you add retries. Then you need retrieval. Then you finally build evals because nobody can tell whether the last prompt change helped or broke three other cases. Then leadership wants to test another model and you discover your codebase is basically a shrine to one provider’s client library.

None of those moves is stupid. That is the important part. A prompt table is often sensible. Typed parsing is sensible. Retries are sensible. Retrieval is sensible. Evals are very sensible. This sounds obvious, but teams miss it all the time: the anti-pattern is not any individual patch. The anti-pattern is leaving every patch as a bespoke local repair instead of lifting it into a coherent program abstraction.

Here is a boring example, because boring examples are where architecture gets expensive.

Imagine an internal support triage service. Week one, you write one model call that labels urgency and suggested queue. It works well enough. Week three, ops wants to tweak wording without waiting for a deploy, so now you have a prompt table. Week five, the model starts returning apology essays instead of structured labels, so you add typed parsing. Week eight, transient failures and parse errors mean retries and fallbacks. Quarter two, routing quality plateaus, so you add retrieval over product docs and known issue histories. Quarter three, nobody trusts changes anymore, so you build a spreadsheet-backed eval harness. Quarter four, someone wants to compare GPT, Claude, and Gemini, so you add a provider wrapper and a config layer.

At no point will the team say, “Great, we are now intentionally designing an LM program.” They will say, “We were being pragmatic.”

Maybe. But pragmatism does not magically erase architecture. It just makes the architecture show up as scar tissue.

That is why Payne’s framing works. He is not saying the early patches are foolish. He is saying they are local optimizations that become bad global design when they stay informal. A database prompt store does not solve ownership. A parser does not solve modularity. A retry decorator does not solve execution policy. An eval script does not solve reproducibility. And a provider wrapper added in month nine is still a late refactor.

That is also why the adoption story is so interesting. PyPI Stats, which is only a rough proxy, currently shows dspy at about 5.7 million monthly downloads versus roughly 229.9 million for langchain. So this is clearly not a story where DSPy already won the framework popularity contest. It is the opposite. The interesting thing is that DSPy’s ideas can be directionally right while its direct adoption still trails far behind the broader application framework ecosystem. Being right is not the same as being easy.

* * *

Why Good Teams Still Put It Off

Why do good teams still put this off? I think there are four boring reasons.

First, DSPy asks you to think earlier than you want to think. Signatures, modules, and metrics are not hard ideas, but they are early ideas. When the first raw API call works, the emotional reward is “ship it,” not “step back and formalize the boundary.” Payne nails this. The abstractions feel expensive before the pain is visible.

Second, optimization only gets interesting once you have data and a credible metric. DSPy’s own evaluation docs say even 20 examples can be useful, which is encouraging, but it still means you need a development set and a definition of good. Many teams do not have that on day one. They have vibes, a demo, and an impatient stakeholder. Metrics arrive later, usually after a few expensive surprises.

Third, stack friction is real. DSPy is a Python framework. That is fine if you are already Python-heavy. It is less fine if your production surface is TypeScript, Java, or .NET and nobody wants to introduce a fresh Python island just to get cleaner abstractions. Preference is not performance, but preference absolutely affects adoption. Teams will often choose to recreate the ideas inside their native stack before they choose a center-of-gravity shift.

Fourth, the old critique that DSPy is “just research code” is less convincing than it used to be, but the concern did not come from nowhere. The current docs now include a production overview, deployment guides, async support, observability, and optimizer tracking. That matters. The picture is better than a lot of skeptics assume. But DSPy is still not the whole production stack. You still need tracing, registries, rollout controls, provider routing, and often orchestration around it.

This is the part the market is quietly admitting. MLflow now ships prompt registry and prompt optimization, with aliases like staging and production plus evaluation integration. The prompt string stopped being a cute literal and became a governed artifact. Once that happens, teams need version history, quality gates, routing control, and rollback semantics whether or not they ever type “import dspy”.

So I do not read the ecosystem as evidence against DSPy. I read it as evidence that teams are discovering the same pressure from different angles. Some adopt DSPy directly. Some buy or build the surrounding layers first. Some do both. The shape keeps converging.

The problem is usually not whether a team believes in DSPy. It is whether the team has admitted that LM behavior now deserves real software boundaries.

* * *

The Brownfield Move That Actually Works

That leads to the only adoption pattern I actually trust: brownfield, narrow, metric-first.

Do not start with a grand rewrite. Do not begin with your most theatrical agent. Pick one repeated LM task that already causes pain. A classifier. An extractor. A reranker. One retrieval stage. Something boring enough that improvement is measurable and failure is legible.

Then write one signature. Not ten. One. Define the typed input and the typed output you actually need. After that, wrap your current logic inside one module. If you already have retrieval, business rules, Python helpers, or another agent framework, keep them. DSPy’s custom module guide explicitly supports integrating external tools and services. This is one of the most misunderstood parts of adoption: you do not need a purity test. You need one stable boundary.

Now collect a small development set. Again, boring wins. Pull 20 to 50 ugly real examples. Not the clean internal demo prompts. The weird tickets. The malformed inputs. The cases that made people complain. DSPy’s docs explicitly say even 20 examples can be useful. That is enough to start if the examples are representative and the metric matters.

Then define a metric that your business would actually care about. Exact match for queue assignment. Pass rate from a human reviewer. Structured output validity plus field-level accuracy. Something you can explain to a skeptical engineer without saying “the vibes improved.”

Only after that should you turn on optimization.

I would also turn on observability before I touched an optimizer. Locally, inspect_history is already useful. For shared work, DSPy and MLflow now support tracing and optimizer tracking. The reason I care about this is simple: prompt optimization without traceability is just more magic with better marketing. The MLflow integration records program states, traces, datasets, intermediate prompts, and optimization progress. That is much closer to a real engineering audit trail.

If the optimizer does not move the metric, stop. Treat optimization as an experiment with a stop condition, not a ritual. This is where a lot of teams waste time. They hear “DSPy has optimizers” and start compiling before they have a stable baseline. Then they cannot tell whether the improvement is real, task-specific, or mostly noise.

If you are not a Python-first shop, keep the blast radius small. DSPy’s deployment tutorial shows two straightforward options: serve the program behind FastAPI for a lightweight REST boundary, or package it with MLflow for a more managed deployment flow. The MLflow path even recommends the llm/v1/chat task format so the deployed interface lines up with the OpenAI chat API shape that many applications already expect. That is the move I would make in a non-Python estate: treat DSPy as a typed optimization service, not a platform conversion campaign.

Here is a concrete version.

Suppose your team owns an internal assistant that routes support tickets. Leave the front end alone. Leave auth alone. Leave your retrieval index alone. Replace just the routing prompt. Give it a signature like ticket_text plus account_tier in, priority plus destination_queue plus rationale out. Wrap the existing retrieval call and the routing step in one module. Pull 30 painful historical tickets. Score queue match and human accept rate. Trace every run for a week. Then run a light optimization pass.

If the metric improves and the traces make sense, keep that module and move on. If not, you learned something cheaply.

That is the part people undervalue. Good brownfield adoption is not just a path to success. It is a path to a cheap “no.”

And keep the layer model straight. Use DSPy for program structure, metrics, and optimization. Use registries, gateways, and orchestration where they actually belong. The official docs already assume this kind of layering by pointing to FastAPI, MLflow, async execution, streaming, and observability rather than pretending the framework is a whole universe. I think that is the healthiest way to use it.

DSPy is not a religion. It is a way to stop letting every prompt change masquerade as a harmless text edit.

I think the teams that ship durable LM systems will look slightly boring from the outside. Fewer hero prompts. Fewer framework arguments. More typed boundaries, more evals, better traces, cleaner rollbacks.

That is why the useful question is not “Are we using DSPy?” It is “How many DSPy-shaped problems are already on our backlog, and why are we pretending they are unrelated?”

You can adopt the framework. You can steal the ideas. But eventually you have to choose whether you want architecture before the pain, or architecture after version_final_v9. Which kind of pragmatist are you?

* * *

Notes and References

1. Omar Khattab et al., DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, ICLR 2024 / arXiv preprint, 2023.

2. DSPy official site, “Programming-not prompting-LMs,” accessed April 18, 2026.

3. DSPy documentation, “Signatures,” accessed April 18, 2026.

4. DSPy documentation, “DSPy Optimizers,” accessed April 18, 2026.

5. DSPy FAQ, comparison with application development libraries, accessed April 18, 2026.

6. Skylar Payne, “If DSPy is So Great, Why Isn’t Anyone Using It?”, March 21, 2026.

7. DSPy community, “Use Cases,” accessed April 18, 2026.

8. Databricks, “Optimizing Databricks LLM Pipelines with DSPy,” May 23, 2024.

9. Replit, “Building LLMs for Code Repair,” April 5, 2024.

10. DSPy documentation, “Evaluation Overview,” accessed April 18, 2026.

11. DSPy documentation, “Deployment,” accessed April 18, 2026.

12. DSPy documentation, “Using DSPy in Production,” “Debugging and Observability,” and “Tracking DSPy Optimizers with MLflow,” accessed April 18, 2026.

13. PyPI Stats, package pages for dspy and langchain, accessed April 18, 2026.

14. MLflow, “Prompt Registry for LLM and Agent Applications” and “Prompt Optimization: Automate Prompt Engineering,” accessed April 18, 2026.

Discussion about this post

Ready for more?