Chapter 1 · Section 2

What is DSPy?

DSPy is a framework for programming—not prompting—foundation models like GPT-4, Claude, and others.

~15 min read

DSPy: Declarative Self-improving Language Programs

DSPy provides a systematic way to build LM-based applications that are modular, composable, and automatically optimizable.

📜 Historical Context: The Demonstrate-Search-Predict Paper

DSPy originated from the groundbreaking research paper "Demonstrate-Search-Predict: A Paradigm for Solving Complex, Multi-Hop Reasoning Tasks with Large Language Models" by Omar Khattab and colleagues at Stanford University.

The paper demonstrated that complex reasoning tasks could be decomposed into three systematic stages:

🎯

DEMONSTRATE

Learning from examples and demonstrations

🔍

SEARCH

Retrieving and synthesizing information from multiple sources

💡

PREDICT

Generating accurate outputs based on gathered evidence

This three-stage approach showed that by treating language model tasks as structured programs rather than mere prompts, we could achieve:

  • ✅ Better compositional generalization
  • ✅ More reliable multi-hop reasoning
  • ✅ Systematic optimization through weak supervision
  • ✅ Zero-shot transfer to new tasks

❌ The Problem: Manual Prompt Engineering

Before understanding DSPy, let's look at the traditional approach to working with LLMs.

Traditional Prompt Engineering

When you want an LLM to perform a task, you typically write a prompt:

import openai

# Manual prompt for question answering
prompt = """
You are a knowledgeable assistant. Answer the following question accurately and concisely.

Question: What is the capital of France?

Provide your answer in a single sentence.
"""

response = openai.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": prompt}]
)

print(response.choices[0].message.content)

This works for simple cases, but scaling this approach leads to significant problems.

Problems with Manual Prompting

⚠️

Brittle and Hard to Maintain

Small changes can dramatically affect results. There's no systematic way to know which prompt version is best.

# Prompt for sentiment analysis
sentiment_prompt = """
Analyze the sentiment of this text and classify it as positive, negative, or neutral.
Be careful to consider context and sarcasm.
Respond with only the sentiment label.

Text: {text}
Sentiment:
"""

What if the model doesn't follow instructions? How do you handle edge cases?

⚠️

Doesn't Compose Well

Chaining multiple steps is manual and error-prone:

# Step 1: Summarize
summary_prompt = f"Summarize this: {document}"
summary = call_llm(summary_prompt)

# Step 2: Extract entities
entity_prompt = f"Extract entities from: {summary}"
entities = call_llm(entity_prompt)

# Step 3: Classify
classification_prompt = f"Classify these entities: {entities}"
result = call_llm(classification_prompt)

Error propagation, no systematic optimization, debugging nightmare

⚠️

No Systematic Optimization

The only way to improve is trial and error:

  • Try different phrasings manually
  • Add examples by hand
  • Test each variation
  • No guarantee of improvement

This is like training a neural network by manually adjusting weights!

✅ The Solution: DSPy

DSPy changes the game by letting you program with language models instead of prompting them.

Key Idea: Separate What from How

Instead of telling the model how to solve a task (via prompts), you tell it what to do (via signatures), and DSPy figures out how.

This shift brings determinism and reproducibility to your AI development. Since your logic is defined in code (Python classes), it is version-controllable, testable, and modular—unlike a giant string of text that might break if you change one word.

Traditional prompting (imperative)

prompt = "You are an assistant. Answer questions. Question: {q}"

DSPy (declarative)

class QuestionAnswer(dspy.Signature):
    """Answer questions accurately."""
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()

DSPy automatically creates the prompts for you!

What DSPy Provides

📝

Signatures

Task specifications that define what a task does, not how:

class Summarize(dspy.Signature):
    """Summarize the given text."""
    document: str = dspy.InputField()
    summary: str = dspy.OutputField(desc="concise summary in 2-3 sentences")

Like a type signature in programming—specifies inputs and outputs.

🧩

Modules

Reusable components that use signatures:

# Create a summarization module
summarizer = dspy.Predict(Summarize)

# Use it
result = summarizer(document="Long text here...")
print(result.summary)

Modules can be combined, extended, and optimized.

Optimizers

Automatically improve your programs:

# Optimize automatically
from dspy.teleprompt import BootstrapFewShot

optimizer = BootstrapFewShot(metric=your_metric)
optimized_rag = optimizer.compile(RAGPipeline(), trainset=your_data)

DSPy learns better prompts, examples, and module compositions!

Core Concepts

📝

Signatures

Think of signatures as function declarations for LM tasks:

# Input -> Output specification
class TranslateToFrench(dspy.Signature):
    english_text: str = dspy.InputField()
    french_text: str = dspy.OutputField()
🧩

Modules

Pre-built and custom components:

  • dspy.Predict - Basic prediction
  • dspy.ChainOfThought - Step-by-step reasoning
  • dspy.ReAct - Agent-style reasoning with tools
  • Custom - Build your own!

Teleprompters (Optimizers)

Automatically improve your program:

  • BootstrapFewShot - Generate few-shot examples
  • MIPRO - Optimize instructions and demonstrations
  • KNNFewShot - Use similarity-based examples

Why DSPy Matters

🔧

Systematic Development

DSPy brings software engineering practices to LM applications: modularity, abstraction, reusability, and systematic testing.

Automatic Optimization

Instead of manually tweaking prompts, DSPy learns from your data, generates optimal prompts, and improves with more examples.

📈

Scalability

Build complex pipelines that chain multiple steps, handle errors gracefully, and scale to production.

📚

Research-Backed

Developed by Stanford NLP, published at NeurIPS, NAACL, and other top venues. Proven effectiveness across tasks.

When to Use DSPy

✅ DSPy is ideal when you:

  • Build complex LM pipelines with multiple steps
  • Want to systematically improve performance
  • Need modularity and reusability
  • Have data for optimization
  • Value maintainability over quick hacks

❌ Consider alternatives when you:

  • Need a simple one-off query
  • Have zero data for optimization
  • Need very specific prompt control
  • Require guaranteed output formats

The DSPy Philosophy

Programming > Prompting

Traditional: Human writes prompt → LM executes → Human tweaks → Repeat

DSPy: Human defines task → DSPy optimizes → LM executes → System improves

Declarative > Imperative

Imperative: "Here's how to answer: First read the context, then..."

Declarative: "Given context and question, produce an answer"

Optimizable > Static

Static: Fixed prompts that require manual updates

Optimizable: Programs that improve automatically from data

📝 Summary

DSPy is:

  • A framework for programming foundation models
  • Based on signatures (task specs) and modules (components)
  • Designed for composition and optimization
  • Research-backed and production-ready

DSPy lets you:

  • Define what tasks do, not how
  • Build modular, composable pipelines
  • Automatically optimize from data
  • Scale to complex applications

Key Advantage: Instead of manually engineering prompts, you program at a higher level and let DSPy handle the prompt optimization automatically.