COPRO (Chain-of-Thought Prompt Optimization) | DSPy: The Comprehensive Guide

📋 Prerequisites

Previous Section: BootstrapFewShot (Few-shot optimization)
Chapter 4: Evaluation (Metrics and validation)
Concept: Evolutionary algorithms (helpful but not required)
Knowledge: Understanding of prompt engineering basics

Introduction to COPRO

COPRO (Chain-of-thought PROmpt optimization) is an advanced DSPy optimizer that uses evolutionary search to discover and refine optimal instructions for your language model programs. Unlike BootstrapFewShot, which focuses on selecting good demonstrations, COPRO specifically targets instruction optimization—finding the best way to describe your task to the language model.

💡

Core Innovation: Prompts that work well for humans may not be optimal for LMs. COPRO solves this by letting the LM generate, evaluate, and evolve its own instructions.

How It Works

Generate Candidates: Uses an LM to propose diverse instruction variations.
Evaluate: Tests each candidate against your metric on a training set.
Evolve: Selects top performers and mutates them to create better versions.
Converge: Repeats this process to find the optimal prompt.

Cost-Aware Optimization

COPRO is designed to be efficient:

Adaptive Evaluation: Spends more compute on promising candidates.
Early Termination: Stops bad search paths quickly.
Budget Management: Respects your constraints on total optimization cost.

💻 Basic Usage

Simple Classification Example

Here is how to use COPRO to optimize a sentiment classifier:

import dspy
from dspy.teleprompt import COPRO

# 1. Define your signature and module
class SentimentClassifier(dspy.Signature):
    """Classify text sentiment."""
    text: str = dspy.InputField()
    sentiment: str = dspy.OutputField(desc="positive, negative, or neutral")

classifier = dspy.Predict(SentimentClassifier)

# 2. Define training data (20-50 examples recommended)
trainset = [
    dspy.Example(text="I love this!", sentiment="positive"),
    dspy.Example(text="Terrible service.", sentiment="negative"),
    # ... add more examples
]

# 3. Define metric
def sentiment_accuracy(example, pred, trace=None):
    return example.sentiment.lower() == pred.sentiment.lower()

# 4. Configure COPRO
copro = COPRO(
    metric=sentiment_accuracy,
    breadth=10,  # Candidates per generation
    depth=3      # Number of generations
)

# 5. Compile
optimized_classifier = copro.compile(classifier, trainset=trainset)

# 6. Usage
result = optimized_classifier(text="This exceeded all expectations!")
print(result.sentiment)

⚙️ Advanced Configuration

Fine-tune COPRO's behavior for your specific needs.

Parameter	Description	Default	Recommended for Reasoning
`breadth`	Candidates per generation	10	15-20
`depth`	Number of generations	3	3-5
`init_temperature`	Creativity for initial candidates	1.4	1.5
`prompt_model`	LM used to generate prompts	None (uses task model)	GPT-4 / Stronger Model

Using a Stronger Teacher Model

A common strategy is to use a powerful model (like GPT-4) to generate prompt candidates, but optimize them for a smaller, faster model (like GPT-3.5 or a local model).

# Use GPT-4 to propose instructions
prompt_generator = dspy.LM(model="openai/gpt-4")

# Optimize for GPT-3.5
target_model = dspy.LM(model="openai/gpt-3.5-turbo")
dspy.configure(lm=target_model)

copro = COPRO(
    metric=your_metric,
    breadth=12,
    depth=4,
    prompt_model=prompt_generator  # Teacher model
)

optimized_program = copro.compile(program, trainset=trainset)

🆚 COPRO vs. Other Optimizers

Feature	COPRO	BootstrapFewShot	MiPRO
Primary Target	Instructions	Demonstrations	Both
Method	Evolutionary Search	Bootstrap Sampling	Bayesian Optimization
Best For	Instruction-sensitive tasks, Reasoning	Tasks with good examples	Complex pipelines, Max performance
Speed	Medium	Fast	Slow

Recommendation: Use COPRO when you have a tricky reasoning task where the exact wording of the prompt matters significantly, or when you have limited data for demonstrations.

🌍 Real-World Applications

1. Medical Triage

In high-stakes domains like healthcare, instruction precision is critical. COPRO can evolve prompts that strictly adhere to safety protocols.

# Metric penalizes dangerous errors heavily
def triage_metric(example, pred, trace=None):
    correct = pred.level == example.level
    # Critical penalty for under-triaging emergencies
    if example.level == "emergency" and pred.level != "emergency":
        return 0.0 
    return 1.0 if correct else 0.5

copro = COPRO(
    metric=triage_metric,
    breadth=15, 
    depth=5, 
    init_temperature=1.2 # Lower temp for stability
)

2. Legal Analysis

For tasks requiring specific vocabulary and structure, COPRO can find the "magic words" that align the model with legal standards.

✨ Best Practices

Diverse Training Data: Ensure your 20-50 training examples cover various edge cases. If they are all simple, COPRO won't learn to handle complexity.
Meaningful Metrics: Your metric determines the direction of evolution. A binary (0/1) metric provides less signal than a continuous score (0.0-1.0) or a weighted multi-component metric.
Start Conservative: Begin with breadth=8, depth=2 to check for quick wins. If promising, expand to breadth=15, depth=5.
Combine Optimizers: You can optimize instructions with COPRO first, then use BootstrapFewShot on the result to add demonstrations.

Next: MiPRO

COPRO: Chain-of-Thought Prompt Optimization