BootstrapFewShot | Chapter 5 | DSPy: The Comprehensive Guide

🎯 Introduction

BootstrapFewShot is one of DSPy's most powerful optimizers. It automatically generates and selects high-quality few-shot examples to improve your program's performance. Instead of manually crafting examples, BootstrapFewShot discovers the optimal demonstrations for your specific task.

💡

Key innovation: Weak supervision—train models without hand-labeled intermediate steps!

🔬 Weak Supervision and the annotate() Method

A key innovation from the Demonstrate-Search-Predict paper is weak supervision. In simple terms, this is like a teacher who doesn't check every line of your scratchpad math, but only grades the final answer. If you get the final answer right using your own method, the teacher says "Good job, do it that way again."

In DSPy, "weak supervision" allows you to train models without hand-labeled intermediate reasoning steps:

# Traditional approach requires manually annotated reasoning
traditional_training = [
    dspy.Example(
        question="What is 15 * 23?",
        reasoning="Step 1: 15 * 20 = 300\nStep 2: 15 * 3 = 45\nStep 3: 300 + 45 = 345",
        answer="345"
    ),
    # ... many more with detailed reasoning
]

# With weak supervision, you only need:
weak_supervision_training = [
    dspy.Example(question="What is 15 * 23?", answer="345"),
    dspy.Example(question="What is 12 * 17?", answer="204"),
    # ... just input-output pairs!
]

# BootstrapFewShot will automatically generate the reasoning!

⚙️ How BootstrapFewShot Works

1️⃣

Initial Generation

Uses the unoptimized program to generate candidate examples

2️⃣

Quality Filtering

Evaluates generated examples using your metric

3️⃣

Example Selection

Chooses the best examples based on performance

4️⃣

Iterative Refinement

Repeats the process to improve example quality

📦 Basic Usage

import dspy
from dspy.teleprompt import BootstrapFewShot

# 1. Define your program
class SimpleQA(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate = dspy.Predict("question -> answer")
    
    def forward(self, question):
        return self.generate(question=question)

# 2. Define evaluation metric
def exact_match(example, pred, trace=None):
    return example.answer.lower() == pred.answer.lower()

# 3. Prepare training data
trainset = [
    dspy.Example(question="What is 2+2?", answer="4").with_inputs("question"),
    dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"),
    dspy.Example(question="Who wrote Romeo and Juliet?", answer="William Shakespeare").with_inputs("question"),
]

# 4. Create optimizer and compile
optimizer = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=4)
compiled_qa = optimizer.compile(SimpleQA(), trainset=trainset)

# 5. Use the compiled program
result = compiled_qa(question="What is 3+3?")
print(result.answer)  # Should be "6"

🎛️ BootstrapFewShot Parameters

Parameter	Type	Default	Description
`metric`	Callable	Required	Function to evaluate example quality
`max_bootstrapped_demos`	int	8	Maximum generated examples
`max_labeled_demos`	int	4	Maximum human-labeled examples
`max_rounds`	int	2	Number of bootstrap iterations

🧠 Using with Chain of Thought

class CoTQA(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate = dspy.ChainOfThought("question -> answer")
    
    def forward(self, question):
        result = self.generate(question=question)
        return dspy.Prediction(
            answer=result.answer,
            reasoning=result.rationale  # Automatically generated!
        )

# Bootstrap with Chain of Thought
optimizer = BootstrapFewShot(
    metric=exact_match,
    max_bootstrapped_demos=8,
    teacher_settings=dict(lm=dspy.settings.lm)
)

compiled_cot = optimizer.compile(CoTQA(), trainset=trainset)

✨

The magic: The reasoning steps are automatically generated during bootstrapping—you only provide input-output pairs!

📊 Defining Metrics

Exact Match

def exact_match_metric(example, pred, trace=None):
    """Simple exact string match."""
    return str(example.answer).lower() == str(pred.answer).lower()

Fuzzy Match

def fuzzy_match(example, pred, trace=None):
    """Fuzzy matching with some tolerance."""
    from difflib import SequenceMatcher
    similarity = SequenceMatcher(None, example.answer, pred.answer).ratio()
    return similarity > 0.9

F1 Score for QA

def qa_f1_metric(example, pred, trace=None):
    """F1 score for QA tasks."""
    from collections import Counter
    
    pred_tokens = Counter(str(pred.answer).lower().split())
    true_tokens = Counter(str(example.answer).lower().split())
    
    common = pred_tokens & true_tokens
    precision = len(common) / len(pred_tokens) if pred_tokens else 0
    recall = len(common) / len(true_tokens) if true_tokens else 0
    
    if precision + recall == 0:
        return 0
    return 2 * precision * recall / (precision + recall)

✨ Benefits of Weak Supervision

💰

Reduced Annotation Cost

No need to write detailed reasoning chains—only final answers

✅

Consistent Quality

Generated reasoning follows consistent patterns

🚀

Rapid Prototyping

Test new tasks with minimal data preparation

🎯

Better Coverage

Generates diverse reasoning strategies automatically

⚠️ Common Pitfalls

Pitfall 1: Too Many Examples

# Problem: Overfitting
optimizer = BootstrapFewShot(max_bootstrapped_demos=50)  # Too many!

# Solution: Use reasonable limits
optimizer = BootstrapFewShot(max_bootstrapped_demos=8)  # Better

Pitfall 2: Poor Metric

# Problem: Metric doesn't reflect actual performance
def bad_metric(example, pred, trace=None):
    return len(pred.answer) > 10  # Wrong!

# Solution: Use meaningful metrics
def good_metric(example, pred, trace=None):
    return example.answer.lower() == pred.answer.lower()

Pitfall 3: Insufficient Diversity

# Problem: All examples are similar
similar_examples = [
    dspy.Example(question="What is 2+2?", answer="4"),
    dspy.Example(question="What is 3+3?", answer="6"),
    # ... all simple math
]

# Solution: Include diverse examples
diverse_examples = [
    dspy.Example(question="What is 2+2?", answer="4"),
    dspy.Example(question="Capital of France?", answer="Paris"),
    dspy.Example(question="Explain photosynthesis", answer="..."),
]

📈 Evaluating Results

# Compare baseline vs compiled
baseline = SimpleQA()
compiled = optimizer.compile(SimpleQA(), trainset=trainset)

# Evaluate both on test set
def evaluate(program, testset):
    correct = 0
    for example in testset:
        pred = program(question=example.question)
        if pred.answer.lower() == example.answer.lower():
            correct += 1
    return correct / len(testset)

baseline_score = evaluate(baseline, testset)
compiled_score = evaluate(compiled, testset)

print(f"Baseline accuracy: {baseline_score:.2%}")
print(f"Compiled accuracy: {compiled_score:.2%}")
print(f"Improvement: {compiled_score - baseline_score:.2%}")

📝 Key Takeaways

BootstrapFewShot automatically generates high-quality few-shot examples

Weak supervision means you only need input-output pairs

Proper metric definition is crucial for success

Data quality and diversity matter more than quantity

Always validate compiled programs on held-out data

📚 References

📄

Demonstrate-Search-Predict

Khattab et al. (2022) - The original paper introducing DSP and weak supervision concepts.

Next: COPRO