🎯 Introduction
BootstrapFewShot is one of DSPy's most powerful optimizers. It automatically generates and selects high-quality few-shot examples to improve your program's performance. Instead of manually crafting examples, BootstrapFewShot discovers the optimal demonstrations for your specific task.
Key innovation: Weak supervision—train models without hand-labeled intermediate steps!
🔬 Weak Supervision and the annotate() Method
A key innovation from the Demonstrate-Search-Predict paper is weak supervision. In simple terms, this is like a teacher who doesn't check every line of your scratchpad math, but only grades the final answer. If you get the final answer right using your own method, the teacher says "Good job, do it that way again."
In DSPy, "weak supervision" allows you to train models without hand-labeled intermediate reasoning steps:
# Traditional approach requires manually annotated reasoning
traditional_training = [
dspy.Example(
question="What is 15 * 23?",
reasoning="Step 1: 15 * 20 = 300\nStep 2: 15 * 3 = 45\nStep 3: 300 + 45 = 345",
answer="345"
),
# ... many more with detailed reasoning
]
# With weak supervision, you only need:
weak_supervision_training = [
dspy.Example(question="What is 15 * 23?", answer="345"),
dspy.Example(question="What is 12 * 17?", answer="204"),
# ... just input-output pairs!
]
# BootstrapFewShot will automatically generate the reasoning!
⚙️ How BootstrapFewShot Works
Initial Generation
Uses the unoptimized program to generate candidate examples
Quality Filtering
Evaluates generated examples using your metric
Example Selection
Chooses the best examples based on performance
Iterative Refinement
Repeats the process to improve example quality
📦 Basic Usage
import dspy
from dspy.teleprompt import BootstrapFewShot
# 1. Define your program
class SimpleQA(dspy.Module):
def __init__(self):
super().__init__()
self.generate = dspy.Predict("question -> answer")
def forward(self, question):
return self.generate(question=question)
# 2. Define evaluation metric
def exact_match(example, pred, trace=None):
return example.answer.lower() == pred.answer.lower()
# 3. Prepare training data
trainset = [
dspy.Example(question="What is 2+2?", answer="4").with_inputs("question"),
dspy.Example(question="What is the capital of France?", answer="Paris").with_inputs("question"),
dspy.Example(question="Who wrote Romeo and Juliet?", answer="William Shakespeare").with_inputs("question"),
]
# 4. Create optimizer and compile
optimizer = BootstrapFewShot(metric=exact_match, max_bootstrapped_demos=4)
compiled_qa = optimizer.compile(SimpleQA(), trainset=trainset)
# 5. Use the compiled program
result = compiled_qa(question="What is 3+3?")
print(result.answer) # Should be "6"
🎛️ BootstrapFewShot Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
metric |
Callable | Required | Function to evaluate example quality |
max_bootstrapped_demos |
int | 8 | Maximum generated examples |
max_labeled_demos |
int | 4 | Maximum human-labeled examples |
max_rounds |
int | 2 | Number of bootstrap iterations |
🧠 Using with Chain of Thought
class CoTQA(dspy.Module):
def __init__(self):
super().__init__()
self.generate = dspy.ChainOfThought("question -> answer")
def forward(self, question):
result = self.generate(question=question)
return dspy.Prediction(
answer=result.answer,
reasoning=result.rationale # Automatically generated!
)
# Bootstrap with Chain of Thought
optimizer = BootstrapFewShot(
metric=exact_match,
max_bootstrapped_demos=8,
teacher_settings=dict(lm=dspy.settings.lm)
)
compiled_cot = optimizer.compile(CoTQA(), trainset=trainset)
The magic: The reasoning steps are automatically generated during bootstrapping—you only provide input-output pairs!
📊 Defining Metrics
Exact Match
def exact_match_metric(example, pred, trace=None):
"""Simple exact string match."""
return str(example.answer).lower() == str(pred.answer).lower()
Fuzzy Match
def fuzzy_match(example, pred, trace=None):
"""Fuzzy matching with some tolerance."""
from difflib import SequenceMatcher
similarity = SequenceMatcher(None, example.answer, pred.answer).ratio()
return similarity > 0.9
F1 Score for QA
def qa_f1_metric(example, pred, trace=None):
"""F1 score for QA tasks."""
from collections import Counter
pred_tokens = Counter(str(pred.answer).lower().split())
true_tokens = Counter(str(example.answer).lower().split())
common = pred_tokens & true_tokens
precision = len(common) / len(pred_tokens) if pred_tokens else 0
recall = len(common) / len(true_tokens) if true_tokens else 0
if precision + recall == 0:
return 0
return 2 * precision * recall / (precision + recall)
✨ Benefits of Weak Supervision
Reduced Annotation Cost
No need to write detailed reasoning chains—only final answers
Consistent Quality
Generated reasoning follows consistent patterns
Rapid Prototyping
Test new tasks with minimal data preparation
Better Coverage
Generates diverse reasoning strategies automatically
⚠️ Common Pitfalls
Pitfall 1: Too Many Examples
# Problem: Overfitting
optimizer = BootstrapFewShot(max_bootstrapped_demos=50) # Too many!
# Solution: Use reasonable limits
optimizer = BootstrapFewShot(max_bootstrapped_demos=8) # Better
Pitfall 2: Poor Metric
# Problem: Metric doesn't reflect actual performance
def bad_metric(example, pred, trace=None):
return len(pred.answer) > 10 # Wrong!
# Solution: Use meaningful metrics
def good_metric(example, pred, trace=None):
return example.answer.lower() == pred.answer.lower()
Pitfall 3: Insufficient Diversity
# Problem: All examples are similar
similar_examples = [
dspy.Example(question="What is 2+2?", answer="4"),
dspy.Example(question="What is 3+3?", answer="6"),
# ... all simple math
]
# Solution: Include diverse examples
diverse_examples = [
dspy.Example(question="What is 2+2?", answer="4"),
dspy.Example(question="Capital of France?", answer="Paris"),
dspy.Example(question="Explain photosynthesis", answer="..."),
]
📈 Evaluating Results
# Compare baseline vs compiled
baseline = SimpleQA()
compiled = optimizer.compile(SimpleQA(), trainset=trainset)
# Evaluate both on test set
def evaluate(program, testset):
correct = 0
for example in testset:
pred = program(question=example.question)
if pred.answer.lower() == example.answer.lower():
correct += 1
return correct / len(testset)
baseline_score = evaluate(baseline, testset)
compiled_score = evaluate(compiled, testset)
print(f"Baseline accuracy: {baseline_score:.2%}")
print(f"Compiled accuracy: {compiled_score:.2%}")
print(f"Improvement: {compiled_score - baseline_score:.2%}")