Dataset Curation
1. Ensure Representative Data
Your evaluation set should mirror real-world distribution. A dataset of only "easy" questions will give you a false sense of security.
# BAD: Homogeneous data
data = [dspy.Example(q="2+2?", a="4"), dspy.Example(q="3+3?", a="6")]
# GOOD: Diverse data with edge cases
data = [
dspy.Example(q="2+2?", a="4"), # Simple
dspy.Example(q="What is the meaning of lif", a="Likely typo..."), # Edge case
dspy.Example(q="", a="Please provide input") # Empty input
]
2. Balance Your Dataset
Ensure key categories are equally represented to prevent bias towards majority classes.
Metric Design
1. Measure What Matters
Don't use proxy metrics just because they are easy. Length does not equal quality.
2. Make Metrics Robust
Handle formatting variations (case sensitivity, punctuation) gracefully.
def robust_metric(example, pred, trace=None):
# Normalize before comparing
clean_gold = example.answer.lower().strip()
clean_pred = pred.answer.lower().strip()
return clean_gold == clean_pred
Avoiding Data Leakage
Critical: Data leakage—where test data overlaps with training data—invalidates your entire evaluation.
Prevention Strategies
- Split by Date: Train on past data, test on future data.
- Deduplicate: Remove identical or near-identical examples before splitting.
- Verify Disjoint Sets: Programmatically check that
len(set(train) & set(test)) == 0.
Reproducibility
1. Fix Random Seeds
import random
random.seed(42)
2. Version Control Data
Treat your datasets like code. Track versions to understand performance changes over time.
Checklist for Every Evaluation
✅ Data Representation
Does the data cover edge cases and real-world variety?
✅ Metric Validity
Does the metric actually measure success for the user?
✅ No Leakage
Are training and testing sets completely disjoint?
✅ Baselines
Did you compare against a simple baseline?