Chapter 4 · Section 6

Best Practices

Proven strategies for reliable, reproducible, and effective evaluation.

~25 min read

Dataset Curation

1. Ensure Representative Data

Your evaluation set should mirror real-world distribution. A dataset of only "easy" questions will give you a false sense of security.

# BAD: Homogeneous data
data = [dspy.Example(q="2+2?", a="4"), dspy.Example(q="3+3?", a="6")]

# GOOD: Diverse data with edge cases
data = [
    dspy.Example(q="2+2?", a="4"),                  # Simple
    dspy.Example(q="What is the meaning of lif", a="Likely typo..."), # Edge case
    dspy.Example(q="", a="Please provide input")    # Empty input
]

2. Balance Your Dataset

Ensure key categories are equally represented to prevent bias towards majority classes.

Metric Design

1. Measure What Matters

Don't use proxy metrics just because they are easy. Length does not equal quality.

2. Make Metrics Robust

Handle formatting variations (case sensitivity, punctuation) gracefully.

def robust_metric(example, pred, trace=None):
    # Normalize before comparing
    clean_gold = example.answer.lower().strip()
    clean_pred = pred.answer.lower().strip()
    return clean_gold == clean_pred

Avoiding Data Leakage

⚠️

Critical: Data leakage—where test data overlaps with training data—invalidates your entire evaluation.

Prevention Strategies

  • Split by Date: Train on past data, test on future data.
  • Deduplicate: Remove identical or near-identical examples before splitting.
  • Verify Disjoint Sets: Programmatically check that len(set(train) & set(test)) == 0.

Reproducibility

1. Fix Random Seeds

import random
random.seed(42)

2. Version Control Data

Treat your datasets like code. Track versions to understand performance changes over time.

Checklist for Every Evaluation

✅ Data Representation

Does the data cover edge cases and real-world variety?

✅ Metric Validity

Does the metric actually measure success for the user?

✅ No Leakage

Are training and testing sets completely disjoint?

✅ Baselines

Did you compare against a simple baseline?