Chapter 4 · Section 5

Evaluation Loops

Systematically assess module performance with parallel execution and detailed reporting.

~30 min read

The Evaluate Class

DSPy's Evaluate class is the workhorse for systematic testing. It automates the process of running your dataset through your module and applying metrics.

Basic Usage

import dspy

# Setup
evaluate = dspy.Evaluate(
    devset=devset,           # Data to test
    metric=accuracy,         # Metric function
    num_threads=8,           # Parallel threads
    display_progress=True    # Show progress bar
)

# Run
score = evaluate(module)
print(f"Accuracy: {score}%")

Configuration Parameters

Parameter Description Default
devset List of Examples to evaluate on Required
metric Function to score predictions Required
num_threads Number of threads for parallel execution 1
display_progress Show progress bar in terminal False
display_table Print table of N results (or True for all) False

Parallel Evaluation

Running evaluations sequentially can be slow. Use num_threads to speed up the process significantly.

💡

Recommendation: Use 4-8 threads for standard API keys. Higher values might hit rate limits.

# Speed up evaluation 8x
evaluate = dspy.Evaluate(
    devset=large_devset,
    metric=metric,
    num_threads=8
)

Analyzing Results

To get more than just a single score, use return_all_scores=True or return_outputs=True.

# Get detailed results
evaluate = dspy.Evaluate(
    devset=devset,
    metric=metric,
    return_outputs=True
)

result = evaluate(module)

for x, pred, score in result.results:
    print(f"Q: {x.question}")
    print(f"Pred: {pred.answer}")
    print(f"Score: {score}")
    print("---")

Evaluation Workflows

1. Development Loop

Quick check on a small subset during active coding.

# Check 5 examples quickly
evaluate(module, devset=devset[:5])

2. Pre-Commit Check

Ensure no regression before committing code.

score = evaluate(module)
assert score > 85.0, "Quality dropped below threshold!"

3. MLflow Integration

Track experiments over time.

import mlflow

with mlflow.start_run():
    score = evaluate(module)
    mlflow.log_metric("accuracy", score)