Evaluation Loops | Chapter 4 | DSPy: The Comprehensive Guide

The Evaluate Class

DSPy's Evaluate class is the workhorse for systematic testing. It automates the process of running your dataset through your module and applying metrics.

Basic Usage

import dspy

# Setup
evaluate = dspy.Evaluate(
    devset=devset,           # Data to test
    metric=accuracy,         # Metric function
    num_threads=8,           # Parallel threads
    display_progress=True    # Show progress bar
)

# Run
score = evaluate(module)
print(f"Accuracy: {score}%")

Configuration Parameters

Parameter	Description	Default
`devset`	List of Examples to evaluate on	Required
`metric`	Function to score predictions	Required
`num_threads`	Number of threads for parallel execution	`1`
`display_progress`	Show progress bar in terminal	`False`
`display_table`	Print table of N results (or True for all)	`False`

Parallel Evaluation

Running evaluations sequentially can be slow. Use num_threads to speed up the process significantly.

💡

Recommendation: Use 4-8 threads for standard API keys. Higher values might hit rate limits.

# Speed up evaluation 8x
evaluate = dspy.Evaluate(
    devset=large_devset,
    metric=metric,
    num_threads=8
)

Analyzing Results

To get more than just a single score, use return_all_scores=True or return_outputs=True.

# Get detailed results
evaluate = dspy.Evaluate(
    devset=devset,
    metric=metric,
    return_outputs=True
)

result = evaluate(module)

for x, pred, score in result.results:
    print(f"Q: {x.question}")
    print(f"Pred: {pred.answer}")
    print(f"Score: {score}")
    print("---")

Evaluation Workflows

1. Development Loop

Quick check on a small subset during active coding.

# Check 5 examples quickly
evaluate(module, devset=devset[:5])

2. Pre-Commit Check

Ensure no regression before committing code.

score = evaluate(module)
assert score > 85.0, "Quality dropped below threshold!"

3. MLflow Integration

Track experiments over time.

import mlflow

with mlflow.start_run():
    score = evaluate(module)
    mlflow.log_metric("accuracy", score)

Next: Best Practices