The Evaluate Class
DSPy's Evaluate class is the workhorse for systematic testing. It
automates the process of running your dataset through your module and applying
metrics.
Basic Usage
import dspy
# Setup
evaluate = dspy.Evaluate(
devset=devset, # Data to test
metric=accuracy, # Metric function
num_threads=8, # Parallel threads
display_progress=True # Show progress bar
)
# Run
score = evaluate(module)
print(f"Accuracy: {score}%")
Configuration Parameters
| Parameter | Description | Default |
|---|---|---|
devset |
List of Examples to evaluate on | Required |
metric |
Function to score predictions | Required |
num_threads |
Number of threads for parallel execution | 1 |
display_progress |
Show progress bar in terminal | False |
display_table |
Print table of N results (or True for all) | False |
Parallel Evaluation
Running evaluations sequentially can be slow. Use num_threads to speed
up the process significantly.
Recommendation: Use 4-8 threads for standard API keys. Higher values might hit rate limits.
# Speed up evaluation 8x
evaluate = dspy.Evaluate(
devset=large_devset,
metric=metric,
num_threads=8
)
Analyzing Results
To get more than just a single score, use return_all_scores=True or
return_outputs=True.
# Get detailed results
evaluate = dspy.Evaluate(
devset=devset,
metric=metric,
return_outputs=True
)
result = evaluate(module)
for x, pred, score in result.results:
print(f"Q: {x.question}")
print(f"Pred: {pred.answer}")
print(f"Score: {score}")
print("---")
Evaluation Workflows
1. Development Loop
Quick check on a small subset during active coding.
# Check 5 examples quickly
evaluate(module, devset=devset[:5])
2. Pre-Commit Check
Ensure no regression before committing code.
score = evaluate(module)
assert score > 85.0, "Quality dropped below threshold!"
3. MLflow Integration
Track experiments over time.
import mlflow
with mlflow.start_run():
score = evaluate(module)
mlflow.log_metric("accuracy", score)