Metric Function Anatomy
A DSPy metric is simply a Python function that evaluates prediction quality by comparing it to the ground truth.
def metric(example, pred, trace=None):
"""
Evaluate prediction quality.
Args:
example: The original Example with inputs AND expected outputs
pred: The Prediction (module output) to evaluate
trace: Optional trace info (used during optimization)
Returns:
bool or float: Score indicating quality (True/False or 0.0-1.0)
"""
# Compare prediction to expected output
return pred.answer == example.answer
The Three Parameters
1. example (Ground Truth)
Contains both the inputs sent to the model and the expected output labels.
2. pred (Prediction)
The actual output generated by your DSPy module.
3. trace (Context)
Indicates if the metric is running during optimization (filtering) or evaluation (scoring).
Built-in Metrics
DSPy provides several ready-to-use metrics for common tasks.
Semantic F1
Measures semantic overlap between answers using a language model.
from dspy.evaluate import SemanticF1
# Initialize
metric = SemanticF1(decompositional=True)
# Use
score = metric(example, pred)
Answer Correctness
def answer_correctness(example, pred, trace=None):
"""Check if predicted answer contains the correct answer."""
correct = example.answer.lower()
predicted = pred.answer.lower()
return correct in predicted or predicted in correct
Creating Custom Metrics
Simple Boolean Metrics
def sentiment_accuracy(example, pred, trace=None):
"""Check if sentiment prediction matches ground truth."""
return example.sentiment == pred.sentiment
Composite Metrics
Combine multiple quality factors into a single score.
def comprehensive_qa_metric(example, pred, trace=None):
# 1. Correctness
correct = example.answer.lower() in pred.answer.lower()
# 2. Completeness (length heuristic)
complete = len(pred.answer) > 20
# 3. No uncertainty
certain = "I don't know" not in pred.answer
# Scoring logic
if not correct: return 0.0
if not complete: return 0.5
if not certain: return 0.7
return 1.0
The Trace Parameter Deep Dive
The trace parameter is what enables DSPy's powerful optimization.
| Mode | trace value | Goal | Return Type |
|---|---|---|---|
| Optimization | not None (Object) |
Select best demos | bool (True/False) |
| Evaluation | None |
Measure performance | float (Score) |
def smart_metric(example, pred, trace=None):
score = calculate_score(example, pred)
if trace is not None:
# Optimization: Be strict! Only accept perfect examples
return score >= 0.9
# Evaluation: Return the actual score
return score
Specialized Metrics for Long-form Content
Evaluating long responses (like articles) requires more sophisticated metrics.
- Topic Coverage: Uses ROUGE scores to check if key concepts are covered.
- FactScore: Breaks text into atomic claims and verifies each against a knowledge source.
- Verifiability: Checks if claims are supported by citations.