Defining Metrics

Metric Function Anatomy

A DSPy metric is simply a Python function that evaluates prediction quality by comparing it to the ground truth.

def metric(example, pred, trace=None):
    """
    Evaluate prediction quality.

    Args:
        example: The original Example with inputs AND expected outputs
        pred: The Prediction (module output) to evaluate
        trace: Optional trace info (used during optimization)

    Returns:
        bool or float: Score indicating quality (True/False or 0.0-1.0)
    """
    # Compare prediction to expected output
    return pred.answer == example.answer

The Three Parameters

1. `example` (Ground Truth)

Contains both the inputs sent to the model and the expected output labels.

2. `pred` (Prediction)

The actual output generated by your DSPy module.

3. `trace` (Context)

Indicates if the metric is running during optimization (filtering) or evaluation (scoring).

Built-in Metrics

DSPy provides several ready-to-use metrics for common tasks.

Semantic F1

Measures semantic overlap between answers using a language model.

from dspy.evaluate import SemanticF1

# Initialize
metric = SemanticF1(decompositional=True)

# Use
score = metric(example, pred)

Answer Correctness

def answer_correctness(example, pred, trace=None):
    """Check if predicted answer contains the correct answer."""
    correct = example.answer.lower()
    predicted = pred.answer.lower()
    return correct in predicted or predicted in correct

Creating Custom Metrics

Simple Boolean Metrics

def sentiment_accuracy(example, pred, trace=None):
    """Check if sentiment prediction matches ground truth."""
    return example.sentiment == pred.sentiment

Composite Metrics

Combine multiple quality factors into a single score.

def comprehensive_qa_metric(example, pred, trace=None):
    # 1. Correctness
    correct = example.answer.lower() in pred.answer.lower()
    
    # 2. Completeness (length heuristic)
    complete = len(pred.answer) > 20
    
    # 3. No uncertainty
    certain = "I don't know" not in pred.answer

    # Scoring logic
    if not correct: return 0.0
    if not complete: return 0.5
    if not certain: return 0.7
    return 1.0

The Trace Parameter Deep Dive

The trace parameter is what enables DSPy's powerful optimization.

Mode	trace value	Goal	Return Type
Optimization	`not None` (Object)	Select best demos	`bool` (True/False)
Evaluation	`None`	Measure performance	`float` (Score)

def smart_metric(example, pred, trace=None):
    score = calculate_score(example, pred)
    
    if trace is not None:
        # Optimization: Be strict! Only accept perfect examples
        return score >= 0.9
        
    # Evaluation: Return the actual score
    return score

Specialized Metrics for Long-form Content

Evaluating long responses (like articles) requires more sophisticated metrics.

Topic Coverage: Uses ROUGE scores to check if key concepts are covered.
FactScore: Breaks text into atomic claims and verifies each against a knowledge source.
Verifiability: Checks if claims are supported by citations.

Next: Evaluation Loops