Chapter 4 · Section 4

Defining Metrics

Master the art of measuring LLM performance with custom and built-in metrics.

~35 min read

Metric Function Anatomy

A DSPy metric is simply a Python function that evaluates prediction quality by comparing it to the ground truth.

def metric(example, pred, trace=None):
    """
    Evaluate prediction quality.

    Args:
        example: The original Example with inputs AND expected outputs
        pred: The Prediction (module output) to evaluate
        trace: Optional trace info (used during optimization)

    Returns:
        bool or float: Score indicating quality (True/False or 0.0-1.0)
    """
    # Compare prediction to expected output
    return pred.answer == example.answer

The Three Parameters

1. example (Ground Truth)

Contains both the inputs sent to the model and the expected output labels.

2. pred (Prediction)

The actual output generated by your DSPy module.

3. trace (Context)

Indicates if the metric is running during optimization (filtering) or evaluation (scoring).

Built-in Metrics

DSPy provides several ready-to-use metrics for common tasks.

Semantic F1

Measures semantic overlap between answers using a language model.

from dspy.evaluate import SemanticF1

# Initialize
metric = SemanticF1(decompositional=True)

# Use
score = metric(example, pred)

Answer Correctness

def answer_correctness(example, pred, trace=None):
    """Check if predicted answer contains the correct answer."""
    correct = example.answer.lower()
    predicted = pred.answer.lower()
    return correct in predicted or predicted in correct

Creating Custom Metrics

Simple Boolean Metrics

def sentiment_accuracy(example, pred, trace=None):
    """Check if sentiment prediction matches ground truth."""
    return example.sentiment == pred.sentiment

Composite Metrics

Combine multiple quality factors into a single score.

def comprehensive_qa_metric(example, pred, trace=None):
    # 1. Correctness
    correct = example.answer.lower() in pred.answer.lower()
    
    # 2. Completeness (length heuristic)
    complete = len(pred.answer) > 20
    
    # 3. No uncertainty
    certain = "I don't know" not in pred.answer

    # Scoring logic
    if not correct: return 0.0
    if not complete: return 0.5
    if not certain: return 0.7
    return 1.0

The Trace Parameter Deep Dive

The trace parameter is what enables DSPy's powerful optimization.

Mode trace value Goal Return Type
Optimization not None (Object) Select best demos bool (True/False)
Evaluation None Measure performance float (Score)
def smart_metric(example, pred, trace=None):
    score = calculate_score(example, pred)
    
    if trace is not None:
        # Optimization: Be strict! Only accept perfect examples
        return score >= 0.9
        
    # Evaluation: Return the actual score
    return score

Specialized Metrics for Long-form Content

Evaluating long responses (like articles) requires more sophisticated metrics.

  • Topic Coverage: Uses ROUGE scores to check if key concepts are covered.
  • FactScore: Breaks text into atomic claims and verifies each against a knowledge source.
  • Verifiability: Checks if claims are supported by citations.