Chapter 4 · Section 7

Exercises

Put your evaluation skills to the test with these hands-on coding challenges.

~2-3 hours Hands-on

Exercise 1: Creating a Quality Dataset

Difficulty: ⭐⭐ Intermediate

Objective

Create a well-structured evaluation dataset for a sentiment analysis task.

Requirements

  • Create a dataset of at least 30 examples with fields: text, sentiment (pos/neg/neu), and confidence.
  • Ensure valid distribution: 10+ positive, 10+ negative, 5+ neutral.
  • Include at least 5 edge cases (sarcasm, mixed sentiment, short text).
  • Properly split data into train (60%), dev (20%), and test (20%).

Starter Code

import dspy
import random

def create_sentiment_dataset():
    """Returns: Tuple of (trainset, devset, testset)"""
    examples = []

    # TODO: Add examples (positive, negative, neutral, edge cases)
    # Hint: Use dspy.Example(...).with_inputs("text")
    
    # TODO: Shuffle with fixed seed
    
    # TODO: Split data
    
    return trainset, devset, testset

Exercise 2: Designing a Custom Metric

Difficulty: ⭐⭐ Intermediate

Objective

Design a comprehensive metric for evaluating a question-answering system.

Requirements

Create a metric that combines:

  • Correctness (40%): Does the answer contain the expected info?
  • Completeness (30%): are all key points addressed?
  • Conciseness (20%): Is the answer brief (10-100 words)?
  • Format (10%): No repeated words or odd punctuation?

The metric handles the trace parameter correctly (stricter during optimization).

Starter Code

def qa_quality_metric(example, pred, trace=None):
    # TODO: Implement sub-scores
    correctness = ...
    completeness = ...
    conciseness = ...
    format_score = ...
    
    final_score = (0.4 * correctness + 
                   0.3 * completeness + 
                   0.2 * conciseness + 
                   0.1 * format_score)

    if trace is not None:
        return final_score >= 0.7  # Stricter for optimization
        
    return final_score

Exercise 3: Systematic Evaluation

Difficulty: ⭐⭐⭐ Intermediate-Advanced

Objective

Build a function that runs evaluation and provides a detailed report, including error analysis.

Requirements

  • Function takes a module, dataset, and metric.
  • Runs evaluation and captures detailed results.
  • Categorizes errors (e.g., empty response, wrong answer).
  • Identifies the best and worst performing examples.
  • Returns a dictionary with aggregate score and analysis.

Starter Code

def comprehensive_evaluation(module, devset, metric):
    results = {
        'score': 0,
        'errors': [],
        'best_examples': [],
        'worst_examples': []
    }
    
    # TODO: Iterate, predict, score, and analyze
    
    return results