Exercise 1: Creating a Quality Dataset
Difficulty: ⭐⭐ Intermediate
Objective
Create a well-structured evaluation dataset for a sentiment analysis task.
Requirements
- Create a dataset of at least 30 examples with fields:
text,sentiment(pos/neg/neu), andconfidence. - Ensure valid distribution: 10+ positive, 10+ negative, 5+ neutral.
- Include at least 5 edge cases (sarcasm, mixed sentiment, short text).
- Properly split data into train (60%), dev (20%), and test (20%).
Starter Code
import dspy
import random
def create_sentiment_dataset():
"""Returns: Tuple of (trainset, devset, testset)"""
examples = []
# TODO: Add examples (positive, negative, neutral, edge cases)
# Hint: Use dspy.Example(...).with_inputs("text")
# TODO: Shuffle with fixed seed
# TODO: Split data
return trainset, devset, testset
Exercise 2: Designing a Custom Metric
Difficulty: ⭐⭐ Intermediate
Objective
Design a comprehensive metric for evaluating a question-answering system.
Requirements
Create a metric that combines:
- Correctness (40%): Does the answer contain the expected info?
- Completeness (30%): are all key points addressed?
- Conciseness (20%): Is the answer brief (10-100 words)?
- Format (10%): No repeated words or odd punctuation?
The metric handles the trace parameter correctly (stricter during
optimization).
Starter Code
def qa_quality_metric(example, pred, trace=None):
# TODO: Implement sub-scores
correctness = ...
completeness = ...
conciseness = ...
format_score = ...
final_score = (0.4 * correctness +
0.3 * completeness +
0.2 * conciseness +
0.1 * format_score)
if trace is not None:
return final_score >= 0.7 # Stricter for optimization
return final_score
Exercise 3: Systematic Evaluation
Difficulty: ⭐⭐⭐ Intermediate-Advanced
Objective
Build a function that runs evaluation and provides a detailed report, including error analysis.
Requirements
- Function takes a module, dataset, and metric.
- Runs evaluation and captures detailed results.
- Categorizes errors (e.g., empty response, wrong answer).
- Identifies the best and worst performing examples.
- Returns a dictionary with aggregate score and analysis.
Starter Code
def comprehensive_evaluation(module, devset, metric):
results = {
'score': 0,
'errors': [],
'best_examples': [],
'worst_examples': []
}
# TODO: Iterate, predict, score, and analyze
return results