Chapter 4 Β· Section 3

Creating Datasets

Structure, load, and manage datasets using the DSPy Example class.

~30 min read

The Example Class

DSPy uses the Example class to represent individual data points for training and evaluation.

Basic Example Creation

import dspy

# Create a simple example
example = dspy.Example(
    question="What is the capital of France?",
    answer="Paris"
)

# Access fields
print(example.question)  # "What is the capital of France?"
print(example.answer)    # "Paris"

The with_inputs() Method

The with_inputs() method is criticalβ€”it tells DSPy which fields are inputs vs. expected outputs:

import dspy

# Create example and mark which fields are inputs
example = dspy.Example(
    question="What is the capital of France?",
    answer="Paris"
).with_inputs("question")

# Now DSPy knows:
# - "question" is an INPUT (given to the module)
# - "answer" is an OUTPUT (expected result for evaluation)

# Access input fields
print(example.inputs())  # {"question": "What is the capital of France?"}
πŸ’‘

Tip: Always use with_inputs() immediately after creating an Example. Without it, DSPy optimizers and evaluators won't know how to use your data.

Loading Datasets

From Python Dictionaries

raw_data = [
    {"q": "What is 2+2?", "a": "4"},
    {"q": "What is 3*3?", "a": "9"},
]

# Convert to DSPy Examples
dataset = [
    dspy.Example(question=item["q"], answer=item["a"]).with_inputs("question")
    for item in raw_data
]

From Hugging Face

from dspy.datasets import DataLoader

# Load from Hugging Face Hub
loader = DataLoader()
raw_data = loader.from_huggingface(
    dataset_name="squad",
    split="train",
    fields=("question", "context", "answers"),
    input_keys=("question", "context")
)

Train/Dev/Test Splits

Proper data splitting is essential for valid evaluation.

Split Purpose Usage
Training Optimize prompts/demonstrations Used by optimizer
Development Tune hyperparameters, iterate Used during development
Test Final unbiased evaluation Used once at the end
import random

# Shuffle with fixed seed for reproducibility
random.Random(42).shuffle(data)

# Split into sets
trainset = data[:200]      # 200 for training
devset = data[200:500]     # 300 for development
testset = data[500:1000]   # 500 for testing

Data Quality Checklist

βœ…

Check Required Fields

Ensure every example has the necessary input and output fields.

βœ…

Remove Duplicates

Clean your dataset to prevent data leakage and bias.

βœ…

Verify Inputs Marked

Double-check that with_inputs() has been called on every example.