KNNFewShot: Similarity-Based Selection | DSPy: The Comprehensive Guide

Introduction

KNNFewShot is a dynamic DSPy optimizer that uses the K-Nearest Neighbors algorithm to select the most relevant examples for each query. Unlike BootstrapFewShot which selects a static set of examples that work well on average, KNNFewShot tailors the context to every single input.

Core Concept

Embed: Convert all training examples to vector embeddings.
Query: Embed the incoming user query.
Search: Find the K most similar training examples in the vector space.
Select: Use those specific examples as the few-shot demonstrations for that query.

✨

Key Advantage: KNNFewShot is highly effective when your task is too broad to be covered by a single set of 5-10 fixed examples (e.g., Open Domain QA, broad classification).

💻 Basic Usage

import dspy
from dspy.teleprompt import KNNFewShot

# 1. Define Program & Data
class QAModule(dspy.Module):
    def __init__(self):
        super().__init__()
        self.generate = dspy.Predict("question, context -> answer")
    
    def forward(self, question):
        return self.generate(question=question)

trainset = [
    dspy.Example(question="Capital of France?", answer="Paris", topic="geography"),
    dspy.Example(question="Who wrote Hamlet?", answer="Shakespeare", topic="literature"),
    # ... Assume 100+ diverse examples
]

# 2. Configure Optimizer
# 'k' is the number of neighbors to retrieve
optimizer = KNNFewShot(k=3)

# 3. Compile
# KNNFewShot doesn't "train" in the traditional sense;
# it indexes your training set.
compiled_qa = optimizer.compile(QAModule(), trainset=trainset)

# 4. Infer
# For "Capital of Germany?", it will likely retrieve the "Capital of France" example
result = compiled_qa(question="What is the capital of Germany?")
print(result.answer)

⚙️ Advanced Configuration

Custom Similarity Metrics

By default, KNNFewShot uses standard embeddings (like OpenAI's). You can define custom similarity logic.

from sentence_transformers import SentenceTransformer
import numpy as np

# Custom encoder
encoder = SentenceTransformer('all-MiniLM-L6-v2')

def custom_similarity(query, example):
    q_emb = encoder.encode(query)
    ex_emb = encoder.encode(example.question)
    return np.dot(q_emb, ex_emb) # Simplistic dot product

optimizer = KNNFewShot(
    k=5,
    similarity_fn=custom_similarity,
    vectorizer=encoder.encode
)

Parameters

Parameter	Type	Description
`k`	int	Number of neighbors to find (default: 3)
`vectorizer`	Callable	Function to turn examples into vectors
`embedding_model`	str	Model name (e.g., "text-embedding-ada-002")

✨ Best Practices & Pitfalls

1. Choose the Right 'k'

Too few samples (k=1) might lead to overfitting on a single similar example. Too many (k=20) might dilute the context or exceed standard token limits. A range of 3-7 is usually optimal.

2. Data Hygiene

Your retrieval is only as good as your database. Ensure your trainset is:

Clean: No typos or bad formatting.
Diverse: Covers the full search space.
Normalized: Consistent capitalization and spacing improves similarity matching.

3. Avoid Self-Selection

If your training set contains the exact query you are testing on (during evaluation), KNN might retrieve the answer directly. Ensure your validation/test sets are distinct from the indexed training set.

Next: Fine-tuning

KNNFewShot Optimization