Introduction
KNNFewShot is a dynamic DSPy optimizer that uses the K-Nearest Neighbors algorithm to select the most relevant examples for each query. Unlike BootstrapFewShot which selects a static set of examples that work well on average, KNNFewShot tailors the context to every single input.
Core Concept
- Embed: Convert all training examples to vector embeddings.
- Query: Embed the incoming user query.
- Search: Find the K most similar training examples in the vector space.
- Select: Use those specific examples as the few-shot demonstrations for that query.
Key Advantage: KNNFewShot is highly effective when your task is too broad to be covered by a single set of 5-10 fixed examples (e.g., Open Domain QA, broad classification).
💻 Basic Usage
import dspy
from dspy.teleprompt import KNNFewShot
# 1. Define Program & Data
class QAModule(dspy.Module):
def __init__(self):
super().__init__()
self.generate = dspy.Predict("question, context -> answer")
def forward(self, question):
return self.generate(question=question)
trainset = [
dspy.Example(question="Capital of France?", answer="Paris", topic="geography"),
dspy.Example(question="Who wrote Hamlet?", answer="Shakespeare", topic="literature"),
# ... Assume 100+ diverse examples
]
# 2. Configure Optimizer
# 'k' is the number of neighbors to retrieve
optimizer = KNNFewShot(k=3)
# 3. Compile
# KNNFewShot doesn't "train" in the traditional sense;
# it indexes your training set.
compiled_qa = optimizer.compile(QAModule(), trainset=trainset)
# 4. Infer
# For "Capital of Germany?", it will likely retrieve the "Capital of France" example
result = compiled_qa(question="What is the capital of Germany?")
print(result.answer)
⚙️ Advanced Configuration
Custom Similarity Metrics
By default, KNNFewShot uses standard embeddings (like OpenAI's). You can define custom similarity logic.
from sentence_transformers import SentenceTransformer
import numpy as np
# Custom encoder
encoder = SentenceTransformer('all-MiniLM-L6-v2')
def custom_similarity(query, example):
q_emb = encoder.encode(query)
ex_emb = encoder.encode(example.question)
return np.dot(q_emb, ex_emb) # Simplistic dot product
optimizer = KNNFewShot(
k=5,
similarity_fn=custom_similarity,
vectorizer=encoder.encode
)
Parameters
| Parameter | Type | Description |
|---|---|---|
k |
int | Number of neighbors to find (default: 3) |
vectorizer |
Callable | Function to turn examples into vectors |
embedding_model |
str | Model name (e.g., "text-embedding-ada-002") |
✨ Best Practices & Pitfalls
1. Choose the Right 'k'
Too few samples (k=1) might lead to overfitting on a single similar example. Too many (k=20) might dilute the context or exceed standard token limits. A range of 3-7 is usually optimal.
2. Data Hygiene
Your retrieval is only as good as your database. Ensure your trainset
is:
- Clean: No typos or bad formatting.
- Diverse: Covers the full search space.
- Normalized: Consistent capitalization and spacing improves similarity matching.
3. Avoid Self-Selection
If your training set contains the exact query you are testing on (during evaluation), KNN might retrieve the answer directly. Ensure your validation/test sets are distinct from the indexed training set.