Chapter 6

Extreme Multi-Label Classification

Scaling classification systems to handle millions of potential labels using efficient indexing, clustering, and DSPy.

Introduction to XML

Extreme Multi-Label Classification (XML) involves tagging data with labels from a massive set—potentially millions (e.g., Wikipedia tags, Amazon product categories). DSPy, combined with efficient retrieval strategies, makes this tractable.

Core Challenges

  • Scale: O(|Labels|) prediction is impossible.
  • Sparsity: Most labels appear rarely (long-tail distribution).
  • Ambiguity: With millions of labels, many will be semantically similar.

Architecture

1. Label Embeddings & Indexing

Instead of a classifier outputting 1M scores, we embed labels into a vector space and use Approximate Nearest Neighbor (ANN) search (like FAISS) to retrieve candidates.

2. Candidate Selection

We first narrow down the millions of labels to a manageable candidate set (e.g., top 50).

candidates = label_index.search(query_embedding, k=50)

3. DSPy Reranking/Classification

DSPy then acts as a sophisticated reranker or final classifier, looking only at the candidates.

class DSPyXMLClassifier(dspy.Module):
    def __init__(self):
        self.candidate_selector = dspy.ChainOfThought("text, candidates -> selected_labels")
    
    def forward(self, text, candidates):
        return self.candidate_selector(text=text, candidates=candidates)

Hierarchical Strategies

Organizing labels into trees (e.g., Cat -> Pet -> Animal) allows you to classify level-by-level, drastically reducing the search space at each step.