Entity Extraction | Chapter 6 | DSPy: The Comprehensive Guide

Introduction

Entity extraction, or Named Entity Recognition (NER), is pivotal for transforming unstructured text into structured data. It underpins applications like resume parsing, contract analysis, and medical record processing. DSPy provides the tools to build sophisticated extraction systems that handle real-world complexity.

Understanding Entity Extraction

Common Entity Types

Person: John Smith, Dr. Sarah Johnson
Organization: Google, Microsoft, Stanford University
Location: New York, 123 Main Street
Date/Time: January 15, 2024, 3:30 PM
Money: $50,000, €1.2 million

Real-World Applications

Resume Processing: Extract skills, experience, and education.
Contract Analysis: Identify parties, dates, and clauses.
Medical Records: Extract diagnoses, medications, and procedures.

Building Entity Extractors

Basic Entity Extractor

A simple module to extract specified entity types:

Python

import dspy

class BasicEntityExtractor(dspy.Module):
    def __init__(self, entity_types):
        super().__init__()
        self.entity_types = entity_types
        types_str = ", ".join(entity_types)
        self.extract = dspy.Predict(
            f"text, entity_types[{types_str}] -> entities"
        )

    def forward(self, text):
        result = self.extract(
            text=text,
            entity_types=", ".join(self.entity_types)
        )
        # Assume result.entities is parsed into a structured list
        return dspy.Prediction(
            entities=result.entities,
            raw_output=result.entities
        )

Advanced Entity Extractor with Context

Handling context, validation, and disambiguation:

Python

class AdvancedEntityExtractor(dspy.Module):
    def __init__(self, entity_types, context_window=100):
        super().__init__()
        self.entity_types = entity_types
        
        types_str = ", ".join(entity_types)
        self.find_entities = dspy.ChainOfThought(
            f"text, context, entity_types[{types_str}] -> entities_with_positions"
        )
        self.validate_entities = dspy.Predict(
            "entity, text_context -> is_valid, corrected_entity, confidence"
        )
        self.disambiguate = dspy.Predict(
            "entity, context, possible_meanings -> disambiguated_entity, reasoning"
        )

    def forward(self, text, document_context=None):
        context = document_context[-100:] if document_context else text

        # Find entities
        extraction = self.find_entities(
            text=text,
            context=context,
            entity_types=", ".join(self.entity_types)
        )

        # Logic to parse, validate, and disambiguate entities would follow...
        # Returns structured entities list
        return dspy.Prediction(
            entities=extraction.entities_with_positions,
            extraction_reasoning=extraction.rationale
        )

Specialized Applications

Resume/CV Parser

Extracting structured data from resumes:

Python

class ResumeParser(dspy.Module):
    def __init__(self):
        super().__init__()
        self.contact_info = dspy.Predict(
            "resume_text -> name, email, phone, location, linkedin"
        )
        self.extract_sections = dspy.Predict(
            "resume_text -> work_experience, education, skills, certifications"
        )
        self.parse_experience = dspy.ChainOfThought(
            "experience_section -> detailed_experiences"
        )

    def forward(self, resume_text):
        contact = self.contact_info(resume_text=resume_text)
        sections = self.extract_sections(resume_text=resume_text)
        
        # Detailed parsing logic...

        return dspy.Prediction(
            contact_info=contact,
            sections=sections
        )

Best Practices

Ambiguity: Use context to disambiguate entities (e.g., "Apple" fruit vs. company).
Validation: Use regex or rules to validate specific types like emails or dates.
Nested Entities: Handle overlapping entities carefully (e.g., "University of [California]").
Optimization: Use BootstrapFewShot with a custom F1 metric to improve extraction performance.

Next: Intelligent Agents