Chapter 6 · Section 5

Entity Extraction

Mine structured information from unstructured text with precision.

~15 min read

Introduction

Entity extraction, or Named Entity Recognition (NER), is pivotal for transforming unstructured text into structured data. It underpins applications like resume parsing, contract analysis, and medical record processing. DSPy provides the tools to build sophisticated extraction systems that handle real-world complexity.

Understanding Entity Extraction

Common Entity Types

  • Person: John Smith, Dr. Sarah Johnson
  • Organization: Google, Microsoft, Stanford University
  • Location: New York, 123 Main Street
  • Date/Time: January 15, 2024, 3:30 PM
  • Money: $50,000, €1.2 million

Real-World Applications

  • Resume Processing: Extract skills, experience, and education.
  • Contract Analysis: Identify parties, dates, and clauses.
  • Medical Records: Extract diagnoses, medications, and procedures.

Building Entity Extractors

Basic Entity Extractor

A simple module to extract specified entity types:

Python
import dspy

class BasicEntityExtractor(dspy.Module):
    def __init__(self, entity_types):
        super().__init__()
        self.entity_types = entity_types
        types_str = ", ".join(entity_types)
        self.extract = dspy.Predict(
            f"text, entity_types[{types_str}] -> entities"
        )

    def forward(self, text):
        result = self.extract(
            text=text,
            entity_types=", ".join(self.entity_types)
        )
        # Assume result.entities is parsed into a structured list
        return dspy.Prediction(
            entities=result.entities,
            raw_output=result.entities
        )

Advanced Entity Extractor with Context

Handling context, validation, and disambiguation:

Python
class AdvancedEntityExtractor(dspy.Module):
    def __init__(self, entity_types, context_window=100):
        super().__init__()
        self.entity_types = entity_types
        
        types_str = ", ".join(entity_types)
        self.find_entities = dspy.ChainOfThought(
            f"text, context, entity_types[{types_str}] -> entities_with_positions"
        )
        self.validate_entities = dspy.Predict(
            "entity, text_context -> is_valid, corrected_entity, confidence"
        )
        self.disambiguate = dspy.Predict(
            "entity, context, possible_meanings -> disambiguated_entity, reasoning"
        )

    def forward(self, text, document_context=None):
        context = document_context[-100:] if document_context else text

        # Find entities
        extraction = self.find_entities(
            text=text,
            context=context,
            entity_types=", ".join(self.entity_types)
        )

        # Logic to parse, validate, and disambiguate entities would follow...
        # Returns structured entities list
        return dspy.Prediction(
            entities=extraction.entities_with_positions,
            extraction_reasoning=extraction.rationale
        )

Specialized Applications

Resume/CV Parser

Extracting structured data from resumes:

Python
class ResumeParser(dspy.Module):
    def __init__(self):
        super().__init__()
        self.contact_info = dspy.Predict(
            "resume_text -> name, email, phone, location, linkedin"
        )
        self.extract_sections = dspy.Predict(
            "resume_text -> work_experience, education, skills, certifications"
        )
        self.parse_experience = dspy.ChainOfThought(
            "experience_section -> detailed_experiences"
        )

    def forward(self, resume_text):
        contact = self.contact_info(resume_text=resume_text)
        sections = self.extract_sections(resume_text=resume_text)
        
        # Detailed parsing logic...

        return dspy.Prediction(
            contact_info=contact,
            sections=sections
        )

Best Practices

  • Ambiguity: Use context to disambiguate entities (e.g., "Apple" fruit vs. company).
  • Validation: Use regex or rules to validate specific types like emails or dates.
  • Nested Entities: Handle overlapping entities carefully (e.g., "University of [California]").
  • Optimization: Use BootstrapFewShot with a custom F1 metric to improve extraction performance.