Scientific Figure Caption Generation | Chapter 6

The Challenge

Writing captions for scientific figures is hard. The caption must be technically precise and match the specific writing style of the paper (or journal). Generic image captioning models fail here because they lack context and domain knowledge.

Two-Stage Pipeline

We solve this with a dual-stage approach:

Context-Aware Generation: A module retrieves relevant text from the paper (e.g., surrounding the figure reference) to understand what the figure shows.
Stylistic Refinement: A second module rewrites basic captions to match the target author's style using few-shot examples from their previous work.

Stage 1: Category-Specific Optimization

We use MIPROv2 to optimize prompts specifically for different figure types (e.g., bar charts, scatter plots, diagrams).

def optimize_for_category(training_data, category):
    # Filter data for just this figure type (e.g. "bar_graph")
    category_data = [x for x in data if x.category == category]
    
    # Compile a specialized module
    optimizer = dspy.MIPROv2(metric=caption_quality_metric)
    return optimizer.compile(CategoryCaptionModule(), trainset=category_data)

Results

This specialized pipeline achieved significant metric improvements:

ROUGE-1 Recall: +8.3%
Style Consistency: ~45% improvement in BLEU scores against author profile examples.

Next: Retrieval-Augmented Guardrails

Scientific Figure Captioning

The Challenge

Two-Stage Pipeline

Stage 1: Category-Specific Optimization

Results