Chapter 4

Human-Aligned Evaluation

Creating evaluation systems that reflect actual human priorities and quality requirements.

Overview

Traditional evaluation metrics like BERTScore, ROUGE, and BLEU often fail to capture what truly matters to human users, especially in complex, nuanced tasks. Human-aligned evaluation focuses on creating evaluation systems that reflect actual human priorities and domain-specific quality requirements.

This section explores how to bridge the gap between automated metrics and human judgment, drawing on techniques like weighted dimensions and granular feedback.

The Limitations of Standard Metrics

Why Off-the-Shelf Metrics Fail

# Example from clinical summarization
standard_metrics = {
    "bert_score": 87.19,  # High semantic similarity
    "rouge_2": 0.82,      # Good n-gram overlap
    # But missed critical clinical details!
}

# Human evaluation revealed:
# - Omitted key diagnoses
# - Missing treatment outcomes
# - Incomplete patient history

Key Problems:

  • Context-blind: Metrics don't understand task-specific requirements
  • Surface-level: Focus on lexical overlap, not meaningful content
  • Poor correlation: Often weak correlation with actual human judgment

Building Human-Aligned Evaluation Systems

1. Understand Your Quality Dimensions

class ClinicalQualityDimensions:
    """Quality dimensions for clinical summarization."""
    FACTUAL_ACCURACY = "Is all information correct?"
    CLINICAL_COMPLETENESS = "Are all critical findings included?"
    CONCISENESS = "Is it appropriately brief?"
    CLINICAL_RELEVANCE = "Is information clinically significant?"

    @classmethod
    def get_weights(cls, context="emergency"):
        # Different weights for different contexts
        if context == "emergency":
            return {
                cls.FACTUAL_ACCURACY: 0.5,
                cls.CLINICAL_COMPLETENESS: 0.3, # ...
            }

2. Human-Aligned LLM Judge

class HumanAlignedLLMJudge(dspy.Module):
    """LLM judge trained on human feedback patterns."""

    def __init__(self, quality_dimensions, weights=None):
        super().__init__()
        self.dimensions = quality_dimensions
        self.weights = weights
        
        # Create evaluation signature explaining dimensions detailedly
        self.evaluation_signature = dspy.Signature(
            # ... prompt definition ...
        )

Case Study Results

Metric Before Optimization After Optimization Improvement
BERTScore 87.19 87.27 +0.08
LLM Judge 53.90 68.07 +26.3%
Human Alignment (ρ) 0.14 0.28 2x improvement

Best Practices

  • Start Clear: Define clear, actionable quality criteria (e.g., rubrics with specific levels).
  • Separate Training/Val: Ensure strict splits to prevent data leakage and overfitting to specific examples.
  • Version Control: Track your evaluation prompts and weights as they evolve.