Chapter 5

Fine-Tuning Small LMs

Adapt small, efficient models like Llama 3 or Mistral to perform specific tasks as well as giant models.

Introduction

While powerful models like GPT-4 are excellent for prototyping, they are often too slow, expensive, or private for production. Fine-tuning allows you to take a smaller model (like Mistral-7B or Llama-3-8B) and specialize it on your specific task, often achieving comparable performance at a fraction of the cost.

🤔 When to Fine-Tune?

Scenario Fine-Tuning is... Why?
High Volume / Real-time ✅ Excellent Low latency, low per-token cost.
Specialized Domain ✅ Excellent Models learn jargon and formats better than via prompting.
Complex Reasoning ⚠️ Challenging Small models struggle with deep reasoning even after fine-tuning.
Frequent Updates ❌ Poor Retraining is slow compared to updating a prompt.

🛠️ Setting Up QLoRA

QLoRA (Quantized Low-Rank Adaptation) is the industry standard for efficient fine-tuning. It allows you to fine-tune a 7B model on a single consumer GPU.

# Prerequisites
!pip install torch transformers datasets accelerate peft bitsandbytes

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import AutoModelForCausalLM, AutoTokenizer

def setup_qlora(model_name="mistralai/Mistral-7B-v0.1"):
    # Load model in 4-bit to save memory
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        load_in_4bit=True,
        device_map="auto"
    )
    
    # Configure LoRA
    lora_config = LoraConfig(
        r=16,                  # Rank: Higher = more parameters to train
        lora_alpha=32,
        target_modules=["q_proj", "v_proj"], # Target attention layers
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM"
    )
    
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, lora_config)
    return model

🔌 Integration with DSPy

Once you have a fine-tuned model, using it in DSPy is straightforward. You simply wrap it in a `dspy.LM` class.

class FineTunedLLM(dspy.LM):
    def __init__(self, model, tokenizer):
        self.model = model
        self.tokenizer = tokenizer
        
    def __call__(self, prompt, **kwargs):
        inputs = self.tokenizer(prompt, return_tensors="pt").to("cuda")
        outputs = self.model.generate(**inputs, max_new_tokens=200)
        return [self.tokenizer.decode(outputs[0], skip_special_tokens=True)]

# Usage
ft_model = FineTunedLLM(my_finetuned_model, my_tokenizer)
dspy.configure(lm=ft_model)

✨ Synergy: Fine-Tuning + Prompt Opt

The most powerful systems typically combine both approaches:

  1. Step 1: Fine-Tune the model to understand the basic format and domain constraints.
  2. Step 2: Prompt Optimize (using BootstrapFewShot or MiPRO) to find the best few-shot examples that steer this new model.
🚀

Result: Research shows this combination can achieve 2-26x improvements over baseline, unlocking capabilities neither method can achieve alone.