DSPy GEPA: Auto-Optimize AI Agent Prompts — ContentBuffer guide

DSPy GEPA: Auto-Optimize AI Agent Prompts

K
Kodetra Technologies··4 min read Intermediate

Summary

Use DSPy GEPA to auto-evolve prompts with reflection and beat hand-tuned baselines.

Why GEPA, and why now

Most teams still tune agent prompts by hand: tweak the wording, run the eval, squint at the failures, repeat. GEPA (Genetic-Pareto) automates that loop. It is a reflective optimizer in DSPy that reads your program's actual execution traces, reasons about what went wrong in plain language, and proposes better instructions automatically.

It is not a toy. GEPA was accepted as an oral at ICLR 2026 and reportedly outperforms the older MIPROv2 optimizer by ~13% and reinforcement-style GRPO tuning by ~20% — while using up to 35x fewer rollouts. In one DSPy example it lifts a ChainOfThought program on MATH from 67% to 93% accuracy through instruction refinement alone. This guide gets you from zero to a working optimization run in about 15 minutes.

Prerequisites

  • Python 3.10 or newer
  • An LLM API key (this guide uses OpenAI, but any LiteLLM-supported provider works)
  • Basic familiarity with a DSPy Module and Signature
  • pip install -U dspy (GEPA ships inside DSPy as dspy.GEPA)
pip install -U dspy
export OPENAI_API_KEY=sk-...

Step 1 - Define a DSPy program

Start with an ordinary DSPy module. Here is a tiny classifier whose instruction we will let GEPA evolve. Notice we do not write a clever prompt — that is GEPA's job.

import dspy

dspy.configure(lm=dspy.LM("openai/gpt-4o-mini", temperature=0.0))

class Classify(dspy.Signature):
    """Classify the sentiment of a sentence."""
    sentence: str = dspy.InputField()
    label: str = dspy.OutputField(desc="positive, negative, or neutral")

program = dspy.ChainOfThought(Classify)

# Sanity check before optimizing
print(program(sentence="The battery dies in an hour.").label)
# -> negative

Step 2 - Write a feedback metric

This is the heart of GEPA. A normal optimizer only sees a number. GEPA can also read textual feedback that explains why a prediction scored the way it did. Return a dspy.Prediction with both a score and a feedback string, and the reflection model uses that text to write smarter instructions.

def metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
    correct = pred.label.strip().lower() == gold.label.strip().lower()
    score = 1.0 if correct else 0.0
    if correct:
        fb = "Correct label."
    else:
        fb = (f"Predicted '{pred.label}' but the gold label is "
              f"'{gold.label}'. Watch for sarcasm and mixed sentiment.")
    return dspy.Prediction(score=score, feedback=fb)

The full signature GEPA may call is metric(gold, pred, trace, pred_name, pred_trace). If you return a plain float instead, GEPA still works — it just falls back to feedback like "This trajectory got a score of 0.0", which is far less useful. Always return real feedback when you can.

Step 3 - Configure GEPA with a reflection LM

GEPA needs two models: the task model you already configured, and a strong reflection model that does the analysis and rewriting. The reflection_lm argument is required. You must also set exactly one budget knob — auto, max_full_evals, or max_metric_calls.

from dspy import GEPA

optimizer = GEPA(
    metric=metric,
    auto="light",                 # "light" | "medium" | "heavy"
    reflection_lm=dspy.LM(
        "openai/gpt-4o", temperature=1.0, max_tokens=8000
    ),
    track_stats=True,
)

Use auto="light" while iterating so runs stay cheap, then move to "medium" or "heavy" for a final pass. A higher temperature on the reflection model (1.0) is recommended — reflection benefits from creative rewrites.

Step 4 - Compile and inspect the result

GEPA uses the trainset for reflective updates and a valset to track Pareto scores. If you omit the valset, it reuses the trainset for both.

trainset = [
    dspy.Example(sentence="I love this!", label="positive").with_inputs("sentence"),
    dspy.Example(sentence="It is fine, I guess.", label="neutral").with_inputs("sentence"),
    dspy.Example(sentence="Worst purchase ever.", label="negative").with_inputs("sentence"),
    # ... add 20-50 examples for a real run
]

optimized = optimizer.compile(program, trainset=trainset, valset=trainset)

# The evolved instruction GEPA discovered:
print(optimized.predict.signature.instructions)
optimized.save("classify_gepa.json")

Example output (abridged) — note how GEPA folded your feedback hints straight into the prompt:

Classify the sentiment of a sentence as positive, negative, or
neutral. Pay close attention to sarcasm and sentences that mix
praise with complaints; judge the dominant sentiment, and default
to neutral only when no clear polarity is present.

Common pitfalls

  • Forgetting reflection_lm. It is mandatory and should be a strong model (e.g. gpt-4o / gpt-5 class), even if your task model is small.
  • Setting two budget knobs. Exactly one of auto, max_full_evals, or max_metric_calls may be set, or GEPA raises an assertion error.
  • Returning only a score. You lose GEPA's biggest advantage. Write feedback that names the failure.
  • Tiny trainsets. 3 examples won't generalize. Aim for 20-50+ so the Pareto frontier is meaningful.
  • Reusing trainset as valset on small data. Fine for a demo, but split them for honest validation scores.

Quick reference

ArgumentPurpose
metricFeedback fn returning Prediction(score, feedback)
reflection_lmRequired strong model that rewrites prompts
autoBudget preset: light / medium / heavy
max_metric_callsAlternative explicit budget (pick one knob)
track_statsExpose detailed_results on the output program
compile(student, trainset, valset)Runs optimization, returns new module

Next steps

You now have an optimizer that improves prompts from your own eval signal — no manual prompt-fiddling. From here, point GEPA at a multi-step agent (it optimizes every named predictor), swap in a domain metric with richer feedback, or use track_best_outputs=True to run GEPA as an inference-time search over a batch. Load your saved program anywhere with program.load("classify_gepa.json") and ship it.

Comments

Subscribe to join the conversation...

Be the first to comment