
DSPy GEPA: Auto-Optimize AI Agent Prompts
Summary
Use DSPy GEPA to auto-evolve prompts with reflection and beat hand-tuned baselines.
Why GEPA, and why now
Most teams still tune agent prompts by hand: tweak the wording, run the eval, squint at the failures, repeat. GEPA (Genetic-Pareto) automates that loop. It is a reflective optimizer in DSPy that reads your program's actual execution traces, reasons about what went wrong in plain language, and proposes better instructions automatically.
It is not a toy. GEPA was accepted as an oral at ICLR 2026 and reportedly outperforms the older MIPROv2 optimizer by ~13% and reinforcement-style GRPO tuning by ~20% — while using up to 35x fewer rollouts. In one DSPy example it lifts a ChainOfThought program on MATH from 67% to 93% accuracy through instruction refinement alone. This guide gets you from zero to a working optimization run in about 15 minutes.
Prerequisites
- Python 3.10 or newer
- An LLM API key (this guide uses OpenAI, but any LiteLLM-supported provider works)
- Basic familiarity with a DSPy
ModuleandSignature pip install -U dspy(GEPA ships inside DSPy asdspy.GEPA)
pip install -U dspy
export OPENAI_API_KEY=sk-...
Step 1 - Define a DSPy program
Start with an ordinary DSPy module. Here is a tiny classifier whose instruction we will let GEPA evolve. Notice we do not write a clever prompt — that is GEPA's job.
import dspy
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini", temperature=0.0))
class Classify(dspy.Signature):
"""Classify the sentiment of a sentence."""
sentence: str = dspy.InputField()
label: str = dspy.OutputField(desc="positive, negative, or neutral")
program = dspy.ChainOfThought(Classify)
# Sanity check before optimizing
print(program(sentence="The battery dies in an hour.").label)
# -> negative
Step 2 - Write a feedback metric
This is the heart of GEPA. A normal optimizer only sees a number. GEPA can also read textual feedback that explains why a prediction scored the way it did. Return a dspy.Prediction with both a score and a feedback string, and the reflection model uses that text to write smarter instructions.
def metric(gold, pred, trace=None, pred_name=None, pred_trace=None):
correct = pred.label.strip().lower() == gold.label.strip().lower()
score = 1.0 if correct else 0.0
if correct:
fb = "Correct label."
else:
fb = (f"Predicted '{pred.label}' but the gold label is "
f"'{gold.label}'. Watch for sarcasm and mixed sentiment.")
return dspy.Prediction(score=score, feedback=fb)
The full signature GEPA may call is metric(gold, pred, trace, pred_name, pred_trace). If you return a plain float instead, GEPA still works — it just falls back to feedback like "This trajectory got a score of 0.0", which is far less useful. Always return real feedback when you can.
Step 3 - Configure GEPA with a reflection LM
GEPA needs two models: the task model you already configured, and a strong reflection model that does the analysis and rewriting. The reflection_lm argument is required. You must also set exactly one budget knob — auto, max_full_evals, or max_metric_calls.
from dspy import GEPA
optimizer = GEPA(
metric=metric,
auto="light", # "light" | "medium" | "heavy"
reflection_lm=dspy.LM(
"openai/gpt-4o", temperature=1.0, max_tokens=8000
),
track_stats=True,
)
Use auto="light" while iterating so runs stay cheap, then move to "medium" or "heavy" for a final pass. A higher temperature on the reflection model (1.0) is recommended — reflection benefits from creative rewrites.
Step 4 - Compile and inspect the result
GEPA uses the trainset for reflective updates and a valset to track Pareto scores. If you omit the valset, it reuses the trainset for both.
trainset = [
dspy.Example(sentence="I love this!", label="positive").with_inputs("sentence"),
dspy.Example(sentence="It is fine, I guess.", label="neutral").with_inputs("sentence"),
dspy.Example(sentence="Worst purchase ever.", label="negative").with_inputs("sentence"),
# ... add 20-50 examples for a real run
]
optimized = optimizer.compile(program, trainset=trainset, valset=trainset)
# The evolved instruction GEPA discovered:
print(optimized.predict.signature.instructions)
optimized.save("classify_gepa.json")
Example output (abridged) — note how GEPA folded your feedback hints straight into the prompt:
Classify the sentiment of a sentence as positive, negative, or
neutral. Pay close attention to sarcasm and sentences that mix
praise with complaints; judge the dominant sentiment, and default
to neutral only when no clear polarity is present.
Common pitfalls
- Forgetting
reflection_lm. It is mandatory and should be a strong model (e.g. gpt-4o / gpt-5 class), even if your task model is small. - Setting two budget knobs. Exactly one of
auto,max_full_evals, ormax_metric_callsmay be set, or GEPA raises an assertion error. - Returning only a score. You lose GEPA's biggest advantage. Write feedback that names the failure.
- Tiny trainsets. 3 examples won't generalize. Aim for 20-50+ so the Pareto frontier is meaningful.
- Reusing trainset as valset on small data. Fine for a demo, but split them for honest validation scores.
Quick reference
| Argument | Purpose |
|---|---|
| metric | Feedback fn returning Prediction(score, feedback) |
| reflection_lm | Required strong model that rewrites prompts |
| auto | Budget preset: light / medium / heavy |
| max_metric_calls | Alternative explicit budget (pick one knob) |
| track_stats | Expose detailed_results on the output program |
| compile(student, trainset, valset) | Runs optimization, returns new module |
Next steps
You now have an optimizer that improves prompts from your own eval signal — no manual prompt-fiddling. From here, point GEPA at a multi-step agent (it optimizes every named predictor), swap in a domain metric with richer feedback, or use track_best_outputs=True to run GEPA as an inference-time search over a batch. Load your saved program anywhere with program.load("classify_gepa.json") and ship it.
Comments
Be the first to comment