GRPO Fine-Tuning: Train Reasoning Into Small LLMs

GRPO Fine-Tuning: Train Reasoning Into Small LLMs

K
Kodetra Technologies·April 30, 2026·9 min read Intermediate

Summary

Use GRPO to teach a 0.5B model multi-step math reasoning end to end.

If you have spent any time in the LLM trenches lately, you have noticed the post-training conversation move on from supervised fine-tuning. The new center of gravity is reinforcement learning — and inside that, the technique that put DeepSeek-R1 on the map: Group Relative Policy Optimization (GRPO). It is the recipe behind every recent reasoning model that can hold its own on math, code, and multi-step logic at fractions of the size of frontier giants.

The good news: you can run GRPO on a single consumer GPU. The better news: the Hugging Face trl library now ships a production-ready GRPOTrainer, and pairing it with Unsloth gives you 17B-parameter capacity on a 15 GB card. This guide walks through fine-tuning Qwen2.5-0.5B-Instruct on grade-school math problems so the model learns to think, not just imitate. By the end you will have a working reward function, a trained checkpoint, and a clear mental model of why GRPO works.

What you will learn

  • How GRPO differs from PPO and SFT, and when to reach for it.
  • How to design reward functions that actually shape behavior.
  • How to configure GRPOConfig for stable training on a small GPU.
  • How to evaluate a reasoning model and avoid the most common reward-hacking traps.

Prerequisites

  • Python 3.10+ and a CUDA-capable GPU with at least 12 GB VRAM (an RTX 4070 or a Colab A100 will do).
  • Comfort with PyTorch and the Hugging Face transformers ecosystem.
  • Basic understanding of supervised fine-tuning — you do not need RL theory, but knowing what a loss function does helps.

How GRPO works in 60 seconds

Classical PPO needs a separate value model the same size as your policy to estimate how good a given partial response is. That doubles memory and training cost. GRPO sidesteps the value model with a clever trick: for each prompt, generate a group of G completions (typically 4–16). Compute a reward for every completion. The advantage of any single completion is just its reward minus the group mean, divided by the group standard deviation.

In other words, the group itself is the baseline. Completions that beat the group average get pushed up, ones that fall short get pushed down. No critic network, no separate forward pass for value, half the memory. The same KL penalty against a frozen reference model that PPO uses keeps the policy from drifting into gibberish.

Why does this matter for reasoning? Because with the right reward function — one that scores final answer correctness, not next-token likelihood — the model learns to discover chains of thought that lead to right answers. That is something SFT cannot do: SFT can only mimic the chains it has seen.


Step 1 — Install the stack

We will use trl for the trainer, transformers and datasets for plumbing, and unsloth for the LoRA + 4-bit speedup that makes GRPO fit on small GPUs.

pip install --upgrade "trl>=0.16" "transformers>=4.45" datasets accelerate
pip install --upgrade "unsloth[colab-new]@git+https://github.com/unslothai/unsloth.git"
pip install vllm  # optional but recommended for fast generation

Step 2 — Load the model and the GSM8K dataset

GSM8K is the standard grade-school math benchmark — about 7,500 word problems whose answers are integers. It is small, the ground truth is unambiguous, and it is the same dataset DeepSeek used to demonstrate GRPO. We will format each example with an explicit reasoning section so the model learns to externalize its thinking.

from unsloth import FastLanguageModel
from datasets import load_dataset

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen2.5-0.5B-Instruct",
    max_seq_length=1024,
    load_in_4bit=True,
    fast_inference=True,  # uses vLLM under the hood
)

# Add a small LoRA so we don't update the full model.
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_alpha=16,
    use_gradient_checkpointing="unsloth",
)

SYSTEM_PROMPT = (
    "You are a careful math tutor. Reason step by step inside <think> tags, "
    "then write the final integer answer inside <answer> tags."
)

def format_example(example):
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["question"]},
        ],
        "answer": example["answer"].split("####")[-1].strip(),
    }

ds = load_dataset("openai/gsm8k", "main", split="train").map(format_example)

Two details matter here. First, we attach a LoRA adapter so only ~1% of the parameters are updated — full fine-tuning would not fit on a 12 GB GPU. Second, we strip the GSM8K answer string down to just the integer after ####; that is what our reward function will compare against.

Step 3 — Design the reward functions

Reward design is where GRPO projects live or die. A single all-or-nothing correctness reward gives the model almost no signal early on — it generates 16 completions, all wrong, advantages collapse to zero, nothing learns. The fix is multi-reward shaping: combine a strong sparse reward for being right with cheap dense rewards for following the format. Each completion gets a weighted sum of all of them.

import re

ANSWER_RE = re.compile(r"<answer>\s*(-?\d+)\s*</answer>", re.IGNORECASE)
FORMAT_RE = re.compile(r"<think>.*?</think>\s*<answer>.*?</answer>", re.DOTALL)

def correctness_reward(prompts, completions, answer, **kwargs):
    """+2.0 if the integer in <answer> matches ground truth, else 0."""
    rewards = []
    for c, gt in zip(completions, answer):
        text = c[0]["content"] if isinstance(c, list) else c
        m = ANSWER_RE.search(text)
        rewards.append(2.0 if m and m.group(1) == gt else 0.0)
    return rewards

def format_reward(completions, **kwargs):
    """+0.5 if the model used the <think>...</think><answer>...</answer> shape."""
    rewards = []
    for c in completions:
        text = c[0]["content"] if isinstance(c, list) else c
        rewards.append(0.5 if FORMAT_RE.search(text) else 0.0)
    return rewards

def integer_reward(completions, **kwargs):
    """+0.25 if whatever is in <answer> at least parses as an integer."""
    rewards = []
    for c in completions:
        text = c[0]["content"] if isinstance(c, list) else c
        m = ANSWER_RE.search(text)
        rewards.append(0.25 if m else 0.0)
    return rewards

The pattern: a strong terminal reward (+2.0) for being right, plus small shaped rewards (+0.5, +0.25) that the model can collect even on wrong answers. Those shaped rewards are the rungs of the ladder. They keep advantages non-zero in the early epochs so the gradient never vanishes, then become irrelevant once the model is consistently producing well-formed responses.

Step 4 — Configure GRPOConfig

The single most important parameter is num_generations — your group size G. Smaller groups are cheaper but produce noisier advantages; larger groups give cleaner learning signal but blow up memory because every generation has to fit alongside the others. Eight is the sweet spot for a 0.5B model on a single GPU.

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    output_dir="qwen2.5-0.5b-grpo-gsm8k",
    learning_rate=5e-6,           # GRPO wants a much smaller LR than SFT
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    num_generations=8,            # group size G
    max_prompt_length=512,
    max_completion_length=256,
    num_train_epochs=1,
    beta=0.04,                    # KL penalty against the reference model
    use_vllm=True,
    bf16=True,
    logging_steps=5,
    save_steps=200,
    report_to="none",
)

trainer = GRPOTrainer(
    model=model,
    processing_class=tokenizer,
    args=training_args,
    train_dataset=ds,
    reward_funcs=[correctness_reward, format_reward, integer_reward],
)
trainer.train()

beta is the KL coefficient. Crank it up if the model starts producing weird, repetitive, or off-distribution text; turn it down if the model refuses to explore. 0.04 is a reasonable default that rarely causes trouble.

Step 5 — Watch the model learn to reason

Run training and tail the logs. You will see something interesting in the reward curves: the format and integer rewards saturate first (within a few hundred steps the model always emits <answer> tags), and only then does the correctness reward start climbing. That is the model graduating from learning structure to learning substance.

Quick before/after sample on a held-out problem:

# Prompt: "Janet has 3 apples and buys 5 more, then gives 2 to a friend. How many does she have?"

# Before GRPO (raw Qwen2.5-0.5B-Instruct):
"Janet has 8 apples."   # wrong, no reasoning

# After 1 epoch of GRPO:
<think>
Janet starts with 3 apples.
She buys 5 more, so she has 3 + 5 = 8.
She gives 2 to a friend, so 8 - 2 = 6.
</think>
<answer>6</answer>

On GSM8K validation, this exact recipe takes Qwen2.5-0.5B from roughly 38% accuracy to the high 50s after a single epoch. That is not magic — it is the model discovering, through trial and error, that showing its work correlates with getting the answer right. GRPO does not teach the model new facts; it teaches it which of its own behaviors pay off.


Evaluating the trained model

Training rewards are not accuracy. A reward curve that climbs nicely can still hide a model that has learned to reward-hack — for example, by emitting <answer>0</answer> on every problem because that satisfies both format and integer rewards. Always evaluate end-to-end on a held-out split.

from datasets import load_dataset

eval_ds = load_dataset("openai/gsm8k", "main", split="test").select(range(200))

def extract_answer(text):
    m = ANSWER_RE.search(text)
    return m.group(1) if m else None

correct = 0
for ex in eval_ds:
    out = trainer.model.generate(
        **tokenizer.apply_chat_template(
            [{"role": "system", "content": SYSTEM_PROMPT},
             {"role": "user", "content": ex["question"]}],
            return_tensors="pt", add_generation_prompt=True,
        ).to(model.device),
        max_new_tokens=256, temperature=0.0,
    )
    pred = extract_answer(tokenizer.decode(out[0], skip_special_tokens=True))
    gold = ex["answer"].split("####")[-1].strip()
    if pred == gold:
        correct += 1

print(f"Accuracy: {correct / len(eval_ds):.1%}")

Run this against the base model first to get your baseline, then against your GRPO checkpoint. The delta is what GRPO actually bought you. If the delta is small or negative, your reward function is the suspect — not the trainer.

GRPO vs SFT: when to choose which

These two are not interchangeable. SFT is unbeatable when you have high-quality input/output pairs and you want the model to imitate them — style transfer, function calling, brand voice, refusing certain prompts. The model never has to decide what is good; you already told it.

GRPO shines exactly where SFT is weakest: tasks where the path matters more than the answer, and where you can grade the answer programmatically. Math word problems, code generation against unit tests, retrieval-augmented QA where you can check the citations, agentic tool use with measurable success conditions. In all of those, the demonstrations you would need for SFT are expensive to produce, but the grader is cheap.

A common 2026 recipe is to do both: a short SFT pass on a few thousand high-quality demonstrations to lock in format and basic competence, then GRPO to push reasoning quality. The SFT pass costs you a few dollars and saves you days of GRPO converging on what a good response even looks like.

Common pitfalls

  • Reward hacking via length. If format_reward only checks for <think> tags, the model will learn to produce empty or repetitive thoughts. Always pair format rewards with a correctness signal that has a much larger weight.
  • Group size of 2 or 4. Tempting on small GPUs, but advantage estimates become so noisy the model trains backwards. Reduce max_completion_length before reducing num_generations.
  • Forgetting to freeze the reference model. If you accidentally update the reference, KL collapses to zero and the policy drifts off into nonsense. trl handles this for you, but if you write a custom trainer, double-check.
  • Sparse rewards only. A single +1/0 correctness reward on a hard task gives you flat learning curves for thousands of steps. Always layer in cheap shaped rewards.
  • Skipping vLLM. Generation dominates GRPO wallclock — every step generates G completions. Without vLLM you will be at 80% generation, 20% optimization. With it, the ratio flips.

Quick reference: GRPO knobs

ParameterTypical valueWhat it controls
num_generations8Group size G; bigger = cleaner advantages, more VRAM
learning_rate1e-6 to 5e-6Way smaller than SFT; GRPO updates are sharper
beta (KL coef)0.02 to 0.1Higher = stays closer to reference model
max_completion_length256 to 1024Cap on generated tokens; biggest VRAM lever
reward weights2.0 / 0.5 / 0.25Strong terminal reward + shaped helpers

Next steps

Once the GSM8K loop is working, the same structure transfers anywhere you can score a completion programmatically: code generation (run the unit tests), structured extraction (parse and check the JSON), tool use (did the agent succeed?). The reward function is the only thing that has to change.

If you want to push the technique further, two directions are worth your time. First, try GRPO on a 3B or 7B model with Unsloth — the same code works, you just need a 24 GB GPU and a longer training run. Second, read the DeepSeekMath paper that introduced GRPO and the DeepSeek-R1 paper that scaled it to o1-class reasoning. Both are short, both are clear, and both will change how you think about post-training.

Reasoning is not a special faculty you bolt on; it is a behavior the model can be rewarded into. GRPO is just the cleanest, cheapest way to do the rewarding.

Comments

Subscribe to join the conversation...

Be the first to comment