What is GRPO and how does it differ from PPO?

GRPO (Group Relative Policy Optimization) generates N completions per prompt, scores them with a reward function, and updates the model based on relative comparisons within each group. Unlike PPO, GRPO eliminates the need for a separate critic model, making it more memory-efficient and simpler to implement.

How do you design reward functions for structured scoring?

Combine multiple metrics into a multi-objective reward: per-answer exact match (0.3 weight), off-by-1 tolerance (0.3 weight), MAE-based continuous signal (0.25 weight), and format compliance (0.15 weight). Single-metric rewards are too sparse; the combined signal provides multiple learning pathways.

Why does a single-metric reward function fail in GRPO?

Single metrics like exact match produce sparse, binary signals where most completions get either 0 or 1. That leaves the policy with almost no gradient to learn from. Multi-objective rewards spread signal across every completion so partial credit becomes a learning lever.

How do you tune the weights when combining multiple reward signals?

Start with 0.30 exact match, 0.30 off-by-1, 0.25 MAE, 0.15 format compliance. Run a few hundred GRPO steps and inspect which sub-reward is plateauing. Bump the lagging signal by 0.05–0.10, re-run, and iterate until average reward climbs steadily.

When should you skip GRPO and use SFT alone?

Skip GRPO when SFT alone hits your target accuracy. GRPO refines a policy that already learned the format and rough behavior. If SFT plateaus 5-15 points below target, GRPO earns the extra compute; if SFT is already at target, GRPO is wasted budget.

Designing Reward Functions for GRPO

How to teach an LLM to score better using multi-objective rewards

Reward function design

TL;DR

GRPO works best with multi-objective reward functions, not single metrics. Combine per-answer exact match, off-by-1 tolerance, MAE-based continuous signal, and format compliance — weighted to your use case. Use temperature 0.7, group size 4, and always run GRPO as a second stage after SFT. Single-metric rewards (like exact match alone) are too sparse for the model to learn from.

What GRPO actually does
Designing reward functions
The advantage calculation
When to use GRPO
What I'd do differently

Supervised fine-tuning gets you most of the way there. The model learns the format, follows instructions, produces reasonable outputs. Then it plateaus.

GRPO (Group Relative Policy Optimization), introduced by Shao et al. in the DeepSeekMath paper, is one way past that plateau. Instead of showing the model correct answers, you let it generate multiple outputs, score them with a reward function, and update the policy based on which outputs were better within each group. GRPO eliminates the need for a separate critic model (unlike PPO), making it more memory-efficient and simpler to implement.

The training loop is easy. Hugging Face TRL handles that. The reward function is where you actually have to think.

What GRPO actually does

Standard SFT: here's the input, here's the correct output, minimize cross-entropy.

GRPO:

For each prompt, generate N completions (typically 4-8)
Score each completion with a reward function
Compute advantages: how much better or worse was each completion compared to the group mean?
Update the model to make better completions more likely

from trl import GRPOConfig, GRPOTrainer

config = GRPOConfig(
    num_generations=4,       # Generate 4 completions per prompt
    temperature=0.7,         # Enough randomness for diversity
    max_new_tokens=64,
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
)

trainer = GRPOTrainer(
    model=model,
    args=config,
    train_dataset=dataset,
    processing_class=tokenizer,
    reward_funcs=reward_function,
)
trainer.train()

The thing that makes GRPO work: it doesn't need an absolute reward signal. It only needs relative comparisons within each group. "Completion A is better than Completion B" is enough to learn from, even if neither is perfect. This group-relative approach is what distinguishes GRPO from other RL methods and is also the core technique behind DeepSeek-R1.

Designing reward functions

Your reward function defines what "better" means. Get it wrong and the model optimizes for the wrong thing.

The single-metric trap

Say you're training a model to output numerical scores. The obvious reward: exact match accuracy.

def exact_match_reward(predictions, ground_truth):
    if predictions == ground_truth:
        return 1.0
    return 0.0

Problem: too sparse. Most outputs score 0. The model gets no signal for outputs that are almost right. A prediction of [3, 4, 2, 5] when the answer is [3, 4, 3, 5] gets the same reward as [0, 0, 0, 0]. That's not useful.

Building a multi-objective reward

Reward components

Instead of one metric, combine several that capture different aspects of quality.

Exact match, but per-answer instead of all-or-nothing. A prediction that gets 3 out of 4 right scores 0.75, not 0:

def exact_match_reward(predictions, ground_truth):
    if not predictions or len(predictions) != len(ground_truth):
        return 0.0
    matches = sum(1 for p, g in zip(predictions, ground_truth) if p == g)
    return matches / len(ground_truth)

Off-by-1 tolerance, because a prediction of 4 when the answer is 3 is much better than predicting 0. Without this signal, the model has no incentive to be approximately right:

def off_by_1_reward(predictions, ground_truth, exact_bonus=0.3):
    total = 0.0
    for p, g in zip(predictions, ground_truth):
        error = abs(p - g)
        if error == 0:
            total += 1.0 + exact_bonus  # Bonus for exact
        elif error == 1:
            total += 1.0               # Acceptable tolerance
        else:
            total += 0.0               # Too far off
    max_possible = len(ground_truth) * (1.0 + exact_bonus)
    return total / max_possible

MAE-based reward for a continuous error signal. Unlike exact match, small improvements in accuracy always increase the reward:

def mae_reward(predictions, ground_truth, max_mae=5.0):
    mae = sum(abs(p - g) for p, g in zip(predictions, ground_truth)) / len(ground_truth)
    return max(0.0, 1.0 - mae / max_mae)

Format compliance, because the model sometimes outputs the right answers wrapped in extra text ("The answers are: 3 4 2 5"). This pushes toward clean, parseable output:

def format_reward(predictions, ground_truth, raw_response):
    reward = 0.0
    # Correct number of outputs?
    if len(predictions) == len(ground_truth):
        reward += 0.5
    # Clean format? (just space-separated numbers)
    if re.match(r'^[\d\s]+$', raw_response.strip()):
        reward += 0.3
    # No extra text?
    if raw_response.strip() == " ".join(str(p) for p in predictions):
        reward += 0.2
    return reward

Combining rewards

Weight the components and sum them:

@dataclass
class RewardConfig:
    exact_match_weight: float = 0.3
    off_by_1_weight: float = 0.3
    mae_weight: float = 0.25
    format_weight: float = 0.15

def combined_reward(predictions, ground_truth, raw_response, config):
    if not predictions or len(predictions) != len(ground_truth):
        return -0.5  # Heavy penalty for structural failure

    reward = (
        config.exact_match_weight * exact_match_reward(predictions, ground_truth) +
        config.off_by_1_weight * off_by_1_reward(predictions, ground_truth) +
        config.mae_weight * mae_reward(predictions, ground_truth) +
        config.format_weight * format_reward(predictions, ground_truth, raw_response)
    )

    # Bonus for perfect output
    if predictions == ground_truth:
        reward += 0.1

    return reward

The weights encode your priorities. If downstream systems can tolerate off-by-1 errors, weight exact match lower. If format parsing is what breaks your pipeline (it usually is), don't go below 0.15 for format weight.

The advantage calculation

GRPO doesn't use raw rewards directly. It normalizes them within each group:

def compute_advantages(rewards, eps=1e-8):
    if len(rewards) <= 1:
        return [0.0] * len(rewards)

    rewards = np.array(rewards)
    mean = np.mean(rewards)
    std = np.std(rewards)

    if std < 0.01:
        return [0.0] * len(rewards)  # No learning signal

    return ((rewards - mean) / (std + eps)).tolist()

Two edge cases to watch for.

If all 4 completions get the same score, std is 0 and advantages are undefined. Return zeros. There's nothing to learn from this group. This happens when the model has converged or when the prompt is trivially easy or impossibly hard.

If the model generates the same completion 4 times, GRPO can't compare alternatives. Check diversity and skip batches where less than 50% of completions are unique:

def check_diversity(completions, min_unique_ratio=0.5):
    unique = set(completions)
    return len(unique) / len(completions) >= min_unique_ratio

When to use GRPO

GRPO works best as a second stage after SFT:

Base Model → SFT (learn format + approximate answers) → GRPO (optimize accuracy)

Don't skip SFT. GRPO needs the model to already produce roughly correct outputs so it can compare quality among them. If the base model outputs nonsense, all completions score near zero and there's no relative signal to learn from.

Temperature is tricky. Too low and all completions are identical (no learning signal). Too high and completions are garbage (no useful comparisons). We used 0.7. 0.6-0.9 works. The TRL GRPOTrainer documentation provides additional guidance on these hyperparameters.

Group size: more completions per prompt means better advantage estimates, but more compute. 4 is where we landed. The original DeepSeekMath paper used group sizes of 64 for math reasoning tasks, but for structured scoring with shorter outputs, 4 is enough for meaningful comparisons without training grinding to a halt.

What I'd do differently

Start with a binary reward. 1.0 if all answers are within tolerance, 0.0 otherwise. Simpler to debug. Only move to multi-objective if the binary version plateaus.

Log reward distributions during training. Track the mean and std of each component. If one component is always near 1.0, it's not providing useful signal. If one is always 0.0, the model can't optimize for it yet, so remove it temporarily.

Don't trust the default weights. The "right" weights depend on your data distribution. Run a few training iterations with different combinations and evaluate on held-out data. The defaults I listed above are a starting point.

Reward landscape

The reward function is the objective your model actually optimizes. Spend time on it. A model trained on a bad reward function will get very good at producing the wrong thing.