Generating More Varied Trajectories During Inference for GRPO

I have been working on using GRPO to train a Qwen3-4b Instruct model to (hopefully) get better at a domain-specific classification task. The reward functions focus on 3 subelements of the main classification task (i.e. if all 3 subelements are correct == yes; and otherwise == no).

My last training run, I used num_generations = 32 (but only per-device-train-batch-size of 1) in the hopes that I would create enough diversity in the responses to push the gradients when calculating the Advantage score for GRPO. However, my little model barely learned anything, and my signal was too quiet. I think the little model was a little too deterministic when only being asked to output the yes/no subelement answers. Ultimately, my trained model ended up getting slightly worse at the benchmark than the base model :sob:

Too push the gradients a bit here, are there any tips on increasing the entropy of the responses when not using a reasoning model? I was thinking of requiring the model to include a one sentence rationale of its answer, even if not graded during training, to increase token diversity, but I’m not sure if this would really be enough of a change.

Here’s a summary of the original training configs in my last failed run:

Sampling: 32 rollouts per prompt, using temperature=0.9, top_p=0.9, top_k=40.

Schedule: 2700 optimizer steps (approx. one pass over the 2700-sample subset), LR 2e-5 with a 20-step warmup, weight decay 0.01.

LoRA config: rank 16, alpha 32, dropout 0.05.

1 Like

Perhaps “reward variance per prompt” is matter?

1 Like

Thank you so much for the thoughtful reply!!!

I am going to incorporate your ideas for the next run. The use of the token’s probabilities alone would probably help immensely.

  1. Original Reward Functions

For more context my original reward functions in this failed run looked like this (a little chaotic but my first time trying it):

Prompt / system directions: requires yes/no completions in the fixed main_classification; sub_one; sub_two; sub_three format, with explicit rules to set main_classification=YES only when all three sub-conditions are YES (and to force the other slots to NO when sub_one=NO).

Per-field weights using verifiers / PI environment (scores are 1.0 if predicted matches truth, otherwise 0.0):

main_classification: 1.0

sub_one: 0.15

sub_two: 0.15

sub_3: 0.3

Consistency penalty (_PENALTY_VALUE = -1.0)

If all subfields come back False → additional -0.5 penalty

If main_classification=YES but any subfield is NO → -1.0.

If main_classification=NO but all subfields are YES → -1.0.

So…kind of a house of cards.

  1. What I was thinking of trying in round two

I was initially thinking of simplifying it by only focusing on the sub elements and inferring the final binary classification based on those responses. With a straight +1 -1 scoring (instead of weighted +1 or 0) with an additional +1 bonus if all three sub elements are correct, so that the reward range is now [-3, +4]. The prompt would be simplified to only include one logic direction: if sub_one is No, then all subs must be No. This is a quirk about my task itself and the dataset.

Instead of format penalties, I was thinking of doing a quick SFT run to teach the output format. Although in the original run, it did seem to learn the output pretty quickly with the consistency penalty.

I was not able to use Unsloth with my stack, and much like folks who rent super cars for an hour in Miami and Las Vegas on vacation, I rented four H100s for this original experiment and most certainly did not properly utilize all that compute ability during training.

1 Like