πŸš€ ReTool Update β€” GRPO + Generation Reuse, Smarter Batching, and Beyond

Hey all,
Quick update on my ReTool project β€” a custom train loop for GRPO-style training with more efficient generation.


:new_button: Key components of the train loop

1. Separated generation from update
No more tying gradient updates directly to when I generate completions. This gives more control over reuse and lets me structure training around steps_per_generation instead of being stuck with the 1:1 PPO pattern.

2. Generation reuse
Instead of

1 completion β†’ 1 gradient update   # PPO style

We can do:

generate:   4 completions β†’ store
train:      4 updates on stored generations

This drops generation cost without losing the group advantage that GRPO gives.

3. Mini & micro batching
Within each set of stored generations:

  • Mini-batch β†’ processes multiple groups together for efficiency.

  • Micro-batch β†’ splits further for gradient accumulation and memory safety.

This combo keeps GPU memory happy while still maintaining good throughput.


:hammer_and_wrench: The training loop now

  1. Check if it’s time to generate new completions.

  2. If yes, run _generate_and_score_completions (with code execution where applicable) and store results.

  3. Train on stored generations using _train_on_stored_generations (handles both mini & micro batching).

  4. Log, monitor, and adapt LR via scheduler.

The batching logic is modular, so you can swap in your own GRPO/PPO/other loss function β€” the infrastructure still works.


I also wrote a Medium post with more details and a few debugging war stories.

If you’re building anything similar, I’d love to swap notes! :hammer_and_wrench:

1 Like