Hey all,
Quick update on my ReTool project β a custom train loop for GRPO-style training with more efficient generation.
Key components of the train loop
1. Separated generation from update
No more tying gradient updates directly to when I generate completions. This gives more control over reuse and lets me structure training around steps_per_generation instead of being stuck with the 1:1 PPO pattern.
2. Generation reuse
Instead of
1 completion β 1 gradient update # PPO style
We can do:
generate: 4 completions β store
train: 4 updates on stored generations
This drops generation cost without losing the group advantage that GRPO gives.
3. Mini & micro batching
Within each set of stored generations:
-
Mini-batch β processes multiple groups together for efficiency.
-
Micro-batch β splits further for gradient accumulation and memory safety.
This combo keeps GPU memory happy while still maintaining good throughput.
The training loop now
-
Check if itβs time to generate new completions.
-
If yes, run
_generate_and_score_completions(with code execution where applicable) and store results. -
Train on stored generations using
_train_on_stored_generations(handles both mini & micro batching). -
Log, monitor, and adapt LR via scheduler.
The batching logic is modular, so you can swap in your own GRPO/PPO/other loss function β the infrastructure still works.
I also wrote a Medium post with more details and a few debugging war stories.
If youβre building anything similar, Iβd love to swap notes! ![]()