🔧 Beyond Pretraining: A Visual Guide to Post-Training Techniques

Ever wondered when to use DPO vs PPO? Or why DeepSeek-R1 chose GRPO over other RL methods?

I created this visual guide to help navigate the post-training landscape - covering distillation, reward modeling, and RL techniques with practical decision frameworks.

What’s inside:

  • Decision tree for choosing between distillation (small models) vs RL (frontier models)
  • When to use PPO vs GRPO (hint: memory constraints matter!)
  • Reward type spectrum: rule-based vs subjective rewards
  • Real examples: SmolLM3 (APO only), Tulu3 (multi-approach), DeepSeek-R1 (GRPO)

The guide synthesizes recent techniques like APO (Anchored Preference Optimization) and GRPO alongside classics like PPO and DPO, with visual frameworks to help you pick the right approach for your use case.

:link: Interactive Space

Free to use with attribution - perfect for talks, documentation, or just wrapping your head around this rapidly evolving space!

What post-training techniques have you found most effective? Always curious to hear what’s working in practice! :nerd_face:

1 Like

Impressive visual guide, but just a heads up: things are moving so fast that a lot of these techniques are already out dated. Everyone’s been too focused on the wrong things. I have already resolved the majority of the problems people are currently facing with Ai tech simply by rewriting everything from scratch properly. As you can see by my screen shot. I’ll be releasing these new gen 2 Ai models on August 31. Training and what not is now obsolete. I don’t train, I install.
image