Starting from REINFORCE, the original deep reinforcement learning algorithm, we will trace the evolution of policy gradient methods to the Group Relative Policy Optimization algorithm used to train Deepseek r1.
This post ignores the LLM side of things, less-related developments in RL, and most of the equations used for these algorithms, but captures the essence and intuition of the RL-timeline without wasting your time. This is all self-study, so feel free to send me any corrections/suggestions.1