GenPO++: Generative Policy Optimization with Jacobian-free Likelihood Ratios

Ke Hu; Shutong Ding; Panxin Tao; Jingya Wang; Ye Shi

Generative Policy Optimization

GenPO++:
Generative Policy Optimization with Jacobian-free Likelihood Ratios

Ke Hu^1,2,* Shutong Ding^1,2,* Panxin Tao¹ Jingya Wang¹ Ye Shi^1,2,†

¹ ShanghaiTech University ² InstAdapt

{huke2024, dingsht, taopx2022}@shanghaitech.edu.cn
{wangjingya, shiye}@shanghaitech.edu.cn

^* Equal contribution ^† Corresponding author

Paper and Code Coming Soon Pipeline Videos

Pipeline

Overview Video

Abstract

Generative policies provide expressive and multimodal action distributions, making them attractive for reinforcement learning (RL) in complex continuous-control tasks. Among them, flow-based policies are especially appealing because they generate actions through deterministic transport maps. However, applying such generative policies to likelihood-based on-policy learning remains limited by the difficulty of evaluating the probability of executed actions. Existing flow RL methods either replace the true action-density ratio with approximate surrogates, which can introduce biased updates, or recover exact likelihoods through dummy-action augmentation, which enlarges the policy space and increases computation. In this work, we propose GenPO++, a reversible generative policy optimization framework that uses history states as auxiliary memory in a high-order reversible ODE solver, yielding exact inversion without changing the original action dimension. The resulting generative policy map has a log-determinant determined only by fixed solver coefficients, enabling exact and Jacobian-free likelihood-ratio computation. This design preserves the expressiveness of generative flow policies while avoiding both action ratio bias and dummy-action overhead. We evaluate GenPO++ on large-scale simulated control, fine-tuning, and real-world robotic manipulation tasks, where it achieves competitive or superior performance over state-of-the-art on-policy RL methods, while improving training stability and computational efficiency.