Mingfei Sun, Benjamin Ellis, Anuj Mahajan, Sam Devlin, Katja Hofmann, Shimon Whiteson
Abstract: Trust Region Policy Optimization (TRPO) is an iterative method that simultaneously optimizes a surrogate objective and constrains state distribution shifts via a trust region over policies. Such constraint of distribution shifts is claimed to be crucial for the monotonic improvement guarantee for policy update. However, solving a trust-region-constrained optimization problem can be computationally intensive as it requires many steps of conjugate gradient and a large number of on-policy samples. In this paper, we take a different perspective: instead of constraining the shifts, we seek a quantity that is invariant to the shifts and can be leveraged to update the policy with performance guarantee. Particularly, we show that the natural policy gradients (NPG) remain invariant under state distribution shifts when the softmax tabular parameterization is used for the policy. We then propose Least Squares Policy Optimization (LeSPO), a simple yet effective least squares optimization method to approximate NPG policy updates. Empirical results show that LeSPO outperforms TRPO and Proximal Policy Optimization in terms of policy performance and sample efficiency on both Mujoco and Atari domains.