"Anti-Martingale Proximal Policy Optimization."

Since the sample data after one exploration process can only be used to update network parameters once in on-policy deep reinforcement learning (DRL), a high sample efficiency is necessary to accelerate the training process of on-policy DRL. In the proposed method, a submartingale criterion is proposed on the basis of the equivalence relationship between the optimal policy and martingale, and then an advanced value iteration (AVI) method is proposed to conduct value iteration with a high accuracy. Based on this foundation, an anti-martingale (AM) reinforcement learning framework is established to efficiently select the sample data that is conducive to policy optimization. In succession, an AM proximal policy optimization (AMPPO) method, which combines the AM framework with proximal policy optimization (PPO), is proposed to reasonably accelerate the updating process of state value that satisfies the submartingale criterion. Experimental results on the Mujoco platform show that AMPPO can achieve better performance than several state-of-the-art comparative DRL methods.

Keywords: policy optimization; martingale proximal; proximal policy; policy; anti martingale

Journal Title: IEEE transactions on cybernetics
Year Published: 2022

Link to full text (if available)

Share on Social Media: Sign Up to like & get
recommendations!
1

LAUSR

You are not signed in:

Sign Up!

Related content

More Information News Social Media Video Recommended