D2 Actor Critic:
Diffusion Actor Meets Distributional Critic
We develop a new model-free reinforcement learning (RL) algorithm: D2AC. Motivated by recent advances in model-based RL, we make two adjustments to the typical actor critic RL pipeline. First, we learn distributional critics with a novel fusion of distributional RL and clipped double Q-learning. Second, we use a diffusion model to parameterize the policy, and derive an efficient method for aligning the diffusion process with policy improvement. These changes are highly effective, resulting in highly performant model-free policies on a benchmark of eighteen hard RL tasks, including Humanoid, Dog, and Shadow Hand domains, spanning both dense-reward and goal-conditioned RL scenarios. D2AC significantly closes the performance gap between model-free and model-based RL methods, while being at least four times faster to run than model-based counterparts.
What is an Actor-Critic method?
Actor Critic is a class of methods for solving reinforcement learning problems, where an agent interacts with an environment to maximize the cumulative reward signals.
In Actor Critic methods, there are only two components: (i) an actor that takes actions in an environment; (ii) a critic that estimates the potential return of being in a specific state and taking an action.
What is Model-Predictive Control?
Model-predictive control (MPC) learns a model of the environment and uses it to plan actions. MPC is sample-efficient but computationally expensive.
MPC has two intriguing properties that we aim to capture in model-free D2AC:
(i) MPC tends to model a distribution rather than a point estimate of future returns;
(ii) MPC uses an iterative refinement process on action proposals, akin to denoising diffusion for policy improvement.
What is D2 Actor Critic?
D2AC employs distributional critics to model return distributions, and additionally applies the clipped double Q-learning technique to make Q-function estimates more accurate.
D2AC uses a denoising diffusion model for the policy network, and derives a novel and simplified policy improvement objective to supervise each step of the denoising process so that action proposals can be iteratively refined at inference.
Video of D2AC Policies
No model-based planning is performed at either training or inference.