research

Variance reduction for policy gradient with action-dependent factorized baselines

For developers building AI workflows that involve reinforcement learning, this research offers a concrete way to improve training stability and reduce computational costs, making RL more accessible for complex tasks.

OpenAI Blog·March 20, 2018·1 min readresearch

researchVariance reduction for policy gradient with action-dependent factorized baselines

openai.com

What happened

OpenAI has published research on a new variance reduction technique for policy gradient reinforcement learning. The method, called action-dependent factorized baselines, improves the estimate of the gradient by using a baseline that depends on the action and factorizes across action dimensions. This reduces variance without increasing bias, leading to more stable and sample-efficient training. The technique is particularly effective for high-dimensional action spaces, such as in robotics or game playing. For builders, this means more reliable and faster convergence when training RL agents, potentially reducing computation time and improving policy quality. The work aligns with ongoing efforts to make RL more practical for real-world applications.

Key takeaways

OpenAI proposed action-dependent factorized baselines to reduce variance in policy gradient methods.
The technique uses a baseline that depends on the action and factorizes across dimensions.
It achieves lower variance without introducing bias, improving sample efficiency.
The method is especially beneficial for high-dimensional action spaces.
This research contributes to making reinforcement learning more stable and practical.