ORSO: Accelerating Reward Design via Online Reward Selection and Policy Optimization

1MIT, 2ETH Zurich, 3Boston University, 4Broad Institute of MIT and Harvard

Abstract

Reward shaping is critical in reinforcement learning (RL), particularly for complex tasks where sparse rewards can hinder learning. However, choosing effective shaping rewards from a set of reward functions in a computationally efficient manner remains an open challenge. We propose Online Reward Selection and Policy Optimization (ORSO), a novel approach that frames the selection of shaping reward function as an online model selection problem. ORSO automatically identifies performant shaping reward functions without human intervention with provable regret guarantees. We demonstrate ORSO's effectiveness across various continuous control tasks. Compared to prior approaches, ORSO significantly reduces the amount of data required to evaluate a shaping reward function, resulting in superior data efficiency and a significant reduction in computational time (up to 8×). ORSO consistently identifies high-quality reward functions outperforming prior methods by more than 50% and on average identifies policies as performant as the ones learned using manually engineered reward functions by domain experts.

ORSO Teaser

Key Results

ORSO Achieves 56% Higher Task Reward

ORSO achieves 56% higher task reward within a limited interaction budget compared to previous methods.

ORSO Main

Comparison of Selection Algorithms

We evaluate ORSO with multiple selection algorithms. D3RB and Exp3 consistently outperform human-designed reward functions, while the naive selection algorithm performs poorly.

ORSO Algorithm Comprison

ORSO Handles Large Reward Sets

ORSO handles large sets of reward functions. We test ORSO on Ant with K = 48 and K = 96.

ORSO Large Reward Set

Regret Guarantees

Under event \(\mathcal{E}\) and Assumption 4.2, with probability \(1 - \delta\), the regret of all learners \(i\) is bounded in all rounds \(T\) as

\[ \sum_{t=1}^{n_T^i} \mathrm{reg}(\pi_{(t)}^i) \leq 6 d_T^{i_\star} \sqrt{n_T^{i_\star} + 1} + 5c \sqrt{(n_T^{i_\star} + 1) \ln \frac{K \ln T}{\delta}}, \]

where \(d_T^{i_\star} = d_{\left(n_T^{i_\star}\right)}^{i_\star}\).

Additional Animations

ORSO (D3RB) Allegro Hand (B=15, K=16)

Naive Allegro Hand (B=15, K=16)

BibTeX

@inproceedings{zhang2025orso,
  title={{ORSO}: Accelerating Reward Design via Online Reward Selection and Policy Optimization},
  author={Chen Bo Calvin Zhang and Zhang-Wei Hong and Aldo Pacchiano and Pulkit Agrawal},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=0uRc3CfJIQ}
}