Key Results
ORSO Achieves 56% Higher Task Reward
ORSO achieves 56% higher task reward within a limited interaction budget compared to previous methods.

Comparison of Selection Algorithms
We evaluate ORSO with multiple selection algorithms. D3RB and Exp3 consistently outperform human-designed reward functions, while the naive selection algorithm performs poorly.

ORSO Handles Large Reward Sets
ORSO handles large sets of reward functions. We test ORSO on Ant with K = 48 and K = 96.

Regret Guarantees
Under event \(\mathcal{E}\) and Assumption 4.2, with probability \(1 - \delta\), the regret of all learners \(i\) is bounded in all rounds \(T\) as
\[ \sum_{t=1}^{n_T^i} \mathrm{reg}(\pi_{(t)}^i) \leq 6 d_T^{i_\star} \sqrt{n_T^{i_\star} + 1} + 5c \sqrt{(n_T^{i_\star} + 1) \ln \frac{K \ln T}{\delta}}, \]
where \(d_T^{i_\star} = d_{\left(n_T^{i_\star}\right)}^{i_\star}\).